From eitan at mellanox.co.il  Fri Apr  1 01:29:12 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 1 Apr 2005 12:29:12 +0300 
Subject: [openib-general] [RMPP] RMPP formatting assumptions
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF062@mtlex01.yok.mtl.com>

Seems ok to me.


> -----Original Message-----
> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Friday, April 01, 2005 2:16 AM
> To: openib-general
> Subject: Re: [openib-general] [RMPP] RMPP formatting assumptions
> 
> So far, here are my assumptions regarding the formatting of the RMPP MADs.
> 
> The following fields in the RMPP header are set by the user:
> Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0
> 
> The RMPP code will set the SegNum and update the Flags, but uses the
> ACTIVE bit to determine if the user requires RMPP for a given transfer.
>   I could easily have the RMPP code set some of these fields, but
> thought that the caller might be able to initialize them more efficiently.
> 
> The WR length of a transfer should equal the size of the MAD header,
> the RMPP header, class specific header for SA or vendor, plus a data
> buffer that is evenly divisible by the size of the class' Data field.
> This requirement is needed to prevent the RMPP code from allocating and
> copying data segments.
> 
> The payload field in the RMPP header should be set to the size of the
> class specific header plus the number of valid bytes of user data in
> the data buffer.  The RMPP code will adjust the payload value to
> account for multiple headers.
> 
> Comments?
> 
> - Sean
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050401/2d5b56ce/attachment.html>

From halr at voltaire.com  Fri Apr  1 04:08:29 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Apr 2005 07:08:29 -0500
Subject: [openib-general] [RMPP] RMPP formatting assumptions
In-Reply-To: <424C92CE.7040709@ichips.intel.com>
References: <42488FDF.2050608@ichips.intel.com>
	<424C92CE.7040709@ichips.intel.com>
Message-ID: <1112357309.4490.63.camel@localhost.localdomain>

On Thu, 2005-03-31 at 19:16, Sean Hefty wrote:
> So far, here are my assumptions regarding the formatting of the RMPP MADs.
> 
> The following fields in the RMPP header are set by the user:
> Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0

Should RMPP set the status rather than the user or is this an efficiency
thing ?

> The RMPP code will set the SegNum and update the Flags, but uses the 
> ACTIVE bit to determine if the user requires RMPP for a given transfer. 
>   I could easily have the RMPP code set some of these fields, but 
> thought that the caller might be able to initialize them more efficiently.
> 
> The WR length of a transfer should equal the size of the MAD header, 
> the RMPP header, class specific header for SA or vendor, plus a data 
> buffer that is evenly divisible by the size of the class' Data field. 
> This requirement is needed to prevent the RMPP code from allocating and 
> copying data segments.
> 
> The payload field in the RMPP header should be set to the size of the 
> class specific header plus the number of valid bytes of user data in 
> the data buffer.  The RMPP code will adjust the payload value to 
> account for multiple headers.

So it sounds like the streaming mode of RMPPis not handled on transmit.
It is optional on that side.

What about receive ? Can it/will it handle streaming mode ?

-- Hal


From roland at topspin.com  Fri Apr  1 08:27:19 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 01 Apr 2005 08:27:19 -0800
Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA
In-Reply-To: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com> (Fab
	Tillier's message of "Thu, 31 Mar 2005 21:59:02 -0800")
References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com>
Message-ID: <528y42laxk.fsf@topspin.com>

    Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in
    Fab> PRM 1.0.0).  It states that a length of zero in a data
    Fab> segment indicates a 2GB transfer (MSb is used as a flag to
    Fab> indicate normal vs. inline data segments).  A zero-byte
    Fab> request must not reference any data segments.

Yup, that must be the problem.  I guess mthca can skip over 0-length
data segments.  Another option would be to say that such work requests
aren't allowed.  Not sure which way I think we should go.  I need to
talk to Libor and find out why SDP is generating such requests.

 - R.


From libor at topspin.com  Fri Apr  1 09:03:31 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 1 Apr 2005 09:03:31 -0800
Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA
In-Reply-To: <52is37kuz4.fsf@topspin.com>;
	from roland@topspin.com on Thu, Mar 31, 2005 at 07:59:43PM -0800
References: <52acojmt34.fsf@topspin.com> <20050331231023.GC6807@mellanox.co.il>
	<52is37kuz4.fsf@topspin.com>
Message-ID: <20050401090331.A2870@topspin.com>

On Thu, Mar 31, 2005 at 07:59:43PM -0800, Roland Dreier wrote:
> 
> SDP is generating the 0-length RDMA by posting an RDMA READ with a
> single scatter entry whole length is zero, which may behave
> differently from posting an RDMA READ with no scatter entries.  I need
> to check this out, and also test on Tavor.

  I'll look into why SDP is generating a 0 length RDMA read, this
should not be happening.


-Libor


From ftillier at infiniconsys.com  Fri Apr  1 09:34:43 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Fri, 1 Apr 2005 09:34:43 -0800
Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA
In-Reply-To: <528y42laxk.fsf@topspin.com>
Message-ID: <001401c536e1$18303080$1802a8c0@infiniconsys.com>

> From: Roland Dreier [mailto:roland at topspin.com]
> Sent: Friday, April 01, 2005 8:27 AM
> 
>     Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in
>     Fab> PRM 1.0.0).  It states that a length of zero in a data
>     Fab> segment indicates a 2GB transfer (MSb is used as a flag to
>     Fab> indicate normal vs. inline data segments).  A zero-byte
>     Fab> request must not reference any data segments.
> 
> Yup, that must be the problem.  I guess mthca can skip over 0-length
> data segments.  Another option would be to say that such work requests
> aren't allowed.  Not sure which way I think we should go.  I need to
> talk to Libor and find out why SDP is generating such requests.
> 

If the overhead of checking for zero-length is negligible, I would recommend
trapping this in mthca.  My reasoning is that unless the IB spec states that
0-length operations can't have data segments, this is a HW specific
limitation and should be handled within the driver.

- Fab


From mshefty at ichips.intel.com  Fri Apr  1 09:35:39 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 01 Apr 2005 09:35:39 -0800
Subject: [openib-general] [RMPP] RMPP formatting assumptions
In-Reply-To: <1112357309.4490.63.camel@localhost.localdomain>
References: <42488FDF.2050608@ichips.intel.com>	
	<424C92CE.7040709@ichips.intel.com>
	<1112357309.4490.63.camel@localhost.localdomain>
Message-ID: <424D866B.50600@ichips.intel.com>

Hal Rosenstock wrote:
> On Thu, 2005-03-31 at 19:16, Sean Hefty wrote:
> 
>>So far, here are my assumptions regarding the formatting of the RMPP MADs.
>>
>>The following fields in the RMPP header are set by the user:
>>Version, Type = DATA, RTime, Flags = ACTIVE, and Status = 0
> 
> Should RMPP set the status rather than the user or is this an efficiency
> thing ?

I was trying to limit the fields that the RMPP header would need to 
touch.  The RMPP layer would change the status if an error occurred.


> So it sounds like the streaming mode of RMPPis not handled on transmit.
> It is optional on that side.

I'm assuming that you're referring to the case where the payload length 
is set to 0.  This is not handled.  I'm not even sure how you could 
handle such a transfer without changes to the MAD API and having the 
client be aware of the RMPP implementation.

> What about receive ? Can it/will it handle streaming mode ?

The receive side uses the LAST bit to check for the end of the data 
transfer, so should work if payload length is 0.

- Sean


From roland at topspin.com  Fri Apr  1 09:45:33 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 01 Apr 2005 09:45:33 -0800
Subject: [openib-general] [PATCH][4/3] IPoIB: document conversion to debugfs
In-Reply-To: <20053311936.XaQmN4N9new7dTCP@topspin.com> (Roland Dreier's
	message of "Thu, 31 Mar 2005 19:36:12 -0800")
References: <20053311936.XaQmN4N9new7dTCP@topspin.com>
Message-ID: <52r7hujsqq.fsf@topspin.com>

Update IPoIB documentation now that multicast debugging files have
moved from ipoibdebugfs to debugfs.

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-export.orig/Documentation/infiniband/ipoib.txt	2005-03-31 19:07:01.000000000 -0800
+++ linux-export/Documentation/infiniband/ipoib.txt	2005-04-01 09:43:27.122520190 -0800
@@ -32,14 +32,13 @@
   mcast_debug_level to 1.  These parameters can be controlled at
   runtime through files in /sys/module/ib_ipoib/.
 
-  CONFIG_INFINIBAND_IPOIB_DEBUG also enables the "ipoib_debugfs"
+  CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
   virtual filesystem.  By mounting this filesystem, for example with
 
-    mkdir -p /ipoib_debugfs
-    mount -t ipoib_debugfs none /ipoib_debufs
+    mount -t debugfs none /sys/kernel/debug
 
-  it is possible to get statistics about multicast groups from the
-  files /ipoib_debugfs/ib0_mcg and so on.
+  it is possible to get statistics about munlticast groups from the
+  files /sys/kernel/debug/ipoib/ib0_mcg and so on.
 
   The performance impact of this option is negligible, so it
   is safe to enable this option with debug_level set to 0 for normal


From roland at topspin.com  Fri Apr  1 10:23:50 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:50 -0800
Subject: [openib-general] [PATCH][2/6] IB: remove unneeded includes
In-Reply-To: <2005411023.BIKgS4OLfFzZN9qI@topspin.com>
Message-ID: <2005411023.AERMWYHGiX8V5KDM@topspin.com>

From: Hal Rosenstock <halr at voltaire.com>

Eliminate no longer needed include files

Signed-off-by: Hal Rosenstock <halr at voltaire.com>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/mad.c	2005-04-01 10:08:54.939957801 -0800
+++ linux-export/drivers/infiniband/core/mad.c	2005-04-01 10:08:56.473624910 -0800
@@ -33,9 +33,6 @@
  */
 
 #include <linux/dma-mapping.h>
-#include <linux/interrupt.h>
-
-#include <ib_mad.h>
 
 #include "mad_priv.h"
 #include "smi.h"


From roland at topspin.com  Fri Apr  1 10:23:51 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:51 -0800
Subject: [openib-general] [PATCH][3/6] IB: Fix FMR pool crash
In-Reply-To: <2005411023.AERMWYHGiX8V5KDM@topspin.com>
Message-ID: <2005411023.09JoUTQ2SAMPiKPQ@topspin.com>

Mask bits correctly from jhash result in ib_fmr_hash() so that the
computed bucket index is within our hash table.  This fixes an SDP
crash.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/fmr_pool.c	2005-03-31 19:07:05.000000000 -0800
+++ linux-export/drivers/infiniband/core/fmr_pool.c	2005-04-01 10:08:58.240241456 -0800
@@ -103,9 +103,8 @@
 
 static inline u32 ib_fmr_hash(u64 first_page)
 {
-	return jhash_2words((u32) first_page,
-			    (u32) (first_page >> 32),
-			    0);
+	return jhash_2words((u32) first_page, (u32) (first_page >> 32), 0) &
+		(IB_FMR_HASH_SIZE - 1);
 }
 
 /* Caller must hold pool_lock */


From roland at topspin.com  Fri Apr  1 10:23:50 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:50 -0800
Subject: [openib-general] [PATCH][1/6] IB: Keep MAD work completion valid
Message-ID: <2005411023.BIKgS4OLfFzZN9qI@topspin.com>

From: Sean Hefty <sean.hefty at intel.com>

Replace the *wc field in ib_mad_recv_wc from pointing to a structure
on the stack to one allocated with the received MAD buffer.  This
allows a client to access the *wc field after their receive completion
handler has returned.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/mad.c	2005-03-31 19:07:01.000000000 -0800
+++ linux-export/drivers/infiniband/core/mad.c	2005-04-01 10:08:54.939957801 -0800
@@ -1600,7 +1600,8 @@
 			 DMA_FROM_DEVICE);
 
 	/* Setup MAD receive work completion from "normal" work completion */
-	recv->header.recv_wc.wc = wc;
+	recv->header.wc = *wc;
+	recv->header.recv_wc.wc = &recv->header.wc;
 	recv->header.recv_wc.mad_len = sizeof(struct ib_mad);
 	recv->header.recv_wc.recv_buf.mad = &recv->mad.mad;
 	recv->header.recv_wc.recv_buf.grh = &recv->grh;
--- linux-export.orig/drivers/infiniband/core/mad_priv.h	2005-03-31 19:07:14.000000000 -0800
+++ linux-export/drivers/infiniband/core/mad_priv.h	2005-04-01 10:08:54.961953027 -0800
@@ -69,6 +69,7 @@
 struct ib_mad_private_header {
 	struct ib_mad_list_head mad_list;
 	struct ib_mad_recv_wc recv_wc;
+	struct ib_wc wc;
 	DECLARE_PCI_UNMAP_ADDR(mapping)
 } __attribute__ ((packed));
 

From roland at topspin.com  Fri Apr  1 10:23:51 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:51 -0800
Subject: [openib-general] [PATCH][4/6] IB: Trivial FMR printk cleanup
In-Reply-To: <2005411023.09JoUTQ2SAMPiKPQ@topspin.com>
Message-ID: <2005411023.5oEZz0iawuKxVyay@topspin.com>

From: Libor Michalek <libor at topspin.com>

Add missing newline in printk.

Signed-off-by: Libor Michalek <libor at topspin.com>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/fmr_pool.c	2005-04-01 10:08:58.240241456 -0800
+++ linux-export/drivers/infiniband/core/fmr_pool.c	2005-04-01 10:08:59.539959345 -0800
@@ -442,7 +442,7 @@
 		list_add(&fmr->list, &pool->free_list);
 		spin_unlock_irqrestore(&pool->pool_lock, flags);
 
-		printk(KERN_WARNING "fmr_map returns %d",
+		printk(KERN_WARNING "fmr_map returns %d\n",
 		       result);
 
 		return ERR_PTR(result);


From roland at topspin.com  Fri Apr  1 10:23:51 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:51 -0800
Subject: [openib-general] [PATCH][5/6] IB: Fix user MAD registrations with
	class 0
In-Reply-To: <2005411023.5oEZz0iawuKxVyay@topspin.com>
Message-ID: <2005411023.Wt2K1CXaZGIHp9sH@topspin.com>

Fix handling of MAD agent registrations with mgmt_class == 0.  In this
case ib_umad should pass a NULL registration request to the MAD core
rather than a request with mgmt_class set to 0.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/user_mad.c	2005-03-31 19:06:42.000000000 -0800
+++ linux-export/drivers/infiniband/core/user_mad.c	2005-04-01 10:09:01.250588043 -0800
@@ -389,15 +389,17 @@
 	goto out;
 
 found:
-	req.mgmt_class         = ureq.mgmt_class;
-	req.mgmt_class_version = ureq.mgmt_class_version;
-	memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask);
-	memcpy(req.oui,         ureq.oui,         sizeof req.oui);
+	if (ureq.mgmt_class) {
+		req.mgmt_class         = ureq.mgmt_class;
+		req.mgmt_class_version = ureq.mgmt_class_version;
+		memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask);
+		memcpy(req.oui,         ureq.oui,         sizeof req.oui);
+	}
 
 	agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num,
 				      ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI,
-				      &req, 0, send_handler, recv_handler,
-				      file);
+				      ureq.mgmt_class ? &req : NULL,
+				      0, send_handler, recv_handler, file);
 	if (IS_ERR(agent)) {
 		ret = PTR_ERR(agent);
 		goto out;


From roland at topspin.com  Fri Apr  1 10:23:51 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 10:23:51 -0800
Subject: [openib-general] [PATCH][6/6] IB: Remove incorrect comments
In-Reply-To: <2005411023.Wt2K1CXaZGIHp9sH@topspin.com>
Message-ID: <2005411023.sEUedyez566a4lDQ@topspin.com>

From: Hal Rosenstock <halr at voltaire.com>

Eliminate unneeded and misleading comments

Signed-off-by: Hal Rosenstock <halr at voltaire.com>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/core/agent.c	2005-03-31 19:06:48.000000000 -0800
+++ linux-export/drivers/infiniband/core/agent.c	2005-04-01 10:09:02.621290525 -0800
@@ -129,7 +129,6 @@
 		goto out;
 	agent_send_wr->mad = mad_priv;
 
-	/* PCI mapping */
 	gather_list.addr = dma_map_single(mad_agent->device->dma_device,
 					  &mad_priv->mad,
 					  sizeof(mad_priv->mad),
@@ -261,7 +260,6 @@
 	list_del(&agent_send_wr->send_list);
 	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
 
-	/* Unmap PCI */
 	dma_unmap_single(mad_agent->device->dma_device,
 			 pci_unmap_addr(agent_send_wr, mapping),
 			 sizeof(agent_send_wr->mad->mad),
--- linux-export.orig/drivers/infiniband/core/mad.c	2005-04-01 10:08:56.473624910 -0800
+++ linux-export/drivers/infiniband/core/mad.c	2005-04-01 10:09:02.768258624 -0800
@@ -2283,7 +2283,6 @@
 		/* Remove from posted receive MAD list */
 		list_del(&mad_list->list);
 
-		/* Undo PCI mapping */
 		dma_unmap_single(qp_info->port_priv->device->dma_device,
 				 pci_unmap_addr(&recv->header, mapping),
 				 sizeof(struct ib_mad_private) -


From tduffy at sun.com  Fri Apr  1 10:24:13 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 01 Apr 2005 10:24:13 -0800
Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha
	code WAS:
	[openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
In-Reply-To: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
Message-ID: <1112379853.18939.11.camel@duffman>

On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote:
> Author: roland
> Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005)
> New Revision: 2101
> 
> Added:
>    gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff
> Log:
> Add patch adding Sinai device IDs for 2.6.11 kernel.

Roland, please consider applying this for svn ease of use:

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- drivers/infiniband/hw/mthca/mthca_dev.h	(revision 2102)
+++ drivers/infiniband/hw/mthca/mthca_dev.h	(working copy)
@@ -49,6 +49,14 @@
 #define DRV_VERSION	"0.06-pre"
 #define DRV_RELDATE	"November 8, 2004"
 
+/* XXX remove once SINAI defines make it into kernel.org */
+#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD
+#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c
+#endif
+#ifndef PCI_DEVICE_ID_MELLANOX_SINAI
+#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274
+#endif
+
 enum {
 	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
 	MTHCA_FLAG_SRQ        = 1 << 2,


From peter at pantasys.com  Fri Apr  1 11:37:18 2005
From: peter at pantasys.com (Peter Buckingham)
Date: Fri, 01 Apr 2005 11:37:18 -0800
Subject: [openib-general] uverbs and OSU MPI/MPI in general?
Message-ID: <424DA2EE.7050802@pantasys.com>

Hi All,

How does gen2's uverbs compare to VAPI? Is it meant to be the same API? 
Should OSU's MPI run on top of this or is there some other MPI 
implementation that will be able to run 'natively' over IB?

thanks,

peter


From roland at topspin.com  Fri Apr  1 11:39:59 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 01 Apr 2005 11:39:59 -0800
Subject: [openib-general] uverbs and OSU MPI/MPI in general?
In-Reply-To: <424DA2EE.7050802@pantasys.com> (Peter Buckingham's message of
	"Fri, 01 Apr 2005 11:37:18 -0800")
References: <424DA2EE.7050802@pantasys.com>
Message-ID: <523buajng0.fsf@topspin.com>

    Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant
    Peter> to be the same API? Should OSU's MPI run on top of this or
    Peter> is there some other MPI implementation that will be able to
    Peter> run 'natively' over IB?

The basic functionality is the same but the API is different.  For
example completion events are handled in a different way that allows
better performance.

None of the current MPI implementations that use IB will run
unmodified, but everyone (including OSU) is porting to the new API.

 - R.


From panda at cse.ohio-state.edu  Fri Apr  1 12:13:52 2005
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Fri, 1 Apr 2005 15:13:52 -0500 (EST)
Subject: [openib-general] uverbs and OSU MPI/MPI in general?
In-Reply-To: <523buajng0.fsf@topspin.com> from "Roland Dreier" at Apr 01,
	2005 11:39:59 AM
Message-ID: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu>

Peter, 

>     Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant
>     Peter> to be the same API? Should OSU's MPI run on top of this or
>     Peter> is there some other MPI implementation that will be able to
>     Peter> run 'natively' over IB?
> 
> The basic functionality is the same but the API is different.  For
> example completion events are handled in a different way that allows
> better performance.
> 
> None of the current MPI implementations that use IB will run
> unmodified, but everyone (including OSU) is porting to the new API.

We have already started working on porting OSU MPI to the Gen2 stack.

We plan to release MVAPICH 0.9.5 (on VAPI stack) during the next 1-2
weeks.  After that we will make a subsequent release of 0.9.5 on the
OpenIB Gen2 stack.

Hope this helps. 

Thanks, 

DK

>  - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From iod00d at hp.com  Fri Apr  1 10:43:46 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 1 Apr 2005 10:43:46 -0800
Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha
	code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
In-Reply-To: <1112379853.18939.11.camel@duffman>
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
Message-ID: <20050401184346.GD11094@esmail.cup.hp.com>

On Fri, Apr 01, 2005 at 10:24:13AM -0800, Tom Duffy wrote:
> On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote:
> > Author: roland
> > Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005)
> > New Revision: 2101
> > 
> > Added:
> >    gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff
> > Log:
> > Add patch adding Sinai device IDs for 2.6.11 kernel.
> 
> Roland, please consider applying this for svn ease of use:

No - I think Rolan is doing the right thing with a seperate patch.
I ran into the same issue since I'm still poking at 2.6.11.

By keeping "backport" patches seperate distro's will have an
easier time figuring out which backport cruft they will
need for their release. And Roland won't have to remember to
clean it out later and won't cause pain for distro's when they
grab a new version of openib but still shipping the same base release.

grant


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][2/27] IB/mthca: fill in more device query
	fields
In-Reply-To: <2005411249.NCfupdZrkMmfcKnV@topspin.com>
Message-ID: <2005411249.WCbW5NdE7NBIkIcr@topspin.com>

Implement more of the device_query method in mthca.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-03-31 19:07:00.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:20.843436141 -0800
@@ -987,6 +987,8 @@
 	if (dev->hca_type == ARBEL_NATIVE) {
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET);
 		dev_lim->hca.arbel.resize_srq = field & 1;
+		MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET);
+		dev_lim->max_sg = min_t(int, field, dev_lim->max_sg);
 		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET);
 		dev_lim->mtt_seg_sz = size;
 		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-03-31 19:07:00.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:20.839437009 -0800
@@ -52,6 +52,8 @@
 	if (!in_mad || !out_mad)
 		goto out;
 
+	memset(props, 0, sizeof props);
+
 	props->fw_ver              = mdev->fw_ver;
 
 	memset(in_mad, 0, sizeof *in_mad);
@@ -71,14 +73,26 @@
 		goto out;
 	}
 
-	props->device_cap_flags = mdev->device_cap_flags;
-	props->vendor_id        = be32_to_cpup((u32 *) (out_mad->data + 36)) &
+	props->device_cap_flags    = mdev->device_cap_flags;
+	props->vendor_id           = be32_to_cpup((u32 *) (out_mad->data + 36)) &
 		0xffffff;
-	props->vendor_part_id   = be16_to_cpup((u16 *) (out_mad->data + 30));
-	props->hw_ver           = be16_to_cpup((u16 *) (out_mad->data + 32));
+	props->vendor_part_id      = be16_to_cpup((u16 *) (out_mad->data + 30));
+	props->hw_ver              = be16_to_cpup((u16 *) (out_mad->data + 32));
 	memcpy(&props->sys_image_guid, out_mad->data +  4, 8);
 	memcpy(&props->node_guid,      out_mad->data + 12, 8);
 
+	props->max_mr_size         = ~0ull;
+	props->max_qp              = mdev->limits.num_qps - mdev->limits.reserved_qps;
+	props->max_qp_wr           = 0xffff;
+	props->max_sge             = mdev->limits.max_sg;
+	props->max_cq              = mdev->limits.num_cqs - mdev->limits.reserved_cqs;
+	props->max_cqe             = 0xffff;
+	props->max_mr              = mdev->limits.num_mpts - mdev->limits.reserved_mrws;
+	props->max_pd              = mdev->limits.num_pds - mdev->limits.reserved_pds;
+	props->max_qp_rd_atom      = 1 << mdev->qp_table.rdb_shift;
+	props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift;
+	props->local_ca_ack_delay  = mdev->limits.local_ca_ack_delay;
+
 	err = 0;
  out:
 	kfree(in_mad);


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][1/27] IB/mthca: map MPT/MTT context in
	mem-free mode
Message-ID: <2005411249.NCfupdZrkMmfcKnV@topspin.com>

In mem-free mode, when allocating memory regions, make sure that the
HCA has context memory mapped to cover the virtual space used for the
MPT and MTTs being used.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-03-31 19:06:51.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:19.884644268 -0800
@@ -390,7 +390,7 @@
 	}
 
 	mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base,
-							 init_hca->mtt_seg_sz,
+							 dev_lim->mtt_seg_sz,
 							 mdev->limits.num_mtt_segs,
 							 mdev->limits.reserved_mtts, 1);
 	if (!mdev->mr_table.mtt_table) {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-03-31 19:06:42.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:19.911638409 -0800
@@ -192,6 +192,38 @@
 	up(&table->mutex);
 }
 
+int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table,
+			  int start, int end)
+{
+	int inc = MTHCA_TABLE_CHUNK_SIZE / table->obj_size;
+	int i, err;
+
+	for (i = start; i <= end; i += inc) {
+		err = mthca_table_get(dev, table, i);
+		if (err)
+			goto fail;
+	}
+
+	return 0;
+
+fail:
+	while (i > start) {
+		i -= inc;
+		mthca_table_put(dev, table, i);
+	}
+
+	return err;
+}
+
+void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table,
+			   int start, int end)
+{
+	int i;
+
+	for (i = start; i <= end; i += MTHCA_TABLE_CHUNK_SIZE / table->obj_size)
+		mthca_table_put(dev, table, i);
+}
+
 struct mthca_icm_table *mthca_alloc_icm_table(struct mthca_dev *dev,
 					      u64 virt, int obj_size,
 					      int nobj, int reserved,
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-03-31 19:06:56.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-04-01 12:38:19.895641881 -0800
@@ -85,6 +85,10 @@
 void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table);
 int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj);
 void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj);
+int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table,
+			  int start, int end);
+void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table,
+			   int start, int end);
 
 static inline void mthca_icm_first(struct mthca_icm *icm,
 				   struct mthca_icm_iter *iter)
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-03-31 19:07:06.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:19.903640145 -0800
@@ -38,6 +38,7 @@
 
 #include "mthca_dev.h"
 #include "mthca_cmd.h"
+#include "mthca_memfree.h"
 
 /*
  * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits.
@@ -71,7 +72,7 @@
  * through the bitmaps)
  */
 
-static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+static u32 __mthca_alloc_mtt(struct mthca_dev *dev, int order)
 {
 	int o;
 	int m;
@@ -105,7 +106,7 @@
 	return seg;
 }
 
-static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+static void __mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
 {
 	seg >>= order;
 
@@ -122,6 +123,32 @@
 	spin_unlock(&dev->mr_table.mpt_alloc.lock);
 }
 
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+{
+	u32 seg = __mthca_alloc_mtt(dev, order);
+
+	if (seg == -1)
+		return -1;
+
+	if (dev->hca_type == ARBEL_NATIVE)
+		if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg,
+					  seg + (1 << order) - 1)) {
+			__mthca_free_mtt(dev, seg, order);
+			seg = -1;
+		}
+
+	return seg;
+}
+
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+{
+	__mthca_free_mtt(dev, seg, order);
+
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_table_put_range(dev, dev->mr_table.mtt_table, seg,
+				      seg + (1 << order) - 1);
+}
+
 static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind)
 {
 	if (dev->hca_type == ARBEL_NATIVE)
@@ -141,7 +168,7 @@
 int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,
 			   u32 access, struct mthca_mr *mr)
 {
-	void *mailbox;
+	void *mailbox = NULL;
 	struct mthca_mpt_entry *mpt_entry;
 	u32 key;
 	int err;
@@ -155,11 +182,17 @@
 		return -ENOMEM;
 	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
 
+	if (dev->hca_type == ARBEL_NATIVE) {
+		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
+		if (err)
+			goto err_out_mpt_free;
+	}
+
 	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
 			  GFP_KERNEL);
 	if (!mailbox) {
-		mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto err_out_table;
 	}
 	mpt_entry = MAILBOX_ALIGN(mailbox);
 
@@ -180,16 +213,27 @@
 	err = mthca_SW2HW_MPT(dev, mpt_entry,
 			      key & (dev->limits.num_mpts - 1),
 			      &status);
-	if (err)
+	if (err) {
 		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
-	else if (status) {
+		goto err_out_table;
+	} else if (status) {
 		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
 			   status);
 		err = -EINVAL;
+		goto err_out_table;
 	}
 
 	kfree(mailbox);
 	return err;
+
+err_out_table:
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_table_put(dev, dev->mr_table.mpt_table, key);
+
+err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	kfree(mailbox);
+	return err;
 }
 
 int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd,
@@ -213,6 +257,12 @@
 		return -ENOMEM;
 	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
 
+	if (dev->hca_type == ARBEL_NATIVE) {
+		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
+		if (err)
+			goto err_out_mpt_free;
+	}
+
 	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
 	     i < list_len;
 	     i <<= 1, ++mr->order)
@@ -220,7 +270,7 @@
 
 	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
 	if (mr->first_seg == -1)
-		goto err_out_mpt_free;
+		goto err_out_table;
 
 	/*
 	 * If list_len is odd, we add one more dummy entry for
@@ -307,13 +357,17 @@
 	kfree(mailbox);
 	return err;
 
- err_out_mailbox_free:
+err_out_mailbox_free:
 	kfree(mailbox);
 
- err_out_free_mtt:
+err_out_free_mtt:
 	mthca_free_mtt(dev, mr->first_seg, mr->order);
 
- err_out_mpt_free:
+err_out_table:
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_table_put(dev, dev->mr_table.mpt_table, key);
+
+err_out_mpt_free:
 	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
 	return err;
 }
@@ -338,6 +392,9 @@
 	if (mr->order >= 0)
 		mthca_free_mtt(dev, mr->first_seg, mr->order);
 
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_table_put(dev, dev->mr_table.mpt_table,
+				key_to_hw_index(dev, mr->ibmr.lkey));
 	mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey));
 }
 

From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][3/27] IB/mthca: fix calculation of RDB
	shift
In-Reply-To: <2005411249.WCbW5NdE7NBIkIcr@topspin.com>
Message-ID: <2005411249.ETBNcLeftemLukfd@topspin.com>

Fix calculation of rdb_shift by using original number of QPs, not
their slot in profile[] (which will be rearranged when we sort it).

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c	2005-03-31 19:07:14.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:21.237350633 -0800
@@ -208,8 +208,7 @@
 			break;
 		case MTHCA_RES_RDB:
 			for (dev->qp_table.rdb_shift = 0;
-			     profile[MTHCA_RES_QP].num << dev->qp_table.rdb_shift <
-				     profile[i].num;
+			     request->num_qp << dev->qp_table.rdb_shift < profile[i].num;
 			     ++dev->qp_table.rdb_shift)
 				; /* nothing */
 			dev->qp_table.rdb_base    = (u32) profile[i].start;


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][4/27] IB/mthca: fix posting sends with
	immediate data
In-Reply-To: <2005411249.ETBNcLeftemLukfd@topspin.com>
Message-ID: <2005411249.dKg4ijljsqXo1Rt6@topspin.com>

When posting a work request with immediate data, put the immediate
data in the immediate data field of the hardware's work request
(rather than overwriting the flags field).

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-03-31 19:06:41.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:21.580276194 -0800
@@ -1465,7 +1465,7 @@
 			cpu_to_be32(1);
 		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
 		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+			((struct mthca_next_seg *) wqe)->imm = wr->imm_data;
 
 		wqe += sizeof (struct mthca_next_seg);
 		size = sizeof (struct mthca_next_seg) / 16;
@@ -1769,7 +1769,7 @@
 			cpu_to_be32(1);
 		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
 		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
-			((struct mthca_next_seg *) wqe)->flags = wr->imm_data;
+			((struct mthca_next_seg *) wqe)->imm = wr->imm_data;
 
 		wqe += sizeof (struct mthca_next_seg);
 		size = sizeof (struct mthca_next_seg) / 16;


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][6/27] IB/mthca: allocate correct number of
	doorbell pages
In-Reply-To: <2005411249.cEJmE9mY2eziJTR6@topspin.com>
Message-ID: <2005411249.VaroeECWUvqcGQCD@topspin.com>

Doorbell record pages are allocated in HCA page size chunks (always
4096 bytes), so we need to divide by 4096 and not PAGE_SIZE when
figuring out how many pages we'll need space for.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:19.911638409 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:22.274125578 -0800
@@ -446,7 +446,7 @@
 
 	init_MUTEX(&dev->db_tab->mutex);
 
-	dev->db_tab->npages     = dev->uar_table.uarc_size / PAGE_SIZE;
+	dev->db_tab->npages     = dev->uar_table.uarc_size / 4096;
 	dev->db_tab->max_group1 = 0;
 	dev->db_tab->min_group2 = dev->db_tab->npages - 1;
 

From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][8/27] IB/mthca: fix MR allocation error
	path
In-Reply-To: <2005411249.i5VdQJiPqpmwTj3T@topspin.com>
Message-ID: <2005411249.mKyALgAB0GbtFnjH@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Fix error handling in MR allocation for mem-free mode:
mthca_free must get an MR index, not a key.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:19.903640145 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:22.968974746 -0800
@@ -231,7 +231,7 @@
 		mthca_table_put(dev, dev->mr_table.mpt_table, key);
 
 err_out_mpt_free:
-	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	mthca_free(&dev->mr_table.mpt_alloc, key);
 	kfree(mailbox);
 	return err;
 }
@@ -368,7 +368,7 @@
 		mthca_table_put(dev, dev->mr_table.mpt_table, key);
 
 err_out_mpt_free:
-	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	mthca_free(&dev->mr_table.mpt_alloc, key);
 	return err;
 }
 

From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][5/27] IB/mthca: allow unaligned memory
	regions
In-Reply-To: <2005411249.dKg4ijljsqXo1Rt6@topspin.com>
Message-ID: <2005411249.cEJmE9mY2eziJTR6@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

The first buffer of a memory region is not required to be
page-aligned, so don't return an error if it's not.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:20.839437009 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:21.926201103 -0800
@@ -494,7 +494,7 @@
 	mask = 0;
 	total_size = 0;
 	for (i = 0; i < num_phys_buf; ++i) {
-		if (buffer_list[i].addr & ~PAGE_MASK)
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
 			return ERR_PTR(-EINVAL);
 		if (i != 0 && i != num_phys_buf - 1 &&
 		    (buffer_list[i].size & ~PAGE_MASK))


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][9/27] IB/mthca: release mutex on doorbell
	alloc error path
In-Reply-To: <2005411249.mKyALgAB0GbtFnjH@topspin.com>
Message-ID: <2005411249.XnosdnfHawyDkITW@topspin.com>

Release mutex on error return path from mthca_alloc_db().

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:22.274125578 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:23.500859288 -0800
@@ -337,7 +337,8 @@
 		break;
 
 	default:
-		return -1;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	for (i = start; i != end; i += dir)


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][7/27] IB/mthca: clean up mthca_dereg_mr()
In-Reply-To: <2005411249.VaroeECWUvqcGQCD@topspin.com>
Message-ID: <2005411249.i5VdQJiPqpmwTj3T@topspin.com>

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

It's cleaner to kfree mthca_mr, and not rely on the fact
that ib_mr is the first field in mthca_mr.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:21.926201103 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:22.630048317 -0800
@@ -568,8 +568,9 @@
 
 static int mthca_dereg_mr(struct ib_mr *mr)
 {
-	mthca_free_mr(to_mdev(mr->device), to_mmr(mr));
-	kfree(mr);
+	struct mthca_mr *mmr = to_mmr(mr);
+	mthca_free_mr(to_mdev(mr->device), mmr);
+	kfree(mmr);
 	return 0;
 }
 

From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][11/27] IB/mthca: only free doorbell
	records in mem-free mode
In-Reply-To: <2005411249.tAq0qtfjGbz3oHeg@topspin.com>
Message-ID: <2005411249.0RpxZQTVnbUL56cR@topspin.com>

On error path, only free doorbell records if we're in mem-free mode.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2005-03-31 19:06:42.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-01 12:38:24.207705852 -0800
@@ -817,10 +817,12 @@
 err_out_mailbox:
 	kfree(mailbox);
 
-	mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index);
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index);
 
 err_out_ci:
-	mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
 
 err_out_icm:
 	mthca_table_put(dev, dev->cq_table.table, cq->cqn);


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][10/27] IB/mthca: print assigned IRQ when
	interrupt test fails
In-Reply-To: <2005411249.XnosdnfHawyDkITW@topspin.com>
Message-ID: <2005411249.tAq0qtfjGbz3oHeg@topspin.com>

Print IRQ number when NOP command interrupt test fails to help debugging.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:19.884644268 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:23.852782896 -0800
@@ -672,7 +672,10 @@
 
 	err = mthca_NOP(dev, &status);
 	if (err || status) {
-		mthca_err(dev, "NOP command failed to generate interrupt, aborting.\n");
+		mthca_err(dev, "NOP command failed to generate interrupt (IRQ %d), aborting.\n",
+			  dev->mthca_flags & MTHCA_FLAG_MSI_X ?
+			  dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector :
+			  dev->pdev->irq);
 		if (dev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X))
 			mthca_err(dev, "Try again with MSI/MSI-X disabled.\n");
 		else


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][13/27] IB/mthca: implement RDMA/atomic
	operations for mem-free mode
In-Reply-To: <2005411249.mBxBGEwdeob5Gy84@topspin.com>
Message-ID: <2005411249.0FJpqa4lTtcUTWSU@topspin.com>

Add code to support RDMA and atomic send work requests in mem-free mode.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:21.580276194 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:25.023528759 -0800
@@ -1775,6 +1775,53 @@
 		size = sizeof (struct mthca_next_seg) / 16;
 
 		switch (qp->transport) {
+		case RC:
+			switch (wr->opcode) {
+			case IB_WR_ATOMIC_CMP_AND_SWP:
+			case IB_WR_ATOMIC_FETCH_AND_ADD:
+				((struct mthca_raddr_seg *) wqe)->raddr =
+					cpu_to_be64(wr->wr.atomic.remote_addr);
+				((struct mthca_raddr_seg *) wqe)->rkey =
+					cpu_to_be32(wr->wr.atomic.rkey);
+				((struct mthca_raddr_seg *) wqe)->reserved = 0;
+
+				wqe += sizeof (struct mthca_raddr_seg);
+
+				if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP) {
+					((struct mthca_atomic_seg *) wqe)->swap_add =
+						cpu_to_be64(wr->wr.atomic.swap);
+					((struct mthca_atomic_seg *) wqe)->compare =
+						cpu_to_be64(wr->wr.atomic.compare_add);
+				} else {
+					((struct mthca_atomic_seg *) wqe)->swap_add =
+						cpu_to_be64(wr->wr.atomic.compare_add);
+					((struct mthca_atomic_seg *) wqe)->compare = 0;
+				}
+
+				wqe += sizeof (struct mthca_atomic_seg);
+				size += sizeof (struct mthca_raddr_seg) / 16 +
+					sizeof (struct mthca_atomic_seg);
+				break;
+
+			case IB_WR_RDMA_WRITE:
+			case IB_WR_RDMA_WRITE_WITH_IMM:
+			case IB_WR_RDMA_READ:
+				((struct mthca_raddr_seg *) wqe)->raddr =
+					cpu_to_be64(wr->wr.rdma.remote_addr);
+				((struct mthca_raddr_seg *) wqe)->rkey =
+					cpu_to_be32(wr->wr.rdma.rkey);
+				((struct mthca_raddr_seg *) wqe)->reserved = 0;
+				wqe += sizeof (struct mthca_raddr_seg);
+				size += sizeof (struct mthca_raddr_seg) / 16;
+				break;
+
+			default:
+				/* No extra segments required for sends */
+				break;
+			}
+
+			break;
+
 		case UD:
 			memcpy(((struct mthca_arbel_ud_seg *) wqe)->av,
 			       to_mah(wr->wr.ud.ah)->av, MTHCA_AV_SIZE);


From roland at topspin.com  Fri Apr  1 12:49:52 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:52 -0800
Subject: [openib-general] [PATCH][12/27] IB/mthca: fix format of CQ number
	for CQ events
In-Reply-To: <2005411249.0RpxZQTVnbUL56cR@topspin.com>
Message-ID: <2005411249.mBxBGEwdeob5Gy84@topspin.com>

CQ numbers are only 24 bits, so only print 6 hex digits and mask off
reserved part when reporting a CQ event.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c	2005-03-31 19:06:55.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c	2005-04-01 12:38:24.575625986 -0800
@@ -344,10 +344,10 @@
 			break;
 
 		case MTHCA_EVENT_TYPE_CQ_ERROR:
-			mthca_warn(dev, "CQ %s on CQN %08x\n",
+			mthca_warn(dev, "CQ %s on CQN %06x\n",
 				   eqe->event.cq_err.syndrome == 1 ?
 				   "overrun" : "access violation",
-				   be32_to_cpu(eqe->event.cq_err.cqn));
+				   be32_to_cpu(eqe->event.cq_err.cqn) & 0xffffff);
 			break;
 
 		case MTHCA_EVENT_TYPE_EQ_OVERFLOW:


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][14/27] IB/mthca: fix MTT allocation in
	mem-free mode
In-Reply-To: <2005411249.0FJpqa4lTtcUTWSU@topspin.com>
Message-ID: <2005411249.E7CWkenJFFkWDs2q@topspin.com>

Fix bug in MTT allocation in mem-free mode.

I misunderstood the MTT size value returned by the firmware -- it is
really the size of a single MTT entry, since mem-free mode does not
segment the MTT as the original firmware did.  This meant that our MTT
addresses ended up being off by a factor of 8.  This meant that our
MTT allocations might overlap, and so we could overwrite and corrupt
earlier memory regions when writing new MTT entries.

We fix this by always using our 64-byte MTT segment size.  This allows
some simplification of the code as well, since there's no reason to
put the MTT segment size in a variable -- we can always use our enum
value directly.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:20.843436141 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:25.574409178 -0800
@@ -990,7 +990,6 @@
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET);
 		dev_lim->max_sg = min_t(int, field, dev_lim->max_sg);
 		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET);
-		dev_lim->mtt_seg_sz = size;
 		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET);
 		dev_lim->mpt_entry_sz = size;
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET);
@@ -1018,7 +1017,6 @@
 	} else {
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET);
 		dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f);
-		dev_lim->mtt_seg_sz   = MTHCA_MTT_SEG_SIZE;
 		dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE;
 	}
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h	2005-03-31 19:06:42.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h	2005-04-01 12:38:25.578408310 -0800
@@ -162,7 +162,6 @@
 	int cqc_entry_sz;
 	int srq_entry_sz;
 	int uar_scratch_entry_sz;
-	int mtt_seg_sz;
 	int mpt_entry_sz;
 	union {
 		struct {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-03-31 19:06:41.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:25.561412000 -0800
@@ -121,7 +121,6 @@
 	int      reserved_eqs;
 	int      num_mpts;
 	int      num_mtt_segs;
-	int      mtt_seg_size;
 	int      reserved_mtts;
 	int      reserved_mrws;
 	int      reserved_uars;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:23.852782896 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:25.566410914 -0800
@@ -390,7 +390,7 @@
 	}
 
 	mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base,
-							 dev_lim->mtt_seg_sz,
+							 MTHCA_MTT_SEG_SIZE,
 							 mdev->limits.num_mtt_segs,
 							 mdev->limits.reserved_mtts, 1);
 	if (!mdev->mr_table.mtt_table) {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:22.968974746 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:25.582407442 -0800
@@ -263,7 +263,7 @@
 			goto err_out_mpt_free;
 	}
 
-	for (i = dev->limits.mtt_seg_size / 8, mr->order = 0;
+	for (i = MTHCA_MTT_SEG_SIZE / 8, mr->order = 0;
 	     i < list_len;
 	     i <<= 1, ++mr->order)
 		; /* nothing */
@@ -286,7 +286,7 @@
 	mtt_entry = MAILBOX_ALIGN(mailbox);
 
 	mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base +
-				   mr->first_seg * dev->limits.mtt_seg_size);
+				   mr->first_seg * MTHCA_MTT_SEG_SIZE);
 	mtt_entry[1] = 0;
 	for (i = 0; i < list_len; ++i)
 		mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] |
@@ -330,7 +330,7 @@
 	memset(&mpt_entry->lkey, 0,
 	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey));
 	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base +
-					   mr->first_seg * dev->limits.mtt_seg_size);
+					   mr->first_seg * MTHCA_MTT_SEG_SIZE);
 
 	if (0) {
 		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:21.237350633 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:25.570410046 -0800
@@ -95,7 +95,7 @@
 	profile[MTHCA_RES_RDB].size  = MTHCA_RDB_ENTRY_SIZE;
 	profile[MTHCA_RES_MCG].size  = MTHCA_MGM_ENTRY_SIZE;
 	profile[MTHCA_RES_MPT].size  = dev_lim->mpt_entry_sz;
-	profile[MTHCA_RES_MTT].size  = dev_lim->mtt_seg_sz;
+	profile[MTHCA_RES_MTT].size  = MTHCA_MTT_SEG_SIZE;
 	profile[MTHCA_RES_UAR].size  = dev_lim->uar_scratch_entry_sz;
 	profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE;
 	profile[MTHCA_RES_UARC].size = request->uarc_size;
@@ -229,10 +229,9 @@
 			break;
 		case MTHCA_RES_MTT:
 			dev->limits.num_mtt_segs = profile[i].num;
-			dev->limits.mtt_seg_size = dev_lim->mtt_seg_sz;
 			dev->mr_table.mtt_base   = profile[i].start;
 			init_hca->mtt_base       = profile[i].start;
-			init_hca->mtt_seg_sz     = ffs(dev_lim->mtt_seg_sz) - 7;
+			init_hca->mtt_seg_sz     = ffs(MTHCA_MTT_SEG_SIZE) - 7;
 			break;
 		case MTHCA_RES_UAR:
 			dev->limits.num_uars       = profile[i].num;


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][15/27] IB/mthca: fill in opcode field for
	send completions
In-Reply-To: <2005411249.E7CWkenJFFkWDs2q@topspin.com>
Message-ID: <2005411249.qipkNNwvZYuE2KBu@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Fill in missing fields in send completions.

Signed-off-by: Itamar Rabenstein <itamar at mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-01 12:38:24.207705852 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-01 12:38:26.177278312 -0800
@@ -473,7 +473,41 @@
 	}
 
 	if (is_send) {
-		entry->opcode = IB_WC_SEND; /* XXX */
+		entry->wc_flags = 0;
+		switch (cqe->opcode) {
+		case MTHCA_OPCODE_RDMA_WRITE:
+			entry->opcode    = IB_WC_RDMA_WRITE;
+			break;
+		case MTHCA_OPCODE_RDMA_WRITE_IMM:
+			entry->opcode    = IB_WC_RDMA_WRITE;
+			entry->wc_flags |= IB_WC_WITH_IMM;
+			break;
+		case MTHCA_OPCODE_SEND:
+			entry->opcode    = IB_WC_SEND;
+			break;
+		case MTHCA_OPCODE_SEND_IMM:
+			entry->opcode    = IB_WC_SEND;
+			entry->wc_flags |= IB_WC_WITH_IMM;
+			break;
+		case MTHCA_OPCODE_RDMA_READ:
+			entry->opcode    = IB_WC_RDMA_READ;
+			entry->byte_len  = be32_to_cpu(cqe->byte_cnt);
+			break;
+		case MTHCA_OPCODE_ATOMIC_CS:
+			entry->opcode    = IB_WC_COMP_SWAP;
+			entry->byte_len  = be32_to_cpu(cqe->byte_cnt);
+			break;
+		case MTHCA_OPCODE_ATOMIC_FA:
+			entry->opcode    = IB_WC_FETCH_ADD;
+			entry->byte_len  = be32_to_cpu(cqe->byte_cnt);
+			break;
+		case MTHCA_OPCODE_BIND_MW:
+			entry->opcode    = IB_WC_BIND_MW;
+			break;
+		default:
+			entry->opcode    = MTHCA_OPCODE_INVALID;
+			break;
+		}
 	} else {
 		entry->byte_len = be32_to_cpu(cqe->byte_cnt);
 		switch (cqe->opcode & 0x1f) {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:25.561412000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:26.173279180 -0800
@@ -88,6 +88,19 @@
 	MTHCA_NUM_EQ
 };
 
+enum {
+	MTHCA_OPCODE_NOP            = 0x00,
+	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
+	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
+	MTHCA_OPCODE_SEND           = 0x0a,
+	MTHCA_OPCODE_SEND_IMM       = 0x0b,
+	MTHCA_OPCODE_RDMA_READ      = 0x10,
+	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
+	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
+	MTHCA_OPCODE_BIND_MW        = 0x18,
+	MTHCA_OPCODE_INVALID        = 0xff
+};
+
 struct mthca_cmd {
 	int                       use_events;
 	struct semaphore          hcr_sem;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:25.023528759 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:26.181277444 -0800
@@ -171,19 +171,6 @@
 };
 
 enum {
-	MTHCA_OPCODE_NOP            = 0x00,
-	MTHCA_OPCODE_RDMA_WRITE     = 0x08,
-	MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09,
-	MTHCA_OPCODE_SEND           = 0x0a,
-	MTHCA_OPCODE_SEND_IMM       = 0x0b,
-	MTHCA_OPCODE_RDMA_READ      = 0x10,
-	MTHCA_OPCODE_ATOMIC_CS      = 0x11,
-	MTHCA_OPCODE_ATOMIC_FA      = 0x12,
-	MTHCA_OPCODE_BIND_MW        = 0x18,
-	MTHCA_OPCODE_INVALID        = 0xff
-};
-
-enum {
 	MTHCA_NEXT_DBD       = 1 << 7,
 	MTHCA_NEXT_FENCE     = 1 << 6,
 	MTHCA_NEXT_CQ_UPDATE = 1 << 3,


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][16/27] IB/mthca: allow address handle
	creation in interrupt context
In-Reply-To: <2005411249.qipkNNwvZYuE2KBu@topspin.com>
Message-ID: <2005411249.gEJosMqrkm8KOH4C@topspin.com>

Make address handle verbs usable from interrupt context.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c	2005-03-31 19:07:01.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c	2005-04-01 12:38:26.648176093 -0800
@@ -63,7 +63,7 @@
 	ah->type = MTHCA_AH_PCI_POOL;
 
 	if (dev->hca_type == ARBEL_NATIVE) {
-		ah->av   = kmalloc(sizeof *ah->av, GFP_KERNEL);
+		ah->av   = kmalloc(sizeof *ah->av, GFP_ATOMIC);
 		if (!ah->av)
 			return -ENOMEM;
 
@@ -77,7 +77,7 @@
 		if (index == -1)
 			goto on_hca_fail;
 
-		av = kmalloc(sizeof *av, GFP_KERNEL);
+		av = kmalloc(sizeof *av, GFP_ATOMIC);
 		if (!av)
 			goto on_hca_fail;
 
@@ -89,7 +89,7 @@
 on_hca_fail:
 	if (ah->type == MTHCA_AH_PCI_POOL) {
 		ah->av = pci_pool_alloc(dev->av_table.pool,
-					SLAB_KERNEL, &ah->avdma);
+					SLAB_ATOMIC, &ah->avdma);
 		if (!ah->av)
 			return -ENOMEM;
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:22.630048317 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:26.644176961 -0800
@@ -315,7 +315,7 @@
 	int err;
 	struct mthca_ah *ah;
 
-	ah = kmalloc(sizeof *ah, GFP_KERNEL);
+	ah = kmalloc(sizeof *ah, GFP_ATOMIC);
 	if (!ah)
 		return ERR_PTR(-ENOMEM);
 

From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][17/27] IB/mthca: encapsulate MTT buddy
	allocator
In-Reply-To: <2005411249.gEJosMqrkm8KOH4C@topspin.com>
Message-ID: <2005411249.S2hhmQaEpM8vK71i@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Encapsulate the buddy allocator used for MTT segments.  This cleans up
the code and also gets us ready to add FMR support.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:26.173279180 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:27.068084943 -0800
@@ -170,10 +170,15 @@
 	struct mthca_alloc alloc;
 };
 
+struct mthca_buddy {
+	unsigned long **bits;
+	int             max_order;
+	spinlock_t      lock;
+};
+
 struct mthca_mr_table {
 	struct mthca_alloc      mpt_alloc;
-	int                     max_mtt_order;
-	unsigned long         **mtt_buddy;
+	struct mthca_buddy	mtt_buddy;
 	u64                     mtt_base;
 	struct mthca_icm_table *mtt_table;
 	struct mthca_icm_table *mpt_table;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:25.582407442 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:27.075083423 -0800
@@ -72,60 +72,108 @@
  * through the bitmaps)
  */
 
-static u32 __mthca_alloc_mtt(struct mthca_dev *dev, int order)
+static u32 mthca_buddy_alloc(struct mthca_buddy *buddy, int order)
 {
 	int o;
 	int m;
 	u32 seg;
 
-	spin_lock(&dev->mr_table.mpt_alloc.lock);
+	spin_lock(&buddy->lock);
 
-	for (o = order; o <= dev->mr_table.max_mtt_order; ++o) {
-		m = 1 << (dev->mr_table.max_mtt_order - o);
-		seg = find_first_bit(dev->mr_table.mtt_buddy[o], m);
+	for (o = order; o <= buddy->max_order; ++o) {
+		m = 1 << (buddy->max_order - o);
+		seg = find_first_bit(buddy->bits[o], m);
 		if (seg < m)
 			goto found;
 	}
 
-	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	spin_unlock(&buddy->lock);
 	return -1;
 
  found:
-	clear_bit(seg, dev->mr_table.mtt_buddy[o]);
+	clear_bit(seg, buddy->bits[o]);
 
 	while (o > order) {
 		--o;
 		seg <<= 1;
-		set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]);
+		set_bit(seg ^ 1, buddy->bits[o]);
 	}
 
-	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	spin_unlock(&buddy->lock);
 
 	seg <<= order;
 
 	return seg;
 }
 
-static void __mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+static void mthca_buddy_free(struct mthca_buddy *buddy, u32 seg, int order)
 {
 	seg >>= order;
 
-	spin_lock(&dev->mr_table.mpt_alloc.lock);
+	spin_lock(&buddy->lock);
 
-	while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) {
-		clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]);
+	while (test_bit(seg ^ 1, buddy->bits[order])) {
+		clear_bit(seg ^ 1, buddy->bits[order]);
 		seg >>= 1;
 		++order;
 	}
 
-	set_bit(seg, dev->mr_table.mtt_buddy[order]);
+	set_bit(seg, buddy->bits[order]);
 
-	spin_unlock(&dev->mr_table.mpt_alloc.lock);
+	spin_unlock(&buddy->lock);
 }
 
-static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order)
+static int __devinit mthca_buddy_init(struct mthca_buddy *buddy, int max_order)
 {
-	u32 seg = __mthca_alloc_mtt(dev, order);
+	int i, s;
+
+	buddy->max_order = max_order;
+	spin_lock_init(&buddy->lock);
+
+	buddy->bits = kmalloc((buddy->max_order + 1) * sizeof (long *),
+			      GFP_KERNEL);
+	if (!buddy->bits)
+		goto err_out;
+
+	memset(buddy->bits, 0, (buddy->max_order + 1) * sizeof (long *));
+
+	for (i = 0; i <= buddy->max_order; ++i) {
+		s = BITS_TO_LONGS(1 << (buddy->max_order - i));
+		buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL);
+		if (!buddy->bits[i])
+			goto err_out_free;
+		bitmap_zero(buddy->bits[i],
+			    1 << (buddy->max_order - i));
+	}
+
+	set_bit(0, buddy->bits[buddy->max_order]);
+
+	return 0;
+
+err_out_free:
+	for (i = 0; i <= buddy->max_order; ++i)
+		kfree(buddy->bits[i]);
+
+	kfree(buddy->bits);
+
+err_out:
+	return -ENOMEM;
+}
+
+static void __devexit mthca_buddy_cleanup(struct mthca_buddy *buddy)
+{
+	int i;
+
+	for (i = 0; i <= buddy->max_order; ++i)
+		kfree(buddy->bits[i]);
+
+	kfree(buddy->bits);
+}
+
+static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order,
+			   struct mthca_buddy *buddy)
+{
+	u32 seg = mthca_buddy_alloc(buddy, order);
 
 	if (seg == -1)
 		return -1;
@@ -133,16 +181,17 @@
 	if (dev->hca_type == ARBEL_NATIVE)
 		if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg,
 					  seg + (1 << order) - 1)) {
-			__mthca_free_mtt(dev, seg, order);
+			mthca_buddy_free(buddy, seg, order);
 			seg = -1;
 		}
 
 	return seg;
 }
 
-static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order)
+static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order,
+			   struct mthca_buddy* buddy)
 {
-	__mthca_free_mtt(dev, seg, order);
+	mthca_buddy_free(buddy, seg, order);
 
 	if (dev->hca_type == ARBEL_NATIVE)
 		mthca_table_put_range(dev, dev->mr_table.mtt_table, seg,
@@ -268,7 +317,8 @@
 	     i <<= 1, ++mr->order)
 		; /* nothing */
 
-	mr->first_seg = mthca_alloc_mtt(dev, mr->order);
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order,
+				       	&dev->mr_table.mtt_buddy);
 	if (mr->first_seg == -1)
 		goto err_out_table;
 
@@ -361,7 +411,7 @@
 	kfree(mailbox);
 
 err_out_free_mtt:
-	mthca_free_mtt(dev, mr->first_seg, mr->order);
+	mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy);
 
 err_out_table:
 	if (dev->hca_type == ARBEL_NATIVE)
@@ -390,7 +440,7 @@
 			   status);
 
 	if (mr->order >= 0)
-		mthca_free_mtt(dev, mr->first_seg, mr->order);
+		mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy);
 
 	if (dev->hca_type == ARBEL_NATIVE)
 		mthca_table_put(dev, dev->mr_table.mpt_table,
@@ -401,7 +451,6 @@
 int __devinit mthca_init_mr_table(struct mthca_dev *dev)
 {
 	int err;
-	int i, s;
 
 	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
 			       dev->limits.num_mpts,
@@ -409,53 +458,24 @@
 	if (err)
 		return err;
 
-	err = -ENOMEM;
-
-	for (i = 1, dev->mr_table.max_mtt_order = 0;
-	     i < dev->limits.num_mtt_segs;
-	     i <<= 1, ++dev->mr_table.max_mtt_order)
-		; /* nothing */
-
-	dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) *
-					  sizeof (long *),
-					  GFP_KERNEL);
-	if (!dev->mr_table.mtt_buddy)
-		goto err_out;
-
-	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
-		dev->mr_table.mtt_buddy[i] = NULL;
-
-	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) {
-		s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i));
-		dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long),
-						     GFP_KERNEL);
-		if (!dev->mr_table.mtt_buddy[i])
-			goto err_out_free;
-		bitmap_zero(dev->mr_table.mtt_buddy[i],
-			    1 << (dev->mr_table.max_mtt_order - i));
-	}
-
-	set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]);
-
-	for (i = 0; i < dev->mr_table.max_mtt_order; ++i)
-		if (1 << i >= dev->limits.reserved_mtts)
-			break;
+	err = mthca_buddy_init(&dev->mr_table.mtt_buddy,
+			       fls(dev->limits.num_mtt_segs - 1));
+	if (err)
+		goto err_mtt_buddy;
 
-	if (i == dev->mr_table.max_mtt_order) {
-		mthca_err(dev, "MTT table of order %d is "
-			  "too small.\n", i);
-		goto err_out_free;
+	if (dev->limits.reserved_mtts) {
+		if (mthca_alloc_mtt(dev, fls(dev->limits.reserved_mtts - 1),
+				    &dev->mr_table.mtt_buddy) == -1) {
+			mthca_warn(dev, "MTT table of order %d is too small.\n",
+				  dev->mr_table.mtt_buddy.max_order);
+			err = -ENOMEM;
+			goto err_mtt_buddy;
+		}
 	}
 
-	(void) mthca_alloc_mtt(dev, i);
-
 	return 0;
 
- err_out_free:
-	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
-		kfree(dev->mr_table.mtt_buddy[i]);
-
- err_out:
+err_mtt_buddy:
 	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
 
 	return err;
@@ -463,11 +483,7 @@
 
 void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
 {
-	int i;
-
 	/* XXX check if any MRs are still allocated? */
-	for (i = 0; i <= dev->mr_table.max_mtt_order; ++i)
-		kfree(dev->mr_table.mtt_buddy[i]);
-	kfree(dev->mr_table.mtt_buddy);
+	mthca_buddy_cleanup(&dev->mr_table.mtt_buddy);
 	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
 }


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][18/27] IB/mthca: add SYNC_TPT firmware
	command
In-Reply-To: <2005411249.S2hhmQaEpM8vK71i@topspin.com>
Message-ID: <2005411249.Wiedh3QohPRJi9Sp@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Add code for SYNC_TPT firmware command, which will be used by FMR implementation.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:25.574409178 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:27.495992056 -0800
@@ -1404,6 +1404,11 @@
 	return err;
 }
 
+int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status)
+{
+	return mthca_cmd(dev, 0, 0, 0, CMD_SYNC_TPT, CMD_TIME_CLASS_B, status);
+}
+
 int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
 		 int eq_num, u8 *status)
 {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.h	2005-04-01 12:38:25.578408310 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.h	2005-04-01 12:38:27.500990971 -0800
@@ -276,6 +276,7 @@
 		    int mpt_index, u8 *status);
 int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry,
 		    int num_mtt, u8 *status);
+int mthca_SYNC_TPT(struct mthca_dev *dev, u8 *status);
 int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap,
 		 int eq_num, u8 *status);
 int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context,


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][19/27] IB/mthca: add mthca_write64_raw()
	for writing to MTT table directly
In-Reply-To: <2005411249.Wiedh3QohPRJi9Sp@topspin.com>
Message-ID: <2005411249.t0DdCtarOabubO3D@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Add mthca_write64_raw() function, which will be used to write FMR
entries that are in ioremapped PCI memory.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h	2005-03-31 19:06:52.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_doorbell.h	2005-04-01 12:38:27.898904595 -0800
@@ -51,6 +51,11 @@
 #define MTHCA_INIT_DOORBELL_LOCK(ptr)    do { } while (0)
 #define MTHCA_GET_DOORBELL_LOCK(ptr)      (NULL)
 
+static inline void mthca_write64_raw(__be64 val, void __iomem *dest)
+{
+	__raw_writeq((__force u64) val, dest);
+}
+
 static inline void mthca_write64(u32 val[2], void __iomem *dest,
 				 spinlock_t *doorbell_lock)
 {
@@ -74,6 +79,12 @@
 #define MTHCA_INIT_DOORBELL_LOCK(ptr)     spin_lock_init(ptr)
 #define MTHCA_GET_DOORBELL_LOCK(ptr)      (ptr)
 
+static inline void mthca_write64_raw(__be64 val, void __iomem *dest)
+{
+	__raw_writel(((__force u32 *) &val)[0], dest);
+	__raw_writel(((__force u32 *) &val)[1], dest + 4);
+}
+
 static inline void mthca_write64(u32 val[2], void __iomem *dest,
 				 spinlock_t *doorbell_lock)
 {


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][20/27] IB/mthca: add mthca_table_find()
	function
In-Reply-To: <2005411249.t0DdCtarOabubO3D@topspin.com>
Message-ID: <2005411249.Tkvt1lzz8zEHUMmz@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Add mthca_table_find() function, which returns the lowmem address of
an entry in a mem-free HCA's context tables.  This will be used by the
FMR implementation.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:23.500859288 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:28.285820606 -0800
@@ -192,6 +192,40 @@
 	up(&table->mutex);
 }
 
+void *mthca_table_find(struct mthca_icm_table *table, int obj)
+{
+	int idx, offset, i;
+	struct mthca_icm_chunk *chunk;
+	struct mthca_icm *icm;
+	struct page *page = NULL;
+
+	if (!table->lowmem)
+		return NULL;
+
+	down(&table->mutex);
+
+	idx = (obj & (table->num_obj - 1)) * table->obj_size;
+	icm = table->icm[idx / MTHCA_TABLE_CHUNK_SIZE];
+	offset = idx % MTHCA_TABLE_CHUNK_SIZE;
+
+	if (!icm)
+		goto out;
+
+	list_for_each_entry(chunk, &icm->chunk_list, list) {
+		for (i = 0; i < chunk->npages; ++i) {
+			if (chunk->mem[i].length >= offset) {
+				page = chunk->mem[i].page;
+				break;
+			}
+			offset -= chunk->mem[i].length;
+		}
+	}
+
+out:
+	up(&table->mutex);
+	return page ? lowmem_page_address(page) + offset : NULL;
+}
+
 int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table,
 			  int start, int end)
 {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-04-01 12:38:19.895641881 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-04-01 12:38:28.280821691 -0800
@@ -85,6 +85,7 @@
 void mthca_free_icm_table(struct mthca_dev *dev, struct mthca_icm_table *table);
 int mthca_table_get(struct mthca_dev *dev, struct mthca_icm_table *table, int obj);
 void mthca_table_put(struct mthca_dev *dev, struct mthca_icm_table *table, int obj);
+void *mthca_table_find(struct mthca_icm_table *table, int obj);
 int mthca_table_get_range(struct mthca_dev *dev, struct mthca_icm_table *table,
 			  int start, int end);
 void mthca_table_put_range(struct mthca_dev *dev, struct mthca_icm_table *table,


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][21/27] IB/mthca: split MR key munging
	routines
In-Reply-To: <2005411249.Tkvt1lzz8zEHUMmz@topspin.com>
Message-ID: <2005411249.VplL6XJIvCp9HHyP@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Split Tavor and Arbel/mem-free index<->hw key munging routines, so that FMR implementation 
can call correct implementation without testing HCA type (which it already knows).

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:27.075083423 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:28.676735749 -0800
@@ -198,20 +198,40 @@
 				      seg + (1 << order) - 1);
 }
 
+static inline u32 tavor_hw_index_to_key(u32 ind)
+{
+	return ind;
+}
+
+static inline u32 tavor_key_to_hw_index(u32 key)
+{
+	return key;
+}
+
+static inline u32 arbel_hw_index_to_key(u32 ind)
+{
+	return (ind >> 24) | (ind << 8);
+}
+
+static inline u32 arbel_key_to_hw_index(u32 key)
+{
+	return (key << 24) | (key >> 8);
+}
+
 static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind)
 {
 	if (dev->hca_type == ARBEL_NATIVE)
-		return (ind >> 24) | (ind << 8);
+		return arbel_hw_index_to_key(ind);
 	else
-		return ind;
+		return tavor_hw_index_to_key(ind);
 }
 
 static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key)
 {
 	if (dev->hca_type == ARBEL_NATIVE)
-		return (key << 24) | (key >> 8);
+		return arbel_key_to_hw_index(key);
 	else
-		return key;
+		return tavor_key_to_hw_index(key);
 }
 
 int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd,


From roland at topspin.com  Fri Apr  1 12:49:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:54 -0800
Subject: [openib-general] [PATCH][23/27] IB/mthca: tweaks to mthca_cmd.c
In-Reply-To: <2005411249.CxF3RBWpNJELwaqL@topspin.com>
Message-ID: <2005411249.5GDmFAellTSOT0Ai@topspin.com>

Minor tweaks to firmware command handling: kill off an unused get of a
value, and add a little more info to debug output.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:27.495992056 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:30.084430178 -0800
@@ -989,7 +989,6 @@
 		dev_lim->hca.arbel.resize_srq = field & 1;
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET);
 		dev_lim->max_sg = min_t(int, field, dev_lim->max_sg);
-		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET);
 		MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET);
 		dev_lim->mpt_entry_sz = size;
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET);
@@ -1297,8 +1296,8 @@
 	pci_free_consistent(dev->pdev, 16, inbox, indma);
 
 	if (!err)
-		mthca_dbg(dev, "Mapped page at %llx for ICM.\n",
-			  (unsigned long long) virt);
+		mthca_dbg(dev, "Mapped page at %llx to %llx for ICM.\n",
+			  (unsigned long long) dma_addr, (unsigned long long) virt);
 
 	return err;
 }


From roland at topspin.com  Fri Apr  1 12:49:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:53 -0800
Subject: [openib-general] [PATCH][22/27] IB/mthca: add fast memory region
	implementation
In-Reply-To: <2005411249.VplL6XJIvCp9HHyP@topspin.com>
Message-ID: <2005411249.CxF3RBWpNJELwaqL@topspin.com>

From: Michael S. Tsirkin <mst at mellanox.co.il>

Implement fast memory regions (FMRs), where the driver writes directly
into the HCA's translation tables rather than requiring a firmware
command.  For Tavor, MTTs for FMR are separate from regular MTTs, and
are reserved at driver initialization. This is done to limit the
amount of virtual memory needed to map the MTTs.  For Arbel, there's
no such limitation, and all MTTs and MPTs may be used for FMR or for
regular MR.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:27.068084943 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:29.460565601 -0800
@@ -61,7 +61,8 @@
 	MTHCA_FLAG_SRQ        = 1 << 2,
 	MTHCA_FLAG_MSI        = 1 << 3,
 	MTHCA_FLAG_MSI_X      = 1 << 4,
-	MTHCA_FLAG_NO_LAM     = 1 << 5
+	MTHCA_FLAG_NO_LAM     = 1 << 5,
+	MTHCA_FLAG_FMR        = 1 << 6
 };
 
 enum {
@@ -134,6 +135,7 @@
 	int      reserved_eqs;
 	int      num_mpts;
 	int      num_mtt_segs;
+	int      fmr_reserved_mtts;
 	int      reserved_mtts;
 	int      reserved_mrws;
 	int      reserved_uars;
@@ -178,10 +180,17 @@
 
 struct mthca_mr_table {
 	struct mthca_alloc      mpt_alloc;
-	struct mthca_buddy	mtt_buddy;
+	struct mthca_buddy      mtt_buddy;
+	struct mthca_buddy     *fmr_mtt_buddy;
 	u64                     mtt_base;
+	u64                     mpt_base;
 	struct mthca_icm_table *mtt_table;
 	struct mthca_icm_table *mpt_table;
+	struct {
+		void __iomem   *mpt_base;
+		void __iomem   *mtt_base;
+		struct mthca_buddy mtt_buddy;
+	} tavor_fmr;
 };
 
 struct mthca_eq_table {
@@ -380,7 +389,17 @@
 			u64 *buffer_list, int buffer_size_shift,
 			int list_len, u64 iova, u64 total_size,
 			u32 access, struct mthca_mr *mr);
-void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr);
+void mthca_free_mr(struct mthca_dev *dev,  struct mthca_mr *mr);
+
+int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd,
+		    u32 access, struct mthca_fmr *fmr);
+int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
+			     int list_len, u64 iova);
+void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr);
+int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
+			     int list_len, u64 iova);
+void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr);
+int mthca_free_fmr(struct mthca_dev *dev,  struct mthca_fmr *fmr);
 
 int mthca_map_eq_icm(struct mthca_dev *dev, u64 icm_virt);
 void mthca_unmap_eq_icm(struct mthca_dev *dev);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:25.566410914 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:29.466564299 -0800
@@ -73,14 +73,15 @@
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
 static struct mthca_profile default_profile = {
-	.num_qp     = 1 << 16,
-	.rdb_per_qp = 4,
-	.num_cq     = 1 << 16,
-	.num_mcg    = 1 << 13,
-	.num_mpt    = 1 << 17,
-	.num_mtt    = 1 << 20,
-	.num_udav   = 1 << 15,	/* Tavor only */
-	.uarc_size  = 1 << 18,	/* Arbel only */
+	.num_qp		   = 1 << 16,
+	.rdb_per_qp	   = 4,
+	.num_cq		   = 1 << 16,
+	.num_mcg	   = 1 << 13,
+	.num_mpt	   = 1 << 17,
+	.num_mtt	   = 1 << 20,
+	.num_udav	   = 1 << 15,	/* Tavor only */
+	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
+	.uarc_size	   = 1 << 18,	/* Arbel only */
 };
 
 static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:28.676735749 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:29.493558440 -0800
@@ -66,6 +66,9 @@
 
 #define MTHCA_MTT_FLAG_PRESENT       1
 
+#define MTHCA_MPT_STATUS_SW 0xF0
+#define MTHCA_MPT_STATUS_HW 0x00
+
 /*
  * Buddy allocator for MTT segments (currently not very efficient
  * since it doesn't keep a free list and just searches linearly
@@ -442,6 +445,20 @@
 	return err;
 }
 
+/* Free mr or fmr */
+static void mthca_free_region(struct mthca_dev *dev, u32 lkey, int order,
+			      u32 first_seg, struct mthca_buddy *buddy)
+{
+	if (order >= 0)
+		mthca_free_mtt(dev, first_seg, order, buddy);
+
+	if (dev->hca_type == ARBEL_NATIVE)
+		mthca_table_put(dev, dev->mr_table.mpt_table,
+				arbel_key_to_hw_index(lkey));
+
+	mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, lkey));
+}
+
 void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr)
 {
 	int err;
@@ -459,18 +476,288 @@
 		mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n",
 			   status);
 
-	if (mr->order >= 0)
-		mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy);
+	mthca_free_region(dev, mr->ibmr.lkey, mr->order, mr->first_seg,
+			  &dev->mr_table.mtt_buddy);
+}
+
+int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd,
+		    u32 access, struct mthca_fmr *mr)
+{
+	struct mthca_mpt_entry *mpt_entry;
+	void *mailbox;
+	u64 mtt_seg;
+	u32 key, idx;
+	u8 status;
+	int list_len = mr->attr.max_pages;
+	int err = -ENOMEM;
+	int i;
+
+	might_sleep();
+
+	if (mr->attr.page_size < 12 || mr->attr.page_size >= 32)
+		return -EINVAL;
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	if (dev->hca_type == ARBEL_NATIVE &&
+	    mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE)
+		return -EINVAL;
+
+	mr->maps = 0;
+
+	key = mthca_alloc(&dev->mr_table.mpt_alloc);
+	if (key == -1)
+		return -ENOMEM;
+
+	idx = key & (dev->limits.num_mpts - 1);
+	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
+		if (err)
+			goto err_out_mpt_free;
+
+		mr->mem.arbel.mpt = mthca_table_find(dev->mr_table.mpt_table, key);
+		BUG_ON(!mr->mem.arbel.mpt);
+	} else
+		mr->mem.tavor.mpt = dev->mr_table.tavor_fmr.mpt_base +
+		       	sizeof *(mr->mem.tavor.mpt) * idx;
+
+	for (i = MTHCA_MTT_SEG_SIZE / 8, mr->order = 0;
+	     i < list_len;
+	     i <<= 1, ++mr->order)
+		; /* nothing */
+
+	mr->first_seg = mthca_alloc_mtt(dev, mr->order,
+				       	dev->mr_table.fmr_mtt_buddy);
+	if (mr->first_seg == -1)
+		goto err_out_table;
+
+	mtt_seg = mr->first_seg * MTHCA_MTT_SEG_SIZE;
+
+	if (dev->hca_type == ARBEL_NATIVE) {
+		mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table,
+						      mr->first_seg);
+		BUG_ON(!mr->mem.arbel.mtts);
+	} else
+		mr->mem.tavor.mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg;
+
+	mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA,
+			  GFP_KERNEL);
+	if (!mailbox)
+		goto err_out_free_mtt;
+
+	mpt_entry = MAILBOX_ALIGN(mailbox);
+
+	mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS     |
+				       MTHCA_MPT_FLAG_MIO         |
+				       MTHCA_MPT_FLAG_REGION      |
+				       access);
+
+	mpt_entry->page_size = cpu_to_be32(mr->attr.page_size - 12);
+	mpt_entry->key       = cpu_to_be32(key);
+	mpt_entry->pd        = cpu_to_be32(pd);
+	memset(&mpt_entry->start, 0,
+	       sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, start));
+	mpt_entry->mtt_seg   = cpu_to_be64(dev->mr_table.mtt_base + mtt_seg);
+
+	if (0) {
+		mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey);
+		for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) {
+			if (i % 4 == 0)
+				printk("[%02x] ", i * 4);
+			printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i]));
+			if ((i + 1) % 4 == 0)
+				printk("\n");
+		}
+	}
+
+	err = mthca_SW2HW_MPT(dev, mpt_entry,
+			      key & (dev->limits.num_mpts - 1),
+			      &status);
+	if (err) {
+		mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err);
+		goto err_out_mailbox_free;
+	}
+	if (status) {
+		mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n",
+			   status);
+		err = -EINVAL;
+		goto err_out_mailbox_free;
+	}
+
+	kfree(mailbox);
+	return 0;
+
+err_out_mailbox_free:
+	kfree(mailbox);
+
+err_out_free_mtt:
+	mthca_free_mtt(dev, mr->first_seg, mr->order,
+		       dev->mr_table.fmr_mtt_buddy);
 
+err_out_table:
 	if (dev->hca_type == ARBEL_NATIVE)
-		mthca_table_put(dev, dev->mr_table.mpt_table,
-				key_to_hw_index(dev, mr->ibmr.lkey));
-	mthca_free(&dev->mr_table.mpt_alloc, key_to_hw_index(dev, mr->ibmr.lkey));
+		mthca_table_put(dev, dev->mr_table.mpt_table, key);
+
+err_out_mpt_free:
+	mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey);
+	return err;
+}
+
+int mthca_free_fmr(struct mthca_dev *dev, struct mthca_fmr *fmr)
+{
+	if (fmr->maps)
+		return -EBUSY;
+
+	mthca_free_region(dev, fmr->ibmr.lkey, fmr->order, fmr->first_seg,
+			  dev->mr_table.fmr_mtt_buddy);
+	return 0;
+}
+
+static inline int mthca_check_fmr(struct mthca_fmr *fmr, u64 *page_list,
+				  int list_len, u64 iova)
+{
+	int i, page_mask;
+
+	if (list_len > fmr->attr.max_pages)
+		return -EINVAL;
+
+	page_mask = (1 << fmr->attr.page_size) - 1;
+
+	/* We are getting page lists, so va must be page aligned. */
+	if (iova & page_mask)
+		return -EINVAL;
+
+	/* Trust the user not to pass misaligned data in page_list */
+	if (0)
+		for (i = 0; i < list_len; ++i) {
+			if (page_list[i] & ~page_mask)
+				return -EINVAL;
+		}
+
+	if (fmr->maps >= fmr->attr.max_maps)
+		return -EINVAL;
+
+	return 0;
+}
+
+
+int mthca_tavor_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
+			     int list_len, u64 iova)
+{
+	struct mthca_fmr *fmr = to_mfmr(ibfmr);
+	struct mthca_dev *dev = to_mdev(ibfmr->device);
+	struct mthca_mpt_entry mpt_entry;
+	u32 key;
+	int i, err;
+
+	err = mthca_check_fmr(fmr, page_list, list_len, iova);
+	if (err)
+		return err;
+
+	++fmr->maps;
+
+	key = tavor_key_to_hw_index(fmr->ibmr.lkey);
+	key += dev->limits.num_mpts;
+	fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key);
+
+	writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt);
+
+	for (i = 0; i < list_len; ++i) {
+		__be64 mtt_entry = cpu_to_be64(page_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+		mthca_write64_raw(mtt_entry, fmr->mem.tavor.mtts + i);
+	}
+
+	mpt_entry.lkey   = cpu_to_be32(key);
+	mpt_entry.length = cpu_to_be64(list_len * (1ull << fmr->attr.page_size));
+	mpt_entry.start  = cpu_to_be64(iova);
+
+	writel(mpt_entry.lkey, &fmr->mem.tavor.mpt->key);
+	memcpy_toio(&fmr->mem.tavor.mpt->start, &mpt_entry.start, 
+		    offsetof(struct mthca_mpt_entry, window_count) -
+		    offsetof(struct mthca_mpt_entry, start));
+
+	writeb(MTHCA_MPT_STATUS_HW, fmr->mem.tavor.mpt);
+
+	return 0;
+}
+
+int mthca_arbel_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
+			     int list_len, u64 iova)
+{
+	struct mthca_fmr *fmr = to_mfmr(ibfmr);
+	struct mthca_dev *dev = to_mdev(ibfmr->device);
+	u32 key;
+	int i, err;
+
+	err = mthca_check_fmr(fmr, page_list, list_len, iova);
+	if (err)
+		return err;
+
+	++fmr->maps;
+
+	key = arbel_key_to_hw_index(fmr->ibmr.lkey);
+	key += dev->limits.num_mpts;
+	fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key);
+
+	*(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_SW;
+
+	wmb();
+
+	for (i = 0; i < list_len; ++i)
+		fmr->mem.arbel.mtts[i] = cpu_to_be64(page_list[i] |
+						     MTHCA_MTT_FLAG_PRESENT);
+
+	fmr->mem.arbel.mpt->key    = cpu_to_be32(key);
+	fmr->mem.arbel.mpt->lkey   = cpu_to_be32(key);
+	fmr->mem.arbel.mpt->length = cpu_to_be64(list_len * (1ull << fmr->attr.page_size));
+	fmr->mem.arbel.mpt->start  = cpu_to_be64(iova);
+
+	wmb();
+
+	*(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_HW;
+
+	wmb();
+
+	return 0;
+}
+
+void mthca_tavor_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr)
+{
+	u32 key;
+
+	if (!fmr->maps)
+		return;
+
+	key = tavor_key_to_hw_index(fmr->ibmr.lkey);
+	key &= dev->limits.num_mpts - 1;
+	fmr->ibmr.lkey = fmr->ibmr.rkey = tavor_hw_index_to_key(key);
+
+	fmr->maps = 0;
+
+	writeb(MTHCA_MPT_STATUS_SW, fmr->mem.tavor.mpt);
+}
+
+void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr)
+{
+	u32 key;
+
+	if (!fmr->maps)
+		return;
+
+	key = arbel_key_to_hw_index(fmr->ibmr.lkey);
+	key &= dev->limits.num_mpts - 1;
+	fmr->ibmr.lkey = fmr->ibmr.rkey = arbel_hw_index_to_key(key);
+
+	fmr->maps = 0;
+
+	*(u8 *) fmr->mem.arbel.mpt = MTHCA_MPT_STATUS_SW;
 }
 
 int __devinit mthca_init_mr_table(struct mthca_dev *dev)
 {
-	int err;
+	int err, i;
 
 	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
 			       dev->limits.num_mpts,
@@ -478,23 +765,93 @@
 	if (err)
 		return err;
 
+	if (dev->hca_type != ARBEL_NATIVE &&
+	    (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN))
+		dev->limits.fmr_reserved_mtts = 0;
+	else
+		dev->mthca_flags |= MTHCA_FLAG_FMR;
+
 	err = mthca_buddy_init(&dev->mr_table.mtt_buddy,
 			       fls(dev->limits.num_mtt_segs - 1));
+
 	if (err)
 		goto err_mtt_buddy;
 
+	dev->mr_table.tavor_fmr.mpt_base = NULL;
+	dev->mr_table.tavor_fmr.mtt_base = NULL;
+
+	if (dev->limits.fmr_reserved_mtts) {
+		i = fls(dev->limits.fmr_reserved_mtts - 1);
+
+		if (i >= 31) {
+			mthca_warn(dev, "Unable to reserve 2^31 FMR MTTs.\n");
+			err = -EINVAL;
+			goto err_fmr_mpt;
+		}
+
+		dev->mr_table.tavor_fmr.mpt_base =
+		       	ioremap(dev->mr_table.mpt_base,
+				(1 << i) * sizeof (struct mthca_mpt_entry));
+
+		if (!dev->mr_table.tavor_fmr.mpt_base) {
+			mthca_warn(dev, "MPT ioremap for FMR failed.\n");
+			err = -ENOMEM;
+			goto err_fmr_mpt;
+		}
+
+		dev->mr_table.tavor_fmr.mtt_base =
+			ioremap(dev->mr_table.mtt_base,
+				(1 << i) * MTHCA_MTT_SEG_SIZE);
+		if (!dev->mr_table.tavor_fmr.mtt_base) {
+			mthca_warn(dev, "MTT ioremap for FMR failed.\n");
+			err = -ENOMEM;
+			goto err_fmr_mtt;
+		}
+
+		err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i);
+		if (err)
+			goto err_fmr_mtt_buddy;
+
+		/* Prevent regular MRs from using FMR keys */
+		err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i);
+		if (err)
+			goto err_reserve_fmr;
+
+		dev->mr_table.fmr_mtt_buddy =
+		       	&dev->mr_table.tavor_fmr.mtt_buddy;
+	} else
+		dev->mr_table.fmr_mtt_buddy = &dev->mr_table.mtt_buddy;
+
+	/* FMR table is always the first, take reserved MTTs out of there */
 	if (dev->limits.reserved_mtts) {
-		if (mthca_alloc_mtt(dev, fls(dev->limits.reserved_mtts - 1),
-				    &dev->mr_table.mtt_buddy) == -1) {
+		i = fls(dev->limits.reserved_mtts - 1);
+		
+		if (mthca_alloc_mtt(dev, i, dev->mr_table.fmr_mtt_buddy) == -1) {
 			mthca_warn(dev, "MTT table of order %d is too small.\n",
-				  dev->mr_table.mtt_buddy.max_order);
+				  dev->mr_table.fmr_mtt_buddy->max_order);
 			err = -ENOMEM;
-			goto err_mtt_buddy;
+			goto err_reserve_mtts;
 		}
 	}
 
 	return 0;
 
+err_reserve_mtts:
+err_reserve_fmr:
+	if (dev->limits.fmr_reserved_mtts)
+		mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy);
+
+err_fmr_mtt_buddy:
+	if (dev->mr_table.tavor_fmr.mtt_base)
+		iounmap(dev->mr_table.tavor_fmr.mtt_base);
+
+err_fmr_mtt:
+	if (dev->mr_table.tavor_fmr.mpt_base)
+		iounmap(dev->mr_table.tavor_fmr.mpt_base);
+
+err_fmr_mpt:
+	mthca_buddy_cleanup(&dev->mr_table.mtt_buddy);
+
 err_mtt_buddy:
 	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
 
@@ -504,6 +861,15 @@
 void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev)
 {
 	/* XXX check if any MRs are still allocated? */
+	if (dev->limits.fmr_reserved_mtts)
+		mthca_buddy_cleanup(&dev->mr_table.tavor_fmr.mtt_buddy);
+
 	mthca_buddy_cleanup(&dev->mr_table.mtt_buddy);
+
+	if (dev->mr_table.tavor_fmr.mtt_base)
+		iounmap(dev->mr_table.tavor_fmr.mtt_base);
+	if (dev->mr_table.tavor_fmr.mpt_base)
+		iounmap(dev->mr_table.tavor_fmr.mpt_base);
+
 	mthca_alloc_cleanup(&dev->mr_table.mpt_alloc);
 }
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:25.570410046 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:29.480561261 -0800
@@ -223,9 +223,10 @@
 			init_hca->mc_hash_sz      = 1 << (profile[i].log_num - 1);
 			break;
 		case MTHCA_RES_MPT:
-			dev->limits.num_mpts = profile[i].num;
-			init_hca->mpt_base   = profile[i].start;
-			init_hca->log_mpt_sz = profile[i].log_num;
+			dev->limits.num_mpts   = profile[i].num;
+			dev->mr_table.mpt_base = profile[i].start;
+			init_hca->mpt_base     = profile[i].start;
+			init_hca->log_mpt_sz   = profile[i].log_num;
 			break;
 		case MTHCA_RES_MTT:
 			dev->limits.num_mtt_segs = profile[i].num;
@@ -259,6 +260,18 @@
 	 */
 	dev->limits.num_pds = MTHCA_NUM_PDS;
 
+	/*
+	 * For Tavor, FMRs use ioremapped PCI memory. For 32 bit
+	 * systems it may use too much vmalloc space to map all MTT
+	 * memory, so we reserve some MTTs for FMR access, taking them
+	 * out of the MR pool. They don't use additional memory, but
+	 * we assign them as part of the HCA profile anyway.
+	 */
+	if (dev->hca_type == ARBEL_NATIVE)
+		dev->limits.fmr_reserved_mtts = 0;
+	else
+		dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;
+
 	kfree(profile);
 	return total_size;
 }
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.h	2005-03-31 19:07:01.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.h	2005-04-01 12:38:29.484560393 -0800
@@ -48,6 +48,7 @@
 	int num_udav;
 	int num_uar;
 	int uarc_size;
+	int fmr_reserved_mtts;
 };
 
 u64 mthca_make_profile(struct mthca_dev *mdev,
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:26.644176961 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:29.471563214 -0800
@@ -574,6 +574,74 @@
 	return 0;
 }
 
+static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags,
+				      struct ib_fmr_attr *fmr_attr)
+{
+	struct mthca_fmr *fmr;
+	int err;
+
+	fmr = kmalloc(sizeof *fmr, GFP_KERNEL);
+	if (!fmr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr);
+	err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num,
+			     convert_access(mr_access_flags), fmr);
+
+	if (err) {
+		kfree(fmr);
+		return ERR_PTR(err);
+	}
+
+	return &fmr->ibmr;
+}
+
+static int mthca_dealloc_fmr(struct ib_fmr *fmr)
+{
+	struct mthca_fmr *mfmr = to_mfmr(fmr);
+	int err;
+
+	err = mthca_free_fmr(to_mdev(fmr->device), mfmr);
+	if (err)
+		return err;
+
+	kfree(mfmr);
+	return 0;
+}
+
+static int mthca_unmap_fmr(struct list_head *fmr_list)
+{
+	struct ib_fmr *fmr;
+	int err;
+	u8 status;
+	struct mthca_dev *mdev = NULL;
+
+	list_for_each_entry(fmr, fmr_list, list) {
+		if (mdev && to_mdev(fmr->device) != mdev)
+			return -EINVAL;
+		mdev = to_mdev(fmr->device);
+	}
+
+	if (!mdev)
+		return 0;
+
+	if (mdev->hca_type == ARBEL_NATIVE) {
+		list_for_each_entry(fmr, fmr_list, list)
+			mthca_arbel_fmr_unmap(mdev, to_mfmr(fmr));
+
+		wmb();
+	} else
+		list_for_each_entry(fmr, fmr_list, list)
+			mthca_tavor_fmr_unmap(mdev, to_mfmr(fmr));
+
+	err = mthca_SYNC_TPT(mdev, &status);
+	if (err)
+		return err;
+	if (status)
+		return -EINVAL;
+	return 0;
+}
+
 static ssize_t show_rev(struct class_device *cdev, char *buf)
 {
 	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
@@ -637,6 +705,17 @@
 	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
 	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
 	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
+
+	if (dev->mthca_flags & MTHCA_FLAG_FMR) {
+		dev->ib_dev.alloc_fmr            = mthca_alloc_fmr;
+		dev->ib_dev.unmap_fmr            = mthca_unmap_fmr;
+		dev->ib_dev.dealloc_fmr          = mthca_dealloc_fmr;
+		if (dev->hca_type == ARBEL_NATIVE)
+			dev->ib_dev.map_phys_fmr = mthca_arbel_map_phys_fmr;
+		else
+			dev->ib_dev.map_phys_fmr = mthca_tavor_map_phys_fmr;
+	}
+
 	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
 	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
 	dev->ib_dev.process_mad          = mthca_process_mad;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h	2005-03-31 19:06:47.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h	2005-04-01 12:38:29.475562346 -0800
@@ -60,6 +60,24 @@
 	u32 first_seg;
 };
 
+struct mthca_fmr {
+	struct ib_fmr ibmr;
+	struct ib_fmr_attr attr;
+	int order;
+	u32 first_seg;
+	int maps;
+	union {
+		struct {
+			struct mthca_mpt_entry __iomem *mpt;
+			u64 __iomem *mtts;
+		} tavor;
+		struct {
+			struct mthca_mpt_entry *mpt;
+			__be64 *mtts;
+		} arbel;
+	} mem;
+};
+
 struct mthca_pd {
 	struct ib_pd    ibpd;
 	u32             pd_num;
@@ -218,6 +236,11 @@
 	dma_addr_t      header_dma;
 };
 
+static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr)
+{
+	return container_of(ibmr, struct mthca_fmr, ibmr);
+}
+
 static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr)
 {
 	return container_of(ibmr, struct mthca_mr, ibmr);


From roland at topspin.com  Fri Apr  1 12:49:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:54 -0800
Subject: [openib-general] [PATCH][24/27] IB/mthca: encapsulate mem-free
	check into mthca_is_memfree()
In-Reply-To: <2005411249.5GDmFAellTSOT0Ai@topspin.com>
Message-ID: <2005411249.qaesrlpuSaCRRPRE@topspin.com>

Clean up mem-free mode support by introducing mthca_is_memfree() function,
which encapsulates the logic of deciding if a device is mem-free.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_av.c	2005-04-01 12:38:26.648176093 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_av.c	2005-04-01 12:38:30.803274137 -0800
@@ -62,7 +62,7 @@
 
 	ah->type = MTHCA_AH_PCI_POOL;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		ah->av   = kmalloc(sizeof *ah->av, GFP_ATOMIC);
 		if (!ah->av)
 			return -ENOMEM;
@@ -192,7 +192,7 @@
 {
 	int err;
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		return 0;
 
 	err = mthca_alloc_init(&dev->av_table.alloc,
@@ -231,7 +231,7 @@
 
 void __devexit mthca_cleanup_av_table(struct mthca_dev *dev)
 {
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		return;
 
 	if (dev->av_table.av_map)
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:30.084430178 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cmd.c	2005-04-01 12:38:30.790276958 -0800
@@ -651,7 +651,7 @@
 	mthca_dbg(dev, "FW version %012llx, max commands %d\n",
 		  (unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		MTHCA_GET(dev->fw.arbel.fw_pages,       outbox, QUERY_FW_SIZE_OFFSET);
 		MTHCA_GET(dev->fw.arbel.clr_int_base,   outbox, QUERY_FW_CLR_INT_BASE_OFFSET);
 		MTHCA_GET(dev->fw.arbel.eq_arm_base,    outbox, QUERY_FW_EQ_ARM_BASE_OFFSET);
@@ -984,7 +984,7 @@
 
 	mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET);
 		dev_lim->hca.arbel.resize_srq = field & 1;
 		MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET);
@@ -1148,7 +1148,7 @@
 	/* TPT attributes */
 
 	MTHCA_PUT(inbox, param->mpt_base,   INIT_HCA_MPT_BASE_OFFSET);
-	if (dev->hca_type != ARBEL_NATIVE)
+	if (!mthca_is_memfree(dev))
 		MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET);
 	MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET);
 	MTHCA_PUT(inbox, param->mtt_base,   INIT_HCA_MTT_BASE_OFFSET);
@@ -1161,7 +1161,7 @@
 
 	MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		MTHCA_PUT(inbox, param->log_uarc_sz, INIT_HCA_UARC_SZ_OFFSET);
 		MTHCA_PUT(inbox, param->log_uar_sz,  INIT_HCA_LOG_UAR_SZ_OFFSET);
 		MTHCA_PUT(inbox, param->uarc_base,   INIT_HCA_UAR_CTX_BASE_OFFSET);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-01 12:38:26.177278312 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-01 12:38:30.794276090 -0800
@@ -180,7 +180,7 @@
 {
 	u32 doorbell[2];
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		*cq->set_ci_db = cpu_to_be32(cq->cons_index);
 		wmb();
 	} else {
@@ -760,7 +760,7 @@
 	if (cq->cqn == -1)
 		return -ENOMEM;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		cq->arm_sn = 1;
 
 		err = mthca_table_get(dev, dev->cq_table.table, cq->cqn);
@@ -811,7 +811,7 @@
 	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
 	cq_context->cqn             = cpu_to_be32(cq->cqn);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		cq_context->ci_db    = cpu_to_be32(cq->set_ci_db_index);
 		cq_context->state_db = cpu_to_be32(cq->arm_db_index);
 	}
@@ -851,11 +851,11 @@
 err_out_mailbox:
 	kfree(mailbox);
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index);
 
 err_out_ci:
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
 
 err_out_icm:
@@ -916,7 +916,7 @@
 	mthca_free_mr(dev, &cq->mr);
 	mthca_free_cq_buf(dev, cq);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM,    cq->arm_db_index);
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
 		mthca_table_put(dev, dev->cq_table.table, cq->cqn);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:29.460565601 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:30.772280864 -0800
@@ -470,4 +470,9 @@
 	return container_of(ibdev, struct mthca_dev, ib_dev);
 }
 
+static inline int mthca_is_memfree(struct mthca_dev *dev)
+{
+	return dev->hca_type == ARBEL_NATIVE;
+}
+
 #endif /* MTHCA_DEV_H */
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_eq.c	2005-04-01 12:38:24.575625986 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_eq.c	2005-04-01 12:38:30.799275005 -0800
@@ -198,7 +198,7 @@
 
 static inline void set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci)
 {
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		arbel_set_eq_ci(dev, eq, ci);
 	else
 		tavor_set_eq_ci(dev, eq, ci);
@@ -223,7 +223,7 @@
 
 static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn)
 {
-	if (dev->hca_type != ARBEL_NATIVE) {
+	if (!mthca_is_memfree(dev)) {
 		u32 doorbell[2];
 
 		doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn);
@@ -535,11 +535,11 @@
 						  MTHCA_EQ_OWNER_HW    |
 						  MTHCA_EQ_STATE_ARMED |
 						  MTHCA_EQ_FLAG_TR);
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		eq_context->flags  |= cpu_to_be32(MTHCA_EQ_STATE_ARBEL);
 
 	eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24);
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		eq_context->arbel_pd = cpu_to_be32(dev->driver_pd.pd_num);
 	} else {
 		eq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index);
@@ -686,7 +686,7 @@
 
 	mthca_base = pci_resource_start(dev->pdev, 0);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		/*
 		 * We assume that the EQ arm and EQ set CI registers
 		 * fall within the first BAR.  We can't trust the
@@ -756,7 +756,7 @@
 
 static void __devexit mthca_unmap_eq_regs(struct mthca_dev *dev)
 {
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		mthca_unmap_reg(dev, (pci_resource_len(dev->pdev, 0) - 1) &
 				dev->fw.arbel.eq_set_ci_base,
 				MTHCA_EQ_SET_CI_SIZE,
@@ -880,7 +880,7 @@
 
 		for (i = 0; i < MTHCA_NUM_EQ; ++i) {
 			err = request_irq(dev->eq_table.eq[i].msi_x_vector,
-					  dev->hca_type == ARBEL_NATIVE ?
+					  mthca_is_memfree(dev) ?
 					  mthca_arbel_msi_x_interrupt :
 					  mthca_tavor_msi_x_interrupt,
 					  0, eq_name[i], dev->eq_table.eq + i);
@@ -890,7 +890,7 @@
 		}
 	} else {
 		err = request_irq(dev->pdev->irq,
-				  dev->hca_type == ARBEL_NATIVE ?
+				  mthca_is_memfree(dev) ?
 				  mthca_arbel_interrupt :
 				  mthca_tavor_interrupt,
 				  SA_SHIRQ, DRV_NAME, dev);
@@ -918,7 +918,7 @@
 			   dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status);
 
 	for (i = 0; i < MTHCA_EQ_CMD; ++i)
-		if (dev->hca_type == ARBEL_NATIVE)
+		if (mthca_is_memfree(dev))
 			arbel_eq_req_not(dev, dev->eq_table.eq[i].eqn_mask);
 		else
 			tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:29.466564299 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:30.776279996 -0800
@@ -601,7 +601,7 @@
 
 static int __devinit mthca_init_hca(struct mthca_dev *mdev)
 {
-	if (mdev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(mdev))
 		return mthca_init_arbel(mdev);
 	else
 		return mthca_init_tavor(mdev);
@@ -835,7 +835,7 @@
 
 	mthca_CLOSE_HCA(mdev, 0, &status);
 
-	if (mdev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(mdev)) {
 		mthca_free_icm_table(mdev, mdev->cq_table.table);
 		mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
 		mthca_free_icm_table(mdev, mdev->qp_table.qp_table);
@@ -939,7 +939,7 @@
 	mdev->pdev     = pdev;
 	mdev->hca_type = id->driver_data;
 
-	if (mdev->hca_type == ARBEL_NATIVE && !mthca_memfree_warned++)
+	if (mthca_is_memfree(mdev) && !mthca_memfree_warned++)
 		mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete.  "
 			   "Your HCA may not work properly.\n");
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:28.285820606 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-01 12:38:30.831268060 -0800
@@ -472,7 +472,7 @@
 {
 	int i;
 
-	if (dev->hca_type != ARBEL_NATIVE)
+	if (!mthca_is_memfree(dev))
 		return 0;
 
 	dev->db_tab = kmalloc(sizeof *dev->db_tab, GFP_KERNEL);
@@ -504,7 +504,7 @@
 	int i;
 	u8 status;
 
-	if (dev->hca_type != ARBEL_NATIVE)
+	if (!mthca_is_memfree(dev))
 		return;
 
 	/*
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:29.493558440 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_mr.c	2005-04-01 12:38:30.822270013 -0800
@@ -181,7 +181,7 @@
 	if (seg == -1)
 		return -1;
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		if (mthca_table_get_range(dev, dev->mr_table.mtt_table, seg,
 					  seg + (1 << order) - 1)) {
 			mthca_buddy_free(buddy, seg, order);
@@ -196,7 +196,7 @@
 {
 	mthca_buddy_free(buddy, seg, order);
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_table_put_range(dev, dev->mr_table.mtt_table, seg,
 				      seg + (1 << order) - 1);
 }
@@ -223,7 +223,7 @@
 
 static inline u32 hw_index_to_key(struct mthca_dev *dev, u32 ind)
 {
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		return arbel_hw_index_to_key(ind);
 	else
 		return tavor_hw_index_to_key(ind);
@@ -231,7 +231,7 @@
 
 static inline u32 key_to_hw_index(struct mthca_dev *dev, u32 key)
 {
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		return arbel_key_to_hw_index(key);
 	else
 		return tavor_key_to_hw_index(key);
@@ -254,7 +254,7 @@
 		return -ENOMEM;
 	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
 		if (err)
 			goto err_out_mpt_free;
@@ -299,7 +299,7 @@
 	return err;
 
 err_out_table:
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_table_put(dev, dev->mr_table.mpt_table, key);
 
 err_out_mpt_free:
@@ -329,7 +329,7 @@
 		return -ENOMEM;
 	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
 		if (err)
 			goto err_out_mpt_free;
@@ -437,7 +437,7 @@
 	mthca_free_mtt(dev, mr->first_seg, mr->order, &dev->mr_table.mtt_buddy);
 
 err_out_table:
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_table_put(dev, dev->mr_table.mpt_table, key);
 
 err_out_mpt_free:
@@ -452,7 +452,7 @@
 	if (order >= 0)
 		mthca_free_mtt(dev, first_seg, order, buddy);
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_table_put(dev, dev->mr_table.mpt_table,
 				arbel_key_to_hw_index(lkey));
 
@@ -498,7 +498,7 @@
 		return -EINVAL;
 
 	/* For Arbel, all MTTs must fit in the same page. */
-	if (dev->hca_type == ARBEL_NATIVE &&
+	if (mthca_is_memfree(dev) &&
 	    mr->attr.max_pages * sizeof *mr->mem.arbel.mtts > PAGE_SIZE)
 		return -EINVAL;
 
@@ -511,7 +511,7 @@
 	idx = key & (dev->limits.num_mpts - 1);
 	mr->ibmr.rkey = mr->ibmr.lkey = hw_index_to_key(dev, key);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		err = mthca_table_get(dev, dev->mr_table.mpt_table, key);
 		if (err)
 			goto err_out_mpt_free;
@@ -534,7 +534,7 @@
 
 	mtt_seg = mr->first_seg * MTHCA_MTT_SEG_SIZE;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table,
 						      mr->first_seg);
 		BUG_ON(!mr->mem.arbel.mtts);
@@ -596,7 +596,7 @@
 		       dev->mr_table.fmr_mtt_buddy);
 
 err_out_table:
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_table_put(dev, dev->mr_table.mpt_table, key);
 
 err_out_mpt_free:
@@ -765,7 +765,7 @@
 	if (err)
 		return err;
 
-	if (dev->hca_type != ARBEL_NATIVE &&
+	if (!mthca_is_memfree(dev) &&
 	    (dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN))
 		dev->limits.fmr_reserved_mtts = 0;
 	else
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:29.480561261 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_profile.c	2005-04-01 12:38:30.785278043 -0800
@@ -116,11 +116,11 @@
 		profile[i].type     = i;
 		profile[i].log_num  = max(ffs(profile[i].num) - 1, 0);
 		profile[i].size    *= profile[i].num;
-		if (dev->hca_type == ARBEL_NATIVE)
+		if (mthca_is_memfree(dev))
 			profile[i].size = max(profile[i].size, (u64) PAGE_SIZE);
 	}
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		mem_base  = 0;
 		mem_avail = dev_lim->hca.arbel.max_icm_sz;
 	} else {
@@ -165,7 +165,7 @@
 				  (unsigned long long) profile[i].size);
 	}
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		mthca_dbg(dev, "HCA context memory: reserving %d KB\n",
 			  (int) (total_size >> 10));
 	else
@@ -267,7 +267,7 @@
 	 * out of the MR pool. They don't use additional memory, but
 	 * we assign them as part of the HCA profile anyway.
 	 */
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		dev->limits.fmr_reserved_mtts = 0;
 	else
 		dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:29.471563214 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:30.780279128 -0800
@@ -625,7 +625,7 @@
 	if (!mdev)
 		return 0;
 
-	if (mdev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(mdev)) {
 		list_for_each_entry(fmr, fmr_list, list)
 			mthca_arbel_fmr_unmap(mdev, to_mfmr(fmr));
 
@@ -710,7 +710,7 @@
 		dev->ib_dev.alloc_fmr            = mthca_alloc_fmr;
 		dev->ib_dev.unmap_fmr            = mthca_unmap_fmr;
 		dev->ib_dev.dealloc_fmr          = mthca_dealloc_fmr;
-		if (dev->hca_type == ARBEL_NATIVE)
+		if (mthca_is_memfree(dev))
 			dev->ib_dev.map_phys_fmr = mthca_arbel_map_phys_fmr;
 		else
 			dev->ib_dev.map_phys_fmr = mthca_tavor_map_phys_fmr;
@@ -720,7 +720,7 @@
 	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
 	dev->ib_dev.process_mad          = mthca_process_mad;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq;
 		dev->ib_dev.post_send     = mthca_arbel_post_send;
 		dev->ib_dev.post_recv     = mthca_arbel_post_receive;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:26.181277444 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:30.827268928 -0800
@@ -639,7 +639,7 @@
 	else if (attr_mask & IB_QP_PATH_MTU)
 		qp_context->mtu_msgmax = (attr->path_mtu << 5) | 31;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		qp_context->rq_size_stride =
 			((ffs(qp->rq.max) - 1) << 3) | (qp->rq.wqe_shift - 4);
 		qp_context->sq_size_stride =
@@ -731,7 +731,7 @@
 		qp_context->next_send_psn = cpu_to_be32(attr->sq_psn);
 	qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn);
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		qp_context->snd_wqe_base_l = cpu_to_be32(qp->send_wqe_offset);
 		qp_context->snd_db_index   = cpu_to_be32(qp->sq.db_index);
 	}
@@ -822,7 +822,7 @@
 
 	qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn);
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		qp_context->rcv_db_index   = cpu_to_be32(qp->rq.db_index);
 
 	if (attr_mask & IB_QP_QKEY) {
@@ -897,7 +897,7 @@
 		size += 2 * sizeof (struct mthca_data_seg);
 		break;
 	case UD:
-		if (dev->hca_type == ARBEL_NATIVE)
+		if (mthca_is_memfree(dev))
 			size += sizeof (struct mthca_arbel_ud_seg);
 		else
 			size += sizeof (struct mthca_tavor_ud_seg);
@@ -1016,7 +1016,7 @@
 {
 	int ret = 0;
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn);
 		if (ret)
 			return ret;
@@ -1057,7 +1057,7 @@
 static void mthca_free_memfree(struct mthca_dev *dev,
 			       struct mthca_qp *qp)
 {
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index);
 		mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
 		mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
@@ -1104,7 +1104,7 @@
 		return ret;
 	}
 
-	if (dev->hca_type == ARBEL_NATIVE) {
+	if (mthca_is_memfree(dev)) {
 		for (i = 0; i < qp->rq.max; ++i) {
 			wqe = get_recv_wqe(qp, i);
 			wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) <<
@@ -1127,7 +1127,7 @@
 {
 	int i;
 
-	if (dev->hca_type != ARBEL_NATIVE)
+	if (!mthca_is_memfree(dev))
 		return;
 
 	for (i = 0; 1 << i < qp->rq.max; ++i)
@@ -2011,7 +2011,7 @@
 	else
 		next = get_recv_wqe(qp, index);
 
-	if (dev->hca_type == ARBEL_NATIVE)
+	if (mthca_is_memfree(dev))
 		*dbd = 1;
 	else
 		*dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD));


From roland at topspin.com  Fri Apr  1 12:49:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:54 -0800
Subject: [openib-general] [PATCH][25/27] IB/mthca: map context for RDMA
	responder in mem-free mode
In-Reply-To: <2005411249.qaesrlpuSaCRRPRE@topspin.com>
Message-ID: <2005411249.Yyk7PJUeNHG0154S@topspin.com>

Fix RDMA in mem-free mode: we need to make sure that the RDMA context
memory is mapped for the HCA.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:30.772280864 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:31.661087929 -0800
@@ -222,6 +222,7 @@
 	struct mthca_array     	qp;
 	struct mthca_icm_table *qp_table;
 	struct mthca_icm_table *eqp_table;
+	struct mthca_icm_table *rdb_table;
 };
 
 struct mthca_av_table {
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:30.776279996 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:31.666086844 -0800
@@ -430,14 +430,25 @@
 		goto err_unmap_qp;
 	}
 
-	mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base,
+	mdev->qp_table.rdb_table = mthca_alloc_icm_table(mdev, init_hca->rdb_base,
+							 MTHCA_RDB_ENTRY_SIZE,
+							 mdev->limits.num_qps <<
+							 mdev->qp_table.rdb_shift,
+							 0, 0);
+	if (!mdev->qp_table.rdb_table) {
+		mthca_err(mdev, "Failed to map RDB context memory, aborting\n");
+		err = -ENOMEM;
+		goto err_unmap_eqp;
+	}
+
+       mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base,
 						     dev_lim->cqc_entry_sz,
 						     mdev->limits.num_cqs,
 						     mdev->limits.reserved_cqs, 0);
 	if (!mdev->cq_table.table) {
 		mthca_err(mdev, "Failed to map CQ context memory, aborting.\n");
 		err = -ENOMEM;
-		goto err_unmap_eqp;
+		goto err_unmap_rdb;
 	}
 
 	/*
@@ -463,6 +474,9 @@
 err_unmap_cq:
 	mthca_free_icm_table(mdev, mdev->cq_table.table);
 
+err_unmap_rdb:
+	mthca_free_icm_table(mdev, mdev->qp_table.rdb_table);
+
 err_unmap_eqp:
 	mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:30.827268928 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:31.673085325 -0800
@@ -1025,11 +1025,16 @@
 		if (ret)
 			goto err_qpc;
 
+		ret = mthca_table_get(dev, dev->qp_table.rdb_table,
+				      qp->qpn << dev->qp_table.rdb_shift);
+		if (ret)
+			goto err_eqpc;
+
 		qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ,
 						 qp->qpn, &qp->rq.db);
 		if (qp->rq.db_index < 0) {
 			ret = -ENOMEM;
-			goto err_eqpc;
+			goto err_rdb;
 		}
 
 		qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ,
@@ -1045,6 +1050,10 @@
 err_rq_db:
 	mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
 
+err_rdb:
+	mthca_table_put(dev, dev->qp_table.rdb_table,
+			qp->qpn << dev->qp_table.rdb_shift);
+
 err_eqpc:
 	mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
 
@@ -1060,6 +1069,8 @@
 	if (mthca_is_memfree(dev)) {
 		mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index);
 		mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
+		mthca_table_put(dev, dev->qp_table.rdb_table,
+				qp->qpn << dev->qp_table.rdb_shift);
 		mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
 		mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn);
 	}


From roland at topspin.com  Fri Apr  1 12:49:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:54 -0800
Subject: [openib-general] [PATCH][26/27] IB/mthca: update receive queue
	initialization for new HCAs
In-Reply-To: <2005411249.Yyk7PJUeNHG0154S@topspin.com>
Message-ID: <2005411249.gE8d9QQAmCCNZRp6@topspin.com>

Update initialization of receive queue to match new documentation.
This change is required to support new MT25204 HCA.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:31.673085325 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-01 12:38:32.124987229 -0800
@@ -181,6 +181,10 @@
 	MTHCA_MLX_SLR        = 1 << 16
 };
 
+enum {
+	MTHCA_INVAL_LKEY = 0x100
+};
+
 struct mthca_next_seg {
 	u32 nda_op;		/* [31:6] next WQE [4:0] next opcode */
 	u32 ee_nds;		/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
@@ -1093,7 +1097,6 @@
 				 enum ib_sig_type send_policy,
 				 struct mthca_qp *qp)
 {
-	struct mthca_next_seg *wqe;
 	int ret;
 	int i;
 
@@ -1116,18 +1119,28 @@
 	}
 
 	if (mthca_is_memfree(dev)) {
+		struct mthca_next_seg *next;
+		struct mthca_data_seg *scatter;
+		int size = (sizeof (struct mthca_next_seg) +
+			    qp->rq.max_gs * sizeof (struct mthca_data_seg)) / 16;
+
 		for (i = 0; i < qp->rq.max; ++i) {
-			wqe = get_recv_wqe(qp, i);
-			wqe->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) <<
-						  qp->rq.wqe_shift);
-			wqe->ee_nds = cpu_to_be32(1 << (qp->rq.wqe_shift - 4));
+			next = get_recv_wqe(qp, i);
+			next->nda_op = cpu_to_be32(((i + 1) & (qp->rq.max - 1)) <<
+						   qp->rq.wqe_shift);
+			next->ee_nds = cpu_to_be32(size);
+
+			for (scatter = (void *) (next + 1);
+			     (void *) scatter < (void *) next + (1 << qp->rq.wqe_shift);
+			     ++scatter)
+				scatter->lkey = cpu_to_be32(MTHCA_INVAL_LKEY);
 		}
 
 		for (i = 0; i < qp->sq.max; ++i) {
-			wqe = get_send_wqe(qp, i);
-			wqe->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) <<
-						   qp->sq.wqe_shift) +
-						  qp->send_wqe_offset);
+			next = get_send_wqe(qp, i);
+			next->nda_op = cpu_to_be32((((i + 1) & (qp->sq.max - 1)) <<
+						    qp->sq.wqe_shift) +
+						   qp->send_wqe_offset);
 		}
 	}
 
@@ -1986,7 +1999,7 @@
 
 		if (i < qp->rq.max_gs) {
 			((struct mthca_data_seg *) wqe)->byte_count = 0;
-			((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(0x100);
+			((struct mthca_data_seg *) wqe)->lkey = cpu_to_be32(MTHCA_INVAL_LKEY);
 			((struct mthca_data_seg *) wqe)->addr = 0;
 		}
 

From roland at topspin.com  Fri Apr  1 12:49:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 1 Apr 2005 12:49:54 -0800
Subject: [openib-general] [PATCH][27/27] IB/mthca: add support for new
	MT25204 HCA
In-Reply-To: <2005411249.gE8d9QQAmCCNZRp6@topspin.com>
Message-ID: <2005411249.RHQWyM8AFcqb1PM4@topspin.com>

Decouple table of HCA features from exact HCA device type.  Add a
current FW version field so we can warn when someone is using old FW.
Add support for new MT25204 HCA.

Remove the warning about mem-free support, since it should be pretty
solid at this point.

Signed-off-by: Roland Dreier <roland at topspin.com>


--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:31.661087929 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-01 12:38:32.606882623 -0800
@@ -49,20 +49,15 @@
 #define DRV_VERSION	"0.06-pre"
 #define DRV_RELDATE	"November 8, 2004"
 
-/* Types of supported HCA */
-enum {
-	TAVOR,			/* MT23108                        */
-	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
-	ARBEL_NATIVE		/* MT25208 with extended features */
-};
-
 enum {
 	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
 	MTHCA_FLAG_SRQ        = 1 << 2,
 	MTHCA_FLAG_MSI        = 1 << 3,
 	MTHCA_FLAG_MSI_X      = 1 << 4,
 	MTHCA_FLAG_NO_LAM     = 1 << 5,
-	MTHCA_FLAG_FMR        = 1 << 6
+	MTHCA_FLAG_FMR        = 1 << 6,
+	MTHCA_FLAG_MEMFREE    = 1 << 7,
+	MTHCA_FLAG_PCIE       = 1 << 8
 };
 
 enum {
@@ -473,7 +468,7 @@
 
 static inline int mthca_is_memfree(struct mthca_dev *dev)
 {
-	return dev->hca_type == ARBEL_NATIVE;
+	return dev->mthca_flags & MTHCA_FLAG_MEMFREE;
 }
 
 #endif /* MTHCA_DEV_H */
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:31.666086844 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-01 12:38:32.611881538 -0800
@@ -103,7 +103,7 @@
 				  "aborting.\n");
 			return -ENODEV;
 		}
-	} else if (mdev->hca_type == TAVOR)
+	} else if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE))
 		mthca_info(mdev, "No PCI-X capability, not setting RBC.\n");
 
 	cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP);
@@ -119,8 +119,7 @@
 				  "register, aborting.\n");
 			return -ENODEV;
 		}
-	} else if (mdev->hca_type == ARBEL_NATIVE ||
-		   mdev->hca_type == ARBEL_COMPAT)
+	} else if (mdev->mthca_flags & MTHCA_FLAG_PCIE)
 		mthca_info(mdev, "No PCI Express capability, "
 			   "not setting Max Read Request Size.\n");
 
@@ -438,7 +437,7 @@
 	if (!mdev->qp_table.rdb_table) {
 		mthca_err(mdev, "Failed to map RDB context memory, aborting\n");
 		err = -ENOMEM;
-		goto err_unmap_eqp;
+		goto err_unmap_rdb;
 	}
 
        mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base,
@@ -593,6 +592,7 @@
 
 err_free_icm:
 	mthca_free_icm_table(mdev, mdev->cq_table.table);
+	mthca_free_icm_table(mdev, mdev->qp_table.rdb_table);
 	mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
 	mthca_free_icm_table(mdev, mdev->qp_table.qp_table);
 	mthca_free_icm_table(mdev, mdev->mr_table.mpt_table);
@@ -851,6 +851,7 @@
 
 	if (mthca_is_memfree(mdev)) {
 		mthca_free_icm_table(mdev, mdev->cq_table.table);
+		mthca_free_icm_table(mdev, mdev->qp_table.rdb_table);
 		mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
 		mthca_free_icm_table(mdev, mdev->qp_table.qp_table);
 		mthca_free_icm_table(mdev, mdev->mr_table.mpt_table);
@@ -869,11 +870,32 @@
 		mthca_SYS_DIS(mdev, &status);
 }
 
+/* Types of supported HCA */
+enum {
+	TAVOR,			/* MT23108                        */
+	ARBEL_COMPAT,		/* MT25208 in Tavor compat mode   */
+	ARBEL_NATIVE,		/* MT25208 with extended features */
+	SINAI			/* MT25204 */
+};
+
+#define MTHCA_FW_VER(major, minor, subminor) \
+	(((u64) (major) << 32) | ((u64) (minor) << 16) | (u64) (subminor))
+
+static struct {
+	u64 latest_fw;
+	int is_memfree;
+	int is_pcie;
+} mthca_hca_table[] = {
+	[TAVOR]        = { .latest_fw = MTHCA_FW_VER(3, 3, 2), .is_memfree = 0, .is_pcie = 0 },
+	[ARBEL_COMPAT] = { .latest_fw = MTHCA_FW_VER(4, 6, 2), .is_memfree = 0, .is_pcie = 1 },
+	[ARBEL_NATIVE] = { .latest_fw = MTHCA_FW_VER(5, 0, 1), .is_memfree = 1, .is_pcie = 1 },
+	[SINAI]        = { .latest_fw = MTHCA_FW_VER(1, 0, 1), .is_memfree = 1, .is_pcie = 1 }
+};
+
 static int __devinit mthca_init_one(struct pci_dev *pdev,
 				    const struct pci_device_id *id)
 {
 	static int mthca_version_printed = 0;
-	static int mthca_memfree_warned = 0;
 	int ddr_hidden = 0;
 	int err;
 	struct mthca_dev *mdev;
@@ -886,6 +908,12 @@
 	printk(KERN_INFO PFX "Initializing %s (%s)\n",
 	       pci_pretty_name(pdev), pci_name(pdev));
 
+	if (id->driver_data >= ARRAY_SIZE(mthca_hca_table)) {
+		printk(KERN_ERR PFX "%s (%s) has invalid driver data %lx\n",
+		       pci_pretty_name(pdev), pci_name(pdev), id->driver_data);
+		return -ENODEV;
+	}
+
 	err = pci_enable_device(pdev);
 	if (err) {
 		dev_err(&pdev->dev, "Cannot enable PCI device, "
@@ -950,15 +978,14 @@
 		goto err_free_res;
 	}
 
-	mdev->pdev     = pdev;
-	mdev->hca_type = id->driver_data;
-
-	if (mthca_is_memfree(mdev) && !mthca_memfree_warned++)
-		mthca_warn(mdev, "Warning: native MT25208 mode support is incomplete.  "
-			   "Your HCA may not work properly.\n");
+	mdev->pdev = pdev;
 
 	if (ddr_hidden)
 		mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN;
+	if (mthca_hca_table[id->driver_data].is_memfree)
+		mdev->mthca_flags |= MTHCA_FLAG_MEMFREE;
+	if (mthca_hca_table[id->driver_data].is_pcie)
+		mdev->mthca_flags |= MTHCA_FLAG_PCIE;
 
 	/*
 	 * Now reset the HCA before we touch the PCI capabilities or
@@ -997,6 +1024,16 @@
 	if (err)
 		goto err_iounmap;
 
+	if (mdev->fw_ver < mthca_hca_table[id->driver_data].latest_fw) {
+		mthca_warn(mdev, "HCA FW version %x.%x.%x is old (%x.%x.%x is current).\n",
+			   (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff,
+			   (int) (mdev->fw_ver & 0xffff),
+			   (int) (mthca_hca_table[id->driver_data].latest_fw >> 32),
+			   (int) (mthca_hca_table[id->driver_data].latest_fw >> 16) & 0xffff,
+			   (int) (mthca_hca_table[id->driver_data].latest_fw & 0xffff));
+		mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n");
+	}
+
 	err = mthca_setup_hca(mdev);
 	if (err)
 		goto err_close;
@@ -1112,6 +1149,14 @@
 	  .driver_data = ARBEL_NATIVE },
 	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL),
 	  .driver_data = ARBEL_NATIVE },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SINAI),
+	  .driver_data = SINAI },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_SINAI),
+	  .driver_data = SINAI },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SINAI_OLD),
+	  .driver_data = SINAI },
+	{ PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_SINAI_OLD),
+	  .driver_data = SINAI },
 	{ 0, }
 };
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:30.780279128 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-01 12:38:32.615880670 -0800
@@ -659,11 +659,18 @@
 static ssize_t show_hca(struct class_device *cdev, char *buf)
 {
 	struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev);
-	switch (dev->hca_type) {
-	case TAVOR:        return sprintf(buf, "MT23108\n");
-	case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
-	case ARBEL_NATIVE: return sprintf(buf, "MT25208\n");
-	default:           return sprintf(buf, "unknown\n");
+	switch (dev->pdev->device) {
+	case PCI_DEVICE_ID_MELLANOX_TAVOR:
+		return sprintf(buf, "MT23108\n");
+	case PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT:
+		return sprintf(buf, "MT25208 (MT23108 compat mode)\n");
+	case PCI_DEVICE_ID_MELLANOX_ARBEL:
+		return sprintf(buf, "MT25208\n");
+	case PCI_DEVICE_ID_MELLANOX_SINAI:
+	case PCI_DEVICE_ID_MELLANOX_SINAI_OLD:
+		return sprintf(buf, "MT25204\n");
+	default:
+		return sprintf(buf, "unknown\n");
 	}
 }
 
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_reset.c	2005-03-31 19:06:41.000000000 -0800
+++ linux-export/drivers/infiniband/hw/mthca/mthca_reset.c	2005-04-01 12:38:32.594885228 -0800
@@ -63,7 +63,7 @@
 	 * header as well.
 	 */
 
-	if (mdev->hca_type == TAVOR) {
+	if (!(mdev->mthca_flags & MTHCA_FLAG_PCIE)) {
 		/* Look for the bridge -- its device ID will be 2 more
 		   than HCA's device ID. */
 		while ((bridge = pci_get_device(mdev->pdev->vendor,


From tduffy at sun.com  Fri Apr  1 13:02:18 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 01 Apr 2005 13:02:18 -0800
Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into
	mtcha code WAS: [openib-commits] r2101 -
	gen2/trunk/src/linux-kernel/patches
In-Reply-To: <20050401184346.GD11094@esmail.cup.hp.com>
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
	<20050401184346.GD11094@esmail.cup.hp.com>
Message-ID: <1112389338.14094.7.camel@duffman>

On Fri, 2005-04-01 at 10:43 -0800, Grant Grundler wrote:
> No - I think Rolan is doing the right thing with a seperate patch.
> I ran into the same issue since I'm still poking at 2.6.11.

I thought the consensus what that openib gen2 trunk would always build
against the latest stable 2.6.x kernel, in this case 2.6.11 which
doesn't have the SINAI defines.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050401/4b1e9a36/attachment.sig>

From iod00d at hp.com  Fri Apr  1 13:32:49 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 1 Apr 2005 13:32:49 -0800
Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha
	code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
In-Reply-To: <1112389338.14094.7.camel@duffman>
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
	<20050401184346.GD11094@esmail.cup.hp.com>
	<1112389338.14094.7.camel@duffman>
Message-ID: <20050401213249.GF11094@esmail.cup.hp.com>

On Fri, Apr 01, 2005 at 01:02:18PM -0800, Tom Duffy wrote:
> I thought the consensus what that openib gen2 trunk would always build
> against the latest stable 2.6.x kernel, in this case 2.6.11 which
> doesn't have the SINAI defines.

hrm...you are right. Forgot about that.
Can we add the SINIA #defines to a local "compat.h" file?

grant


From peter at pantasys.com  Fri Apr  1 13:45:14 2005
From: peter at pantasys.com (Peter Buckingham)
Date: Fri, 01 Apr 2005 13:45:14 -0800
Subject: [openib-general] uverbs and OSU MPI/MPI in general?
In-Reply-To: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu>
References: <200504012013.j31KDqus007452@xi.cse.ohio-state.edu>
Message-ID: <424DC0EA.3070701@pantasys.com>

Dhabaleswar Panda wrote:
> Peter, 
> 
> 
>>    Peter> Hi All, How does gen2's uverbs compare to VAPI? Is it meant
>>    Peter> to be the same API? Should OSU's MPI run on top of this or
>>    Peter> is there some other MPI implementation that will be able to
>>    Peter> run 'natively' over IB?
>>
>>The basic functionality is the same but the API is different.  For
>>example completion events are handled in a different way that allows
>>better performance.
>>
>>None of the current MPI implementations that use IB will run
>>unmodified, but everyone (including OSU) is porting to the new API.
> 
> 
> We have already started working on porting OSU MPI to the Gen2 stack.
> 
> We plan to release MVAPICH 0.9.5 (on VAPI stack) during the next 1-2
> weeks.  After that we will make a subsequent release of 0.9.5 on the
> OpenIB Gen2 stack.

excellent! thanks for the info.

peter


From roland at topspin.com  Fri Apr  1 14:01:19 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 01 Apr 2005 14:01:19 -0800
Subject: [openib-general] [PATCH][MTHCA] add in SINAI defines into mtcha
	code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
In-Reply-To: <20050401213249.GF11094@esmail.cup.hp.com> (Grant Grundler's
	message of "Fri, 1 Apr 2005 13:32:49 -0800")
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
	<20050401184346.GD11094@esmail.cup.hp.com>
	<1112389338.14094.7.camel@duffman>
	<20050401213249.GF11094@esmail.cup.hp.com>
Message-ID: <52ll82i2c0.fsf@topspin.com>

I could go either way on this.  The point of the patches directory is
to make the trunk build against the current 2.6.11 tree.  On the other
hand, Tom's patch doesn't break anything (since the new symbols are
inside an #ifdef) and I know I can trust Tom to remember to take it
out once 2.6.12 comes out.

So I think I'll go ahead and commit this change.

 - R.


From roland at topspin.com  Fri Apr  1 14:06:50 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 01 Apr 2005 14:06:50 -0800
Subject: [openib-general] [PATCH][26.5/27] Add MT25204 PCI IDs
In-Reply-To: <2005411249.RHQWyM8AFcqb1PM4@topspin.com> (Roland Dreier's
	message of "Fri, 1 Apr 2005 12:49:54 -0800")
References: <2005411249.RHQWyM8AFcqb1PM4@topspin.com>
Message-ID: <52hdiqi22t.fsf@topspin.com>

Ugh, this patch is required to build support for the new Mellanox
HCAs.  Greg K-H applied it to his tree a while ago but it hasn't made
it to Linus yet.

Sorry,
  Roland

Add PCI device IDs for new Mellanox MT25204 "Sinai" InfiniHost III Lx HCA.

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-export.orig/include/linux/pci_ids.h	2005-03-31 19:07:14.000000000 -0800
+++ linux-export/include/linux/pci_ids.h	2005-04-01 14:03:16.468519075 -0800
@@ -2122,6 +2122,8 @@
 #define PCI_DEVICE_ID_MELLANOX_TAVOR	0x5a44
 #define PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT 0x6278
 #define PCI_DEVICE_ID_MELLANOX_ARBEL	0x6282
+#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c
+#define PCI_DEVICE_ID_MELLANOX_SINAI	0x6274
 
 #define PCI_VENDOR_ID_PDC		0x15e9
 #define PCI_DEVICE_ID_PDC_1841		0x1841


From mshefty at ichips.intel.com  Fri Apr  1 14:12:40 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 01 Apr 2005 14:12:40 -0800
Subject: [openib-general] [RMPP] RMPP formatting assumptions
In-Reply-To: <424C92CE.7040709@ichips.intel.com>
References: <42488FDF.2050608@ichips.intel.com>
	<424C92CE.7040709@ichips.intel.com>
Message-ID: <424DC758.4040504@ichips.intel.com>

Sean Hefty wrote:
> The payload field in the RMPP header should be set to the size of the 
> class specific header plus the number of valid bytes of user data in the 
> data buffer.  The RMPP code will adjust the payload value to account for 
> multiple headers.

Doing this brings up the issue with the byte-ordering of the payload 
value set by the user.  One on hand, the value is used to communicate 
with the RMPP code, so could be in host-order.  But on the other hand, 
the value is in a data structure where all of the other fields are in 
network-byte order...

I'm leaning towards network-byte order, which would set the payload to 
the correct value for a single-segment RMPP MAD.

- Sean


From libor at topspin.com  Fri Apr  1 14:54:14 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 1 Apr 2005 14:54:14 -0800
Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA
In-Reply-To: <528y42laxk.fsf@topspin.com>;
	from roland@topspin.com on Fri, Apr 01, 2005 at 08:27:19AM -0800
References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com>
	<528y42laxk.fsf@topspin.com>
Message-ID: <20050401145414.B2870@topspin.com>

On Fri, Apr 01, 2005 at 08:27:19AM -0800, Roland Dreier wrote:
>     Fab> If you are blessed with a Tavor PRM, see section 8.2.1.6 (in
>     Fab> PRM 1.0.0).  It states that a length of zero in a data
>     Fab> segment indicates a 2GB transfer (MSb is used as a flag to
>     Fab> indicate normal vs. inline data segments).  A zero-byte
>     Fab> request must not reference any data segments.
> 
> Yup, that must be the problem.  I guess mthca can skip over 0-length
> data segments.  Another option would be to say that such work requests
> aren't allowed.  Not sure which way I think we should go.  I need to
> talk to Libor and find out why SDP is generating such requests.

Roland,

  Can you try this patch, it should close a gap to prevent a zero 
length IOCB from getting into the receive data path. 

Thanks.

-Libor

Index: sdp_recv.c
===================================================================
--- sdp_recv.c	(revision 2094)
+++ sdp_recv.c	(working copy)
@@ -297,13 +297,13 @@
 	 * if there is no more advertised space,  queue the
 	 * advertisment for completion
 	 */
-	if (advt->size <= 0)
+	if (!advt->size)
 		sdp_advt_q_put(&conn->src_actv,
 			       sdp_advt_q_get(&conn->src_pend));
 	/*
 	 * if there is no more iocb space queue the it for completion
 	 */
-	if (iocb->len <= 0) {
+	if (!iocb->len) {
 		iocb = sdp_iocb_q_get_head(&conn->r_pend);
 		if (!iocb) {
 			sdp_dbg_warn(conn, "read IOCB disappeared. <%d>",
@@ -1368,26 +1371,11 @@
 		 * RDMA advertisements are checked to determine if remote
 		 * data is pending and accessible.
 		 */
-		if (!(copied < low_water) &&
-		    !conn->src_recv) {
-#if 0 /* performance cheat. LM */
-			if (!(conn->snk_zthresh > size)) {
+		if (copied == size)
+			break;
 
-				conn->nond_recv--;
-
-				result = sdp_send_ctrl_snk_avail(conn,
-								 0, 0, 0);
-				if (result < 0) {
-					/*
-					 * since the message did not go out,
-					 * back out the non_discard counter
-					 */
-					conn->nond_recv++;
-				}
-			}
-#endif
+		if (!(copied < low_water) && !conn->src_recv)
 			break;
-		}
 		/*
 		 * check connection errors, and then wait for more data.
 		 * check status. POSIX 1003.1g order.


From halr at voltaire.com  Fri Apr  1 14:56:00 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Apr 2005 17:56:00 -0500
Subject: [openib-general] [PATCH] FMR support in mthca
In-Reply-To: <20050331134116.B1541@topspin.com>
References: <20050327153112.GA26108@mellanox.co.il>
	<20050328170351.B30499@topspin.com>
	<20050330010814.GB24794@esmail.cup.hp.com>
	<20050329181228.H31683@topspin.com>
	<20050330032904.GA24936@esmail.cup.hp.com>
	<20050330164349.B32764@topspin.com>
	<1112304328.7331.42.camel@localhost.localdomain>
	<20050331134116.B1541@topspin.com>
Message-ID: <1112395932.4476.217.camel@localhost.localdomain>

On Thu, 2005-03-31 at 16:41, Libor Michalek wrote: 
> On Thu, Mar 31, 2005 at 04:25:28PM -0500, Hal Rosenstock wrote:
> > On Wed, 2005-03-30 at 19:43, Libor Michalek wrote:
> > > The program has a decent help for available parameters, but here are
> > > some reasonable defaults:
> > > 
> > >   server:
> > > 
> > >     ./ttcp.aio.x -r -l 65536 -a 20
> > > 
> > >   client:
> > > 
> > >     ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.0.100
> > 
> > Are these the parameters used to achieve the throughput numbers you
> > published ?
> > 
> > Sounds like you tweaked the numbers in sdp_dev.h. Anywhere else ?
> > 
> > Can you provide the tuning numbers used and where they were found so these
> > results can be reproduced ?
> 
>   No tweaking or changes to the SDP code itself. The parameters above 
> should give similar results, but here are the exact parameters I used
> for the two aync tests I mentioned in the original results I posted.
> 
> > > For async socket I kept 20 96K buffers in flight. For the FMR pool cache 
> > > hit async results I used only 20 different buffers. 
> 
>      ./ttcp.aio.x -r -l 98304 -a 20 -f M
>      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100
> 
> > > For the FMR pool cache miss async results I used 1000 different
> > > buffers, of which only 20 were in flight at a time.
> 
>      ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M
>      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100

We are seeing issues with both buffer size and iterations. We get back
-ENOMEM and also see VMA lock errors. Are the 2 related ? Should we turn
on SDP debug to see what specifically can't be allocated ? In that case,
what could be done ?

When using the default parameters, we see the following:

On the server:

[root at openib1 ~]# ./ttcp.aio.x -r -l 65536 -a 20
ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001
ttcp-r: socket
ttcp-r: accept from 192.168.1.4
ttcp-r: Event error <-12> <5275648>
ttcp-r: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++
ttcp-r: 2 I/O calls, usec/call = 112.00, calls/sec = 8928.57
ttcp-r: user: 0 sys: 0 total: 0 real: 224 (microseconds)
[root at openib1 ~]#

On the client:

[root at openib2 ~]# ./ttcp.aio.x -t -l 65536 -n 100000 -a 20 192.168.1.3
ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5001
192.168.1.3
ttcp-t: socket
ttcp-t: connect
ttcp-t: Event error <-12> <5275648>
ttcp-t: 0 bytes in 0.00 real seconds = 0.00 Mbit/sec +++
ttcp-t: 2 I/O calls, usec/call = 83.00, calls/sec = 12048.19
ttcp-t: user: 0 sys: 0 total: 0 real: 166 (microseconds)
[root at openib2 ~]#

Here's the output from the dmesg on the server:

 ERR: : VMA lock <620000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <634000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <648000:65536> error <-12> <16:0:8>
...<repeats>...

Here's the output from the dmesg (client):

 ERR: : VMA lock <580000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <594000:65536> error <-12> <16:0:8>
 ERR: : VMA lock <5a8000:65536> error <-12> <16:0:8>
...<repeats>...

If the value of -l (length of network read/write buffers) it runs (up to
buffer size of 4K). However, there still is dmesg output on the server
side:

Here's the output from the dmesg on the server:

 ERR: : VMA lock <550000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <554000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <558000:1024> error <-12> <1:8:8>
WARN: : Cancel read with no IOCB. <2:0:00000005>
WARN: : Cancel read with no IOCB. <2:0:00000005>
 ERR: : VMA lock <528000:1024> error <-12> <1:8:8>
 ERR: : VMA lock <52c000:1024> error <-12> <1:8:8>
...<repeats>...

Is this related to system configuration somehow ? How much system memory
in your machines ? Is this a factor ?

Thanks.

-- Hal


From libor at topspin.com  Fri Apr  1 15:07:50 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 1 Apr 2005 15:07:50 -0800
Subject: [openib-general] [PATCH] FMR support in mthca
In-Reply-To: <1112395932.4476.217.camel@localhost.localdomain>;
	from halr@voltaire.com on Fri, Apr 01, 2005 at 05:56:00PM -0500
References: <20050327153112.GA26108@mellanox.co.il>
	<20050328170351.B30499@topspin.com>
	<20050330010814.GB24794@esmail.cup.hp.com>
	<20050329181228.H31683@topspin.com>
	<20050330032904.GA24936@esmail.cup.hp.com>
	<20050330164349.B32764@topspin.com>
	<1112304328.7331.42.camel@localhost.localdomain>
	<20050331134116.B1541@topspin.com>
	<1112395932.4476.217.camel@localhost.localdomain>
Message-ID: <20050401150750.C2870@topspin.com>

On Fri, Apr 01, 2005 at 05:56:00PM -0500, Hal Rosenstock wrote:
> On Thu, 2005-03-31 at 16:41, Libor Michalek wrote: 
> > 
> >      ./ttcp.aio.x -r -l 98304 -a 20 -f M
> >      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -f M 192.168.0.100
> > 
> > > > For the FMR pool cache miss async results I used 1000 different
> > > > buffers, of which only 20 were in flight at a time.
> > 
> >      ./ttcp.aio.x -r -l 98304 -a 20 -x 1000 -f M
> >      ./ttcp.aio.x -t -l 98304 -n 200000 -a 20 -x 1000 -f M 192.168.0.100
> 
> We are seeing issues with both buffer size and iterations. We get back
> -ENOMEM and also see VMA lock errors. Are the 2 related ? Should we turn
> on SDP debug to see what specifically can't be allocated ? In that case,
> what could be done ?

Hal,

  You need to increase the amount of memory that the user is allowed
to lock. The following command in each shell from which you are running
ttcp:

  limit memorylocked unlimited

  In 2.6 mlock() looks at the rlimits for the process executing the lock,
I decided not to artificially increase the limit while locking, instead
relying on the user/admin to set the appropriate value. The default is
pretty small.

-Libor


From halr at voltaire.com  Fri Apr  1 15:21:24 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Apr 2005 18:21:24 -0500
Subject: [openib-general] [PATCH] FMR support in mthca
In-Reply-To: <20050401150750.C2870@topspin.com>
References: <20050327153112.GA26108@mellanox.co.il>
	<20050328170351.B30499@topspin.com>
	<20050330010814.GB24794@esmail.cup.hp.com>
	<20050329181228.H31683@topspin.com>
	<20050330032904.GA24936@esmail.cup.hp.com>
	<20050330164349.B32764@topspin.com>
	<1112304328.7331.42.camel@localhost.localdomain>
	<20050331134116.B1541@topspin.com>
	<1112395932.4476.217.camel@localhost.localdomain>
	<20050401150750.C2870@topspin.com>
Message-ID: <1112397606.4476.232.camel@localhost.localdomain>

On Fri, 2005-04-01 at 18:07, Libor Michalek wrote:
>   You need to increase the amount of memory that the user is allowed
> to lock. The following command in each shell from which you are running
> ttcp:
> 
>   limit memorylocked unlimited
> 
>   In 2.6 mlock() looks at the rlimits for the process executing the lock,
> I decided not to artificially increase the limit while locking, instead
> relying on the user/admin to set the appropriate value. The default is
> pretty small.

Thanks.

Should this go into a SDP FAQ ? Also, perhaps also the alternatives for 
the SDP protocol family as well. Anything else ?

-- Hal


From iod00d at hp.com  Fri Apr  1 17:43:03 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 1 Apr 2005 17:43:03 -0800
Subject: [openib-general] [PATCH] libsdp debug output
Message-ID: <20050402014303.GK11094@esmail.cup.hp.com>

Hi,
I found this output from libsdp.so less than helpful:
	default libsdp configuration is used

First, I need a clue what the default location is if I just want
to hack the file.  I was expecting it to live in /etc/libsdp.conf 
based on email describing gen1 in 2004:
	http://openib.org/pipermail/openib-general/2004-June/003222.html	

Secondly, the "make install" puts the libsdp.so in /usr/local/etc
by default. That's ok if the lib tells me that.

A future enhancement would be *always* print the path
if a "verbose=1" (or something) exists the .conf file.
At some point, customers don't want to know.
I don't mind it since it's a good remind when I'm testing.

thanks,
grant

Signed-off-by; Grant Grundler <iod00d at hp.com>


Index: src/userspace/libsdp/src/port.c
===================================================================
--- src/userspace/libsdp/src/port.c	(revision 2103)
+++ src/userspace/libsdp/src/port.c	(working copy)
@@ -1202,8 +1202,10 @@ void __sdp_init(
   if (config_file) {
     __sdp_read_config(config_file);
   } else {
-      printf("default libsdp configuration is used\n");
-#define LIBSDP_DEFAULT_CONFIG_FILE  "/usr/local/ibgd/etc/libsdp.conf"
+/* #define LIBSDP_DEFAULT_CONFIG_FILE  "/etc/libsdp.conf" */
+#define LIBSDP_DEFAULT_CONFIG_FILE  "/usr/local/etc/libsdp.conf"
+      printf("libsdp.so: $LIBSDP_CONFIG_FILE not set. Using "
+					 LIBSDP_DEFAULT_CONFIG_FILE "\n");
     __sdp_read_config(LIBSDP_DEFAULT_CONFIG_FILE);
   }
 } /* __sdp_init */


From iod00d at hp.com  Fri Apr  1 17:51:29 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 1 Apr 2005 17:51:29 -0800
Subject: [openib-general] [PATCH] improve tvflash output (verbose and error
	cases)
Message-ID: <20050402015129.GM11094@esmail.cup.hp.com>

Roland,
This patch adds more output when "-v" is specified and adds details
to output from some of the error cases.

BTW, this is the version of code I used to flash all my
cards to v3.3.2 on rx2600 and rx4640 (HP IA64/ZX1 boxen).

thanks,
grant

Signed-off-by: Grant Grundler <iod00d at hp.com>

Index: src/userspace/tvflash/src/tvflash.c
===================================================================
--- src/userspace/tvflash/src/tvflash.c	(revision 2103)
+++ src/userspace/tvflash/src/tvflash.c	(working copy)
@@ -460,6 +460,9 @@ static int open_hca(struct tvdevice *tvd
 {
 	cur_hca = tvdev->pdev;
 
+	if (verbose)
+		fprintf(stderr, "open_hca(%d)\n", tvdev->num);
+
 	if (!config && tvdev->can_map) {
 		int fd = open("/dev/mem", O_RDWR, O_SYNC);
 		if (fd < 0) {
@@ -485,6 +488,9 @@ static int open_hca(struct tvdevice *tvd
 
 static void close_hca(void)
 {
+	if (verbose)
+		fprintf(stderr, "close_hca()\n");
+
 	if (bar0)
 		munmap(bar0, 1 << 20);
 }
@@ -563,6 +569,9 @@ static void flash_write_cmd(unsigned int
 
 static void flash_chip_reset(void)
 {
+	if (verbose)
+		fprintf(stderr, "flash_chip_reset()\n");
+
 	/* Issue Flash Reset Command*/
 	flash_write_cmd(0x0, 0xf0);
 }
@@ -784,6 +793,9 @@ static int flash_image_read_from_file(ch
 	char *buf;
 	unsigned int sector_sz;
 
+	if (verbose)
+		fprintf(stderr, "flash_image_read_from_file(%s)\n", fname);
+
 	/* Open and read image files */
 	fimg = fopen(fname, "r");
 	if (fimg == NULL) {
@@ -872,6 +884,9 @@ static int flash_check_failsafe(void)
 	char *psbuf;
 	int i;
 
+	if (verbose)
+		fprintf(stderr, "flash_check_failsafe()\n");
+
 	/* Grab the sector size first */
 	sector_sz_ptr = (flash_byte_read(0x16) << 8) | flash_byte_read(0x17);
 	sector_sz     = (flash_byte_read(0x32 + sector_sz_ptr) << 8) |
@@ -882,6 +897,8 @@ static int flash_check_failsafe(void)
 	 * than 1MB is suspicious and thrown out
 	 */
 	if (sector_sz < 12 || sector_sz > 20) {
+		fprintf(stderr, "flash_check_failsafe(): sector_sz (%d) not"
+				" valid. Set to zero.\n", sector_sz);
 		failsafe.sector_sz = TV_FLASH_DEFAULT_SECTOR_SIZE;
 		failsafe.valid     = 0;
 		return 0;
@@ -1192,6 +1209,8 @@ static int identify_hca(int num, struct 
 		case PCI_DEVICE_MELLANOX_MT25208_COMPAT:
 			printf("HCA #%d: Found MT25208 (MT23108 mode)", num);
 			break;
+		default:
+			printf("HCA #%d: WTF? 0x%x", num, tvdev->pdev->device_id);
 		}
 
 		switch (identify_board(tvdev)) {
@@ -1236,7 +1255,10 @@ static int identify_hca(int num, struct 
 						       ver_str,
 						       failsafe.images[0].vsd.data.vendor.topspin.hw_label);
 				} else
-					printf("  Primary image is valid, unknown source\n");
+					printf("  Primary image is valid, "
+						"unknown source (sig 0x%x/0x%x)\n",
+						failsafe.images[0].vsd.data.signature,
+						failsafe.images[0].vsd.data.vendor.topspin.signature2);
 			} else
 				printf("  Primary image is NOT valid\n");
 
@@ -1257,7 +1279,10 @@ static int identify_hca(int num, struct 
 						       ver_str,
 						       failsafe.images[1].vsd.data.vendor.topspin.hw_label);
 				} else
-					printf("  Secondary image is valid, unknown source\n");
+					printf("  Secondary image is valid,"
+						" unknown source (sig 0x%x/0x%x)\n",
+						failsafe.images[1].vsd.data.signature,
+						failsafe.images[1].vsd.data.vendor.topspin.signature2);
 			} else
 				printf("  Secondary image is NOT valid\n");
 		} else
@@ -1429,6 +1454,9 @@ static int flash_image_write_to_file(cha
 	int i, fd;
 	unsigned int offset;
 
+	if (verbose)
+		fprintf(stderr, "flash_image_write_to_file(%s)\n", fname);
+
 	buffer = malloc(failsafe.sector_sz);
 	if (!buffer) {
 		fprintf(stderr, "couldn't allocated %d bytes of memory for buffer\n",
@@ -1460,12 +1488,10 @@ static int flash_image_write_to_file(cha
 		}
 
 		write(fd, buffer, failsafe.sector_sz);
-
 		offset += failsafe.sector_sz;
 	}
 
 	close(fd);
-
 	return 0;
 }
 
@@ -1474,6 +1500,9 @@ static int download_firmware(int hca, ch
 	struct tvdevice *tvdev;
 	int ret;
 
+	if (verbose)
+		fprintf(stderr, "download_firmware(%d,%s)\n", hca, ofname);
+
 	tvdev = find_device(hca);
 	if (!tvdev) {
 		fprintf(stderr, "couldn't find HCA #%d on the PCI bus\n", hca);


From iod00d at hp.com  Fri Apr  1 18:40:48 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 1 Apr 2005 18:40:48 -0800
Subject: [openib-general] ia64 perf and FMR
Message-ID: <20050402024048.GN11094@esmail.cup.hp.com>

Hi,
Just wanted to share initial perf results (and surprise)
that I'm getting on the HP ZX1/IA64 boxes.

Before FMR support was committed, netperf was reporting around
1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned
to the CPU that wasn't taking interrupts. After FMR was committed,
netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was
saturated on the send side in all cases.

I've a vague idea what "Fast Memory Registration" is but not a good
understanding.  Can someone point me at a decent explanation of FMR?

I'd like to understand the 2X in performance.
Maybe we are doing 1/2 as much DMA mapping in one of
the bug fixes?

And I'm suspicious of the IPoIB numbers since SDP is also seeing
a bit over 3500 Mb/s and sending CPU is also saturated. I was hoping
SDP would be 40-60% faster than TCP (ipoib). Maybe I'm just not
configuring libsdp.conf correctly for netperf and maybe the IPoIB
numbers are correct.  I've "rmmod ib_sdp" on both boxes, unloaded
and reloaded all the other IB drivers, and "unset LD_PRELOAD".
Is unloading ib_sdp sufficient to be sure SDP isn't used?

(I do get "module in use" when netserver is running with LD_PRELOAD
pointing at libsdp.so)


I also reviewed all the "__attribute__ ((packed))" uses in
include/ib_mad.h and include/ib_smi.h. It looks safe to me
to remove them since every field is "naturally" aligned from
the start of it's respective structure. I also checked
nested cases. However, while it worked fine, removing all use
from the two files didn't matter for netperf TCP_STREAM.

I didn't realize other files also use "packed" and will
have to revisit the issue. I'm mostly worried some
new use will not be well aligned and cause the compiler
to insert padding. That will be a PITA to debug.
What we need is a compiler warning to tell us when/where
padding is inserted in a structure with a similar __attribute__.

Reminder: not pinning the netserver thread to the other CPU
costs around 25% performance. I think that's true for any single
threaded networking perf test that saturates the CPU.

thanks,
grant


From mst at mellanox.co.il  Sat Apr  2 12:29:44 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 2 Apr 2005 23:29:44 +0300
Subject: [openib-general] Re: [PATCH][MTHCA] add in SINAI defines into mtcha
	code WAS: [openib-commits] r2101 -
	gen2/trunk/src/linux-kernel/patches
In-Reply-To: <1112379853.18939.11.camel@duffman>
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
Message-ID: <20050402202944.GB29843@mellanox.co.il>

Quoting r. Tom Duffy <tduffy at sun.com>:
> Subject: [PATCH][MTHCA] add in SINAI defines into mtcha code WAS: [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
> 
> On Thu, 2005-03-31 at 12:43 -0800, roland at openib.org wrote:
> > Author: roland
> > Date: 2005-03-31 12:43:29 -0800 (Thu, 31 Mar 2005)
> > New Revision: 2101
> > 
> > Added:
> >    gen2/trunk/src/linux-kernel/patches/linux-2.6.11-sinai.diff
> > Log:
> > Add patch adding Sinai device IDs for 2.6.11 kernel.
> 
> Roland, please consider applying this for svn ease of use:
> 

Just adding defines wont make sinai work for you.
RQ formatting needs to be fixed. I posted patches that
make Sinai work earlier:

http://www.mail-archive.com/openib-general at openib.org/msg03891.html

and

http://www.mail-archive.com/openib-general at openib.org/msg03892.html

I can repost if needed.


-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Sun Apr  3 10:35:48 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 3 Apr 2005 20:35:48 +0300
Subject: [openib-general] Re: ia64 perf and FMR
In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com>
References: <20050402024048.GN11094@esmail.cup.hp.com>
Message-ID: <20050403173548.GA14915@mellanox.co.il>

Quoting r. Grant Grundler <iod00d at hp.com>:
> Subject: ia64 perf and FMR
> 
> Hi,
> Just wanted to share initial perf results (and surprise)
> that I'm getting on the HP ZX1/IA64 boxes.
> 
> Before FMR support was committed, netperf was reporting around
> 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned
> to the CPU that wasn't taking interrupts. After FMR was committed,
> netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was
> saturated on the send side in all cases.
> 
> I've a vague idea what "Fast Memory Registration" is but not a good
> understanding.  Can someone point me at a decent explanation of FMR?
> 
> I'd like to understand the 2X in performance.
> Maybe we are doing 1/2 as much DMA mapping in one of
> the bug fixes?
> 
> And I'm suspicious of the IPoIB numbers since SDP is also seeing
> a bit over 3500 Mb/s and sending CPU is also saturated. I was hoping
> SDP would be 40-60% faster than TCP (ipoib). Maybe I'm just not
> configuring libsdp.conf correctly for netperf and maybe the IPoIB
> numbers are correct.  I've "rmmod ib_sdp" on both boxes, unloaded
> and reloaded all the other IB drivers, and "unset LD_PRELOAD".
> Is unloading ib_sdp sufficient to be sure SDP isn't used?
> 
> (I do get "module in use" when netserver is running with LD_PRELOAD
> pointing at libsdp.so)
> 
> 
> I also reviewed all the "__attribute__ ((packed))" uses in
> include/ib_mad.h and include/ib_smi.h. It looks safe to me
> to remove them since every field is "naturally" aligned from
> the start of it's respective structure. I also checked
> nested cases. However, while it worked fine, removing all use
> from the two files didn't matter for netperf TCP_STREAM.
> 
> I didn't realize other files also use "packed" and will
> have to revisit the issue. I'm mostly worried some
> new use will not be well aligned and cause the compiler
> to insert padding. That will be a PITA to debug.
> What we need is a compiler warning to tell us when/where
> padding is inserted in a structure with a similar __attribute__.
> 
> Reminder: not pinning the netserver thread to the other CPU
> costs around 25% performance. I think that's true for any single
> threaded networking perf test that saturates the CPU.
> 
> thanks,
> grant

Can you try with hide DDR? this will disable FMRs for tavor.


-- 
MST - Michael S. Tsirkin


From iod00d at hp.com  Sun Apr  3 14:13:48 2005
From: iod00d at hp.com (Grant Grundler)
Date: Sun, 3 Apr 2005 14:13:48 -0700
Subject: [openib-general] Re: ia64 perf and FMR
In-Reply-To: <20050403173548.GA14915@mellanox.co.il>
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<20050403173548.GA14915@mellanox.co.il>
Message-ID: <20050403211348.GA18395@esmail.cup.hp.com>

On Sun, Apr 03, 2005 at 08:35:48PM +0300, Michael S. Tsirkin wrote:
> Can you try with hide DDR? this will disable FMRs for tavor.

I could if someone provided the v3.3.2 "failsafe" FW image with DDR hidden.
I'm not equipped with a windows machine nor infiniburn to create
my own. I'll need a "failsafe" image for both cougar and cougarcub.

Once I'm done with this round of testing, I'd be happy to try
a newer version of firmware w/ and w/o DDR hidden.

thanks,
grant


From iod00d at hp.com  Sun Apr  3 22:51:31 2005
From: iod00d at hp.com (Grant Grundler)
Date: Sun, 3 Apr 2005 22:51:31 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com>
References: <20050402024048.GN11094@esmail.cup.hp.com>
Message-ID: <20050404055131.GA19409@esmail.cup.hp.com>

On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote:
> Hi,
> Just wanted to share initial perf results (and surprise)
> that I'm getting on the HP ZX1/IA64 boxes.
> 
> Before FMR support was committed, netperf was reporting around
> 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned
> to the CPU that wasn't taking interrupts. After FMR was committed,
> netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was
> saturated on the send side in all cases.

FMR is a red herring.  I tried SVN r2080 and it has roughly the same
performance as r2082 (when FMR was committed) and later r210x.
"packed" attribute is a red herring too.

Performance stunk with r2050 and I will do a binary search this
week until I sort out which changes doubled the perf. ISTR
there was one change related  to a "double mapping" issue and I
will be tracking that down in a few days.

> I've a vague idea what "Fast Memory Registration" is but not a good
> understanding.  Can someone point me at a decent explanation of FMR?

I'm still fishing for this.
Even tips on which docs I might scrounge through are welcome.

...
> Maybe I'm just not configuring libsdp.conf correctly for netperf
> and maybe the IPoIB numbers are correct.

This was in fact the case.
The explanations aren't very good in the default .conf file.
Is there other documentation to describe libsdp.conf file?

"match program *" worked. Variations of "match destination"
and "match listen *:12866" did not. Well, it might have worked
for one side or the other, but not both.

I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf.
(256KB socket size....probably too small).
So HP ZX1 chipset is doing quite well for a 3yr old PCI-X chipset.

thanks,
grant


From ftillier at infiniconsys.com  Sun Apr  3 23:16:35 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Sun, 3 Apr 2005 23:16:35 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com>
Message-ID: <001901c538dd$db3e0590$1802a8c0@infiniconsys.com>

> From: Grant Grundler [mailto:iod00d at hp.com]
> Sent: Sunday, April 03, 2005 10:52 PM
> 
> On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote:
> > I've a vague idea what "Fast Memory Registration" is but not a good
> > understanding.  Can someone point me at a decent explanation of FMR?
> 
> I'm still fishing for this.
> Even tips on which docs I might scrounge through are welcome.
> 

If you have access to a Tavor PRM, you can see what they are and how they
work.  The Mellanox implementation of FMR is not the same as FMR defined in
the 1.2 IB spec.

Basically, FMR lets you register memory without using the command interface,
using memory mapped HCA resource to access the translation tables directly.
There are pitfalls with them related to the HCA caching translation entries
and cache coherency between the HCA and what the app wants it to do.

That's my current understanding, and will gladly stand corrected.  I hope
that helps some.

- Fab


From iod00d at hp.com  Sun Apr  3 23:28:29 2005
From: iod00d at hp.com (Grant Grundler)
Date: Sun, 3 Apr 2005 23:28:29 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <001901c538dd$db3e0590$1802a8c0@infiniconsys.com>
References: <20050404055131.GA19409@esmail.cup.hp.com>
	<001901c538dd$db3e0590$1802a8c0@infiniconsys.com>
Message-ID: <20050404062829.GC19481@esmail.cup.hp.com>

On Sun, Apr 03, 2005 at 11:16:35PM -0700, Fab Tillier wrote:
> If you have access to a Tavor PRM, you can see what they are and how they
> work.  The Mellanox implementation of FMR is not the same as FMR defined in
> the 1.2 IB spec.

I know who in HP does.  But I hate contaminating myself with docs
that are only available under NDA. That's why I haven't looked at
them yet.  Is "competitive advantage" still a reason for Mellanox
to NOT publish the PRM for older PCI-X chips?
(e.g. Tavor)

> Basically, FMR lets you register memory without using the command interface,
> using memory mapped HCA resource to access the translation tables directly.
> There are pitfalls with them related to the HCA caching translation entries
> and cache coherency between the HCA and what the app wants it to do.

ok - I can see how FMR helps with latency and PCI bus utilization.
But not a 2x increase in throughput.

> That's my current understanding, and will gladly stand corrected.  I hope
> that helps some.

It does.

thanks,
grant


From roland at topspin.com  Mon Apr  4 07:06:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 07:06:53 -0700
Subject: [openib-general] ia64 perf and FMR
References: <20050402024048.GN11094@esmail.cup.hp.com>
Message-ID: <52sm26eiv6.fsf@topspin.com>

    Grant> Before FMR support was committed, netperf was reporting
    Grant> around 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and
    Grant> netserver pinned to the CPU that wasn't taking
    Grant> interrupts. After FMR was committed, netperf is reporting
    Grant> about 3500 Mb/s (437 MB/s) for IPoIB. CPU was saturated on
    Grant> the send side in all cases.

    Grant> I've a vague idea what "Fast Memory Registration" is but
    Grant> not a good understanding.  Can someone point me at a decent
    Grant> explanation of FMR?

A memory region (MR) is a memory translation mapping in the HCA's
context.  Usually, we create MRs via a firmware command, which is
prohibitively expensive to do in the data path.  However, it is
possible for the driver to write directly into the HCA's context,
bypassing the firmware.  This is very cheap, just some posted writes,
and so we can do it in the data path.  For example, for AIO, SDP uses
this to map a bunch of random userspace pages into something virtually
contiguous in the HCA's memory map, so that it can be used as for RDMA.

However this shouldn't affect IPoIB in the least since a) it doesn't
do any dynamic memory registration and b) it doesn't call any FMR
functions anyway.

    Grant> I'd like to understand the 2X in performance.  Maybe we are
    Grant> doing 1/2 as much DMA mapping in one of the bug fixes?

    Grant> And I'm suspicious of the IPoIB numbers since SDP is also
    Grant> seeing a bit over 3500 Mb/s and sending CPU is also
    Grant> saturated. I was hoping SDP would be 40-60% faster than TCP
    Grant> (ipoib). Maybe I'm just not configuring libsdp.conf
    Grant> correctly for netperf and maybe the IPoIB numbers are
    Grant> correct.  I've "rmmod ib_sdp" on both boxes, unloaded and
    Grant> reloaded all the other IB drivers, and "unset LD_PRELOAD".
    Grant> Is unloading ib_sdp sufficient to be sure SDP isn't used?

This is really odd.  I don't see how FMRs could directly change IPoIB
performance, since IPoIB isn't using FMRs, even indirectly.  If SDP is
not loaded, then I don't see how it could be used, but the fact that
you get the same number for SDP and IPoIB really makes me think that
the IPoIB number is really an SDP number.

    Grant> I also reviewed all the "__attribute__ ((packed))" uses in
    Grant> include/ib_mad.h and include/ib_smi.h. It looks safe to me
    Grant> to remove them since every field is "naturally" aligned
    Grant> from the start of it's respective structure. I also checked
    Grant> nested cases. However, while it worked fine, removing all
    Grant> use from the two files didn't matter for netperf
    Grant> TCP_STREAM.

Yeah, none of that code is in the data path, so I wouldn't expect it
to make a difference one way or another.

The one that might make a difference is struct mthca_eqe in
mthca_eq.c.  Unfortunately simply removing the packed attribute will
break things on 64 bit archs unless the structure is written slightly
differently.  It shouldn't be that difficult, so I should have
something for you to test in a day or two.

 - R.


From roland at topspin.com  Mon Apr  4 06:51:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 06:51:53 -0700
Subject: [openib-general] Re: [PATCH][MTHCA] add in SINAI defines into mtcha
 code WAS:
 [openib-commits] r2101 - gen2/trunk/src/linux-kernel/patches
References: <20050331204331.4320C2283D9@openib.ca.sandia.gov>
	<1112379853.18939.11.camel@duffman>
	<20050402202944.GB29843@mellanox.co.il>
Message-ID: <52y8byejk6.fsf@topspin.com>

    Michael> Just adding defines wont make sinai work for you.  RQ
    Michael> formatting needs to be fixed. I posted patches that make
    Michael> Sinai work earlier:

Yes, I already committed a similar change based on the latest PRM.  I
finally got a Sinai board and it seems to be working with the current
code.

 - R.


From mst at mellanox.co.il  Mon Apr  4 08:02:35 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Apr 2005 18:02:35 +0300
Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca
Message-ID: <20050404150235.GZ15034@mellanox.co.il>


Adds support for posting SEND_INLINE work requests in libmthca.
With this patch, I get latency as low as 3.35 usec unidirectional
with Arbel Tavor mode. Passed basic testing on Tavor and Arbel mode.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: src/qp.c
===================================================================
--- src/qp.c	(revision 2104)
+++ src/qp.c	(working copy)
@@ -57,6 +57,10 @@ enum {
 	MTHCA_NEXT_SOLICIT   = 1 << 1,
 };
 
+enum {
+	MTHCA_INLINE_SEG = 1<<31
+};
+
 struct mthca_next_seg {
 	uint32_t	nda_op;	/* [31:6] next WQE [4:0] next opcode */
 	uint32_t	ee_nds;	/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
@@ -107,6 +111,10 @@ struct mthca_data_seg {
 	uint64_t	addr;
 };
 
+struct mthca_inline_seg {
+	uint32_t	byte_count;
+};
+
 static const uint8_t mthca_opcode[] = {
 	[IBV_WR_SEND]                 = MTHCA_OPCODE_SEND,
 	[IBV_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
@@ -255,15 +263,38 @@ int mthca_tavor_post_send(struct ibv_qp 
 			goto out;
 		}
 
-		for (i = 0; i < wr->num_sge; ++i) {
-			((struct mthca_data_seg *) wqe)->byte_count =
-				htonl(wr->sg_list[i].length);
-			((struct mthca_data_seg *) wqe)->lkey =
-				htonl(wr->sg_list[i].lkey);
-			((struct mthca_data_seg *) wqe)->addr =
-				htonll(wr->sg_list[i].addr);
-			wqe += sizeof (struct mthca_data_seg);
-			size += sizeof (struct mthca_data_seg) / 16;
+		if (wr->send_flags & IBV_SEND_INLINE) {
+			struct mthca_inline_seg *seg = wqe;
+			int s = 0;
+			wqe += sizeof *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				struct ibv_sge *sge = &wr->sg_list[i];
+				int l;
+				l = sge->length;
+				s += l;
+
+				if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) {
+					ret = -1;
+					*bad_wr = wr;
+					goto out;
+				}
+
+				memcpy(wqe, (void*)(intptr_t)sge->addr, l);
+				wqe += l;
+			}
+			seg->byte_count = htonl(MTHCA_INLINE_SEG | s);
+
+			size += align(s + sizeof *seg, 16) / 16;
+		} else {
+			struct mthca_data_seg *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				seg = wqe;
+				seg->byte_count = htonl(wr->sg_list[i].length);
+				seg->lkey = htonl(wr->sg_list[i].lkey);
+				seg->addr = htonll(wr->sg_list[i].addr);
+				wqe += sizeof *seg;
+			}
+			size += wr->num_sge * sizeof *seg / 16;
 		}
 
 		qp->wrid[ind + qp->rq.max] = wr->wr_id;
@@ -512,15 +543,37 @@ int mthca_arbel_post_send(struct ibv_qp 
 			goto out;
 		}
 
-		for (i = 0; i < wr->num_sge; ++i) {
-			((struct mthca_data_seg *) wqe)->byte_count =
-				htonl(wr->sg_list[i].length);
-			((struct mthca_data_seg *) wqe)->lkey =
-				htonl(wr->sg_list[i].lkey);
-			((struct mthca_data_seg *) wqe)->addr =
-				htonll(wr->sg_list[i].addr);
-			wqe += sizeof (struct mthca_data_seg);
-			size += sizeof (struct mthca_data_seg) / 16;
+		if (wr->send_flags & IBV_SEND_INLINE) {
+			struct mthca_inline_seg *seg = wqe;
+			int s = 0;
+			wqe += sizeof *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				int l = wr->sg_list[i].length;
+				s += l;
+
+				if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) {
+					ret = -1;
+					*bad_wr = wr;
+					goto out;
+				}
+
+				memcpy(wqe,
+				       (void*)(intptr_t)wr->sg_list[i].addr, l);
+				wqe += l;
+			}
+			seg->byte_count = htonl(MTHCA_INLINE_SEG | s);
+
+			size += align(s + sizeof *seg, 16) / 16;
+		} else {
+			struct mthca_data_seg *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				seg = wqe;
+				seg->byte_count = htonl(wr->sg_list[i].length);
+				seg->lkey = htonl(wr->sg_list[i].lkey);
+				seg->addr = htonll(wr->sg_list[i].addr);
+				wqe += sizeof *seg;
+			}
+			size += wr->num_sge * sizeof *seg / 16;
 		}
 
 		qp->wrid[ind + qp->rq.max] = wr->wr_id;
-- 
MST - Michael S. Tsirkin


From iod00d at hp.com  Mon Apr  4 08:29:05 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 4 Apr 2005 08:29:05 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <52sm26eiv6.fsf@topspin.com>
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<52sm26eiv6.fsf@topspin.com>
Message-ID: <20050404152905.GA20973@esmail.cup.hp.com>

On Mon, Apr 04, 2005 at 07:06:53AM -0700, Roland Dreier wrote:
> A memory region (MR) is a memory translation mapping in the HCA's
> context.  Usually, we create MRs via a firmware command, which is
> prohibitively expensive to do in the data path.  However, it is
> possible for the driver to write directly into the HCA's context,
> bypassing the firmware.  This is very cheap, just some posted writes,
> and so we can do it in the data path.  For example, for AIO, SDP uses
> this to map a bunch of random userspace pages into something virtually
> contiguous in the HCA's memory map, so that it can be used as for RDMA.

Thanks! I understood about 1/2 of that before. I'd like to read
a bit more detail though...

> However this shouldn't affect IPoIB in the least since a) it doesn't
> do any dynamic memory registration and b) it doesn't call any FMR
> functions anyway.

*nod* - FMR is clearly a red herring in this case.

> This is really odd.  I don't see how FMRs could directly change IPoIB
> performance, since IPoIB isn't using FMRs, even indirectly.

Sorry - I said "FMR" when I should have said r210x release. FMR
was just recently committed and I assumed (there's that word again)
that was related somehow. My bad.

> If SDP is
> not loaded, then I don't see how it could be used, but the fact that
> you get the same number for SDP and IPoIB really makes me think that
> the IPoIB number is really an SDP number.

Nope - those really where IPoIB numbers.

...
> The one that might make a difference is struct mthca_eqe in
> mthca_eq.c.  Unfortunately simply removing the packed attribute will
> break things on 64 bit archs unless the structure is written slightly
> differently.  It shouldn't be that difficult, so I should have
> something for you to test in a day or two.

Ok. I won't be able to test that until next week...I'll note which
rev picks that up and make sure to test it seperately.

BTW, SDP uses "packed" for a dozen or so structures. I haven't looked
at any q-syscollect or pfmon data yet to see where SDP is spending time
or if "packed" is part of the code path. But I don't have the impression
SDP is CPU bound like IPoIB is.

thanks,
grant


From libor at topspin.com  Mon Apr  4 08:11:54 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 4 Apr 2005 08:11:54 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050402024048.GN11094@esmail.cup.hp.com>;
	from iod00d@hp.com on Fri, Apr 01, 2005 at 06:40:48PM -0800
References: <20050402024048.GN11094@esmail.cup.hp.com>
Message-ID: <20050404081154.A10315@topspin.com>

On Fri, Apr 01, 2005 at 06:40:48PM -0800, Grant Grundler wrote:
> Hi,
> Just wanted to share initial perf results (and surprise)
> that I'm getting on the HP ZX1/IA64 boxes.
> 
> Before FMR support was committed, netperf was reporting around
> 1720 Mb/s (215 MB/s) for IPoIB with msi_x=1 and netserver pinned
> to the CPU that wasn't taking interrupts. After FMR was committed,
> netperf is reporting about 3500 Mb/s (437 MB/s) for IPoIB. CPU was
> saturated on the send side in all cases.
> 
> I've a vague idea what "Fast Memory Registration" is but not a good
> understanding.  Can someone point me at a decent explanation of FMR?

  A few people responded with good FMR descriptions. However, I'd like
to add that when using an unmodified version of netperf, neither IPoIB
or SDP are using FMRs. IPoIB never uses FMRs, and currently SDP only
uses FMRs when the application is using Linux AIO to read or write data
on the socket. In that instance if the buffers are larger then a
threshold value they will be registered using FMRs and the contiguous
address is then shared with the remote half of the connetion which can
then RDMA to/from the buffer. The example code (ttcp.aio.c) I checked
in will use AIO and FMRs if the transfer size (-l) is over the 5K
default threshold.


-Libor


From libor at topspin.com  Mon Apr  4 08:20:02 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 4 Apr 2005 08:20:02 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com>;
	from iod00d@hp.com on Sun, Apr 03, 2005 at 10:51:31PM -0700
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<20050404055131.GA19409@esmail.cup.hp.com>
Message-ID: <20050404082002.B10315@topspin.com>

On Sun, Apr 03, 2005 at 10:51:31PM -0700, Grant Grundler wrote:
> ...
> > Maybe I'm just not configuring libsdp.conf correctly for netperf
> > and maybe the IPoIB numbers are correct.
> 
> This was in fact the case.
> The explanations aren't very good in the default .conf file.
> Is there other documentation to describe libsdp.conf file?
> 
> "match program *" worked. Variations of "match destination"
> and "match listen *:12866" did not. Well, it might have worked
> for one side or the other, but not both.

  Using libsdp with netperf shows some of the limitations of libsdp.
netperf connects to the server on a well known socket, but then the
server creates a second socket which it autobinds, checks to see
which port was assigned, and passes the port number to the client
which connects to the port. This second connection is then used for
the data transfer. 

  Since the connection is not on a well known port the way to match
it is with the 'program' keyword:

  match program netperf

> I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf.
> (256KB socket size....probably too small).

  The socket size socket option is still in the TODO file.


-Libor


From iod00d at hp.com  Mon Apr  4 12:20:03 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 4 Apr 2005 12:20:03 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050404082002.B10315@topspin.com>
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<20050404055131.GA19409@esmail.cup.hp.com>
	<20050404082002.B10315@topspin.com>
Message-ID: <20050404192003.GE20973@esmail.cup.hp.com>

On Mon, Apr 04, 2005 at 08:20:02AM -0700, Libor Michalek wrote:
>   Using libsdp with netperf shows some of the limitations of libsdp.
> netperf connects to the server on a well known socket, but then the
> server creates a second socket which it autobinds, checks to see
> which port was assigned, and passes the port number to the client
> which connects to the port. This second connection is then used for
> the data transfer. 

Ah ok. That explains why just keying off the port # didn't work.
I wasn't aware of that.

>   Since the connection is not on a well known port the way to match
> it is with the 'program' keyword:
> 
>   match program netperf

Wouldn't we also need a line like this?
	match program netserver

For a sanity check, I sometimes run netperf first in one direction
and then the other. It helps confirm the two boxes are symetrical
(same HW config, same CPU, same firmware, same drivers, etc).
Having one libsdp.conf would keep things easy.

And in fact, my current configuration is NOT symetrical.
I have the same HW config but system firmware is not the same.
This results in ~8-10% loss in performance in one direction vs the other.

> > I'm now getting ~5300-5500 Mb/s (~660 MB/s) using SDP with netperf.
> > (256KB socket size....probably too small).
> 
>   The socket size socket option is still in the TODO file.

OH! I was working out how long it takes IB card to transmit or fill a
256KB buffer and it's really not very long. Kind of limits how many
transactions can be coalesced into one interrupt and how long the
interrupt handler can be deferred. But that's probably not a burning
issue (yet).

thanks,
grant


From roland at topspin.com  Mon Apr  4 13:37:41 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 13:37:41 -0700
Subject: [openib-general] Re: problem with SDP/AIO on mem-free HCA
In-Reply-To: <20050401145414.B2870@topspin.com> (Libor Michalek's message of
	"Fri, 1 Apr 2005 14:54:14 -0800")
References: <001301c5367f$e86a8a50$1802a8c0@infiniconsys.com>
	<528y42laxk.fsf@topspin.com> <20050401145414.B2870@topspin.com>
Message-ID: <52psxaffca.fsf@topspin.com>

This patch seems to fix it for me.  With the patch applied, ttcp.aio
runs through to the end and switches from 4 KB RDMAs to 8 KB RDMAs
after 256 KB has been transferred.  Without the patch, ttcp.aio does a
0-length after 256 KB and fails.

 - R.


From roland at topspin.com  Mon Apr  4 13:51:01 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 13:51:01 -0700
Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca
In-Reply-To: <20050404150235.GZ15034@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 4 Apr 2005 18:02:35 +0300")
References: <20050404150235.GZ15034@mellanox.co.il>
Message-ID: <52is32feq2.fsf@topspin.com>

Is the test here correct?

+				if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) {

It seems we need to take into account the size of next segment and any
RDMA segment that we may be posting as well.

Also does it make sense to put the code for gathering inline data
segments and writing gather lists into an inline function that can be
called from both the tavor and arbel post send function?  Will gcc
actually inline this function?

 - R.


From halr at voltaire.com  Mon Apr  4 15:08:03 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Apr 2005 18:08:03 -0400
Subject: [openib-general] IPoIB
Message-ID: <1112652482.4490.281.camel@localhost.localdomain>

A while ago, Tom brought up the issue of IPoIB link level broadcasting
from user space (with the arping tool). Is it possible to do this from
kernel space ? For example, how would/could sendto() work when sending
to a IPoIB link layer address ? If all we wanted to support was
broadcast, perhaps there could be a remapping of the ethernet MAC
broadcast address to the all hosts MGID and QPN for that IPoIB
interface. Or perhaps the entire ipoib pseudoheader should be supported
in this mode. This is needed to support RARPing. Some hosts want to RARP
for their IP address and this should be supported over IPoIB.

-- Hal


From roland at topspin.com  Mon Apr  4 15:09:00 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 4 Apr 2005 15:09:00 -0700
Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace
	verbs
In-Reply-To: <200544159.Ahk9l0puXy39U6u6@topspin.com>
Message-ID: <200544159.Qg0tUfQc4xGRabsc@topspin.com>

Add new structs and struct members required by userspace verbs to IB core.

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-export.orig/drivers/infiniband/core/verbs.c	2005-01-11 09:35:27.046388000 -0800
+++ linux-export/drivers/infiniband/core/verbs.c	2005-04-04 14:50:59.579791210 -0700
@@ -47,10 +47,11 @@
 {
 	struct ib_pd *pd;
 
-	pd = device->alloc_pd(device);
+	pd = device->alloc_pd(device, NULL, NULL, 0);
 
 	if (!IS_ERR(pd)) {
-		pd->device = device;
+		pd->device  = device;
+		pd->uobject = NULL;
 		atomic_set(&pd->usecnt, 0);
 	}
 
@@ -76,8 +77,9 @@
 	ah = pd->device->create_ah(pd, ah_attr);
 
 	if (!IS_ERR(ah)) {
-		ah->device = pd->device;
-		ah->pd     = pd;
+		ah->device  = pd->device;
+		ah->pd      = pd;
+		ah->uobject = NULL;
 		atomic_inc(&pd->usecnt);
 	}
 
@@ -122,7 +124,7 @@
 {
 	struct ib_qp *qp;
 
-	qp = pd->device->create_qp(pd, qp_init_attr);
+	qp = pd->device->create_qp(pd, qp_init_attr, NULL, 0);
 
 	if (!IS_ERR(qp)) {
 		qp->device     	  = pd->device;
@@ -130,6 +132,7 @@
 		qp->send_cq    	  = qp_init_attr->send_cq;
 		qp->recv_cq    	  = qp_init_attr->recv_cq;
 		qp->srq	       	  = qp_init_attr->srq;
+		qp->uobject       = NULL;
 		qp->event_handler = qp_init_attr->event_handler;
 		qp->qp_context    = qp_init_attr->qp_context;
 		qp->qp_type	  = qp_init_attr->qp_type;
@@ -197,10 +200,11 @@
 {
 	struct ib_cq *cq;
 
-	cq = device->create_cq(device, cqe);
+	cq = device->create_cq(device, cqe, NULL, NULL, 0);
 
 	if (!IS_ERR(cq)) {
 		cq->device        = device;
+		cq->uobject       = NULL;
 		cq->comp_handler  = comp_handler;
 		cq->event_handler = event_handler;
 		cq->cq_context    = cq_context;
@@ -245,8 +249,9 @@
 	mr = pd->device->get_dma_mr(pd, mr_access_flags);
 
 	if (!IS_ERR(mr)) {
-		mr->device = pd->device;
-		mr->pd     = pd;
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
 		atomic_inc(&pd->usecnt);
 		atomic_set(&mr->usecnt, 0);
 	}
@@ -267,8 +272,9 @@
 				     mr_access_flags, iova_start);
 
 	if (!IS_ERR(mr)) {
-		mr->device = pd->device;
-		mr->pd     = pd;
+		mr->device  = pd->device;
+		mr->pd      = pd;
+		mr->uobject = NULL;
 		atomic_inc(&pd->usecnt);
 		atomic_set(&mr->usecnt, 0);
 	}
@@ -344,8 +350,9 @@
 
 	mw = pd->device->alloc_mw(pd);
 	if (!IS_ERR(mw)) {
-		mw->device = pd->device;
-		mw->pd     = pd;
+		mw->device  = pd->device;
+		mw->pd      = pd;
+		mw->uobject = NULL;
 		atomic_inc(&pd->usecnt);
 	}
 
--- linux-export.orig/drivers/infiniband/include/ib_verbs.h	2005-02-22 10:14:06.623746000 -0800
+++ linux-export/drivers/infiniband/include/ib_verbs.h	2005-04-04 14:50:42.054602327 -0700
@@ -41,7 +41,9 @@
 
 #include <linux/types.h>
 #include <linux/device.h>
+
 #include <asm/atomic.h>
+#include <asm/scatterlist.h>
 
 union ib_gid {
 	u8	raw[16];
@@ -618,29 +620,78 @@
 	u8	page_size;
 };
 
+struct ib_ucontext {
+	struct ib_device       *device;
+	struct list_head	pd_list;
+	struct list_head	mr_list;
+	struct list_head	mw_list;
+	struct list_head	cq_list;
+	struct list_head	qp_list;
+	struct list_head	srq_list;
+	struct list_head	ah_list;
+	spinlock_t              lock;
+};
+
+struct ib_uobject {
+	u64			user_handle;	/* handle given to us by userspace */
+	struct ib_ucontext     *context;	/* associated user context */
+	struct list_head	list;		/* link to context's list */
+	u32			id;		/* index into kernel idr */
+};
+
+struct ib_umem {
+	unsigned long		user_base;
+	unsigned long		virt_base;
+	size_t			length;
+	int			offset;
+	int			page_size;
+	struct list_head	chunk_list;
+};
+
+struct ib_umem_chunk {
+	struct list_head	list;
+	int                     nents;
+	int                     nmap;
+	struct scatterlist      page_list[0];
+};
+
+#define IB_UMEM_MAX_PAGE_CHUNK						\
+	((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) /	\
+	 ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] -	\
+	  (void *) &((struct ib_umem_chunk *) 0)->page_list[0]))
+
+struct ib_umem_object {
+	struct ib_uobject	uobject;
+	struct ib_umem		umem;
+};
+
 struct ib_pd {
-	struct ib_device *device;
-	atomic_t          usecnt; /* count all resources */
+	struct ib_device       *device;
+	struct ib_uobject      *uobject;
+	atomic_t          	usecnt; /* count all resources */
 };
 
 struct ib_ah {
 	struct ib_device	*device;
 	struct ib_pd		*pd;
+	struct ib_uobject      *uobject;
 };
 
 typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
 
 struct ib_cq {
-	struct ib_device *device;
-	ib_comp_handler   comp_handler;
-	void             (*event_handler)(struct ib_event *, void *);
-	void *            cq_context;
-	int               cqe;
-	atomic_t          usecnt; /* count number of work queues */
+	struct ib_device       *device;
+	struct ib_uobject      *uobject;
+	ib_comp_handler   	comp_handler;
+	void                  (*event_handler)(struct ib_event *, void *);
+	void *            	cq_context;
+	int               	cqe;
+	atomic_t          	usecnt; /* count number of work queues */
 };
 
 struct ib_srq {
 	struct ib_device	*device;
+	struct ib_uobject	*uobject;
 	struct ib_pd		*pd;
 	void			*srq_context;
 	atomic_t		usecnt;
@@ -652,6 +703,7 @@
 	struct ib_cq	       *send_cq;
 	struct ib_cq	       *recv_cq;
 	struct ib_srq	       *srq;
+	struct ib_uobject      *uobject;
 	void                  (*event_handler)(struct ib_event *, void *);
 	void		       *qp_context;
 	u32			qp_num;
@@ -659,16 +711,18 @@
 };
 
 struct ib_mr {
-	struct ib_device *device;
-	struct ib_pd     *pd;
-	u32		  lkey;
-	u32		  rkey;
-	atomic_t          usecnt; /* count number of MWs */
+	struct ib_device  *device;
+	struct ib_pd	  *pd;
+	struct ib_uobject *uobject;
+	u32		   lkey;
+	u32		   rkey;
+	atomic_t	   usecnt; /* count number of MWs */
 };
 
 struct ib_mw {
 	struct ib_device	*device;
 	struct ib_pd		*pd;
+	struct ib_uobject	*uobject;
 	u32			rkey;
 };
 
@@ -737,7 +791,12 @@
 	int		           (*modify_port)(struct ib_device *device,
 						  u8 port_num, int port_modify_mask,
 						  struct ib_port_modify *port_modify);
-	struct ib_pd *             (*alloc_pd)(struct ib_device *device);
+	struct ib_ucontext *       (*alloc_ucontext)(struct ib_device *device,
+						     const void __user *udata, int udatalen);
+	int                        (*dealloc_ucontext)(struct ib_ucontext *context);
+	struct ib_pd *             (*alloc_pd)(struct ib_device *device,
+					       struct ib_ucontext *context,
+					       const void __user *udata, int udatalen);
 	int                        (*dealloc_pd)(struct ib_pd *pd);
 	struct ib_ah *             (*create_ah)(struct ib_pd *pd,
 						struct ib_ah_attr *ah_attr);
@@ -747,7 +806,8 @@
 					       struct ib_ah_attr *ah_attr);
 	int                        (*destroy_ah)(struct ib_ah *ah);
 	struct ib_qp *             (*create_qp)(struct ib_pd *pd,
-						struct ib_qp_init_attr *qp_init_attr);
+						struct ib_qp_init_attr *qp_init_attr,
+						const void __user *udata, int udatalen);
 	int                        (*modify_qp)(struct ib_qp *qp,
 						struct ib_qp_attr *qp_attr,
 						int qp_attr_mask);
@@ -762,8 +822,9 @@
 	int                        (*post_recv)(struct ib_qp *qp,
 						struct ib_recv_wr *recv_wr,
 						struct ib_recv_wr **bad_recv_wr);
-	struct ib_cq *             (*create_cq)(struct ib_device *device,
-						int cqe);
+	struct ib_cq *             (*create_cq)(struct ib_device *device, int cqe,
+						struct ib_ucontext *context,
+						const void __user *udata, int udatalen);
 	int                        (*destroy_cq)(struct ib_cq *cq);
 	int                        (*resize_cq)(struct ib_cq *cq, int *cqe);
 	int                        (*poll_cq)(struct ib_cq *cq, int num_entries,
@@ -780,6 +841,11 @@
 						  int num_phys_buf,
 						  int mr_access_flags,
 						  u64 *iova_start);
+	struct ib_mr *             (*reg_user_mr)(struct ib_pd *pd,
+						  struct ib_umem *region,
+						  int mr_access_flags,
+						  const void __user *udata,
+						  int udatalen);
 	int                        (*query_mr)(struct ib_mr *mr,
 					       struct ib_mr_attr *mr_attr);
 	int                        (*dereg_mr)(struct ib_mr *mr);
@@ -816,7 +882,10 @@
 						  struct ib_grh *in_grh,
 						  struct ib_mad *in_mad,
 						  struct ib_mad *out_mad);
+	int                        (*mmap)(struct ib_ucontext *context,
+					   struct vm_area_struct *vma);
 
+	struct module               *owner;
 	struct class_device          class_dev;
 	struct kobject               ports_parent;
 	struct list_head             port_list;


From roland at topspin.com  Mon Apr  4 15:09:00 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 4 Apr 2005 15:09:00 -0700
Subject: [openib-general] [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
Message-ID: <200544159.Ahk9l0puXy39U6u6@topspin.com>

Here is an initial implementation of InfiniBand userspace verbs.  I
plan to commit this code to the OpenIB repository shortly, and submit
it for inclusion during the 2.6.13 cycle, so I am posting it early for
comments.

This code, in conjunction with the libibverbs and libmthca userspace
libraries available from the subversion trees at

    https://openib.org/svn/gen2/branches/roland-uverbs/src/userspace/libibverbs
    https://openib.org/svn/gen2/branches/roland-uverbs/src/userspace/libmthca

enables userspace processes to access InfiniBand HCAs directly.

For those not familiar with the InfiniBand architecture, this
so-called "userspace verbs" support allows userspace to post data path
commands directly to the HCA.  Resource allocation and other control
path operations still go through the kernel driver.

Please take a look at this code if you have a chance.  I would
appreciate high-level criticism of the design and implementation as
well as nitpicky complaints about coding style and typos.

In particular, the memory pinning code in in uverbs_mem.c could stand
a looking over.  In addition, a sanity check of the write()-based
scheme for passing commands into the kernel in uverbs_main.c and
uverbs_cmd.c is probably worthwhile.

Thanks,
  Roland


From roland at topspin.com  Mon Apr  4 15:09:00 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 4 Apr 2005 15:09:00 -0700
Subject: [openib-general] [PATCH][RFC][2/4] IB: userspace verbs main module
In-Reply-To: <200544159.Qg0tUfQc4xGRabsc@topspin.com>
Message-ID: <200544159.3X7p8nZ87qWqA7cv@topspin.com>

Add device-independent userspace verbs support (ib_uverbs module).

Signed-off-by: Roland Dreier <roland at topspin.com>

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/core/uverbs.h	2005-04-04 14:55:10.496227053 -0700
@@ -0,0 +1,124 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id: uverbs.h 2001 2005-03-16 04:15:41Z roland $
+ */
+
+#ifndef UVERBS_H
+#define UVERBS_H
+
+/* Include device.h and fs.h until cdev.h is self-sufficient */
+#include <linux/fs.h>
+#include <linux/device.h>
+#include <linux/cdev.h>
+#include <linux/kref.h>
+#include <linux/idr.h>
+
+#include <ib_verbs.h>
+#include <ib_user_verbs.h>
+
+struct ib_uverbs_device {
+	int                 devnum;
+	struct cdev         dev;
+	struct class_device class_dev;
+	struct ib_device   *ib_dev;
+	int                 num_comp;
+};
+
+struct ib_uverbs_event_file {
+	struct ib_uverbs_file *uverbs_file;
+	spinlock_t             lock;
+	int                    fd;
+	int                    is_async;
+	wait_queue_head_t      poll_wait;
+	struct list_head       event_list;
+};
+
+struct ib_uverbs_file {
+	struct kref                 ref;
+	struct ib_uverbs_device    *device;
+	struct ib_ucontext         *ucontext;
+	struct ib_event_handler     event_handler;
+	struct ib_uverbs_event_file async_file; 
+	struct ib_uverbs_event_file comp_file[1]; 
+};
+
+struct ib_uverbs_async_event {
+	struct ib_uverbs_async_event_desc desc;
+	struct list_head                  list;
+};
+
+struct ib_uverbs_comp_event {
+	struct ib_uverbs_comp_event_desc desc;
+	struct list_head                 list;
+};
+
+struct ib_uobject_mr {
+	struct ib_uobject   uobj;
+	struct page        *page_list;
+	struct scatterlist *sg_list;
+};
+
+extern struct semaphore ib_uverbs_idr_mutex;
+extern struct idr ib_uverbs_pd_idr;
+extern struct idr ib_uverbs_mr_idr;
+extern struct idr ib_uverbs_mw_idr;
+extern struct idr ib_uverbs_ah_idr;
+extern struct idr ib_uverbs_cq_idr;
+extern struct idr ib_uverbs_qp_idr;
+
+void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context);
+void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr);
+void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr);
+
+int ib_umem_get(struct ib_device *dev, struct ib_umem *mem,
+		void *addr, size_t size);
+void ib_umem_release(struct ib_device *dev, struct ib_umem *umem);
+
+#define IB_UVERBS_DECLARE_CMD(name)					\
+	ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,		\
+				 const char __user *buf, int in_len,	\
+				 int out_len)
+
+IB_UVERBS_DECLARE_CMD(query_params);
+IB_UVERBS_DECLARE_CMD(get_context);
+IB_UVERBS_DECLARE_CMD(query_port);
+IB_UVERBS_DECLARE_CMD(alloc_pd);
+IB_UVERBS_DECLARE_CMD(dealloc_pd);
+IB_UVERBS_DECLARE_CMD(reg_mr);
+IB_UVERBS_DECLARE_CMD(dereg_mr);
+IB_UVERBS_DECLARE_CMD(create_cq);
+IB_UVERBS_DECLARE_CMD(destroy_cq);
+IB_UVERBS_DECLARE_CMD(create_qp);
+IB_UVERBS_DECLARE_CMD(modify_qp);
+IB_UVERBS_DECLARE_CMD(destroy_qp);
+
+#endif /* UVERBS_H */
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/core/uverbs_cmd.c	2005-04-04 14:53:12.136965074 -0700
@@ -0,0 +1,790 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id: uverbs_cmd.c 1995 2005-03-15 19:25:10Z roland $
+ */
+
+#include <asm/uaccess.h>
+
+#include "uverbs.h"
+
+ssize_t ib_uverbs_query_params(struct ib_uverbs_file *file,
+			       const char __user *buf,
+			       int in_len, int out_len)
+{
+	struct ib_uverbs_query_params      cmd;
+	struct ib_uverbs_query_params_resp resp;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	resp.num_cq_events = file->device->num_comp;
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp))
+	    return -EFAULT;
+
+	return in_len;
+}
+
+ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
+			      const char __user *buf,
+			      int in_len, int out_len)
+{
+	struct ib_uverbs_get_context       cmd;
+	struct ib_uverbs_get_context_resp *resp;
+	struct ib_device                  *ibdev = file->device->ib_dev;
+	int outsz;
+	int i;
+	int ret = in_len;
+
+	outsz = sizeof *resp + (file->device->num_comp - 1) * sizeof (__u32);
+
+	if (out_len < outsz)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	resp = kmalloc(outsz, GFP_KERNEL);
+	if (!resp)
+		return -ENOMEM;
+
+	file->ucontext = ibdev->alloc_ucontext(ibdev, buf + sizeof cmd,
+					       in_len - sizeof cmd -
+					       sizeof (struct ib_uverbs_cmd_hdr));
+	if (IS_ERR(file->ucontext)) {
+		ret = PTR_ERR(file->ucontext);
+		file->ucontext = NULL;
+		kfree(resp);
+		return ret;
+	}
+
+	file->ucontext->device = ibdev;
+	INIT_LIST_HEAD(&file->ucontext->pd_list);
+	INIT_LIST_HEAD(&file->ucontext->mr_list);
+	INIT_LIST_HEAD(&file->ucontext->mw_list);
+	INIT_LIST_HEAD(&file->ucontext->cq_list);
+	INIT_LIST_HEAD(&file->ucontext->qp_list);
+	INIT_LIST_HEAD(&file->ucontext->srq_list);
+	INIT_LIST_HEAD(&file->ucontext->ah_list);
+	spin_lock_init(&file->ucontext->lock);
+
+	resp->async_fd  = file->async_file.fd;
+	for (i = 0; i < file->device->num_comp; ++i)
+		resp->cq_fd[i] = file->comp_file[i].fd;
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, outsz)) {
+		ibdev->dealloc_ucontext(file->ucontext);
+		file->ucontext = NULL;
+		ret = -EFAULT;
+	}
+
+	kfree(resp);
+	return ret;
+}
+
+ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
+			     const char __user *buf,
+			     int in_len, int out_len)
+{
+	struct ib_uverbs_query_port      cmd;
+	struct ib_uverbs_query_port_resp resp;
+	struct ib_port_attr              attr;
+	int                              ret;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	ret = ib_query_port(file->device->ib_dev, cmd.port_num, &attr);
+	if (ret)
+		return ret;
+
+	resp.state 	     = attr.state;
+	resp.max_mtu 	     = attr.max_mtu;
+	resp.active_mtu      = attr.active_mtu;
+	resp.gid_tbl_len     = attr.gid_tbl_len;
+	resp.port_cap_flags  = attr.port_cap_flags;
+	resp.max_msg_sz      = attr.max_msg_sz;
+	resp.bad_pkey_cntr   = attr.bad_pkey_cntr;
+	resp.qkey_viol_cntr  = attr.qkey_viol_cntr;
+	resp.pkey_tbl_len    = attr.pkey_tbl_len;
+	resp.lid 	     = attr.lid;
+	resp.sm_lid 	     = attr.sm_lid;
+	resp.lmc 	     = attr.lmc;
+	resp.max_vl_num      = attr.max_vl_num;
+	resp.sm_sl 	     = attr.sm_sl;
+	resp.subnet_timeout  = attr.subnet_timeout;
+	resp.init_type_reply = attr.init_type_reply;
+	resp.active_width    = attr.active_width;
+	resp.active_speed    = attr.active_speed;
+	resp.phys_state      = attr.phys_state;
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+			 &resp, sizeof resp))
+		return -EFAULT;
+
+	return in_len;
+}
+
+ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
+			   const char __user *buf,
+			   int in_len, int out_len)
+{
+	struct ib_uverbs_alloc_pd      cmd;
+	struct ib_uverbs_alloc_pd_resp resp;
+	struct ib_uobject             *uobj;
+	struct ib_pd                  *pd;
+	int                            ret;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
+	if (!uobj)
+		return -ENOMEM;
+
+	uobj->context = file->ucontext;
+
+	pd = file->device->ib_dev->alloc_pd(file->device->ib_dev,
+					    file->ucontext, buf + sizeof cmd,
+					    in_len - sizeof cmd -
+					    sizeof (struct ib_uverbs_cmd_hdr));
+	if (IS_ERR(pd)) {
+		ret = PTR_ERR(pd);
+		goto err;
+	}
+
+	pd->device  = file->device->ib_dev;
+	pd->uobject = uobj;
+	atomic_set(&pd->usecnt, 0);
+
+retry:
+	if (!idr_pre_get(&ib_uverbs_pd_idr, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_pd;
+	}
+
+	down(&ib_uverbs_idr_mutex);
+	ret = idr_get_new(&ib_uverbs_pd_idr, pd, &uobj->id);
+	up(&ib_uverbs_idr_mutex);
+
+	if (ret == -EAGAIN)
+		goto retry;
+	if (ret)
+		goto err_pd;
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_add_tail(&uobj->list, &file->ucontext->pd_list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	resp.pd_handle = uobj->id;
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+			 &resp, sizeof resp)) {
+		ret = -EFAULT;
+		goto err_list;
+	}
+
+	return in_len;
+
+err_list:
+ 	spin_lock_irq(&file->ucontext->lock);
+	list_del(&uobj->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	down(&ib_uverbs_idr_mutex);
+	idr_remove(&ib_uverbs_pd_idr, uobj->id);
+	up(&ib_uverbs_idr_mutex);
+
+err_pd:
+	ib_dealloc_pd(pd);
+
+err:
+	kfree(uobj);
+	return ret;
+}
+
+ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
+			     const char __user *buf,
+			     int in_len, int out_len)
+{
+	struct ib_uverbs_dealloc_pd cmd;
+	struct ib_pd               *pd;
+	int                         ret = -EINVAL;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	down(&ib_uverbs_idr_mutex);
+
+	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
+	if (!pd || pd->uobject->context != file->ucontext)
+		goto out;
+
+	ret = ib_dealloc_pd(pd);
+	if (ret)
+		goto out;
+
+	idr_remove(&ib_uverbs_pd_idr, cmd.pd_handle);
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&pd->uobject->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	kfree(pd->uobject);
+
+out:
+	up(&ib_uverbs_idr_mutex);
+
+	return ret ? ret : in_len;
+}
+
+ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
+			 const char __user *buf, int in_len,
+			 int out_len)
+{
+	struct ib_uverbs_reg_mr      cmd;
+	struct ib_uverbs_reg_mr_resp resp;
+	struct ib_umem_object       *obj;
+	struct ib_pd                *pd;
+	struct ib_mr                *mr;
+	int                          ret;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK))
+		return -EINVAL;
+
+	obj = kmalloc(sizeof *obj, GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+
+	obj->uobject.context = file->ucontext;
+
+	ret = ib_umem_get(file->device->ib_dev, &obj->umem,
+			  (void *) (unsigned long) cmd.start,
+			  cmd.length);
+	if (ret)
+		goto err_free;
+
+	obj->umem.virt_base = cmd.hca_va;
+
+	down(&ib_uverbs_idr_mutex);
+
+	pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
+	if (!pd || pd->uobject->context != file->ucontext) {
+		ret = -EINVAL;
+		goto err_up;
+	}
+
+	if (!pd->device->reg_user_mr) {
+		ret = -ENOSYS;
+		goto err_up;
+	}
+
+	mr = pd->device->reg_user_mr(pd, &obj->umem,
+				     cmd.access_flags,
+				     buf + sizeof cmd,
+				     in_len - sizeof cmd -
+				     sizeof (struct ib_uverbs_cmd_hdr));
+	if (IS_ERR(mr)) {
+		ret = PTR_ERR(mr);
+		goto err_up;
+	}
+
+	mr->device  = pd->device;
+	mr->pd      = pd;
+	mr->uobject = &obj->uobject;
+	atomic_inc(&pd->usecnt);
+	atomic_set(&mr->usecnt, 0);
+
+	resp.lkey = mr->lkey;
+	resp.rkey = mr->rkey;
+
+retry:
+	if (!idr_pre_get(&ib_uverbs_mr_idr, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_unreg;
+	}
+
+	ret = idr_get_new(&ib_uverbs_mr_idr, mr, &obj->uobject.id);
+
+	if (ret == -EAGAIN)
+		goto retry;
+	if (ret)
+		goto err_unreg;
+
+	resp.mr_handle = obj->uobject.id;
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_add_tail(&obj->uobject.list, &file->ucontext->mr_list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+			 &resp, sizeof resp)) {
+		ret = -EFAULT;
+		goto err_list;
+	}
+
+	up(&ib_uverbs_idr_mutex);
+
+	return in_len;
+
+err_list:
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&obj->uobject.list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+err_unreg:
+	ib_dereg_mr(mr);
+
+err_up:
+	up(&ib_uverbs_idr_mutex);
+
+	ib_umem_release(file->device->ib_dev, &obj->umem);
+
+err_free:
+	kfree(obj);
+	return ret;
+}
+
+ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
+			   const char __user *buf, int in_len,
+			   int out_len)
+{
+	struct ib_uverbs_dereg_mr cmd;
+	struct ib_mr             *mr;
+	struct ib_umem_object    *memobj;
+	int                       ret = -EINVAL;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	down(&ib_uverbs_idr_mutex);
+
+	mr = idr_find(&ib_uverbs_mr_idr, cmd.mr_handle);
+	if (!mr || mr->uobject->context != file->ucontext)
+		goto out;
+	
+	ret = ib_dereg_mr(mr);
+	if (ret)
+		goto out;
+
+	idr_remove(&ib_uverbs_mr_idr, cmd.mr_handle);
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&mr->uobject->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	memobj = container_of(mr->uobject, struct ib_umem_object, uobject);
+	ib_umem_release(file->device->ib_dev, &memobj->umem);
+	kfree(memobj);
+
+out:
+	up(&ib_uverbs_idr_mutex);
+
+	return ret ? ret : in_len;
+}
+
+ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
+			    const char __user *buf, int in_len,
+			    int out_len)
+{
+	struct ib_uverbs_create_cq      cmd;
+	struct ib_uverbs_create_cq_resp resp;
+	struct ib_uobject              *uobj;
+	struct ib_cq                   *cq;
+	int                             ret;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
+	if (!uobj)
+		return -ENOMEM;
+
+	uobj->user_handle = cmd.user_handle;
+	uobj->context     = file->ucontext;
+
+	cq = file->device->ib_dev->create_cq(file->device->ib_dev, cmd.cqe,
+					     file->ucontext, buf + sizeof cmd,
+					     in_len - sizeof cmd -
+					     sizeof (struct ib_uverbs_cmd_hdr));
+	if (IS_ERR(cq)) {
+		ret = PTR_ERR(cq);
+		goto err;
+	}
+
+	cq->device        = file->device->ib_dev;
+	cq->uobject       = uobj;
+	cq->comp_handler  = ib_uverbs_comp_handler;
+	cq->event_handler = ib_uverbs_cq_event_handler;
+	cq->cq_context    = file;
+	atomic_set(&cq->usecnt, 0);
+
+retry:
+	if (!idr_pre_get(&ib_uverbs_cq_idr, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_cq;
+	}
+
+	down(&ib_uverbs_idr_mutex);
+	ret = idr_get_new(&ib_uverbs_cq_idr, cq, &uobj->id);
+	up(&ib_uverbs_idr_mutex);
+
+	if (ret == -EAGAIN)
+		goto retry;
+	if (ret)
+		goto err_cq;
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_add_tail(&uobj->list, &file->ucontext->cq_list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	resp.cq_handle = uobj->id;
+	resp.cqe       = cq->cqe;
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+			 &resp, sizeof resp)) {
+		ret = -EFAULT;
+		goto err_list;
+	}
+
+	return in_len;
+
+err_list:
+ 	spin_lock_irq(&file->ucontext->lock);
+	list_del(&uobj->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	down(&ib_uverbs_idr_mutex);
+	idr_remove(&ib_uverbs_cq_idr, uobj->id);
+	up(&ib_uverbs_idr_mutex);
+
+err_cq:
+	ib_destroy_cq(cq);
+
+err:
+	kfree(uobj);
+	return ret;
+}
+
+ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
+			     const char __user *buf, int in_len,
+			     int out_len)
+{
+	struct ib_uverbs_destroy_cq cmd;
+	struct ib_cq               *cq;
+	int                         ret = -EINVAL;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	down(&ib_uverbs_idr_mutex);
+
+	cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle);
+	if (!cq || cq->uobject->context != file->ucontext)
+		goto out;
+
+	ret = ib_destroy_cq(cq);
+	if (ret)
+		goto out;
+
+	idr_remove(&ib_uverbs_cq_idr, cmd.cq_handle);
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&cq->uobject->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	kfree(cq->uobject);
+
+out:
+	up(&ib_uverbs_idr_mutex);
+
+	return ret ? ret : in_len;
+}
+
+ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
+			    const char __user *buf, int in_len,
+			    int out_len)
+{
+	struct ib_uverbs_create_qp      cmd;
+	struct ib_uverbs_create_qp_resp resp;
+	struct ib_uobject              *uobj;
+	struct ib_pd                   *pd;
+	struct ib_cq                   *scq, *rcq;
+	struct ib_qp                   *qp;
+	struct ib_qp_init_attr          attr;
+	int ret;
+
+	if (out_len < sizeof resp)
+		return -ENOSPC;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
+	if (!uobj)
+		return -ENOMEM;
+
+	down(&ib_uverbs_idr_mutex);
+
+	pd  = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle);
+	scq = idr_find(&ib_uverbs_cq_idr, cmd.send_cq_handle);
+	rcq = idr_find(&ib_uverbs_cq_idr, cmd.recv_cq_handle);
+
+	if (!pd  || pd->uobject->context  != file->ucontext ||
+	    !scq || scq->uobject->context != file->ucontext ||
+	    !rcq || rcq->uobject->context != file->ucontext) {
+		ret = -EINVAL;
+		goto err_up;
+	}
+
+	attr.event_handler = ib_uverbs_qp_event_handler;
+	attr.qp_context    = file;
+	attr.send_cq       = scq;
+	attr.recv_cq       = rcq;
+	attr.srq           = NULL;
+	attr.sq_sig_type   = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR;
+	attr.qp_type       = cmd.qp_type;
+
+	attr.cap.max_send_wr     = cmd.max_send_wr;
+	attr.cap.max_recv_wr     = cmd.max_recv_wr;
+	attr.cap.max_send_sge    = cmd.max_send_sge;
+	attr.cap.max_recv_sge    = cmd.max_recv_sge;
+	attr.cap.max_inline_data = cmd.max_inline_data;
+
+	uobj->user_handle = cmd.user_handle;
+	uobj->context     = file->ucontext;
+
+	qp = pd->device->create_qp(pd, &attr, buf + sizeof cmd,
+				   in_len - sizeof cmd -
+				   sizeof (struct ib_uverbs_cmd_hdr));
+	if (IS_ERR(qp)) {
+		ret = PTR_ERR(qp);
+		goto err_up;
+	}
+
+	qp->device     	  = pd->device;
+	qp->pd         	  = pd;
+	qp->send_cq    	  = attr.send_cq;
+	qp->recv_cq    	  = attr.recv_cq;
+	qp->srq	       	  = attr.srq;
+	qp->uobject       = uobj;
+	qp->event_handler = attr.event_handler;
+	qp->qp_context    = attr.qp_context;
+	qp->qp_type	  = attr.qp_type;
+	atomic_inc(&pd->usecnt);
+	atomic_inc(&attr.send_cq->usecnt);
+	atomic_inc(&attr.recv_cq->usecnt);
+	if (attr.srq)
+		atomic_inc(&attr.srq->usecnt);
+
+	resp.qpn = qp->qp_num;
+
+retry:
+	if (!idr_pre_get(&ib_uverbs_qp_idr, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err_destroy;
+	}
+
+	ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->id);
+
+	if (ret == -EAGAIN)
+		goto retry;
+	if (ret)
+		goto err_destroy;
+
+	resp.qp_handle = uobj->id;
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_add_tail(&uobj->list, &file->ucontext->qp_list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	if (copy_to_user((void __user *) (unsigned long) cmd.response,
+			 &resp, sizeof resp)) {
+		ret = -EFAULT;
+		goto err_list;
+	}
+
+	up(&ib_uverbs_idr_mutex);
+
+	return in_len;
+
+err_list:
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&uobj->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+err_destroy:
+	ib_destroy_qp(qp);
+
+err_up:
+	up(&ib_uverbs_idr_mutex);
+
+	kfree(uobj);
+	return ret;
+}
+
+ssize_t ib_uverbs_modify_qp(struct ib_uverbs_file *file,
+			    const char __user *buf, int in_len,
+			    int out_len)
+{
+	struct ib_uverbs_modify_qp cmd;
+	struct ib_qp              *qp;
+	struct ib_qp_attr         *attr;
+	int                        ret;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	attr = kmalloc(sizeof *attr, GFP_KERNEL);
+	if (!attr)
+		return -ENOMEM;
+
+	down(&ib_uverbs_idr_mutex);
+
+	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
+	if (!qp || qp->uobject->context != file->ucontext) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	attr->qp_state 		  = cmd.qp_state;
+	attr->cur_qp_state 	  = cmd.cur_qp_state;
+	attr->path_mtu 		  = cmd.path_mtu;
+	attr->path_mig_state 	  = cmd.path_mig_state;
+	attr->qkey 		  = cmd.qkey;
+	attr->rq_psn 		  = cmd.rq_psn;
+	attr->sq_psn 		  = cmd.sq_psn;
+	attr->dest_qp_num 	  = cmd.dest_qp_num;
+	attr->qp_access_flags 	  = cmd.qp_access_flags;
+	attr->pkey_index 	  = cmd.pkey_index;
+	attr->alt_pkey_index 	  = cmd.pkey_index;
+	attr->en_sqd_async_notify = cmd.en_sqd_async_notify;
+	attr->max_rd_atomic 	  = cmd.max_rd_atomic;
+	attr->max_dest_rd_atomic  = cmd.max_dest_rd_atomic;
+	attr->min_rnr_timer 	  = cmd.min_rnr_timer;
+	attr->port_num 		  = cmd.port_num;
+	attr->timeout 		  = cmd.timeout;
+	attr->retry_cnt 	  = cmd.retry_cnt;
+	attr->rnr_retry 	  = cmd.rnr_retry;
+	attr->alt_port_num 	  = cmd.alt_port_num;
+	attr->alt_timeout 	  = cmd.alt_timeout;
+
+	memcpy(attr->ah_attr.grh.dgid.raw, cmd.dest.dgid, 16);
+	attr->ah_attr.grh.flow_label        = cmd.dest.flow_label;
+	attr->ah_attr.grh.sgid_index        = cmd.dest.sgid_index;
+	attr->ah_attr.grh.hop_limit         = cmd.dest.hop_limit;
+	attr->ah_attr.grh.traffic_class     = cmd.dest.traffic_class;
+	attr->ah_attr.dlid 	    	    = cmd.dest.dlid;
+	attr->ah_attr.sl   	    	    = cmd.dest.sl;
+	attr->ah_attr.src_path_bits 	    = cmd.dest.src_path_bits;
+	attr->ah_attr.static_rate   	    = cmd.dest.static_rate;
+	attr->ah_attr.ah_flags 	    	    = cmd.dest.is_global ? IB_AH_GRH : 0;
+	attr->ah_attr.port_num 	    	    = cmd.dest.port_num;
+
+	memcpy(attr->alt_ah_attr.grh.dgid.raw, cmd.alt_dest.dgid, 16);
+	attr->alt_ah_attr.grh.flow_label    = cmd.alt_dest.flow_label;
+	attr->alt_ah_attr.grh.sgid_index    = cmd.alt_dest.sgid_index;
+	attr->alt_ah_attr.grh.hop_limit     = cmd.alt_dest.hop_limit;
+	attr->alt_ah_attr.grh.traffic_class = cmd.alt_dest.traffic_class;
+	attr->alt_ah_attr.dlid 	    	    = cmd.alt_dest.dlid;
+	attr->alt_ah_attr.sl   	    	    = cmd.alt_dest.sl;
+	attr->alt_ah_attr.src_path_bits     = cmd.alt_dest.src_path_bits;
+	attr->alt_ah_attr.static_rate       = cmd.alt_dest.static_rate;
+	attr->alt_ah_attr.ah_flags 	    = cmd.alt_dest.is_global ? IB_AH_GRH : 0;
+	attr->alt_ah_attr.port_num 	    = cmd.alt_dest.port_num;
+
+	ret = ib_modify_qp(qp, attr, cmd.attr_mask);
+	if (ret)
+		goto out;
+
+	ret = in_len;
+
+out:
+	up(&ib_uverbs_idr_mutex);
+	kfree(attr);
+
+	return ret;
+}
+
+ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
+			     const char __user *buf, int in_len,
+			     int out_len)
+{
+	struct ib_uverbs_destroy_qp cmd;
+	struct ib_qp               *qp;
+	int                         ret = -EINVAL;
+
+	if (copy_from_user(&cmd, buf, sizeof cmd))
+		return -EFAULT;
+
+	down(&ib_uverbs_idr_mutex);
+
+	qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle);
+	if (!qp || qp->uobject->context != file->ucontext)
+		goto out;
+
+	ret = ib_destroy_qp(qp);
+	if (ret)
+		goto out;
+
+	idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle);
+
+	spin_lock_irq(&file->ucontext->lock);
+	list_del(&qp->uobject->list);
+	spin_unlock_irq(&file->ucontext->lock);
+
+	kfree(qp->uobject);
+
+out:
+	up(&ib_uverbs_idr_mutex);
+
+	return ret ? ret : in_len;
+}
+
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/core/uverbs_main.c	2005-04-04 14:53:17.824728218 -0700
@@ -0,0 +1,688 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id: uverbs_main.c 2109 2005-04-04 21:10:34Z roland $
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+
+#include <asm/uaccess.h>
+
+#include "uverbs.h"
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("InfiniBand userspace verbs access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#define INFINIBANDEVENTFS_MAGIC	0x49426576	/* "IBev" */
+
+enum {
+	IB_UVERBS_MAJOR       = 231,
+	IB_UVERBS_BASE_MINOR  = 128,
+	IB_UVERBS_MAX_DEVICES = 32
+};
+
+#define IB_UVERBS_BASE_DEV	MKDEV(IB_UVERBS_MAJOR, IB_UVERBS_BASE_MINOR)
+
+DECLARE_MUTEX(ib_uverbs_idr_mutex);
+DEFINE_IDR(ib_uverbs_pd_idr);
+DEFINE_IDR(ib_uverbs_mr_idr);
+DEFINE_IDR(ib_uverbs_mw_idr);
+DEFINE_IDR(ib_uverbs_ah_idr);
+DEFINE_IDR(ib_uverbs_cq_idr);
+DEFINE_IDR(ib_uverbs_qp_idr);
+
+static spinlock_t map_lock;
+static DECLARE_BITMAP(dev_map, IB_UVERBS_MAX_DEVICES);
+
+static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
+				     const char __user *buf, int in_len,
+				     int out_len) = {
+	[IB_USER_VERBS_CMD_QUERY_PARAMS]  = ib_uverbs_query_params,
+	[IB_USER_VERBS_CMD_GET_CONTEXT]   = ib_uverbs_get_context,
+	[IB_USER_VERBS_CMD_QUERY_PORT]    = ib_uverbs_query_port,
+	[IB_USER_VERBS_CMD_ALLOC_PD]      = ib_uverbs_alloc_pd,
+	[IB_USER_VERBS_CMD_DEALLOC_PD]    = ib_uverbs_dealloc_pd,
+	[IB_USER_VERBS_CMD_REG_MR]        = ib_uverbs_reg_mr,
+	[IB_USER_VERBS_CMD_DEREG_MR]      = ib_uverbs_dereg_mr,
+	[IB_USER_VERBS_CMD_CREATE_CQ]     = ib_uverbs_create_cq,
+	[IB_USER_VERBS_CMD_DESTROY_CQ]    = ib_uverbs_destroy_cq,
+	[IB_USER_VERBS_CMD_CREATE_QP]     = ib_uverbs_create_qp,
+	[IB_USER_VERBS_CMD_MODIFY_QP]     = ib_uverbs_modify_qp,
+	[IB_USER_VERBS_CMD_DESTROY_QP]    = ib_uverbs_destroy_qp,
+};
+
+static struct vfsmount *uverbs_event_mnt;
+
+static void ib_uverbs_add_one(struct ib_device *device);
+static void ib_uverbs_remove_one(struct ib_device *device);
+
+static int ib_dealloc_ucontext(struct ib_ucontext *context)
+{
+	struct ib_uobject *uobj, *tmp;
+
+	if (!context)
+		return 0;
+
+	/* Free AHs */
+
+	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
+		struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id);
+		idr_remove(&ib_uverbs_qp_idr, uobj->id);
+		ib_destroy_qp(qp);
+		list_del(&uobj->list);
+		kfree(uobj);
+	}
+
+	list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) {
+		struct ib_cq *cq = idr_find(&ib_uverbs_cq_idr, uobj->id);
+		idr_remove(&ib_uverbs_cq_idr, uobj->id);
+		ib_destroy_cq(cq);
+		list_del(&uobj->list);
+		kfree(uobj);
+	}
+
+	/* XXX Free SRQs */
+	/* XXX Free MWs */
+
+	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
+		struct ib_mr *mr = idr_find(&ib_uverbs_mr_idr, uobj->id);
+		struct ib_umem_object *memobj;
+
+		memobj = container_of(uobj, struct ib_umem_object, uobject);
+		ib_umem_release(mr->device, &memobj->umem);
+
+		idr_remove(&ib_uverbs_mr_idr, uobj->id);
+		ib_dereg_mr(mr);
+		list_del(&uobj->list);
+		kfree(memobj);
+	}
+
+	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
+		struct ib_pd *pd = idr_find(&ib_uverbs_pd_idr, uobj->id);
+		idr_remove(&ib_uverbs_pd_idr, uobj->id);
+		ib_dealloc_pd(pd);
+		list_del(&uobj->list);
+		kfree(uobj);
+	}
+
+	return context->device->dealloc_ucontext(context);
+}
+
+static void ib_uverbs_release_file(struct kref *ref)
+{
+	struct ib_uverbs_file *file = 
+		container_of(ref, struct ib_uverbs_file, ref);
+
+	module_put(file->device->ib_dev->owner);
+	kfree(file);
+}
+
+static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf,
+				    size_t count, loff_t *pos)
+{
+	struct ib_uverbs_event_file *file = filp->private_data;
+	void *event;
+	int eventsz;
+	int ret = 0;
+
+	spin_lock_irq(&file->lock);
+
+	while (list_empty(&file->event_list) && file->fd >= 0) {
+		spin_unlock_irq(&file->lock);
+
+		if (filp->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+
+		if (wait_event_interruptible(file->poll_wait,
+					     !list_empty(&file->event_list) ||
+					     file->fd < 0))
+			return -ERESTARTSYS;
+
+		spin_lock_irq(&file->lock);
+	}
+
+	if (file->fd < 0) {
+		spin_unlock_irq(&file->lock);
+		return -ENODEV;
+	}
+
+	if (file->is_async) {
+		event   = list_entry(file->event_list.next,
+				     struct ib_uverbs_async_event, list);
+		eventsz = sizeof (struct ib_uverbs_async_event_desc);
+	} else {
+		event   = list_entry(file->event_list.next,
+				     struct ib_uverbs_comp_event, list);
+		eventsz = sizeof (struct ib_uverbs_comp_event_desc);
+	}
+
+	if (eventsz > count) {
+		ret   = -EINVAL;
+		event = NULL;
+	} else
+		list_del(file->event_list.next);
+
+	spin_unlock_irq(&file->lock);
+
+	if (event) {
+		if (copy_to_user(buf, event, eventsz))
+			ret = -EFAULT;
+		else
+			ret = eventsz;
+	}
+
+	kfree(event);
+
+	return ret;
+}
+
+static unsigned int ib_uverbs_event_poll(struct file *filp,
+					 struct poll_table_struct *wait)
+{
+	unsigned int pollflags = 0;
+	struct ib_uverbs_event_file *file = filp->private_data;
+
+	poll_wait(filp, &file->poll_wait, wait);
+
+	spin_lock_irq(&file->lock);
+	if (file->fd < 0)
+		pollflags = POLLERR;
+	else if (!list_empty(&file->event_list))
+		pollflags = POLLIN | POLLRDNORM;
+	spin_unlock_irq(&file->lock);
+
+	return pollflags;
+}
+
+static void ib_uverbs_event_release(struct ib_uverbs_event_file *file)
+{
+	struct list_head *entry, *tmp;
+	int put = 0;
+
+	spin_lock_irq(&file->lock);
+	if (file->fd != -1) {
+		put      = 1;
+		file->fd = -1;
+		list_for_each_safe(entry, tmp, &file->event_list)
+			if (file->is_async)
+				kfree(list_entry(entry, struct ib_uverbs_async_event, list));
+			else
+				kfree(list_entry(entry, struct ib_uverbs_comp_event, list));
+	}
+	spin_unlock_irq(&file->lock);
+
+	if (put)
+		kref_put(&file->uverbs_file->ref, ib_uverbs_release_file);
+
+}
+
+static int ib_uverbs_event_close(struct inode *inode, struct file *filp)
+{
+	struct ib_uverbs_event_file *file = filp->private_data;
+
+	ib_uverbs_event_release(file);
+
+	return 0;
+}
+
+static struct file_operations uverbs_event_fops = {
+	/*
+	 * No .owner field since we artificially create event files,
+	 * so there is no increment to the module reference count in
+	 * the open path.  All event files come from a uverbs command
+	 * file, which already takes a module reference, so this is OK.
+	 */
+	.read 	 = ib_uverbs_event_read,
+	.poll    = ib_uverbs_event_poll,
+	.release = ib_uverbs_event_close
+};
+
+void ib_uverbs_comp_handler(struct ib_cq *cq, void *cq_context)
+{
+	struct ib_uverbs_file       *file = cq_context;
+	struct ib_uverbs_comp_event *entry;
+	unsigned long                flags;
+
+	entry = kmalloc(sizeof *entry, GFP_ATOMIC);
+	if (!entry)
+		return;
+
+	entry->desc.cq_handle = cq->uobject->user_handle;
+
+	spin_lock_irqsave(&file->comp_file[0].lock, flags);
+	list_add_tail(&entry->list, &file->comp_file[0].event_list);
+	spin_unlock_irqrestore(&file->comp_file[0].lock, flags);
+
+	wake_up_interruptible(&file->comp_file[0].poll_wait);
+}
+
+void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr)
+{
+
+}
+
+void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr)
+{
+
+}
+
+static void ib_uverbs_event_handler(struct ib_event_handler *handler,
+				    struct ib_event *event)
+{
+	struct ib_uverbs_file *file =
+		container_of(handler, struct ib_uverbs_file, event_handler);
+	struct ib_uverbs_async_event *entry;
+	unsigned long flags;
+
+	entry = kmalloc(sizeof *entry, GFP_ATOMIC);
+	if (!entry)
+		return;
+
+	entry->desc.event_type = event->event;
+	entry->desc.element    = event->element.port_num;
+
+	spin_lock_irqsave(&file->async_file.lock, flags);
+	list_add_tail(&entry->list, &file->async_file.event_list);
+	spin_unlock_irqrestore(&file->async_file.lock, flags);
+
+	wake_up_interruptible(&file->async_file.poll_wait);
+}
+
+static int ib_uverbs_event_init(struct ib_uverbs_event_file *file,
+				struct ib_uverbs_file *uverbs_file)
+{
+	struct file *filp;
+
+	spin_lock_init(&file->lock);
+	INIT_LIST_HEAD(&file->event_list);
+	init_waitqueue_head(&file->poll_wait);
+	file->uverbs_file = uverbs_file;
+
+	file->fd = get_unused_fd();
+	if (file->fd < 0)
+		return file->fd;
+
+	filp = get_empty_filp();
+	if (!filp) {
+		put_unused_fd(file->fd);
+		return -ENFILE;
+	}
+
+	filp->f_op 	   = &uverbs_event_fops;
+	filp->f_vfsmnt 	   = mntget(uverbs_event_mnt);
+	filp->f_dentry 	   = dget(uverbs_event_mnt->mnt_root);
+	filp->f_mapping    = filp->f_dentry->d_inode->i_mapping;
+	filp->f_flags      = O_RDONLY;
+	filp->f_mode       = FMODE_READ;
+	filp->private_data = file;
+
+	fd_install(file->fd, filp);
+
+	return 0;
+}
+
+static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
+			     size_t count, loff_t *pos)
+{
+	struct ib_uverbs_file *file = filp->private_data;
+	struct ib_uverbs_cmd_hdr hdr;
+
+	if (count < sizeof hdr)
+		return -EINVAL;
+
+	if (copy_from_user(&hdr, buf, sizeof hdr))
+		return -EFAULT;
+
+	if (hdr.in_words * 4 != count)
+		return -EINVAL;
+
+	if (hdr.command < 0 || hdr.command >= ARRAY_SIZE(uverbs_cmd_table))
+		return -EINVAL;
+
+	if (!file->ucontext                               &&
+	    hdr.command != IB_USER_VERBS_CMD_QUERY_PARAMS &&
+	    hdr.command != IB_USER_VERBS_CMD_GET_CONTEXT)
+		return -EINVAL;
+
+	return uverbs_cmd_table[hdr.command](file, buf + sizeof hdr,
+					     hdr.in_words * 4, hdr.out_words * 4);
+}
+
+static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ib_uverbs_file *file = filp->private_data;
+
+	return file->device->ib_dev->mmap(file->ucontext, vma);
+}
+
+static int ib_uverbs_open(struct inode *inode, struct file *filp)
+{
+	struct ib_uverbs_device *dev =
+		container_of(inode->i_cdev, struct ib_uverbs_device, dev);
+	struct ib_uverbs_file *file;
+	int i = 0;
+	int ret;
+
+	if (!try_module_get(dev->ib_dev->owner))
+		return -ENODEV;
+
+	file = kmalloc(sizeof *file +
+		       (dev->num_comp - 1) * sizeof (struct ib_uverbs_event_file),
+		       GFP_KERNEL);
+	if (!file)
+		return -ENOMEM;
+
+	file->device = dev;
+	kref_init(&file->ref);
+
+	file->ucontext = NULL;
+
+	ret = ib_uverbs_event_init(&file->async_file, file);
+	if (ret)
+		goto err;
+
+	file->async_file.is_async = 1;
+
+	kref_get(&file->ref);
+
+	for (i = 0; i < dev->num_comp; ++i) {
+		ret = ib_uverbs_event_init(&file->comp_file[i], file);
+		if (ret)
+			goto err_async;
+		kref_get(&file->ref);
+		file->comp_file[i].is_async = 0;
+	}
+
+
+	filp->private_data = file;
+
+	INIT_IB_EVENT_HANDLER(&file->event_handler, dev->ib_dev,
+			      ib_uverbs_event_handler);
+	if (ib_register_event_handler(&file->event_handler))
+		goto err_async;
+
+	return 0;
+
+err_async:
+	while (i--)
+		ib_uverbs_event_release(&file->comp_file[i]);
+
+	ib_uverbs_event_release(&file->async_file);
+
+err:
+	kref_put(&file->ref, ib_uverbs_release_file);
+
+	return ret;
+}
+
+static int ib_uverbs_close(struct inode *inode, struct file *filp)
+{
+	struct ib_uverbs_file *file = filp->private_data;
+	int i;
+
+	ib_unregister_event_handler(&file->event_handler);
+	ib_uverbs_event_release(&file->async_file);
+	ib_dealloc_ucontext(file->ucontext);
+
+	for (i = 0; i < file->device->num_comp; ++i)
+		ib_uverbs_event_release(&file->comp_file[i]);
+
+	kref_put(&file->ref, ib_uverbs_release_file);
+
+	return 0;
+}
+
+static struct file_operations uverbs_fops = {
+	.owner 	 = THIS_MODULE,
+	.write 	 = ib_uverbs_write,
+	.open 	 = ib_uverbs_open,
+	.release = ib_uverbs_close
+};
+
+static struct file_operations uverbs_mmap_fops = {
+	.owner 	 = THIS_MODULE,
+	.write 	 = ib_uverbs_write,
+	.mmap    = ib_uverbs_mmap,
+	.open 	 = ib_uverbs_open,
+	.release = ib_uverbs_close
+};
+
+static struct ib_client uverbs_client = {
+	.name   = "uverbs",
+	.add    = ib_uverbs_add_one,
+	.remove = ib_uverbs_remove_one
+};
+
+static ssize_t show_dev(struct class_device *class_dev, char *buf)
+{
+	struct ib_uverbs_device *dev =
+		container_of(class_dev, struct ib_uverbs_device, class_dev);
+
+	return print_dev_t(buf, dev->dev.dev);
+}
+static CLASS_DEVICE_ATTR(dev, S_IRUGO, show_dev, NULL);
+
+static ssize_t show_ibdev(struct class_device *class_dev, char *buf)
+{
+	struct ib_uverbs_device *dev =
+		container_of(class_dev, struct ib_uverbs_device, class_dev);
+
+	return sprintf(buf, "%s\n", dev->ib_dev->name);
+}
+static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL);
+
+static void ib_uverbs_release_class_dev(struct class_device *class_dev)
+{
+	struct ib_uverbs_device *dev =
+		container_of(class_dev, struct ib_uverbs_device, class_dev);
+
+	cdev_del(&dev->dev);
+	clear_bit(dev->devnum, dev_map);
+	kfree(dev);
+}
+
+static struct class uverbs_class = {
+	.name    = "infiniband_verbs",
+	.release = ib_uverbs_release_class_dev
+};
+
+static ssize_t show_abi_version(struct class *class, char *buf)
+{
+	return sprintf(buf, "%d\n", IB_USER_VERBS_ABI_VERSION);
+}
+static CLASS_ATTR(abi_version, S_IRUGO, show_abi_version, NULL);
+
+static void ib_uverbs_add_one(struct ib_device *device)
+{
+	struct ib_uverbs_device *uverbs_dev;
+
+	if (!device->alloc_ucontext)
+		return;
+
+	uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL);
+	if (!uverbs_dev)
+		return;
+
+	memset(uverbs_dev, 0, sizeof *uverbs_dev);
+
+	spin_lock(&map_lock);
+	uverbs_dev->devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES);
+	if (uverbs_dev->devnum >= IB_UVERBS_MAX_DEVICES) {
+		spin_unlock(&map_lock);
+		goto err;
+	}
+	set_bit(uverbs_dev->devnum, dev_map);
+	spin_unlock(&map_lock);
+
+	uverbs_dev->ib_dev   = device;
+	uverbs_dev->num_comp = 1;
+
+	if (device->mmap)
+		cdev_init(&uverbs_dev->dev, &uverbs_mmap_fops);
+	else
+		cdev_init(&uverbs_dev->dev, &uverbs_fops);
+	uverbs_dev->dev.owner = THIS_MODULE;
+	kobject_set_name(&uverbs_dev->dev.kobj, "uverbs%d", uverbs_dev->devnum);
+	if (cdev_add(&uverbs_dev->dev, IB_UVERBS_BASE_DEV + uverbs_dev->devnum, 1))
+		goto err;
+
+	uverbs_dev->class_dev.class = &uverbs_class;
+	uverbs_dev->class_dev.dev   = device->dma_device;
+	snprintf(uverbs_dev->class_dev.class_id, BUS_ID_SIZE, "uverbs%d", uverbs_dev->devnum);
+	if (class_device_register(&uverbs_dev->class_dev))
+		goto err_cdev;
+
+	if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_dev))
+		goto err_class;
+	if (class_device_create_file(&uverbs_dev->class_dev, &class_device_attr_ibdev))
+		goto err_class;
+
+	ib_set_client_data(device, &uverbs_client, uverbs_dev);
+
+	return;
+
+err_class:
+	class_device_unregister(&uverbs_dev->class_dev);
+
+err_cdev:
+	cdev_del(&uverbs_dev->dev);
+	clear_bit(uverbs_dev->devnum, dev_map);
+
+err:
+	kfree(uverbs_dev);
+	return;
+}
+
+static void ib_uverbs_remove_one(struct ib_device *device)
+{
+	struct ib_uverbs_device *uverbs_dev = ib_get_client_data(device, &uverbs_client);
+
+	if (!uverbs_dev)
+		return;
+
+	class_device_unregister(&uverbs_dev->class_dev);
+}
+
+static struct super_block *uverbs_event_get_sb(struct file_system_type *fs_type, int flags,
+					       const char *dev_name, void *data)
+{
+	return get_sb_pseudo(fs_type, "infinibandevent:", NULL,
+			     INFINIBANDEVENTFS_MAGIC);
+}
+
+static struct file_system_type uverbs_event_fs = {
+	/* No owner field so module can be unloaded */
+	.name    = "infinibandeventfs",
+	.get_sb  = uverbs_event_get_sb,
+	.kill_sb = kill_litter_super
+};
+
+static int __init ib_uverbs_init(void)
+{
+	int ret;
+
+	spin_lock_init(&map_lock);
+
+	ret = register_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES,
+				     "infiniband_verbs");
+	if (ret) {
+		printk(KERN_ERR "user_verbs: couldn't register device number\n");
+		goto out;
+	}
+
+	ret = class_register(&uverbs_class);
+	if (ret) {
+		printk(KERN_ERR "user_verbs: couldn't create class infiniband_verbs\n");
+		goto out_chrdev;
+	}
+
+	ret = class_create_file(&uverbs_class, &class_attr_abi_version);
+	if (ret) {
+		printk(KERN_ERR "user_verbs: couldn't create abi_version attribute\n");
+		goto out_class;
+	}
+
+	ret = register_filesystem(&uverbs_event_fs);
+	if (ret) {
+		printk(KERN_ERR "user_verbs: couldn't register infinibandeventfs\n");
+		goto out_class;
+	}
+
+	uverbs_event_mnt = kern_mount(&uverbs_event_fs);
+	if (IS_ERR(uverbs_event_mnt)) {
+		ret = PTR_ERR(uverbs_event_mnt);
+		printk(KERN_ERR "user_verbs: couldn't mount infinibandeventfs\n");
+		goto out_fs;
+	}
+
+	ret = ib_register_client(&uverbs_client);
+	if (ret) {
+		printk(KERN_ERR "user_verbs: couldn't register client\n");
+		goto out_mnt;
+	}
+
+	return 0;
+
+out_mnt:
+	mntput(uverbs_event_mnt);
+
+out_fs:
+	unregister_filesystem(&uverbs_event_fs);
+
+out_class:
+	class_unregister(&uverbs_class);
+
+out_chrdev:
+	unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES);
+
+out:
+	return ret;
+}
+
+static void __exit ib_uverbs_cleanup(void)
+{
+	ib_unregister_client(&uverbs_client);
+	unregister_filesystem(&uverbs_event_fs);
+	mntput(uverbs_event_mnt);
+	class_unregister(&uverbs_class);
+	unregister_chrdev_region(IB_UVERBS_BASE_DEV, IB_UVERBS_MAX_DEVICES);
+}
+
+module_init(ib_uverbs_init);
+module_exit(ib_uverbs_cleanup);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/core/uverbs_mem.c	2005-04-04 14:53:17.825728001 -0700
@@ -0,0 +1,202 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id: uverbs_mem.c 1979 2005-03-11 21:17:00Z roland $
+ */
+
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+
+#include "uverbs.h"
+
+static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem)
+{
+	struct ib_umem_chunk *chunk, *tmp;
+	int i;
+
+	list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
+		dma_unmap_sg(dev->dma_device, chunk->page_list,
+			     chunk->nents, DMA_BIDIRECTIONAL);
+		for (i = 0; i < chunk->nents; ++i) {
+			set_page_dirty_lock(chunk->page_list[i].page);
+			put_page(chunk->page_list[i].page);
+		}
+
+		kfree(chunk);
+	}
+}
+
+static void __ib_umem_unmark(struct ib_umem *umem, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	unsigned long cur_base;
+
+	vma = find_vma(mm, umem->user_base);
+
+	for (cur_base = umem->user_base;
+	     cur_base < umem->user_base + umem->length;
+	     cur_base = vma->vm_end) {
+		if (!vma || vma->vm_start > umem->user_base + umem->length)
+			break;
+
+		if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+			vma->vm_flags &= ~VM_DONTCOPY;
+
+		vma = vma->vm_next;
+	}
+}
+
+int ib_umem_get(struct ib_device *dev, struct ib_umem *mem,
+		void *addr, size_t size)
+{
+	struct page **page_list;
+	struct vm_area_struct *vma;
+	struct ib_umem_chunk *chunk;
+	unsigned long cur_base;
+	int npages;
+	int ret = 0;
+	int off;
+	int i;
+
+	page_list = (struct page **) __get_free_page(GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	mem->user_base = (unsigned long) addr;
+	mem->length    = size;
+	mem->offset    = (unsigned long) addr & ~PAGE_MASK;
+	mem->page_size = PAGE_SIZE;
+
+	INIT_LIST_HEAD(&mem->chunk_list);
+
+	npages   = PAGE_ALIGN(size + mem->offset) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+
+	vma = find_vma(current->mm, mem->user_base);
+
+	for (cur_base = mem->user_base;
+	     cur_base < mem->user_base + size;
+	     cur_base = vma->vm_end) {
+		if (!vma || vma->vm_start > cur_base) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+			vma->vm_flags |= VM_DONTCOPY;
+
+		vma = vma->vm_next;
+	}
+
+	cur_base = (unsigned long) addr & PAGE_MASK;
+
+	while (npages) {
+		ret = get_user_pages(current, current->mm, cur_base,
+				     min_t(int, npages,
+					   PAGE_SIZE / sizeof (struct page *)),
+				     1, 0, page_list, NULL);
+
+		if (ret < 0)
+			goto out;
+
+		cur_base += ret * PAGE_SIZE;
+		npages   -= ret;
+
+		off = 0;
+
+		while (ret) {
+			chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) *
+					min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
+					GFP_KERNEL);
+			if (!chunk) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
+			for (i = 0; i < chunk->nents; ++i) {
+				chunk->page_list[i].page   = page_list[i + off];
+				chunk->page_list[i].offset = 0;
+				chunk->page_list[i].length = PAGE_SIZE;
+			}
+
+			chunk->nmap = dma_map_sg(dev->dma_device,
+						 &chunk->page_list[0],
+						 chunk->nents,
+						 DMA_BIDIRECTIONAL);
+			if (chunk->nmap <= 0) {
+				for (i = 0; i < chunk->nents; ++i)
+					put_page(chunk->page_list[i].page);
+				kfree(chunk);
+
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			ret -= chunk->nents;
+			off += chunk->nents;
+			list_add_tail(&chunk->list, &mem->chunk_list);
+		}
+
+		ret = 0;
+	}
+
+out:
+	if (ret < 0) {
+		__ib_umem_unmark(mem, current->mm);
+		__ib_umem_release(dev, mem);
+	}
+
+	up_write(&current->mm->mmap_sem);
+	free_page((unsigned long) page_list);
+
+	return ret;
+}
+
+void ib_umem_release(struct ib_device *dev, struct ib_umem *umem)
+{
+	struct mm_struct *mm;
+
+	mm = get_task_mm(current);
+
+	if (mm) {
+		down_write(&mm->mmap_sem);
+		__ib_umem_unmark(umem, mm);
+	}
+
+	__ib_umem_release(dev, umem);
+
+	if (mm) {
+		up_write(&current->mm->mmap_sem);
+		mmput(mm);
+	}
+}
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/include/ib_user_verbs.h	2005-04-04 14:55:47.946083444 -0700
@@ -0,0 +1,275 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id: ib_user_verbs.h 2001 2005-03-16 04:15:41Z roland $
+ */
+
+#ifndef IB_USER_VERBS_H
+#define IB_USER_VERBS_H
+
+#include <linux/types.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+#define IB_USER_VERBS_ABI_VERSION	1
+
+enum {
+	IB_USER_VERBS_CMD_QUERY_PARAMS,
+	IB_USER_VERBS_CMD_GET_CONTEXT,
+	IB_USER_VERBS_CMD_QUERY_PORT,
+	IB_USER_VERBS_CMD_ALLOC_PD,
+	IB_USER_VERBS_CMD_DEALLOC_PD,
+	IB_USER_VERBS_CMD_REG_MR,
+	IB_USER_VERBS_CMD_DEREG_MR,
+	IB_USER_VERBS_CMD_CREATE_CQ,
+	IB_USER_VERBS_CMD_DESTROY_CQ,
+	IB_USER_VERBS_CMD_CREATE_QP,
+	IB_USER_VERBS_CMD_MODIFY_QP,
+	IB_USER_VERBS_CMD_DESTROY_QP,
+};
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct ib_uverbs_async_event_desc {
+	__u64 element;
+	__u32 event_type;	/* enum ib_event_type */
+	__u32 reserved;
+};
+
+struct ib_uverbs_comp_event_desc {
+	__u64 cq_handle;
+};
+
+/*
+ * All commands from userspace should start with a __u32 command field
+ * followed by __u16 in_words and out_words fields (which give the
+ * length of the command block and response buffer if any in 32-bit
+ * words).  The kernel driver will read these fields first and read
+ * the rest of the command struct based on these value.
+ */
+
+struct ib_uverbs_cmd_hdr {
+	__u32 command;
+	__u16 in_words;
+	__u16 out_words;
+};
+
+/*
+ * No driver_data for "query params" command, since this is intended
+ * to be a core function with no possible device dependence.
+ */
+struct ib_uverbs_query_params {
+	__u64 response;
+};
+
+struct ib_uverbs_query_params_resp {
+	__u32 num_cq_events;
+};
+
+struct ib_uverbs_get_context {
+	__u64 response;
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_get_context_resp {
+	__u32 async_fd;
+	__u32 cq_fd[1];
+};
+
+struct ib_uverbs_query_port {
+	__u64 response;
+	__u8  port_num;
+	__u8  reserved[7];
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_query_port_resp {
+	__u32 port_cap_flags;
+	__u32 max_msg_sz;
+	__u32 bad_pkey_cntr;
+	__u32 qkey_viol_cntr;
+	__u32 gid_tbl_len;
+	__u16 pkey_tbl_len;
+	__u16 lid;
+	__u16 sm_lid;
+	__u8  state;
+	__u8  max_mtu;
+	__u8  active_mtu;
+	__u8  lmc;
+	__u8  max_vl_num;
+	__u8  sm_sl;
+	__u8  subnet_timeout;
+	__u8  init_type_reply;
+	__u8  active_width;
+	__u8  active_speed;
+	__u8  phys_state;
+	__u8  reserved[3];
+};
+
+struct ib_uverbs_alloc_pd {
+	__u64 response;
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_alloc_pd_resp {
+	__u32 pd_handle;
+};
+
+struct ib_uverbs_dealloc_pd {
+	__u32 pd_handle;
+};
+
+struct ib_uverbs_reg_mr {
+	__u64 response;
+	__u64 start;
+	__u64 length;
+	__u64 hca_va;
+	__u32 pd_handle;
+	__u32 access_flags;
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_reg_mr_resp {
+	__u32 mr_handle;
+	__u32 lkey;
+	__u32 rkey;
+};
+
+struct ib_uverbs_dereg_mr {
+	__u32 mr_handle;
+};
+
+struct ib_uverbs_create_cq {
+	__u64 response;
+	__u64 user_handle;
+	__u32 cqe;
+	__u32 reserved;
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_create_cq_resp {
+	__u32 cq_handle;
+	__u32 cqe;
+};
+
+struct ib_uverbs_destroy_cq {
+	__u32 cq_handle;
+};
+
+struct ib_uverbs_create_qp {
+	__u64 response;
+	__u64 user_handle;
+	__u32 pd_handle;
+	__u32 send_cq_handle;
+	__u32 recv_cq_handle;
+	__u32 srq_handle;
+	__u32 max_send_wr;
+	__u32 max_recv_wr;
+	__u32 max_send_sge;
+	__u32 max_recv_sge;
+	__u32 max_inline_data;
+	__u8  sq_sig_all;
+	__u8  qp_type;
+	__u8  is_srq;
+	__u8  reserved;
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_create_qp_resp {
+	__u32 qp_handle;
+	__u32 qpn;
+};
+
+/*
+ * This struct needs to remain a multiple of 8 bytes to keep the
+ * alignment of the modify QP parameters.
+ */
+struct ib_uverbs_qp_dest {
+	__u8  dgid[16];
+	__u32 flow_label;
+	__u16 dlid;
+	__u16 reserved;
+	__u8  sgid_index;
+	__u8  hop_limit;
+	__u8  traffic_class;
+	__u8  sl;
+	__u8  src_path_bits;
+	__u8  static_rate;
+	__u8  is_global;
+	__u8  port_num;
+};
+
+struct ib_uverbs_modify_qp {
+	struct ib_uverbs_qp_dest dest;
+	struct ib_uverbs_qp_dest alt_dest;
+	__u32 qp_handle;
+	__u32 attr_mask;
+	__u32 qkey;
+	__u32 rq_psn;
+	__u32 sq_psn;
+	__u32 dest_qp_num;
+	__u32 qp_access_flags;
+	__u16 pkey_index;
+	__u16 alt_pkey_index;
+	__u8  qp_state;
+	__u8  cur_qp_state;
+	__u8  path_mtu;
+	__u8  path_mig_state;
+	__u8  en_sqd_async_notify;
+	__u8  max_rd_atomic;
+	__u8  max_dest_rd_atomic;
+	__u8  min_rnr_timer;
+	__u8  port_num;
+	__u8  timeout;
+	__u8  retry_cnt;
+	__u8  rnr_retry;
+	__u8  alt_port_num;
+	__u8  alt_timeout;
+	__u8  reserved[2];
+	__u64 driver_data[0];
+};
+
+struct ib_uverbs_modify_qp_resp {
+};
+
+struct ib_uverbs_destroy_qp {
+	__u32 qp_handle;
+};
+
+#endif /* IB_USER_VERBS_H */


From roland at topspin.com  Mon Apr  4 15:09:00 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 4 Apr 2005 15:09:00 -0700
Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes
In-Reply-To: <200544159.3X7p8nZ87qWqA7cv@topspin.com>
Message-ID: <200544159.AzH1nqpM3uTQZaKG@topspin.com>

Add Mellanox HCA-specific userspace verbs support to mthca.

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-04 14:57:12.228756073 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_cq.c	2005-04-04 14:58:12.364679525 -0700
@@ -743,6 +743,7 @@
 }
 
 int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq)
 {
 	int size = nent * MTHCA_CQ_ENTRY_SIZE;
@@ -754,30 +755,33 @@
 
 	might_sleep();
 
-	cq->ibcq.cqe = nent - 1;
+	cq->ibcq.cqe  = nent - 1;
+	cq->is_kernel = !ctx;
 
 	cq->cqn = mthca_alloc(&dev->cq_table.alloc);
 	if (cq->cqn == -1)
 		return -ENOMEM;
 
 	if (mthca_is_memfree(dev)) {
-		cq->arm_sn = 1;
-
 		err = mthca_table_get(dev, dev->cq_table.table, cq->cqn);
 		if (err)
 			goto err_out;
 
-		err = -ENOMEM;
+		if (cq->is_kernel) {
+			cq->arm_sn = 1;
+
+			err = -ENOMEM;
 
-		cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI,
-						     cq->cqn, &cq->set_ci_db);
-		if (cq->set_ci_db_index < 0)
-			goto err_out_icm;
-
-		cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM,
-						  cq->cqn, &cq->arm_db);
-		if (cq->arm_db_index < 0)
-			goto err_out_ci;
+			cq->set_ci_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_SET_CI,
+							     cq->cqn, &cq->set_ci_db);
+			if (cq->set_ci_db_index < 0)
+				goto err_out_icm;
+
+			cq->arm_db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_CQ_ARM,
+							  cq->cqn, &cq->arm_db);
+			if (cq->arm_db_index < 0)
+				goto err_out_ci;
+		}
 	}
 
 	mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA,
@@ -787,12 +791,14 @@
 
 	cq_context = MAILBOX_ALIGN(mailbox);
 
-	err = mthca_alloc_cq_buf(dev, size, cq);
-	if (err)
-		goto err_out_mailbox;
+	if (cq->is_kernel) {
+		err = mthca_alloc_cq_buf(dev, size, cq);
+		if (err)
+			goto err_out_mailbox;
 
-	for (i = 0; i < nent; ++i)
-		set_cqe_hw(get_cqe(cq, i));
+		for (i = 0; i < nent; ++i)
+			set_cqe_hw(get_cqe(cq, i));
+	}
 
 	spin_lock_init(&cq->lock);
 	atomic_set(&cq->refcount, 1);
@@ -803,11 +809,14 @@
 						  MTHCA_CQ_STATE_DISARMED |
 						  MTHCA_CQ_FLAG_TR);
 	cq_context->start           = cpu_to_be64(0);
-	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 |
-						  dev->driver_uar.index);
+	cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24);
+	if (ctx)
+		cq_context->logsize_usrpage |= cpu_to_be32(ctx->uar.index);
+	else
+		cq_context->logsize_usrpage |= cpu_to_be32(dev->driver_uar.index);
 	cq_context->error_eqn       = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn);
 	cq_context->comp_eqn        = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn);
-	cq_context->pd              = cpu_to_be32(dev->driver_pd.pd_num);
+	cq_context->pd              = cpu_to_be32(pdn);
 	cq_context->lkey            = cpu_to_be32(cq->mr.ibmr.lkey);
 	cq_context->cqn             = cpu_to_be32(cq->cqn);
 
@@ -845,17 +854,19 @@
 	return 0;
 
 err_out_free_mr:
-	mthca_free_mr(dev, &cq->mr);
-	mthca_free_cq_buf(dev, cq);
+	if (cq->is_kernel) {
+		mthca_free_mr(dev, &cq->mr);
+		mthca_free_cq_buf(dev, cq);
+	}
 
 err_out_mailbox:
 	kfree(mailbox);
 
-	if (mthca_is_memfree(dev))
+	if (cq->is_kernel && mthca_is_memfree(dev))
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM, cq->arm_db_index);
 
 err_out_ci:
-	if (mthca_is_memfree(dev))
+	if (cq->is_kernel)
 		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
 
 err_out_icm:
@@ -895,7 +906,8 @@
 		int j;
 
 		printk(KERN_ERR "context for CQN %x (cons index %x, next sw %d)\n",
-		       cq->cqn, cq->cons_index, !!next_cqe_sw(cq));
+		       cq->cqn, cq->cons_index,
+		       cq->is_kernel ? !!next_cqe_sw(cq) : 0);
 		for (j = 0; j < 16; ++j)
 			printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
 	}
@@ -913,15 +925,17 @@
 	atomic_dec(&cq->refcount);
 	wait_event(cq->wait, !atomic_read(&cq->refcount));
 
-	mthca_free_mr(dev, &cq->mr);
-	mthca_free_cq_buf(dev, cq);
-
-	if (mthca_is_memfree(dev)) {
-		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM,    cq->arm_db_index);
-		mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
-		mthca_table_put(dev, dev->cq_table.table, cq->cqn);
+	if (cq->is_kernel) {
+		mthca_free_mr(dev, &cq->mr);
+		mthca_free_cq_buf(dev, cq);
+		if (mthca_is_memfree(dev)) {
+			mthca_free_db(dev, MTHCA_DB_TYPE_CQ_ARM,    cq->arm_db_index);
+			mthca_free_db(dev, MTHCA_DB_TYPE_CQ_SET_CI, cq->set_ci_db_index);
+		}
 	}
 
+	if (mthca_is_memfree(dev))
+		mthca_table_put(dev, dev->cq_table.table, cq->cqn);
 	mthca_free(&dev->cq_table.alloc, cq->cqn);
 	kfree(mailbox);
 }
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-04 14:57:12.254750421 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-04 14:58:12.411669307 -0700
@@ -49,14 +49,6 @@
 #define DRV_VERSION	"0.06-pre"
 #define DRV_RELDATE	"November 8, 2004"
 
-/* XXX remove once SINAI defines make it into kernel.org */
-#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD
-#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c
-#endif
-#ifndef PCI_DEVICE_ID_MELLANOX_SINAI
-#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274
-#endif
-
 enum {
 	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
 	MTHCA_FLAG_SRQ        = 1 << 2,
@@ -413,6 +405,7 @@
 int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
+		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
 void mthca_free_cq(struct mthca_dev *dev,
 		   struct mthca_cq *cq);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-04 14:57:12.256749986 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.c	2005-04-04 14:58:12.412669090 -0700
@@ -45,6 +45,15 @@
 	MTHCA_TABLE_CHUNK_SIZE = 1 << 18
 };
 
+struct mthca_user_db_table {
+	struct semaphore mutex;
+	struct {
+		u64                uvirt;
+		struct scatterlist mem;
+		int                refcount;
+	}                page[0];
+};
+
 void mthca_free_icm(struct mthca_dev *dev, struct mthca_icm *icm)
 {
 	struct mthca_icm_chunk *chunk, *tmp;
@@ -334,13 +343,132 @@
 	kfree(table);
 }
 
-static u64 mthca_uarc_virt(struct mthca_dev *dev, int page)
+static u64 mthca_uarc_virt(struct mthca_dev *dev, struct mthca_uar *uar, int page)
 {
 	return dev->uar_table.uarc_base +
-		dev->driver_uar.index * dev->uar_table.uarc_size +
+		uar->index * dev->uar_table.uarc_size +
 		page * 4096;
 }
 
+int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
+		      struct mthca_user_db_table *db_tab, int index, u64 uaddr)
+{
+	int ret = 0;
+	u8 status;
+	int i;
+
+	if (!mthca_is_memfree(dev))
+		return 0;
+
+	if (index < 0 || index > dev->uar_table.uarc_size / 8)
+		return -EINVAL;
+
+	down(&db_tab->mutex);
+
+	i = index / MTHCA_DB_REC_PER_PAGE;
+
+	if ((db_tab->page[i].refcount >= MTHCA_DB_REC_PER_PAGE)       ||
+	    (db_tab->page[i].uvirt && db_tab->page[i].uvirt != uaddr) ||
+	    (uaddr & 4095)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (db_tab->page[i].refcount) {
+		++db_tab->page[i].refcount;
+		goto out;
+	}
+
+	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
+			     &db_tab->page[i].mem.page, NULL);
+	if (ret < 0)
+		goto out;
+
+	db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK;
+
+	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
+	if (ret < 0) {
+		put_page(db_tab->page[i].mem.page);
+		goto out;
+	}
+
+	ret = mthca_MAP_ICM_page(dev, sg_dma_address(&db_tab->page[i].mem),
+				 mthca_uarc_virt(dev, uar, i), &status);
+	if (!ret && status)
+		ret = -EINVAL;
+	if (ret) {
+		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
+		put_page(db_tab->page[i].mem.page);
+		goto out;
+	}
+
+	db_tab->page[i].uvirt    = uaddr;
+	db_tab->page[i].refcount = 1;
+
+out:
+	up(&db_tab->mutex);
+	return ret;
+}
+
+void mthca_unmap_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
+			 struct mthca_user_db_table *db_tab, int index)
+{
+	if (!mthca_is_memfree(dev))
+		return;
+
+	/*
+	 * To make our bookkeeping simpler, we don't unmap DB
+	 * pages until we clean up the whole db table.
+	 */
+
+	down(&db_tab->mutex);
+
+	--db_tab->page[index / MTHCA_DB_REC_PER_PAGE].refcount;
+
+	up(&db_tab->mutex);
+}
+
+struct mthca_user_db_table *mthca_init_user_db_tab(struct mthca_dev *dev)
+{
+	struct mthca_user_db_table *db_tab;
+	int npages;
+	int i;
+
+	if (!mthca_is_memfree(dev))
+		return NULL;
+
+	npages = dev->uar_table.uarc_size / 4096;
+	db_tab = kmalloc(sizeof *db_tab + npages * sizeof *db_tab->page, GFP_KERNEL);
+	if (!db_tab)
+		return ERR_PTR(-ENOMEM);
+
+	init_MUTEX(&db_tab->mutex);
+	for (i = 0; i < npages; ++i) {
+		db_tab->page[i].refcount = 0;
+		db_tab->page[i].uvirt    = 0;
+	}
+
+	return db_tab;
+}
+
+void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
+			       struct mthca_user_db_table *db_tab)
+{
+	int i;
+	u8 status;
+
+	if (!mthca_is_memfree(dev))
+		return;
+
+	for (i = 0; i < dev->uar_table.uarc_size / 4096; ++i) {
+		if (db_tab->page[i].uvirt) {
+			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1, &status);
+			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
+			put_page(db_tab->page[i].mem.page);
+		}
+	}
+}
+
 int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db)
 {
 	int group;
@@ -397,7 +525,8 @@
 	}
 	memset(page->db_rec, 0, 4096);
 
-	ret = mthca_MAP_ICM_page(dev, page->mapping, mthca_uarc_virt(dev, i), &status);
+	ret = mthca_MAP_ICM_page(dev, page->mapping,
+				 mthca_uarc_virt(dev, &dev->driver_uar, i), &status);
 	if (!ret && status)
 		ret = -EINVAL;
 	if (ret) {
@@ -451,7 +580,7 @@
 
 	if (bitmap_empty(page->used, MTHCA_DB_REC_PER_PAGE) &&
 	    i >= dev->db_tab->max_group1 - 1) {
-		mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status);
+		mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status);
 		
 		dma_free_coherent(&dev->pdev->dev, 4096,
 				  page->db_rec, page->mapping);
@@ -520,7 +649,7 @@
 		if (!bitmap_empty(dev->db_tab->page[i].used, MTHCA_DB_REC_PER_PAGE))
 			mthca_warn(dev, "Kernel UARC page %d not empty\n", i);
 
-		mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, i), 1, &status);
+		mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, &dev->driver_uar, i), 1, &status);
 		
 		dma_free_coherent(&dev->pdev->dev, 4096,
 				  dev->db_tab->page[i].db_rec,
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-04-04 14:57:12.256749986 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_memfree.h	2005-04-04 14:58:12.413668872 -0700
@@ -148,7 +148,7 @@
 	struct semaphore      mutex;
 };
 
-enum {
+enum mthca_db_type {
 	MTHCA_DB_TYPE_INVALID   = 0x0,
 	MTHCA_DB_TYPE_CQ_SET_CI = 0x1,
 	MTHCA_DB_TYPE_CQ_ARM    = 0x2,
@@ -158,6 +158,17 @@
 	MTHCA_DB_TYPE_GROUP_SEP = 0x7
 };
 
+struct mthca_user_db_table;
+struct mthca_uar;
+
+int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
+		      struct mthca_user_db_table *db_tab, int index, u64 uaddr);
+void mthca_unmap_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
+			 struct mthca_user_db_table *db_tab, int index);
+struct mthca_user_db_table *mthca_init_user_db_tab(struct mthca_dev *dev);
+void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
+			       struct mthca_user_db_table *db_tab);
+
 int mthca_init_db_tab(struct mthca_dev *dev);
 void mthca_cleanup_db_tab(struct mthca_dev *dev);
 int mthca_alloc_db(struct mthca_dev *dev, int type, u32 qn, u32 **db);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-04 14:57:12.286743464 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.c	2005-04-04 14:58:12.444662133 -0700
@@ -29,13 +29,17 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: mthca_provider.c 2100 2005-03-31 20:43:01Z roland $
+ * $Id: mthca_provider.c 2109 2005-04-04 21:10:34Z roland $
  */
 
+#include <asm/uaccess.h>
+
 #include <ib_smi.h>
 
 #include "mthca_dev.h"
 #include "mthca_cmd.h"
+#include "mthca_user.h"
+#include "mthca_memfree.h"
 
 static int mthca_query_device(struct ib_device *ibdev,
 			      struct ib_device_attr *props)
@@ -283,11 +287,78 @@
 	return err;
 }
 
-static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev)
+static struct ib_ucontext *mthca_alloc_ucontext(struct ib_device *ibdev,
+						const void __user *udata, int udatalen)
+{
+	struct mthca_alloc_ucontext      ucmd;
+	struct mthca_alloc_ucontext_resp uresp;
+	struct mthca_ucontext           *context;
+	int                              err;
+
+	if (copy_from_user(&ucmd, udata, sizeof ucmd))
+		return ERR_PTR(-EFAULT);
+
+	uresp.qp_tab_size = to_mdev(ibdev)->limits.num_qps;
+	if (mthca_is_memfree(to_mdev(ibdev)))
+		uresp.uarc_size = to_mdev(ibdev)->uar_table.uarc_size;
+	else
+		uresp.uarc_size = 0;
+
+	context = kmalloc(sizeof *context, GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+
+	err = mthca_uar_alloc(to_mdev(ibdev), &context->uar);
+	if (err) {
+		kfree(context);
+		return ERR_PTR(err);
+	}
+
+	context->db_tab = mthca_init_user_db_tab(to_mdev(ibdev));
+	if (IS_ERR(context->db_tab)) {
+		err = PTR_ERR(context->db_tab);
+		mthca_uar_free(to_mdev(ibdev), &context->uar);
+		kfree(context);
+		return ERR_PTR(err);
+	}
+
+	if (copy_to_user((void __user *) (unsigned long) ucmd.respbuf,
+			 &uresp, sizeof uresp)) {
+		mthca_cleanup_user_db_tab(to_mdev(ibdev), &context->uar, context->db_tab);
+		mthca_uar_free(to_mdev(ibdev), &context->uar);
+		kfree(context);
+		return ERR_PTR(-EFAULT);
+	}
+
+	return &context->ibucontext;
+}
+
+static int mthca_dealloc_ucontext(struct ib_ucontext *context)
 {
+	mthca_cleanup_user_db_tab(to_mdev(context->device), &to_mucontext(context)->uar,
+				  to_mucontext(context)->db_tab);
+	mthca_uar_free(to_mdev(context->device), &to_mucontext(context)->uar);
+	kfree(to_mucontext(context));
+
+	return 0;
+}
+
+static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev,
+				    struct ib_ucontext *context,
+				    const void __user *udata, int udatalen)
+{
+	struct mthca_alloc_pd ucmd;
 	struct mthca_pd *pd;
 	int err;
 
+	if (context) {
+		if (udatalen != sizeof ucmd)
+			return ERR_PTR(-EINVAL);
+
+		if (copy_from_user(&ucmd, udata, sizeof ucmd))
+			return ERR_PTR(-EFAULT);
+	}
+
 	pd = kmalloc(sizeof *pd, GFP_KERNEL);
 	if (!pd)
 		return ERR_PTR(-ENOMEM);
@@ -298,6 +369,14 @@
 		return ERR_PTR(err);
 	}
 
+	if (context) {
+		if (put_user(pd->pd_num, (u32 __user *) (unsigned long) ucmd.pdnbuf)) {
+			mthca_pd_free(to_mdev(ibdev), pd);
+			kfree(pd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+
 	return &pd->ibpd;
 }
 
@@ -337,8 +416,10 @@
 }
 
 static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
-				     struct ib_qp_init_attr *init_attr)
+				     struct ib_qp_init_attr *init_attr,
+				     const void __user *udata, int udatalen)
 {
+	struct mthca_create_qp ucmd;
 	struct mthca_qp *qp;
 	int err;
 
@@ -347,10 +428,48 @@
 	case IB_QPT_UC:
 	case IB_QPT_UD:
 	{
+		struct mthca_ucontext *context;
+
 		qp = kmalloc(sizeof *qp, GFP_KERNEL);
 		if (!qp)
 			return ERR_PTR(-ENOMEM);
 
+		if (pd->uobject) {
+			context = to_mucontext(pd->uobject->context);
+
+			if (udatalen != sizeof ucmd)
+				return ERR_PTR(-EINVAL);
+
+			if (copy_from_user(&ucmd, udata, sizeof ucmd))
+				return ERR_PTR(-EFAULT);
+
+			err = mthca_map_user_db(to_mdev(pd->device), &context->uar,
+						context->db_tab,
+						ucmd.sq_db_index, ucmd.sq_db_page);
+			if (err) {
+				kfree(qp);
+				return ERR_PTR(err);
+			}
+
+			err = mthca_map_user_db(to_mdev(pd->device), &context->uar,
+						context->db_tab,
+						ucmd.rq_db_index, ucmd.rq_db_page);
+			if (err) {
+				mthca_unmap_user_db(to_mdev(pd->device),
+						    &context->uar,
+						    context->db_tab,
+						    ucmd.sq_db_index);
+				kfree(qp);
+				return ERR_PTR(err);
+			}
+		}
+
+		if (pd->uobject) {
+			qp->mr.ibmr.lkey = ucmd.lkey;
+			qp->sq.db_index  = ucmd.sq_db_index;
+			qp->rq.db_index  = ucmd.rq_db_index;
+		}
+
 		qp->sq.max    = init_attr->cap.max_send_wr;
 		qp->rq.max    = init_attr->cap.max_recv_wr;
 		qp->sq.max_gs = init_attr->cap.max_send_sge;
@@ -361,12 +480,30 @@
 				     to_mcq(init_attr->recv_cq),
 				     init_attr->qp_type, init_attr->sq_sig_type,
 				     qp);
+
+		if (err && pd->uobject) {
+			context = to_mucontext(pd->uobject->context);
+
+			mthca_unmap_user_db(to_mdev(pd->device),
+					    &context->uar,
+					    context->db_tab,
+					    ucmd.sq_db_index);
+			mthca_unmap_user_db(to_mdev(pd->device),
+					    &context->uar,
+					    context->db_tab,
+					    ucmd.rq_db_index);
+		}
+
 		qp->ibqp.qp_num = qp->qpn;
 		break;
 	}
 	case IB_QPT_SMI:
 	case IB_QPT_GSI:
 	{
+		/* Don't allow userspace to create special QPs */
+		if (pd->uobject)
+			return ERR_PTR(-EINVAL);
+
 		qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
 		if (!qp)
 			return ERR_PTR(-ENOMEM);
@@ -396,42 +533,116 @@
 		return ERR_PTR(err);
 	}
 
-        init_attr->cap.max_inline_data = 0;
+	init_attr->cap.max_inline_data = 0;
+	init_attr->cap.max_send_wr     = qp->sq.max;
+	init_attr->cap.max_recv_wr     = qp->rq.max;
 
 	return &qp->ibqp;
 }
 
 static int mthca_destroy_qp(struct ib_qp *qp)
 {
+	if (qp->uobject) {
+		mthca_unmap_user_db(to_mdev(qp->device),
+				    &to_mucontext(qp->uobject->context)->uar,
+				    to_mucontext(qp->uobject->context)->db_tab,
+				    to_mqp(qp)->sq.db_index);
+		mthca_unmap_user_db(to_mdev(qp->device),
+				    &to_mucontext(qp->uobject->context)->uar,
+				    to_mucontext(qp->uobject->context)->db_tab,
+				    to_mqp(qp)->rq.db_index);
+	}
 	mthca_free_qp(to_mdev(qp->device), to_mqp(qp));
 	kfree(qp);
 	return 0;
 }
 
-static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries)
+static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries,
+				     struct ib_ucontext *context,
+				     const void __user *udata, int udatalen)
 {
+	struct mthca_create_cq ucmd;
 	struct mthca_cq *cq;
 	int nent;
 	int err;
 
+	if (context) {
+		if (udatalen != sizeof ucmd)
+			return ERR_PTR(-EINVAL);
+
+		if (copy_from_user(&ucmd, udata, sizeof ucmd))
+			return ERR_PTR(-EFAULT);
+
+		err = mthca_map_user_db(to_mdev(ibdev), &to_mucontext(context)->uar,
+					to_mucontext(context)->db_tab,
+					ucmd.set_db_index, ucmd.set_db_page);
+		if (err)
+			return ERR_PTR(err);
+
+		err = mthca_map_user_db(to_mdev(ibdev), &to_mucontext(context)->uar,
+					to_mucontext(context)->db_tab,
+					ucmd.arm_db_index, ucmd.arm_db_page);
+		if (err)
+			goto err_unmap_set;
+	}
+
 	cq = kmalloc(sizeof *cq, GFP_KERNEL);
-	if (!cq)
-		return ERR_PTR(-ENOMEM);
+	if (!cq) {
+		err = -ENOMEM;
+		goto err_unmap_arm;
+	}
+
+	if (context) {
+		cq->mr.ibmr.lkey    = ucmd.lkey;
+		cq->set_ci_db_index = ucmd.set_db_index;
+		cq->arm_db_index    = ucmd.arm_db_index;
+	}
 
 	for (nent = 1; nent <= entries; nent <<= 1)
 		; /* nothing */
 
-	err = mthca_init_cq(to_mdev(ibdev), nent, cq);
-	if (err) {
-		kfree(cq);
-		cq = ERR_PTR(err);
+	err = mthca_init_cq(to_mdev(ibdev), nent, 
+			    context ? to_mucontext(context) : NULL,
+			    context ? ucmd.pdn : to_mdev(ibdev)->driver_pd.pd_num,
+			    cq);
+	if (err)
+		goto err_free;
+
+	if (context && put_user(cq->cqn, (u32 __user *) (unsigned long) ucmd.cqnbuf)) {
+		mthca_free_cq(to_mdev(ibdev), cq);
+		goto err_free;
 	}
 
 	return &cq->ibcq;
+
+err_free:
+	kfree(cq);
+
+err_unmap_arm:
+	if (context)
+		mthca_unmap_user_db(to_mdev(ibdev), &to_mucontext(context)->uar,
+				    to_mucontext(context)->db_tab, ucmd.arm_db_index);
+
+err_unmap_set:
+	if (context)
+		mthca_unmap_user_db(to_mdev(ibdev), &to_mucontext(context)->uar,
+				    to_mucontext(context)->db_tab, ucmd.set_db_index);
+
+	return ERR_PTR(err);
 }
 
 static int mthca_destroy_cq(struct ib_cq *cq)
 {
+	if (cq->uobject) {
+		mthca_unmap_user_db(to_mdev(cq->device),
+				    &to_mucontext(cq->uobject->context)->uar,
+				    to_mucontext(cq->uobject->context)->db_tab,
+				    to_mcq(cq)->arm_db_index);
+		mthca_unmap_user_db(to_mdev(cq->device),
+				    &to_mucontext(cq->uobject->context)->uar,
+				    to_mucontext(cq->uobject->context)->db_tab,
+				    to_mcq(cq)->set_ci_db_index);
+	}
 	mthca_free_cq(to_mdev(cq->device), to_mcq(cq));
 	kfree(cq);
 
@@ -558,6 +769,57 @@
 				  convert_access(acc), mr);
 
 	if (err) {
+		kfree(page_list);
+		kfree(mr);
+		return ERR_PTR(err);
+	}
+
+	kfree(page_list);
+	return &mr->ibmr;
+}
+
+static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				       int acc, const void __user *udata, int udatalen)
+{
+	struct ib_umem_chunk *chunk;
+	int npages = 0;
+	u64 *page_list;
+	struct mthca_mr *mr;
+	int shift;
+	int i, j, k;
+	int err;
+
+	shift = ffs(region->page_size) - 1;
+
+	mr = kmalloc(sizeof *mr, GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+	
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		npages += chunk->nents;
+
+	page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL);
+	if (!page_list) {
+		kfree(mr);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	i = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j)
+			for (k = 0; k < sg_dma_len(&chunk->page_list[j]) >> shift; ++k)
+				page_list[i++] = sg_dma_address(&chunk->page_list[j]) +
+					region->page_size * k;
+
+	err = mthca_mr_alloc_phys(to_mdev(pd->device),
+				  to_mpd(pd)->pd_num,
+				  page_list, shift, npages,
+				  region->virt_base, region->length,
+				  convert_access(acc), mr);
+
+	if (err) {
+		kfree(page_list);
 		kfree(mr);
 		return ERR_PTR(err);
 	}
@@ -574,6 +836,22 @@
 	return 0;
 }
 
+static int mthca_mmap_uar(struct ib_ucontext *context,
+			  struct vm_area_struct *vma)
+{
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	if (remap_pfn_range(vma, vma->vm_start,
+			    to_mucontext(context)->uar.pfn,
+			    PAGE_SIZE, vma->vm_page_prot))
+		return -EAGAIN;
+
+	return 0;
+}
+
 static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags,
 				      struct ib_fmr_attr *fmr_attr)
 {
@@ -690,6 +968,8 @@
 	int i;
 
 	strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.owner                = THIS_MODULE;
+
 	dev->ib_dev.node_type            = IB_NODE_CA;
 	dev->ib_dev.phys_port_cnt        = dev->limits.num_ports;
 	dev->ib_dev.dma_device           = &dev->pdev->dev;
@@ -699,6 +979,8 @@
 	dev->ib_dev.modify_port          = mthca_modify_port;
 	dev->ib_dev.query_pkey           = mthca_query_pkey;
 	dev->ib_dev.query_gid            = mthca_query_gid;
+	dev->ib_dev.alloc_ucontext       = mthca_alloc_ucontext;
+	dev->ib_dev.dealloc_ucontext     = mthca_dealloc_ucontext;
 	dev->ib_dev.alloc_pd             = mthca_alloc_pd;
 	dev->ib_dev.dealloc_pd           = mthca_dealloc_pd;
 	dev->ib_dev.create_ah            = mthca_ah_create;
@@ -711,6 +993,7 @@
 	dev->ib_dev.poll_cq              = mthca_poll_cq;
 	dev->ib_dev.get_dma_mr           = mthca_get_dma_mr;
 	dev->ib_dev.reg_phys_mr          = mthca_reg_phys_mr;
+	dev->ib_dev.reg_user_mr          = mthca_reg_user_mr;
 	dev->ib_dev.dereg_mr             = mthca_dereg_mr;
 
 	if (dev->mthca_flags & MTHCA_FLAG_FMR) {
@@ -726,6 +1009,7 @@
 	dev->ib_dev.attach_mcast         = mthca_multicast_attach;
 	dev->ib_dev.detach_mcast         = mthca_multicast_detach;
 	dev->ib_dev.process_mad          = mthca_process_mad;
+	dev->ib_dev.mmap                 = mthca_mmap_uar;
 
 	if (mthca_is_memfree(dev)) {
 		dev->ib_dev.req_notify_cq = mthca_arbel_arm_cq;
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_provider.h	2005-04-04 14:57:12.287743246 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_provider.h	2005-04-04 14:58:12.445661916 -0700
@@ -54,6 +54,14 @@
 	int           index;
 };
 
+struct mthca_user_db_table;
+
+struct mthca_ucontext {
+	struct ib_ucontext          ibucontext;
+	struct mthca_uar            uar;
+	struct mthca_user_db_table *db_tab;
+};
+
 struct mthca_mr {
 	struct ib_mr ibmr;
 	int order;
@@ -167,6 +175,7 @@
 	int                    cqn;
 	u32                    cons_index;
 	int                    is_direct;
+	int                    is_kernel;
 
 	/* Next fields are Arbel only */
 	int                    set_ci_db_index;
@@ -236,6 +245,11 @@
 	dma_addr_t      header_dma;
 };
 
+static inline struct mthca_ucontext *to_mucontext(struct ib_ucontext *ibucontext)
+{
+	return container_of(ibucontext, struct mthca_ucontext, ibucontext);
+}
+
 static inline struct mthca_fmr *to_mfmr(struct ib_fmr *ibmr)
 {
 	return container_of(ibmr, struct mthca_fmr, ibmr);
--- linux-export.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-04 14:57:12.320736072 -0700
+++ linux-export/drivers/infiniband/hw/mthca/mthca_qp.c	2005-04-04 14:58:12.491651915 -0700
@@ -652,7 +652,11 @@
 
 	/* leave arbel_sched_queue as 0 */
 
-	qp_context->usr_page   = cpu_to_be32(dev->driver_uar.index);
+	if (qp->ibqp.uobject)
+		qp_context->usr_page =
+			cpu_to_be32(to_mucontext(qp->ibqp.uobject->context)->uar.index);
+	else
+		qp_context->usr_page = cpu_to_be32(dev->driver_uar.index);
 	qp_context->local_qpn  = cpu_to_be32(qp->qpn);
 	if (attr_mask & IB_QP_DEST_QPN) {
 		qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num);
@@ -917,6 +921,15 @@
 
 	qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift,
 				    1 << qp->sq.wqe_shift);
+
+	/*
+	 * If this is a userspace QP, we don't actually have to
+	 * allocate anything.  All we need is to calculate the WQE
+	 * sizes and the send_wqe_offset, so we're done now.
+	 */
+	if (pd->ibpd.uobject)
+		return 0;
+
 	size = PAGE_ALIGN(qp->send_wqe_offset +
 			  (qp->sq.max << qp->sq.wqe_shift));
 
@@ -1015,10 +1028,33 @@
 	return err;
 }
 
-static int mthca_alloc_memfree(struct mthca_dev *dev,
+static void mthca_free_wqe_buf(struct mthca_dev *dev,
 			       struct mthca_qp *qp)
 {
-	int ret = 0;
+	int i;
+	int size = PAGE_ALIGN(qp->send_wqe_offset +
+			      (qp->sq.max << qp->sq.wqe_shift));
+
+	if (qp->is_direct) {
+		pci_free_consistent(dev->pdev, size,
+				    qp->queue.direct.buf,
+				    pci_unmap_addr(&qp->queue.direct, mapping));
+	} else {
+		for (i = 0; i < size / PAGE_SIZE; ++i) {
+			pci_free_consistent(dev->pdev, PAGE_SIZE,
+					    qp->queue.page_list[i].buf,
+					    pci_unmap_addr(&qp->queue.page_list[i],
+							   mapping));
+		}
+	}
+
+	kfree(qp->wrid);
+}
+
+static int mthca_map_memfree(struct mthca_dev *dev,
+			     struct mthca_qp *qp)
+{
+	int ret;
 
 	if (mthca_is_memfree(dev)) {
 		ret = mthca_table_get(dev, dev->qp_table.qp_table, qp->qpn);
@@ -1029,35 +1065,15 @@
 		if (ret)
 			goto err_qpc;
 
-		ret = mthca_table_get(dev, dev->qp_table.rdb_table,
-				      qp->qpn << dev->qp_table.rdb_shift);
-		if (ret)
-			goto err_eqpc;
-
-		qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ,
-						 qp->qpn, &qp->rq.db);
-		if (qp->rq.db_index < 0) {
-			ret = -ENOMEM;
-			goto err_rdb;
-		}
+ 		ret = mthca_table_get(dev, dev->qp_table.rdb_table,
+ 				      qp->qpn << dev->qp_table.rdb_shift);
+ 		if (ret)
+ 			goto err_eqpc;
 
-		qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ,
-						 qp->qpn, &qp->sq.db);
-		if (qp->sq.db_index < 0) {
-			ret = -ENOMEM;
-			goto err_rq_db;
-		}
 	}
 
 	return 0;
 
-err_rq_db:
-	mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
-
-err_rdb:
-	mthca_table_put(dev, dev->qp_table.rdb_table,
-			qp->qpn << dev->qp_table.rdb_shift);
-
 err_eqpc:
 	mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
 
@@ -1067,16 +1083,43 @@
 	return ret;
 }
 
+static void mthca_unmap_memfree(struct mthca_dev *dev,
+				struct mthca_qp *qp)
+{
+	if (mthca_is_memfree(dev)) {
+ 		mthca_table_put(dev, dev->qp_table.rdb_table,
+ 				qp->qpn << dev->qp_table.rdb_shift);
+		mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
+		mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn);
+	}
+}
+
+static int mthca_alloc_memfree(struct mthca_dev *dev,
+			       struct mthca_qp *qp)
+{
+	int ret = 0;
+
+	if (mthca_is_memfree(dev)) {
+		qp->rq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_RQ,
+						 qp->qpn, &qp->rq.db);
+		if (qp->rq.db_index < 0)
+			return ret;
+
+		qp->sq.db_index = mthca_alloc_db(dev, MTHCA_DB_TYPE_SQ,
+						 qp->qpn, &qp->sq.db);
+		if (qp->sq.db_index < 0)
+			mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
+	}
+
+	return ret;
+}
+
 static void mthca_free_memfree(struct mthca_dev *dev,
 			       struct mthca_qp *qp)
 {
 	if (mthca_is_memfree(dev)) {
 		mthca_free_db(dev, MTHCA_DB_TYPE_SQ, qp->sq.db_index);
 		mthca_free_db(dev, MTHCA_DB_TYPE_RQ, qp->rq.db_index);
-		mthca_table_put(dev, dev->qp_table.rdb_table,
-				qp->qpn << dev->qp_table.rdb_shift);
-		mthca_table_put(dev, dev->qp_table.eqp_table, qp->qpn);
-		mthca_table_put(dev, dev->qp_table.qp_table, qp->qpn);
 	}
 }
 
@@ -1108,13 +1151,28 @@
 	mthca_wq_init(&qp->sq);
 	mthca_wq_init(&qp->rq);
 
-	ret = mthca_alloc_memfree(dev, qp);
+	ret = mthca_map_memfree(dev, qp);
 	if (ret)
 		return ret;
 
 	ret = mthca_alloc_wqe_buf(dev, pd, qp);
 	if (ret) {
-		mthca_free_memfree(dev, qp);
+		mthca_unmap_memfree(dev, qp);
+		return ret;
+	}
+
+	/*
+	 * If this is a userspace QP, we're done now.  The doorbells
+	 * will be allocated and buffers will be initialized in
+	 * userspace.
+	 */
+	if (pd->ibpd.uobject)
+		return 0;
+
+	ret = mthca_alloc_memfree(dev, qp);
+	if (ret) {
+		mthca_free_wqe_buf(dev, qp);
+		mthca_unmap_memfree(dev, qp);
 		return ret;
 	}
 
@@ -1274,8 +1332,6 @@
 		   struct mthca_qp *qp)
 {
 	u8 status;
-	int size;
-	int i;
 	struct mthca_cq *send_cq;
 	struct mthca_cq *recv_cq;
 
@@ -1305,31 +1361,22 @@
 	if (qp->state != IB_QPS_RESET)
 		mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status);
 
-	mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
-	if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
-		mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
-
-	mthca_free_mr(dev, &qp->mr);
-
-	size = PAGE_ALIGN(qp->send_wqe_offset +
-			  (qp->sq.max << qp->sq.wqe_shift));
+	/*
+	 * If this is a userspace QP, the buffers, MR, CQs and so on
+	 * will be cleaned up in userspace, so all we have to do is
+	 * unref the mem-free tables and free the QPN in our table.
+	 */
+	if (!qp->ibqp.uobject) {
+		mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn);
+		if (qp->ibqp.send_cq != qp->ibqp.recv_cq)
+			mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn);
 
-	if (qp->is_direct) {
-		pci_free_consistent(dev->pdev, size,
-				    qp->queue.direct.buf,
-				    pci_unmap_addr(&qp->queue.direct, mapping));
-	} else {
-		for (i = 0; i < size / PAGE_SIZE; ++i) {
-			pci_free_consistent(dev->pdev, PAGE_SIZE,
-					    qp->queue.page_list[i].buf,
-					    pci_unmap_addr(&qp->queue.page_list[i],
-							   mapping));
-		}
+		mthca_free_mr(dev, &qp->mr);
+		mthca_free_memfree(dev, qp);
+		mthca_free_wqe_buf(dev, qp);
 	}
 
-	kfree(qp->wrid);
-
-	mthca_free_memfree(dev, qp);
+	mthca_unmap_memfree(dev, qp);
 
 	if (is_sqp(dev, qp)) {
 		atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count));
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-export/drivers/infiniband/hw/mthca/mthca_user.h	2005-04-04 14:58:12.491651915 -0700
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef MTHCA_USER_H
+#define MTHCA_USER_H
+
+#include <linux/types.h>
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct mthca_alloc_ucontext {
+	__u64 respbuf;
+};
+
+struct mthca_alloc_ucontext_resp {
+	__u32 qp_tab_size;
+	__u32 uarc_size;
+};
+
+struct mthca_alloc_pd {
+	__u64 pdnbuf;
+};
+
+struct mthca_alloc_pd_resp {
+	__u32 pdn;
+	__u32 reserved;
+};
+
+struct mthca_create_cq {
+	__u64 cqnbuf;
+	__u32 lkey;
+	__u32 pdn;
+	__u64 arm_db_page;
+	__u64 set_db_page;
+	__u32 arm_db_index;
+	__u32 set_db_index;
+};
+
+struct mthca_create_cq_resp {
+	__u32 cqn;
+	__u32 reserved;
+};
+
+struct mthca_create_qp {
+	__u32 lkey;
+	__u32 reserved;
+	__u64 sq_db_page;
+	__u64 rq_db_page;
+	__u32 sq_db_index;
+	__u32 rq_db_index;
+};
+
+#endif /* MTHCA_USER_H */


From roland at topspin.com  Mon Apr  4 15:09:00 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 4 Apr 2005 15:09:00 -0700
Subject: [openib-general] [PATCH][RFC][4/4] IB: userspace verbs
	Kconfig/Makefile changes
In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
Message-ID: <200544159.LHYjypUjDczyHP7A@topspin.com>

Hook userspace verbs up to Kconfig and Makefile.

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-export.orig/drivers/infiniband/Kconfig	2005-04-04 14:58:53.397756926 -0700
+++ linux-export/drivers/infiniband/Kconfig	2005-04-04 15:01:08.716332258 -0700
@@ -7,6 +7,14 @@
 	  any protocols you wish to use as well as drivers for your
 	  InfiniBand hardware.
 
+config INFINIBAND_USER_VERBS
+	tristate "InfiniBand userspace verbs support"
+	depends on INFINIBAND
+	---help---
+	  Userspace InfiniBand verbs support.  This is the kernel side
+	  of userspace verbs.  You will also need libibverbs and a
+	  hardware driver library from <http://www.openib.org>.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
--- linux-export.orig/drivers/infiniband/core/Makefile	2005-04-04 14:58:53.398756709 -0700
+++ linux-export/drivers/infiniband/core/Makefile	2005-04-04 15:00:44.933503748 -0700
@@ -1,7 +1,8 @@
 EXTRA_CFLAGS += -Idrivers/infiniband/include
 
-obj-$(CONFIG_INFINIBAND) +=	ib_core.o ib_mad.o ib_ping.o \
-				ib_cm.o ib_sa.o ib_umad.o ib_ucm.o
+obj-$(CONFIG_INFINIBAND) +=		ib_core.o ib_mad.o ib_ping.o \
+					ib_cm.o ib_sa.o ib_umad.o ib_ucm.o
+obj-$(CONFIG_INFINIBAND_USER_VERBS) +=	ib_uverbs.o
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
 				device.o fmr_pool.o cache.o
@@ -16,4 +17,6 @@
 
 ib_umad-y :=			user_mad.o
 
+ib_uverbs-y :=			uverbs_main.o uverbs_cmd.o uverbs_mem.o
+
 ib_ucm-y :=			ucm.o


From roland at topspin.com  Mon Apr  4 15:15:12 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 15:15:12 -0700
Subject: [openib-general] Re: IPoIB
In-Reply-To: <1112652482.4490.281.camel@localhost.localdomain> (Hal
	Rosenstock's message of "04 Apr 2005 18:08:03 -0400")
References: <1112652482.4490.281.camel@localhost.localdomain>
Message-ID: <524qemfatr.fsf@topspin.com>

    Hal> A while ago, Tom brought up the issue of IPoIB link level
    Hal> broadcasting from user space (with the arping tool). Is it
    Hal> possible to do this from kernel space ? For example, how
    Hal> would/could sendto() work when sending to a IPoIB link layer
    Hal> address ? If all we wanted to support was broadcast, perhaps
    Hal> there could be a remapping of the ethernet MAC broadcast
    Hal> address to the all hosts MGID and QPN for that IPoIB
    Hal> interface. Or perhaps the entire ipoib pseudoheader should be
    Hal> supported in this mode. This is needed to support
    Hal> RARPing. Some hosts want to RARP for their IP address and
    Hal> this should be supported over IPoIB.

I think it should "just work" with the current setup.
ipoib_hard_header() will look at the skb it's passed, and if there's
no neighbour struct, it will just save off the destination link
address.  Then ipoib_start_xmit() will look at the destination address
and handle multicast link addresses correctly.

 - R.


From iod00d at hp.com  Mon Apr  4 15:35:16 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 4 Apr 2005 15:35:16 -0700
Subject: [openib-general] IPoIB
In-Reply-To: <1112652482.4490.281.camel@localhost.localdomain>
References: <1112652482.4490.281.camel@localhost.localdomain>
Message-ID: <20050404223516.GC22119@esmail.cup.hp.com>

On Mon, Apr 04, 2005 at 06:08:03PM -0400, Hal Rosenstock wrote:
> A while ago, Tom brought up the issue of IPoIB link level broadcasting
> from user space (with the arping tool). Is it possible to do this from
> kernel space?

I would think any driver can call hard_xmit() for any "NIC".
pktgen.c does.

> For example, how would/could sendto() work when sending
> to a IPoIB link layer address?

Would net/core/pktgen.c help?

	 * A tool for loading the network with preconfigurated packets.
	 * The tool is implemented as a linux module.  Parameters are output
	 * device, delay (to hard_xmit), number of packets, and whether
	 * to use multiple SKBs or just the same one.
	 * pktgen uses the installed interface's output routine.

That's one of the tools I use occasionally for performance analysis.
This certainly would be useful to test TCP/IP <-> IB bridge/router
support in the kernel.

> If all we wanted to support was
> broadcast, perhaps there could be a remapping of the ethernet MAC
> broadcast address to the all hosts MGID and QPN for that IPoIB
> interface. Or perhaps the entire ipoib pseudoheader should be supported
> in this mode. This is needed to support RARPing. Some hosts want to RARP
> for their IP address and this should be supported over IPoIB.

sorry - the above is mostly greek to me...

grant

> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From tduffy at sun.com  Mon Apr  4 15:49:35 2005
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 04 Apr 2005 15:49:35 -0700
Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca
	changes
In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
References: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
Message-ID: <1112654975.22537.12.camel@duffman>

On Mon, 2005-04-04 at 15:09 -0700, Roland Dreier wrote:
> --- linux-export.orig/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-04 14:57:12.254750421 -0700
> +++ linux-export/drivers/infiniband/hw/mthca/mthca_dev.h	2005-04-04 14:58:12.411669307 -0700
> @@ -49,14 +49,6 @@
>  #define DRV_VERSION	"0.06-pre"
>  #define DRV_RELDATE	"November 8, 2004"
>  
> -/* XXX remove once SINAI defines make it into kernel.org */
> -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI_OLD
> -#define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c
> -#endif
> -#ifndef PCI_DEVICE_ID_MELLANOX_SINAI
> -#define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274
> -#endif
> -
>  enum {
>  	MTHCA_FLAG_DDR_HIDDEN = 1 << 1,
>  	MTHCA_FLAG_SRQ        = 1 << 2,

Now, you are really gonna hate me for asking you to put this in as you
probably did not want to include this in the patch to lkml.

So, maybe Grant was right ;-)

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050404/b94ad46c/attachment.sig>

From halr at voltaire.com  Mon Apr  4 15:48:19 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Apr 2005 18:48:19 -0400
Subject: [openib-general] IPoIB
In-Reply-To: <20050404223516.GC22119@esmail.cup.hp.com>
References: <1112652482.4490.281.camel@localhost.localdomain>
	<20050404223516.GC22119@esmail.cup.hp.com>
Message-ID: <1112654898.4490.293.camel@localhost.localdomain>

On Mon, 2005-04-04 at 18:35, Grant Grundler wrote:
> On Mon, Apr 04, 2005 at 06:08:03PM -0400, Hal Rosenstock wrote:
> > A while ago, Tom brought up the issue of IPoIB link level broadcasting
> > from user space (with the arping tool). Is it possible to do this from
> > kernel space?
> 
> I would think any driver can call hard_xmit() for any "NIC".
> pktgen.c does.

Yes, but I was looking at a different "use" case. How do
net/packet/af_packet.c work when with link layer sends rather than IP
based sends ? Can this be made to work for IPoIB and how ?

> > For example, how would/could sendto() work when sending
> > to a IPoIB link layer address?
> 
> Would net/core/pktgen.c help?

Glancing at pktgen.c, there would need to be some mods made for IPoIB
as IPoIB does not deal with MAC addresses (random src/dest MACs).

pktgen.c uses the driver's transmit routine directly so this is a
different case from what I was describing.

> 	 * A tool for loading the network with preconfigurated packets.
> 	 * The tool is implemented as a linux module.  Parameters are output
> 	 * device, delay (to hard_xmit), number of packets, and whether
> 	 * to use multiple SKBs or just the same one.
> 	 * pktgen uses the installed interface's output routine.
> 
> That's one of the tools I use occasionally for performance analysis.
> This certainly would be useful to test TCP/IP <-> IB bridge/router
> support in the kernel.

Do you mean IB or IP bridge/router ? IB bridges are switches. IB routers
forward at the IB network layer and are not completely specified. I
suspect you mean an IP router with one or more IPoIB interfaces.

-- Hal


From halr at voltaire.com  Mon Apr  4 16:16:28 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Apr 2005 19:16:28 -0400
Subject: [openib-general] Re: IPoIB
In-Reply-To: <524qemfatr.fsf@topspin.com>
References: <1112652482.4490.281.camel@localhost.localdomain>
	<524qemfatr.fsf@topspin.com>
Message-ID: <1112656588.4490.300.camel@localhost.localdomain>

On Mon, 2005-04-04 at 18:15, Roland Dreier wrote:
> I think it should "just work" with the current setup.
> ipoib_hard_header() will look at the skb it's passed, and if there's
> no neighbour struct, it will just save off the destination link
> address.  Then ipoib_start_xmit() will look at the destination address
> and handle multicast link addresses correctly.

That's good to hear. There are going to be some other changes for this.
At a quick glance, ipoib_main.c::ipoib_start_xmit drops a unicast link
level response if it is not ARP. RARP is also possible there, right ?

I'm not sure the Linux code above this is set up to support the larger
link level address needed by IPoIB either.

-- Hal


From libor at topspin.com  Mon Apr  4 16:22:59 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 4 Apr 2005 16:22:59 -0700
Subject: [openib-general] What context can CM be called from?
In-Reply-To: <20050331105141.A1541@topspin.com>;
	from libor@topspin.com on Thu, Mar 31, 2005 at 10:51:41AM -0800
References: <506C3D7B14CDD411A52C00025558DED6064BF21D@mtlex01.yok.mtl.com>
	<52u0muqvt7.fsf@topspin.com> <4249907F.8010101@ichips.intel.com>
	<20050329153826.D31683@topspin.com>
	<4249ED30.3060208@ichips.intel.com>
	<1112290139.4490.20.camel@localhost.localdomain>
	<424C3667.7040909@ichips.intel.com>
	<20050331105141.A1541@topspin.com>
Message-ID: <20050404162259.C10315@topspin.com>

On Thu, Mar 31, 2005 at 10:51:41AM -0800, Libor Michalek wrote:
> On Thu, Mar 31, 2005 at 09:41:59AM -0800, Sean Hefty wrote:
> > Hal Rosenstock wrote:
> > > Is this just the kmalloc in cm_alloc_msg or is there more to this ?
> > 
> > I _think_ that the kmalloc in cm_alloc_msg is all that needs to change.
> 
>   Yes, this CM change should be sufficient. I'm testing it now and
> it looks good. I'll run some more tests and then check in the change.

  Sean, this change works correctly in all of my tests, 
so I checked it in.

-Libor


From roland at topspin.com  Mon Apr  4 16:34:18 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 16:34:18 -0700
Subject: [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca
	changes
In-Reply-To: <1112654975.22537.12.camel@duffman> (Tom Duffy's message of
	"Mon, 04 Apr 2005 15:49:35 -0700")
References: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
	<1112654975.22537.12.camel@duffman>
Message-ID: <52vf72dslh.fsf@topspin.com>

    Tom> Now, you are really gonna hate me for asking you to put this
    Tom> in as you probably did not want to include this in the patch
    Tom> to lkml.

    Tom> So, maybe Grant was right ;-)

Oh well, I didn't read the patches over carefully enough.  Fortunately
it was just my "for review" version.

 - R.


From roland at topspin.com  Mon Apr  4 16:43:21 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 16:43:21 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <20050404055131.GA19409@esmail.cup.hp.com> (Grant Grundler's
	message of "Sun, 3 Apr 2005 22:51:31 -0700")
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<20050404055131.GA19409@esmail.cup.hp.com>
Message-ID: <52hdimds6e.fsf@topspin.com>

    Grant> FMR is a red herring.  I tried SVN r2080 and it has roughly
    Grant> the same performance as r2082 (when FMR was committed) and
    Grant> later r210x.  "packed" attribute is a red herring too.

    Grant> Performance stunk with r2050 and I will do a binary search
    Grant> this week until I sort out which changes doubled the
    Grant> perf. ISTR there was one change related to a "double
    Grant> mapping" issue and I will be tracking that down in a few
    Grant> days.

A binary search to find the changeset that makes the difference would
be really useful.  I read through the svn log from r2046 through r2082
and I don't see anything that should make a difference to IPoIB.

The only changes that seem remotely plausible are

    r2059 "Set skb->mac.raw on receive"
    r2068 "Make address handle verbs usable from interrupt context"

but I don't see how either one could really have an effect.

So I wonder what obvious thing I'm missing...

 - R.


From roland at topspin.com  Mon Apr  4 16:37:10 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 04 Apr 2005 16:37:10 -0700
Subject: [openib-general] Re: IPoIB
In-Reply-To: <1112656588.4490.300.camel@localhost.localdomain> (Hal
	Rosenstock's message of "04 Apr 2005 19:16:28 -0400")
References: <1112652482.4490.281.camel@localhost.localdomain>
	<524qemfatr.fsf@topspin.com>
	<1112656588.4490.300.camel@localhost.localdomain>
Message-ID: <52ll7ydsgp.fsf@topspin.com>

    Hal> That's good to hear. There are going to be some other changes
    Hal> for this.  At a quick glance, ipoib_main.c::ipoib_start_xmit
    Hal> drops a unicast link level response if it is not ARP. RARP is
    Hal> also possible there, right ?

Yeah, you're right.  That check can probably just be deleted.  The
driver should trust the kernel to pass it packets it means to send.

    Hal> I'm not sure the Linux code above this is set up to support
    Hal> the larger link level address needed by IPoIB either.

Not sure what you mean by this.

 - R.


From iod00d at hp.com  Mon Apr  4 17:26:23 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 4 Apr 2005 17:26:23 -0700
Subject: [openib-general] IPoIB
In-Reply-To: <1112654898.4490.293.camel@localhost.localdomain>
References: <1112652482.4490.281.camel@localhost.localdomain>
	<20050404223516.GC22119@esmail.cup.hp.com>
	<1112654898.4490.293.camel@localhost.localdomain>
Message-ID: <20050405002623.GH22119@esmail.cup.hp.com>

On Mon, Apr 04, 2005 at 06:48:19PM -0400, Hal Rosenstock wrote:
> Do you mean IB or IP bridge/router ? IB bridges are switches. IB routers
> forward at the IB network layer and are not completely specified. I
> suspect you mean an IP router with one or more IPoIB interfaces.

Yes, I was thinking IP router.

thanks,
grant


From halr at voltaire.com  Mon Apr  4 17:44:01 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Apr 2005 20:44:01 -0400
Subject: [openib-general] Re: IPoIB
In-Reply-To: <52ll7ydsgp.fsf@topspin.com>
References: <1112652482.4490.281.camel@localhost.localdomain>
	<524qemfatr.fsf@topspin.com>
	<1112656588.4490.300.camel@localhost.localdomain>
	<52ll7ydsgp.fsf@topspin.com>
Message-ID: <1112661841.4490.320.camel@localhost.localdomain>

On Mon, 2005-04-04 at 19:37, Roland Dreier wrote:
>     Hal> That's good to hear. There are going to be some other changes
>     Hal> for this.  At a quick glance, ipoib_main.c::ipoib_start_xmit
>     Hal> drops a unicast link level response if it is not ARP. RARP is
>     Hal> also possible there, right ?
> 
> Yeah, you're right.  That check can probably just be deleted.  The
> driver should trust the kernel to pass it packets it means to send.

OK. It does look like unicast_arp_send would work for this case if the
ARP check wasn't made. I'll play with this and propose a patch.

>     Hal> I'm not sure the Linux code above this is set up to support
>     Hal> the larger link level address needed by IPoIB either.
> 
> Not sure what you mean by this.

There are a couple of things that might be problematic but I'm not sure.
The first has to do with some data structures:
include/linux/socket.h:
struct sockaddr {
        sa_family_t     sa_family;      /* address family, AF_xxx      
*/
        char            sa_data[14];    /* 14 bytes of protocol address
*/
};
sa_data is not large enough for the IPoIB hardware address.
Also, similarly for sll_addr in include/linux/if_packet.h:
struct sockaddr_ll
{
...
        unsigned char   sll_halen;
        unsigned char   sll_addr[8];
};

I'm not sure what all the implications of changing these are.

Then, in net/packet/af_packet.c::packet_sendmsg:
        if (saddr == NULL) {
...
        } else {
                err = -EINVAL;
                if (msg->msg_namelen < sizeof(struct sockaddr_ll))
                        goto out;
                ifindex = saddr->sll_ifindex;
                proto   = saddr->sll_protocol;
                addr    = saddr->sll_addr;
        }

-- Hal


From hozer at hozed.org  Mon Apr  4 19:50:37 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Mon, 4 Apr 2005 21:50:37 -0500
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <1112310139.4490.24.camel@localhost.localdomain>
References: <1112190300.4495.67.camel@localhost.localdomain>
	<424B4CA7.1050606@sandia.gov>
	<1112310139.4490.24.camel@localhost.localdomain>
Message-ID: <20050405025037.GR26127@kalmia.hozed.org>

On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote:
> On Wed, 2005-03-30 at 20:04, Josh England wrote: 
> > Are there any plans to modify the linux DHCP client so it would be
> > possible to do kernel-level DHCP and NFSroot over IB?
> 
> I took a quick look at this and it looks pretty straightforward. Stay
> tuned...

I'd say don't.

Using initrd/initramfs is a much better solution. At some point the
in-kernel dhcp is going to get so buggy and old it's going to get
removed.

I boot all my cluster systems with NFS root servers, and I'm trying to
get everything moved to using Debian packaged kernels with initrd's.
With an initrd, you at least have a chance to get a shell and figure out
why you couldn't find your nfs server, instead of "kernel panic, I'm
going to die now" you get with in-kernel dhcp/nfs.


From jjengla at sandia.gov  Mon Apr  4 20:17:43 2005
From: jjengla at sandia.gov (Josh England)
Date: Mon, 04 Apr 2005 20:17:43 -0700
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <20050405025037.GR26127@kalmia.hozed.org>
References: <1112190300.4495.67.camel@localhost.localdomain>
	<424B4CA7.1050606@sandia.gov>
	<1112310139.4490.24.camel@localhost.localdomain>
	<20050405025037.GR26127@kalmia.hozed.org>
Message-ID: <42520357.6090607@sandia.gov>

Troy Benjegerdes wrote:
> On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote:
> 
>>On Wed, 2005-03-30 at 20:04, Josh England wrote: 
>>
>>>Are there any plans to modify the linux DHCP client so it would be
>>>possible to do kernel-level DHCP and NFSroot over IB?
>>
>>I took a quick look at this and it looks pretty straightforward. Stay
>>tuned...
> 
> 
> I'd say don't.
> 
> Using initrd/initramfs is a much better solution. At some point the
> in-kernel dhcp is going to get so buggy and old it's going to get
> removed.

I know...it's just crummy to have ship another 1.3 Megs out to every node.

> I boot all my cluster systems with NFS root servers, and I'm trying to
> get everything moved to using Debian packaged kernels with initrd's.
> With an initrd, you at least have a chance to get a shell and figure out
> why you couldn't find your nfs server, instead of "kernel panic, I'm
> going to die now" you get with in-kernel dhcp/nfs.

Check out oneSIS (http://onesis.org).  It can build initrds for you that
do NFSroot (and drop to a shell when things go sour).  I'd love to hear
some feedback from people familiar with running NFSroot.

-JE


From mst at mellanox.co.il  Tue Apr  5 00:42:13 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 5 Apr 2005 10:42:13 +0300
Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca
In-Reply-To: <52is32feq2.fsf@topspin.com>
References: <20050404150235.GZ15034@mellanox.co.il>
	<52is32feq2.fsf@topspin.com>
Message-ID: <20050405074213.GC15034@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [openib-general] [PATCH] SEND_INLINE support in libmthca
> 
> Is the test here correct?
> 
> +				if (s + sizeof *seg > (1 << qp->sq.wqe_shift)) {
> It seems we need to take into account the size of next segment and any
> RDMA segment that we may be posting as well.

Right. Fixed (below).

> Also does it make sense to put the code for gathering inline data
> segments and writing gather lists into an inline function that can be
> called from both the tavor and arbel post send function?  Will gcc
> actually inline this function?
> 
>  - R.
> 

It does get inlined, but the function would have to return
both size and status, so however I rearrange it I get either extra loads/stores
or extra branches.

Inline data support for libmthca (important for latency).
Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: src/qp.c
===================================================================
--- src/qp.c	(revision 2096)
+++ src/qp.c	(working copy)
@@ -57,6 +57,10 @@ enum {
 	MTHCA_NEXT_SOLICIT   = 1 << 1,
 };
 
+enum {
+	MTHCA_INLINE_SEG = 1<<31
+};
+
 struct mthca_next_seg {
 	uint32_t	nda_op;	/* [31:6] next WQE [4:0] next opcode */
 	uint32_t	ee_nds;	/* [31:8] next EE  [7] DBD [6] F [5:0] next WQE size */
@@ -107,6 +111,10 @@ struct mthca_data_seg {
 	uint64_t	addr;
 };
 
+struct mthca_inline_seg {
+	uint32_t	byte_count;
+};
+
 static const uint8_t mthca_opcode[] = {
 	[IBV_WR_SEND]                 = MTHCA_OPCODE_SEND,
 	[IBV_WR_SEND_WITH_IMM]        = MTHCA_OPCODE_SEND_IMM,
@@ -255,15 +263,39 @@ int mthca_tavor_post_send(struct ibv_qp 
 			goto out;
 		}
 
-		for (i = 0; i < wr->num_sge; ++i) {
-			((struct mthca_data_seg *) wqe)->byte_count =
-				htonl(wr->sg_list[i].length);
-			((struct mthca_data_seg *) wqe)->lkey =
-				htonl(wr->sg_list[i].lkey);
-			((struct mthca_data_seg *) wqe)->addr =
-				htonll(wr->sg_list[i].addr);
-			wqe += sizeof (struct mthca_data_seg);
-			size += sizeof (struct mthca_data_seg) / 16;
+		if (wr->send_flags & IBV_SEND_INLINE) {
+			struct mthca_inline_seg *seg = wqe;
+			int wqe_size = 1 << qp->sq.wqe_shift;
+			int s = 0;
+			wqe += sizeof *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				struct ibv_sge *sge = &wr->sg_list[i];
+				int l;
+				l = sge->length;
+				s += l;
+
+				if (s + sizeof *seg + size * 16 > wqe_size) {
+					ret = -1;
+					*bad_wr = wr;
+					goto out;
+				}
+
+				memcpy(wqe, (void*)(intptr_t)sge->addr, l);
+				wqe += l;
+			}
+			seg->byte_count = htonl(MTHCA_INLINE_SEG | s);
+
+			size += align(s + sizeof *seg, 16) / 16;
+		} else {
+			struct mthca_data_seg *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				seg = wqe;
+				seg->byte_count = htonl(wr->sg_list[i].length);
+				seg->lkey = htonl(wr->sg_list[i].lkey);
+				seg->addr = htonll(wr->sg_list[i].addr);
+				wqe += sizeof *seg;
+			}
+			size += wr->num_sge * sizeof *seg / 16;
 		}
 
 		qp->wrid[ind + qp->rq.max] = wr->wr_id;
@@ -512,15 +544,39 @@ int mthca_arbel_post_send(struct ibv_qp 
 			goto out;
 		}
 
-		for (i = 0; i < wr->num_sge; ++i) {
-			((struct mthca_data_seg *) wqe)->byte_count =
-				htonl(wr->sg_list[i].length);
-			((struct mthca_data_seg *) wqe)->lkey =
-				htonl(wr->sg_list[i].lkey);
-			((struct mthca_data_seg *) wqe)->addr =
-				htonll(wr->sg_list[i].addr);
-			wqe += sizeof (struct mthca_data_seg);
-			size += sizeof (struct mthca_data_seg) / 16;
+		if (wr->send_flags & IBV_SEND_INLINE) {
+			struct mthca_inline_seg *seg = wqe;
+			int wqe_size = 1 << qp->sq.wqe_shift;
+			int s = 0;
+			wqe += sizeof *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				struct ibv_sge *sge = &wr->sg_list[i];
+				int l;
+				l = sge->length;
+				s += l;
+
+				if (s + sizeof *seg + size * 16 > wqe_size) {
+					ret = -1;
+					*bad_wr = wr;
+					goto out;
+				}
+
+				memcpy(wqe, (void*)(intptr_t)sge->addr, l);
+				wqe += l;
+			}
+			seg->byte_count = htonl(MTHCA_INLINE_SEG | s);
+
+			size += align(s + sizeof *seg, 16) / 16;
+		} else {
+			struct mthca_data_seg *seg;
+			for (i = 0; i < wr->num_sge; ++i) {
+				seg = wqe;
+				seg->byte_count = htonl(wr->sg_list[i].length);
+				seg->lkey = htonl(wr->sg_list[i].lkey);
+				seg->addr = htonll(wr->sg_list[i].addr);
+				wqe += sizeof *seg;
+			}
+			size += wr->num_sge * sizeof *seg / 16;
 		}
 
 		qp->wrid[ind + qp->rq.max] = wr->wr_id;

-- 
MST - Michael S. Tsirkin


From bunk at stusta.de  Tue Apr  5 07:24:49 2005
From: bunk at stusta.de (Adrian Bunk)
Date: Tue, 5 Apr 2005 16:24:49 +0200
Subject: [openib-general] [-mm patch]
	drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label
In-Reply-To: <20050405000524.592fc125.akpm@osdl.org>
References: <20050405000524.592fc125.akpm@osdl.org>
Message-ID: <20050405142449.GF6885@stusta.de>

On Tue, Apr 05, 2005 at 12:05:24AM -0700, Andrew Morton wrote:
>...
> Changes since 2.6.12-rc1-mm4:
>...
> +ib-mthca-add-support-for-new-mt25204-hca.patch
> 
>  Infiniband update
>...


This patch causes the following compile warning:

<--  snip  -->

...
  CC      drivers/infiniband/hw/mthca/mthca_main.o
drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm':
drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label 
`err_unmap_eqp' defined but not used
...

<--  snip  -->


I'm not sure whether this patch to remove this label is correct, but if 
it isn't correct there must be a bug somewhere.


Signed-off-by: Adrian Bunk <bunk at stusta.de>

--- linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c.old	2005-04-05 16:18:09.000000000 +0200
+++ linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-05 16:19:15.000000000 +0200
@@ -475,8 +475,6 @@
 
 err_unmap_rdb:
 	mthca_free_icm_table(mdev, mdev->qp_table.rdb_table);
-
-err_unmap_eqp:
 	mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
 
 err_unmap_qp:


From halr at voltaire.com  Tue Apr  5 07:37:25 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Apr 2005 10:37:25 -0400
Subject: [openib-general] Re: [-mm patch]
	drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label
In-Reply-To: <20050405142449.GF6885@stusta.de>
References: <20050405000524.592fc125.akpm@osdl.org>
	<20050405142449.GF6885@stusta.de>
Message-ID: <1112711845.4490.4.camel@localhost.localdomain>

On Tue, 2005-04-05 at 10:24, Adrian Bunk wrote:
> On Tue, Apr 05, 2005 at 12:05:24AM -0700, Andrew Morton wrote:
> >...
> > Changes since 2.6.12-rc1-mm4:
> >...
> > +ib-mthca-add-support-for-new-mt25204-hca.patch
> > 
> >  Infiniband update
> >...
> 
> 
> This patch causes the following compile warning:
> 
> <--  snip  -->
> 
> ...
>   CC      drivers/infiniband/hw/mthca/mthca_main.o
> drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm':
> drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label 
> `err_unmap_eqp' defined but not used
> ...
> 
> <--  snip  -->
> 
> 
> I'm not sure whether this patch to remove this label is correct, but if 
> it isn't correct there must be a bug somewhere.
> 
> 
> Signed-off-by: Adrian Bunk <bunk at stusta.de>
> 
> --- linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c.old	2005-04-05 16:18:09.000000000 +0200
> +++ linux-2.6.12-rc2-mm1-full/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-05 16:19:15.000000000 +0200
> @@ -475,8 +475,6 @@
>  
>  err_unmap_rdb:
>  	mthca_free_icm_table(mdev, mdev->qp_table.rdb_table);
> -
> -err_unmap_eqp:
>  	mthca_free_icm_table(mdev, mdev->qp_table.eqp_table);
>  
>  err_unmap_qp:

Roland caught this recently and there is a patch for this which will
sent upstream. The proper fix is different from this.

-- Hal


From hozer at hozed.org  Tue Apr  5 08:53:24 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Tue, 5 Apr 2005 10:53:24 -0500
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <42520357.6090607@sandia.gov>
References: <1112190300.4495.67.camel@localhost.localdomain>
	<424B4CA7.1050606@sandia.gov>
	<1112310139.4490.24.camel@localhost.localdomain>
	<20050405025037.GR26127@kalmia.hozed.org>
	<42520357.6090607@sandia.gov>
Message-ID: <20050405155323.GT26127@kalmia.hozed.org>

On Mon, Apr 04, 2005 at 08:17:43PM -0700, Josh England wrote:
> Troy Benjegerdes wrote:
> > On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote:
> > 
> >>On Wed, 2005-03-30 at 20:04, Josh England wrote: 
> >>
> >>>Are there any plans to modify the linux DHCP client so it would be
> >>>possible to do kernel-level DHCP and NFSroot over IB?
> >>
> >>I took a quick look at this and it looks pretty straightforward. Stay
> >>tuned...
> > 
> > 
> > I'd say don't.
> > 
> > Using initrd/initramfs is a much better solution. At some point the
> > in-kernel dhcp is going to get so buggy and old it's going to get
> > removed.
> 
> I know...it's just crummy to have ship another 1.3 Megs out to every node.
> 
> > I boot all my cluster systems with NFS root servers, and I'm trying to
> > get everything moved to using Debian packaged kernels with initrd's.
> > With an initrd, you at least have a chance to get a shell and figure out
> > why you couldn't find your nfs server, instead of "kernel panic, I'm
> > going to die now" you get with in-kernel dhcp/nfs.
> 
> Check out oneSIS (http://onesis.org).  It can build initrds for you that
> do NFSroot (and drop to a shell when things go sour).  I'd love to hear
> some feedback from people familiar with running NFSroot.

Debian has a package called "lessdisks" that does some similiar stuff..
If I do "apt-get install initrd-netboot-tools", and then install a
debian kernel image, it builds an initrd that can netboot. I suppose my
next trick after I get the debian kernel maintainers to make sure
infiniband is enabled is to try booting over IPoIB.

What dhcp client does onesis use? The debian lessdisks stuff uses
'udhcpc', advertised as a "very small DHCP client".


From halr at voltaire.com  Tue Apr  5 08:51:31 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Apr 2005 11:51:31 -0400
Subject: [openib-general] CM misuse of in_atomic/irqs_disabled
Message-ID: <1112716291.4490.8.camel@localhost.localdomain>

Hi Sean,

Should the following in the CM be changed:

int ib_cm_establish(struct ib_cm_id *cm_id)
{
...
        work = kmalloc(sizeof *work, (in_atomic() || irqs_disabled()) ?
                                      GFP_ATOMIC : GFP_KERNEL);
to just
        work = kmalloc(sizeof *work, GFP_ATOMIC);

similar to the other core changes for this issue ?

-- Hal


From mshefty at ichips.intel.com  Tue Apr  5 09:06:35 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Apr 2005 09:06:35 -0700
Subject: [openib-general] Re: CM misuse of in_atomic/irqs_disabled
In-Reply-To: <1112716291.4490.8.camel@localhost.localdomain>
References: <1112716291.4490.8.camel@localhost.localdomain>
Message-ID: <4252B78B.10801@ichips.intel.com>

Hal Rosenstock wrote:
> Hi Sean,
> 
> Should the following in the CM be changed:
> 
> int ib_cm_establish(struct ib_cm_id *cm_id)
> {
> ...
>         work = kmalloc(sizeof *work, (in_atomic() || irqs_disabled()) ?
>                                       GFP_ATOMIC : GFP_KERNEL);
> to just
>         work = kmalloc(sizeof *work, GFP_ATOMIC);
> 
> similar to the other core changes for this issue ?

I think so.  This was structured similar to the MAD code.  I'll commit 
a patch to change this.

- Sean


From roland at topspin.com  Tue Apr  5 09:05:31 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 05 Apr 2005 09:05:31 -0700
Subject: [openib-general] CM misuse of in_atomic/irqs_disabled
In-Reply-To: <1112716291.4490.8.camel@localhost.localdomain> (Hal
	Rosenstock's message of "05 Apr 2005 11:51:31 -0400")
References: <1112716291.4490.8.camel@localhost.localdomain>
Message-ID: <52y8bxcipg.fsf@topspin.com>

Or we could make ib_cm_establish() take a gfp_mask...

 - R.


From peter at pantasys.com  Tue Apr  5 09:32:26 2005
From: peter at pantasys.com (Peter Buckingham)
Date: Tue, 05 Apr 2005 09:32:26 -0700
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <42520357.6090607@sandia.gov>
References: <1112190300.4495.67.camel@localhost.localdomain>	<424B4CA7.1050606@sandia.gov>	<1112310139.4490.24.camel@localhost.localdomain>	<20050405025037.GR26127@kalmia.hozed.org>
	<42520357.6090607@sandia.gov>
Message-ID: <4252BD9A.6000800@pantasys.com>

Josh England wrote:
> Check out oneSIS (http://onesis.org).  It can build initrds for you that
> do NFSroot (and drop to a shell when things go sour).  I'd love to hear
> some feedback from people familiar with running NFSroot.

i'll check it out.

we actually build our own initrd's here.part of that is because gen1 IB 
drivers can't be built into the kernel, but it also gives us better fail 
over options too.

peter


From jjengla at sandia.gov  Tue Apr  5 09:37:42 2005
From: jjengla at sandia.gov (Josh England)
Date: Tue, 05 Apr 2005 09:37:42 -0700
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <20050405155323.GT26127@kalmia.hozed.org>
References: <1112190300.4495.67.camel@localhost.localdomain>
	<424B4CA7.1050606@sandia.gov>
	<1112310139.4490.24.camel@localhost.localdomain>
	<20050405025037.GR26127@kalmia.hozed.org>
	<42520357.6090607@sandia.gov>
	<20050405155323.GT26127@kalmia.hozed.org>
Message-ID: <4252BED6.1000105@sandia.gov>

Troy Benjegerdes wrote:
> On Mon, Apr 04, 2005 at 08:17:43PM -0700, Josh England wrote:
> 
>>Troy Benjegerdes wrote:
>>
>>>On Thu, Mar 31, 2005 at 06:06:30PM -0500, Hal Rosenstock wrote:
>>>
>>>
>>>>On Wed, 2005-03-30 at 20:04, Josh England wrote: 
>>>>
>>>>
>>>>>Are there any plans to modify the linux DHCP client so it would be
>>>>>possible to do kernel-level DHCP and NFSroot over IB?
>>>>
>>>>I took a quick look at this and it looks pretty straightforward. Stay
>>>>tuned...
>>>
>>>
>>>I'd say don't.
>>>
>>>Using initrd/initramfs is a much better solution. At some point the
>>>in-kernel dhcp is going to get so buggy and old it's going to get
>>>removed.
>>
>>I know...it's just crummy to have ship another 1.3 Megs out to every node.
>>
>>
>>>I boot all my cluster systems with NFS root servers, and I'm trying to
>>>get everything moved to using Debian packaged kernels with initrd's.
>>>With an initrd, you at least have a chance to get a shell and figure out
>>>why you couldn't find your nfs server, instead of "kernel panic, I'm
>>>going to die now" you get with in-kernel dhcp/nfs.
>>
>>Check out oneSIS (http://onesis.org).  It can build initrds for you that
>>do NFSroot (and drop to a shell when things go sour).  I'd love to hear
>>some feedback from people familiar with running NFSroot.
> 
> 
> Debian has a package called "lessdisks" that does some similiar stuff..
> If I do "apt-get install initrd-netboot-tools", and then install a
> debian kernel image, it builds an initrd that can netboot. I suppose my
> next trick after I get the debian kernel maintainers to make sure
> infiniband is enabled is to try booting over IPoIB.
> 
> What dhcp client does onesis use? The debian lessdisks stuff uses
> 'udhcpc', advertised as a "very small DHCP client".

It has used udhcpc in the past, but I noticed it going significantly
slower on some machines and never found out why.  Right now it uses
dhclient, but that is easy enough to change.  The initrd itself is a
full mini-linux (busybox) with some helpful utilities (though I still
need to add lspci).  Maybe this discussion could be taken off the openib
list, though.  I'd love to talk more about the other cool stuff that
oneSIS can do.

-JE


From peter at pantasys.com  Tue Apr  5 09:40:48 2005
From: peter at pantasys.com (Peter Buckingham)
Date: Tue, 05 Apr 2005 09:40:48 -0700
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <20050405155323.GT26127@kalmia.hozed.org>
References: <1112190300.4495.67.camel@localhost.localdomain>	<424B4CA7.1050606@sandia.gov>	<1112310139.4490.24.camel@localhost.localdomain>	<20050405025037.GR26127@kalmia.hozed.org>	<42520357.6090607@sandia.gov>
	<20050405155323.GT26127@kalmia.hozed.org>
Message-ID: <4252BF90.1070803@pantasys.com>

Troy Benjegerdes wrote:
> What dhcp client does onesis use? The debian lessdisks stuff uses
> 'udhcpc', advertised as a "very small DHCP client".

I'd really suggest having a look at the klibc project. it provides 
pretty much all you need (apart from good documentation ;-). This was 
done as part of the move to initramfs by the kernel guys (hpa, al viro, 
etc).

sources:

	http://www.kernel.org/pub/linux/libs/klibc/

peter


From roland at topspin.com  Tue Apr  5 09:53:21 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 05 Apr 2005 09:53:21 -0700
Subject: [openib-general] Re: [-mm patch]
 drivers/infiniband/hw/mthca/mthca_main.c: remove an unused label
In-Reply-To: <20050405142449.GF6885@stusta.de> (Adrian Bunk's message of
	"Tue, 5 Apr 2005 16:24:49 +0200")
References: <20050405000524.592fc125.akpm@osdl.org>
	<20050405142449.GF6885@stusta.de>
Message-ID: <52psx9cghq.fsf@topspin.com>

    >   CC      drivers/infiniband/hw/mthca/mthca_main.o
    > drivers/infiniband/hw/mthca/mthca_main.c: In function `mthca_init_icm':
    > drivers/infiniband/hw/mthca/mthca_main.c:479: warning: label 
    > `err_unmap_eqp' defined but not used

Thanks, good catch.  I screwed up the error path in that function a
little while merging patches.  Here's the correct fix.


Correct unwinding in error path of mthca_init_icm().

Signed-off-by: Roland Dreier <roland at topspin.com>

--- linux-2.6.12-rc2-mm1.orig/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-05 09:49:02.944473724 -0700
+++ linux-2.6.12-rc2-mm1/drivers/infiniband/hw/mthca/mthca_main.c	2005-04-05 09:49:15.679708865 -0700
@@ -437,7 +437,7 @@
 	if (!mdev->qp_table.rdb_table) {
 		mthca_err(mdev, "Failed to map RDB context memory, aborting\n");
 		err = -ENOMEM;
-		goto err_unmap_rdb;
+		goto err_unmap_eqp;
 	}
 
        mdev->cq_table.table = mthca_alloc_icm_table(mdev, init_hca->cqc_base,


From jjengla at sandia.gov  Tue Apr  5 10:26:45 2005
From: jjengla at sandia.gov (Josh England)
Date: Tue, 05 Apr 2005 10:26:45 -0700
Subject: [openib-general] Port of ISC DHCP-3.0.2 to OpenIB IPoIB
In-Reply-To: <4252BF90.1070803@pantasys.com>
References: <1112190300.4495.67.camel@localhost.localdomain>	<424B4CA7.1050606@sandia.gov>	<1112310139.4490.24.camel@localhost.localdomain>	<20050405025037.GR26127@kalmia.hozed.org>	<42520357.6090607@sandia.gov>
	<20050405155323.GT26127@kalmia.hozed.org>
	<4252BF90.1070803@pantasys.com>
Message-ID: <4252CA55.3000001@sandia.gov>

Peter Buckingham wrote:
> Troy Benjegerdes wrote:
> 
>> What dhcp client does onesis use? The debian lessdisks stuff uses
>> 'udhcpc', advertised as a "very small DHCP client".
> 
> 
> I'd really suggest having a look at the klibc project. it provides
> pretty much all you need (apart from good documentation ;-). This was
> done as part of the move to initramfs by the kernel guys (hpa, al viro,
> etc).
> 
> sources:
> 
>     http://www.kernel.org/pub/linux/libs/klibc/

Yeah...it definitely looks to be the way to go for 2.6 systems.

-JE


From rf at q-leap.de  Tue Apr  5 10:26:52 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Tue, 5 Apr 2005 19:26:52 +0200
Subject: [openib-general] gen2 opensm
Message-ID: <16978.51804.357810.302532@gargle.gargle.HOWL>

Hi,

I have tried the kernel 2.6.11 drivers on an x86-64 machine with a
MT23108 card. The driver loads ok after
$ modprobe ib_mthca; modprobe ib_umad

Since I use devfs, I have to manually create

$ mknod /dev/infiniband/umad0 c 231 0
$ mknod /dev/infiniband/umad1 c 231 1
$ mknod /dev/infiniband/issm0 c 231 64
$ mknod /dev/infiniband/issm1 c 231 65

I get 

$ /usr/local/ib/bin/ibstat
CA 'mthca0'
        CA type: MT23108
        Number of ports: 2
        Firmware version: 3.2.0
        Hardware version: a1
        Node GUID: 0x000000008815bcaa
        System image GUID: 0x000000008815bcaa
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00500a68
                Port GUID: 0x0000000000000000
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 2
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00500a68
                Port GUID: 0x0000000000000000

which already looks strange (GUID 0 ???). Running opensm then doesn't
activate the ports:

Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA  0x7fffffffd010ort 1 guid (0x65babaa) as the default port.
Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa.
Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x8000000.
Apr 05 19:18:25 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping.

What could have gone wrong?

Roland


From halr at voltaire.com  Tue Apr  5 10:42:16 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Apr 2005 13:42:16 -0400
Subject: [openib-general] gen2 opensm
In-Reply-To: <16978.51804.357810.302532@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
Message-ID: <1112722935.4634.12.camel@localhost.localdomain>

On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote:
> Hi,
> 
> I have tried the kernel 2.6.11 drivers on an x86-64 machine with a
> MT23108 card. The driver loads ok after
> $ modprobe ib_mthca; modprobe ib_umad
> 
> Since I use devfs, I have to manually create
> 
> $ mknod /dev/infiniband/umad0 c 231 0
> $ mknod /dev/infiniband/umad1 c 231 1
> $ mknod /dev/infiniband/issm0 c 231 64
> $ mknod /dev/infiniband/issm1 c 231 65

What are the permissions on those ? Are they crw ?

> I get 
> 
> $ /usr/local/ib/bin/ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.2.0
>         Hardware version: a1
>         Node GUID: 0x000000008815bcaa
>         System image GUID: 0x000000008815bcaa
>         Port 1:
>                 State: Initializing
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00500a68
>                 Port GUID: 0x0000000000000000
>         Port 2:
>                 State: Down
>                 Physical state: Polling
>                 Rate: 2
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00500a68
>                 Port GUID: 0x0000000000000000
> 
> which already looks strange (GUID 0 ???). 

It looks like the port GUIDs are not set in NVRAM.

> Running opensm then doesn't activate the ports:
> 
> Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA  0x7fffffffd010ort 1 guid (0x65babaa) as the default port.

I see a bug in this message. I will fix it. Please sync OpenSM to at
least version 2111 and rerun.

> Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa.
> Apr 05 19:18:25 [4000] -> osm_vendor_bind: Binding to port 0x8000000.

Two binds. This looks wrong to me.

> Apr 05 19:18:25 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping.

The vendor layer couldn't find the matching request to a response which
came in. This is pretty fishy but probably related to the port issue.

> What could have gone wrong?

I would start with setting the port GUIDs for this HCA and see if the
problem persists.

-- Hal

> 
> Roland
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Tue Apr  5 12:30:08 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Apr 2005 15:30:08 -0400
Subject: [openib-general] gen2 opensm
In-Reply-To: <16978.51804.357810.302532@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
Message-ID: <1112729408.4490.12.camel@localhost.localdomain>

On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote:
> $ /usr/local/ib/bin/ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.2.0

You might also want to upgrade to 3.3.2. I forget what problems 3.2.0
had and whether they will ultimately get in your way.

-- Hal


From rf at q-leap.de  Tue Apr  5 13:33:34 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Tue, 5 Apr 2005 22:33:34 +0200
Subject: [openib-general] gen2 opensm
In-Reply-To: <1112722935.4634.12.camel@localhost.localdomain>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
Message-ID: <16978.63006.330557.956153@gargle.gargle.HOWL>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    Hal> On Tue, 2005-04-05 at 13:26, Roland Fehrenbacher wrote:

    >> I have tried the kernel 2.6.11 drivers on an x86-64 machine
    >> with a MT23108 card. The driver loads ok after
    >> $ modprobe ib_mthca; modprobe ib_umad

    >> Since I use devfs, I have to manually create

    >> $ mknod /dev/infiniband/umad0 c 231 0
    >> $ mknod /dev/infiniband/umad1 c 231 1
    >> $ mknod /dev/infiniband/issm0 c 231 64
    >> $ mknod /dev/infiniband/issm1 c 231 65

    Hal> What are the permissions on those ? Are they crw ?

$ ls -l /dev/infiniband
total 0
crw-r--r--  1 root root 231, 64 Apr  5 18:53 issm0
crw-r--r--  1 root root 231, 65 Apr  5 18:54 issm1
crw-r--r--  1 root root 231,  0 Apr  5 18:52 umad0
crw-r--r--  1 root root 231,  1 Apr  5 18:54 umad1

    >> I get 
    >> 
    >> $ /usr/local/ib/bin/ibstat
    >> CA 'mthca0'
    >>         CA type: MT23108
    >>         Number of ports: 2
    >>         Firmware version: 3.2.0
    >>         Hardware version: a1
    >>         Node GUID: 0x000000008815bcaa
    >>         System image GUID: 0x000000008815bcaa
    >>         Port 1:
    >>                 State: Initializing
    >>                 Physical state: LinkUp
    >>                 Rate: 10
    >>                 Base lid: 0
    >>                 LMC: 0
    >>                 SM lid: 0
    >>                 Capability mask: 0x00500a68
    >>                 Port GUID: 0x0000000000000000
    >>         Port 2:
    >>                 State: Down
    >>                 Physical state: Polling
    >>                 Rate: 2
    >>                 Base lid: 0
    >>                 LMC: 0
    >>                 SM lid: 0
    >>                 Capability mask: 0x00500a68
    >>                 Port GUID: 0x0000000000000000
    >> 
    >> which already looks strange (GUID 0 ???).

    Hal> It looks like the port GUIDs are not set in NVRAM.

They seem to be shown alright with ibstatus (or isn't gid = GUID?):

$ /usr/local/ib/bin/ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0000:771d
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            10 Gb/sec (4X)

Infiniband device 'mthca0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0000:771e
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            2.5 Gb/sec (1X)

> Running opensm then doesn't activate the ports:
> 
> Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 05 19:18:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 05 19:18:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 05 19:18:25 [4000] -> osm_vendor_get_all_port_attr: assign CA  0x7fffffffd010ort 1 guid (0x65babaa) as the default port.

    Hal> I see a bug in this message. I will fix it. Please sync
    Hal> OpenSM to at least version 2111 and rerun.

I will recompile tomorrow, and try a firmware upgrade.

Roland


From halr at voltaire.com  Tue Apr  5 14:02:14 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Apr 2005 17:02:14 -0400
Subject: [openib-general] IB Address Translation service
In-Reply-To: <4229CEA0.7060904@sun.com>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
Message-ID: <1112734934.4490.99.camel@localhost.localdomain>

Reviving an old thread...

On Sat, 2005-03-05 at 10:22, David M. Brean wrote: 
> There is an I-D for DHCP on IB.  IPoIB defines a "broadcast" address and 
> DHCP (and ARP) on IB use it.  Could make RARP work using this mechanism, 
> but as someone else pointed out, the IB hardware address contains a 
> QPN.  The I-D for IPoIB says something like:
> 
>     The link-layer address for IPoIB includes the QPN which might not be
>     constant across reboots or even across network interface resets.
>     Cached QPN entries, such as in static ARP entries or in RARP servers
>     will only work if the implementation(s) using these options ensure
>     that the QPN associated with an interface is invariant across
>     reboots/network resets.
> 
> So, there are requirements on the IPoIB implementation to make RARP 
> work.  Folks in the IPoIB work group decided not to go much further than 
> these statements for RARP support since most folks felt that DHCP is (de 
> facto) replacement.

There are 3 cases I can envision:

1. A single IPoIB interface per HCA port. In this case, the RARP server
can just match on the hardware address (port GID) without the QPN.

2. In the case of VLANs, I think we are likely OK as well. In that case,
there is a separate IP subnet (per PKey) so the port GID is unique per
IP subnet (the port GID is unique on that partition (IP subnet)). I
think there is a different QPN per VLAN.

So I don't think that the above 2 cases require an invariant QPN.

3. The third case is multihomed interfaces on the same IPoIB subnet. I
don't think this is currently supported by IPoIB (but may someday). That
would either not be supported by RARP or some way to have invariant QPNs
would be needed. I'm not sure how important this case is.

Is the above correct ? Are there other cases ? 

-- Hal


From mst at mellanox.co.il  Wed Apr  6 00:19:33 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Apr 2005 10:19:33 +0300
Subject: [openib-general] Re: IB Address Translation service
In-Reply-To: <1112734934.4490.99.camel@localhost.localdomain>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
	<1112734934.4490.99.camel@localhost.localdomain>
Message-ID: <20050406071933.GL15034@mellanox.co.il>

Quoting r. Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: IB Address Translation service
> 
> Reviving an old thread...
> 
> On Sat, 2005-03-05 at 10:22, David M. Brean wrote: 
> > There is an I-D for DHCP on IB.  IPoIB defines a "broadcast" address and 
> > DHCP (and ARP) on IB use it.  Could make RARP work using this mechanism, 
> > but as someone else pointed out, the IB hardware address contains a 
> > QPN.  The I-D for IPoIB says something like:
> > 
> >     The link-layer address for IPoIB includes the QPN which might not be
> >     constant across reboots or even across network interface resets.
> >     Cached QPN entries, such as in static ARP entries or in RARP servers
> >     will only work if the implementation(s) using these options ensure
> >     that the QPN associated with an interface is invariant across
> >     reboots/network resets.
> > 
> > So, there are requirements on the IPoIB implementation to make RARP 
> > work.  Folks in the IPoIB work group decided not to go much further than 
> > these statements for RARP support since most folks felt that DHCP is (de 
> > facto) replacement.
> 
> There are 3 cases I can envision:
> 
> 1. A single IPoIB interface per HCA port. In this case, the RARP server
> can just match on the hardware address (port GID) without the QPN.
> 
> 2. In the case of VLANs, I think we are likely OK as well. In that case,
> there is a separate IP subnet (per PKey) so the port GID is unique per
> IP subnet (the port GID is unique on that partition (IP subnet)). I
> think there is a different QPN per VLAN.
> 
> So I don't think that the above 2 cases require an invariant QPN.
> 
> 3. The third case is multihomed interfaces on the same IPoIB subnet. I
> don't think this is currently supported by IPoIB (but may someday). That
> would either not be supported by RARP or some way to have invariant QPNs
> would be needed. I'm not sure how important this case is.
> 
> Is the above correct ? Are there other cases ? 
> 
> -- Hal
> 

Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address.
It seems to me that making this work requires an invariant QP, right?

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Wed Apr  6 04:30:57 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Apr 2005 14:30:57 +0300
Subject: [openib-general] Re: IB Address Translation service
In-Reply-To: <1112785737.4809.28.camel@localhost.localdomain>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
	<1112734934.4490.99.camel@localhost.localdomain>
	<20050406071933.GL15034@mellanox.co.il>
	<1112785737.4809.28.camel@localhost.localdomain>
Message-ID: <20050406113057.GU15034@mellanox.co.il>

Quoting r. Hal Rosenstock <halr at voltaire.com>:
> Subject: Re: IB Address Translation service
> 
> On Wed, 2005-04-06 at 03:19, Michael S. Tsirkin wrote:
> > > So I don't think that the above 2 cases require an invariant QPN.
> > > 
> > > 3. The third case is multihomed interfaces on the same IPoIB subnet. I
> > > don't think this is currently supported by IPoIB (but may someday). That
> > > would either not be supported by RARP or some way to have invariant QPNs
> > > would be needed. I'm not sure how important this case is.
> > > 
> > > Is the above correct ? Are there other cases ? 
> > > 
> > > -- Hal
> > > 
> > 
> > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address.
> > It seems to me that making this work requires an invariant QP, right?
> 
> I believe that DHCP servers require work in order to do this for IPoIB
> as they do not understand an IPoIB hardware address.

Are we talking about IPoIB hardware address size issues?

> So if they do
> support client identifier mapping to IP address, then I think the answer
> is maybe (rather than a definitive yes). The reason I say this is that
> the DHCP draft appears to allow any unique client identifier per IP
> subnet to be used. In that case, as long there is a scheme to make each
> identifier unique per IP subnet, DHCP is fine. 
> 
>     The "client-identifier" option includes a type and identifier pair.
>     The identifier included in the "client-identifier" option may
>     consist of a hardware address or any other unique value such as the
>     DNS name of the client. When a hardware address is used, the type
>     field should be one of the ARP hardware types listed in [ARPPARAM].
> 
> The most common (simple to implement) client identifier from the the
> DHCP client perspective is the IPoIB hardware address.
> 
> http://www.ietf.org/internet-drafts/draft-ietf-dhc-3315id-for-v4-04.txt
> states:
> 
> Client identities are ephemeral
> 
>    RFC2132 recommends that client identifiers be generated by using
>    the permanent link-layer address of the network interface that the
>    client is trying to configure. 
> 
> Requirements
> 
>    In order to address the problems stated in section 2, DHCPv4 client
>    identifiers must have the following characteristics:
> 
>    - They must be persistent, in the sense that a particular host's
>      client identifier must not change simply because a piece of
>      network hardware is added or removed.
> 
> ...
> 
>    - DHCPv4 client identifiers used by dual-stack hosts that also use
>      DHCPv6 must use the same host identification string for both
>      DHCPv4 and DHCPv6 - for example, a DHCPv4 server that uses the
>      client's identity to update the DNS on behalf of a DHCPv4 client
>      must register the same client identity in the DNS that would be
>      registered by the DHCPv6 server on behalf of the DHCPv6 client
>      running on that host, and vice versa.
> 
>    In order to satisfy all but the last of these requirements, we need
>    to construct a DHCPv4 client identifier out of two parts.  One part
>    must be unique to the host on which the client is running.  The
>    other must be unique to the network identity being presented.  The
>    DHCP Unique Identifier (DUID) and Identity Association Identifier
>    (IAID) specified in RFC3315 satisfy these requirements. 
> 
> DHCPv4 Client behavior
> 
>    DHCPv4 clients conforming to this specification MUST use stable
>    DHCPv4 node identifiers in the dhcp-client-identifier option.
>    DHCPv4 clients MUST NOT use client identifiers based solely on
>    layer two addresses that are hard-wired to the layer two device
>    (e.g., the ethernet MAC address) as suggested in RFC2131, except as
>    allowed in section 9.2 of RFC3315.  DHCPv4 clients MUST send a
>    'client identifier' option containing an Identity Association
>    Unique Identifier, as defined in section 10 of RFC3315, and a DHCP
>    Unique Identifier, as defined in section 9 of RFC3315.  These
>    together constitute an RFC3315-style binding identifier.
> 
>    The general format of the DHCPv4 'client identifier' option is
>    defined in section 9.14 of RFC2132.
> 
>    To send an RFC3315-style binding identifiier in a DHCPv4 'client
>    identifier' option, the type of the 'client identifier' option is
>    set to 255.  The type field is immediately followed by the IAID,
>    which is an opaque 32-bit quantity.  The IAID is immediately
>    followed by the DUID, which consumes the remaining contents of the
>    'client identifier' option.  The format of the 'client identifier'
>    option is as follows:
> 
>       Code  Len  Type  IAID                DUID
>       +----+----+-----+----+----+----+----+----+----+---
>       | 61 | n  | 255 | i1 | i2 | i3 | i4 | d1 | d2 |...
>       +----+----+-----+----+----+----+----+----+----+---
> 
>    Any DHCPv4 or DHCPv6 client that conforms to this specification
>    SHOULD provide a means by which an operator can learn what DUID the
>    client has chosen.  Such clients SHOULD also provide a means by
>    which the operator can configure the DUID.  A device that is
>    normally configured with both a DHCPv4 and DHCPv6 client SHOULD
>    automatically use the same DUID for DHCPv4 and DHCPv6 without any
>    operator intervention.
> 
>    DHCPv4 clients that support more than one network interface SHOULD
>    use the same DUID on every interface.  DHCPv4 clients that support
>    more than one network interface SHOULD use a different IAID on
>    each interface.
> 
> >From RFC 3315, there are multiple DUID types: DUID-LLT (link link address
> plus time), DUIT_EN (assigned by vendor based on enterprise number), DUID-LL
> (based on link layer address), 
> 
> Identity Association
> 
>    An "identity-association" (IA) is a construct through which a server
>    and a client can identify, group, and manage a set of related IPv6
>    addresses.  Each IA consists of an IAID and associated configuration
>    information.
> 
>    A client must associate at least one distinct IA with each of its
>    network interfaces for which it is to request the assignment of IPv6
>    addresses from a DHCP server.  The client uses the IAs assigned to an
>    interface to obtain configuration information from a server for that
>    interface.  Each IA must be associated with exactly one interface.
> 
>    The IAID uniquely identifies the IA and must be chosen to be unique
>    among the IAIDs on the client.  The IAID is chosen by the client.
>    For any given use of an IA by the client, the IAID for that IA MUST
>    be consistent across restarts of the DHCP client.  The client may
>    maintain consistency either by storing the IAID in non-volatile
>    storage or by using an algorithm that will consistently produce the
>    same IAID as long as the configuration of the client has not changed.
>    There may be no way for a client to maintain consistency of the IAIDs
>    if it does not have non-volatile storage and the client's hardware
>    configuration changes.

So a client could, for example, mask the QP number and use the
remaining non-volatile portion as the identifier?

> Using a scheme along these lines, precludes the requirement for a nonvolatile
> QPN for DHCP.
> 
> -- Hal
> 

Do you know whether dhcp clients / servers support this?
dhcpd man page seems to talk only about hardware address.

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Wed Apr  6 04:46:24 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Apr 2005 14:46:24 +0300
Subject: [openib-general] [PATCH] two fixes for INIT_IB
Message-ID: <20050406114624.GV15034@mellanox.co.il>

fixes the INIT_IB command:
1. Allocate 64 bytes for the inbox, so that the address is 16
   bytes aligned as required by the manual.
2. Free the exact number of bytes we allocate.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: mthca_cmd.c
===================================================================
--- mthca_cmd.c	(revision 2115)
+++ mthca_cmd.c	(working copy)
@@ -1183,7 +1183,7 @@ int mthca_INIT_IB(struct mthca_dev *dev,
 	int err;
 	u32 flags;
 
-#define INIT_IB_IN_SIZE          56
+#define INIT_IB_IN_SIZE          0x40
 #define INIT_IB_FLAGS_OFFSET     0x00
 #define INIT_IB_FLAG_SIG         (1 << 18)
 #define INIT_IB_FLAG_NG          (1 << 17)
@@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev,
 	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
 			CMD_TIME_CLASS_A, status);
 
-	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma);
 	return err;
 }
 

-- 
MST - Michael S. Tsirkin


From halr at voltaire.com  Wed Apr  6 04:08:57 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Apr 2005 07:08:57 -0400
Subject: [openib-general] Re: IB Address Translation service
In-Reply-To: <20050406071933.GL15034@mellanox.co.il>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
	<1112734934.4490.99.camel@localhost.localdomain>
	<20050406071933.GL15034@mellanox.co.il>
Message-ID: <1112785737.4809.28.camel@localhost.localdomain>

On Wed, 2005-04-06 at 03:19, Michael S. Tsirkin wrote:
> > So I don't think that the above 2 cases require an invariant QPN.
> > 
> > 3. The third case is multihomed interfaces on the same IPoIB subnet. I
> > don't think this is currently supported by IPoIB (but may someday). That
> > would either not be supported by RARP or some way to have invariant QPNs
> > would be needed. I'm not sure how important this case is.
> > 
> > Is the above correct ? Are there other cases ? 
> > 
> > -- Hal
> > 
> 
> Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address.
> It seems to me that making this work requires an invariant QP, right?

I believe that DHCP servers require work in order to do this for IPoIB
as they do not understand an IPoIB hardware address. So if they do
support client identifier mapping to IP address, then I think the answer
is maybe (rather than a definitive yes). The reason I say this is that
the DHCP draft appears to allow any unique client identifier per IP
subnet to be used. In that case, as long there is a scheme to make each
identifier unique per IP subnet, DHCP is fine. 

    The "client-identifier" option includes a type and identifier pair.
    The identifier included in the "client-identifier" option may
    consist of a hardware address or any other unique value such as the
    DNS name of the client. When a hardware address is used, the type
    field should be one of the ARP hardware types listed in [ARPPARAM].

The most common (simple to implement) client identifier from the the
DHCP client perspective is the IPoIB hardware address.

http://www.ietf.org/internet-drafts/draft-ietf-dhc-3315id-for-v4-04.txt
states:

Client identities are ephemeral

   RFC2132 recommends that client identifiers be generated by using
   the permanent link-layer address of the network interface that the
   client is trying to configure. 

Requirements

   In order to address the problems stated in section 2, DHCPv4 client
   identifiers must have the following characteristics:

   - They must be persistent, in the sense that a particular host's
     client identifier must not change simply because a piece of
     network hardware is added or removed.

...

   - DHCPv4 client identifiers used by dual-stack hosts that also use
     DHCPv6 must use the same host identification string for both
     DHCPv4 and DHCPv6 - for example, a DHCPv4 server that uses the
     client's identity to update the DNS on behalf of a DHCPv4 client
     must register the same client identity in the DNS that would be
     registered by the DHCPv6 server on behalf of the DHCPv6 client
     running on that host, and vice versa.

   In order to satisfy all but the last of these requirements, we need
   to construct a DHCPv4 client identifier out of two parts.  One part
   must be unique to the host on which the client is running.  The
   other must be unique to the network identity being presented.  The
   DHCP Unique Identifier (DUID) and Identity Association Identifier
   (IAID) specified in RFC3315 satisfy these requirements. 

DHCPv4 Client behavior

   DHCPv4 clients conforming to this specification MUST use stable
   DHCPv4 node identifiers in the dhcp-client-identifier option.
   DHCPv4 clients MUST NOT use client identifiers based solely on
   layer two addresses that are hard-wired to the layer two device
   (e.g., the ethernet MAC address) as suggested in RFC2131, except as
   allowed in section 9.2 of RFC3315.  DHCPv4 clients MUST send a
   'client identifier' option containing an Identity Association
   Unique Identifier, as defined in section 10 of RFC3315, and a DHCP
   Unique Identifier, as defined in section 9 of RFC3315.  These
   together constitute an RFC3315-style binding identifier.

   The general format of the DHCPv4 'client identifier' option is
   defined in section 9.14 of RFC2132.

   To send an RFC3315-style binding identifiier in a DHCPv4 'client
   identifier' option, the type of the 'client identifier' option is
   set to 255.  The type field is immediately followed by the IAID,
   which is an opaque 32-bit quantity.  The IAID is immediately
   followed by the DUID, which consumes the remaining contents of the
   'client identifier' option.  The format of the 'client identifier'
   option is as follows:

      Code  Len  Type  IAID                DUID
      +----+----+-----+----+----+----+----+----+----+---
      | 61 | n  | 255 | i1 | i2 | i3 | i4 | d1 | d2 |...
      +----+----+-----+----+----+----+----+----+----+---

   Any DHCPv4 or DHCPv6 client that conforms to this specification
   SHOULD provide a means by which an operator can learn what DUID the
   client has chosen.  Such clients SHOULD also provide a means by
   which the operator can configure the DUID.  A device that is
   normally configured with both a DHCPv4 and DHCPv6 client SHOULD
   automatically use the same DUID for DHCPv4 and DHCPv6 without any
   operator intervention.

   DHCPv4 clients that support more than one network interface SHOULD
   use the same DUID on every interface.  DHCPv4 clients that support
   more than one network interface SHOULD use a different IAID on
   each interface.

>From RFC 3315, there are multiple DUID types: DUID-LLT (link link address
plus time), DUIT_EN (assigned by vendor based on enterprise number), DUID-LL
(based on link layer address), 

Identity Association

   An "identity-association" (IA) is a construct through which a server
   and a client can identify, group, and manage a set of related IPv6
   addresses.  Each IA consists of an IAID and associated configuration
   information.

   A client must associate at least one distinct IA with each of its
   network interfaces for which it is to request the assignment of IPv6
   addresses from a DHCP server.  The client uses the IAs assigned to an
   interface to obtain configuration information from a server for that
   interface.  Each IA must be associated with exactly one interface.

   The IAID uniquely identifies the IA and must be chosen to be unique
   among the IAIDs on the client.  The IAID is chosen by the client.
   For any given use of an IA by the client, the IAID for that IA MUST
   be consistent across restarts of the DHCP client.  The client may
   maintain consistency either by storing the IAID in non-volatile
   storage or by using an algorithm that will consistently produce the
   same IAID as long as the configuration of the client has not changed.
   There may be no way for a client to maintain consistency of the IAIDs
   if it does not have non-volatile storage and the client's hardware
   configuration changes.

Using a scheme along these lines, precludes the requirement for a nonvolatile
QPN for DHCP.

-- Hal


From halr at voltaire.com  Wed Apr  6 05:16:25 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Apr 2005 08:16:25 -0400
Subject: [openib-general] Re: IB Address Translation service
In-Reply-To: <20050406113057.GU15034@mellanox.co.il>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
	<1112734934.4490.99.camel@localhost.localdomain>
	<20050406071933.GL15034@mellanox.co.il>
	<1112785737.4809.28.camel@localhost.localdomain>
	<20050406113057.GU15034@mellanox.co.il>
Message-ID: <1112789785.4809.86.camel@localhost.localdomain>

On Wed, 2005-04-06 at 07:30, Michael S. Tsirkin wrote:
> > > Some DHCP servers (dhcpd) let you configure a fixed IP per hardware address.
> > > It seems to me that making this work requires an invariant QP, right?
> > 
> > I believe that DHCP servers require work in order to do this for IPoIB
> > as they do not understand an IPoIB hardware address.
> 
> Are we talking about IPoIB hardware address size issues?

I was referring to dealing with IPoIB DHCP zero'ing chaddr field wuth
hlen 0 and indicating htype of IPoIB. Not sure whether this is supported
without any changes to DHCP servers. The other issue would be the
support for the client identifier field and whether this is supported or
needs work.

> So a client could, for example, mask the QP number and use the
> remaining non-volatile portion as the identifier?

(This is what I said in a previous email in terms of making this work 
for RARP). That works if this is unique in the IP subnet. That's not
true in all cases. Also, per the emerging DHCP requirement, it does not
follow the format for the client identifier as an IAID is also required.
I suppose if there is only 1 interface then IAID isn't a problem either.

> Do you know whether dhcp clients / servers support this?

Not sure exactly which this you are referring to. The I-D requirement
(RFC3315id) is likely not supported.

> dhcpd man page seems to talk only about hardware address.

I think it may be dependent on which DHCP client/server. For the ISC
one, the changes were minimal (I put them out; there is one update
since) but this doesn't support 3315id.

-- Hal


From Roland.Fehrenbacher at transtec.de  Wed Apr  6 07:44:05 2005
From: Roland.Fehrenbacher at transtec.de (Roland Fehrenbacher)
Date: Wed, 6 Apr 2005 16:44:05 +0200
Subject: [openib-general] gen2 opensm
In-Reply-To: <1112788349.4809.39.camel@localhost.localdomain>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
Message-ID: <16979.62901.482352.325628@gargle.gargle.HOWL>


> $ /usr/local/ib/bin/ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0000:771d
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            10 Gb/sec (4X)
> 
> Infiniband device 'mthca0' port 2 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0000:771e
>         base lid:        0x0
>         sm lid:          0x0
>         state:           1: DOWN
>         phys state:      2: Polling
>         rate:            2.5 Gb/sec (1X)

    Hal> That's strange that you can get the port GIDs via ibstatus
    Hal> but not via ibstat.

    Hal> The one thing different I see is that the NodeGUID is very
    Hal> different from the PortGUIDs. Not sure if this messes things
    Hal> up.

Somehow the tools don't seem to get the correct information, but it's
there:

$ cat /sys/class/infiniband/mthca0/node_guid
0002:c902:0000:771c

$ cat /sys/class/infiniband/mthca0/sys_image_guid
0002:c902:0000:771f

How can this happen?

> > Running opensm then doesn't activate the ports:
> > 
> > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 .......

     Hal> I see a bug in this message. I will fix it. Please sync
     Hal> OpenSM to at least version 2111 and rerun.

> I will recompile tomorrow, and try a firmware upgrade.

The error log with the recompiled opensm is now:

Apr 06 14:39:14 [4000] -> OpenSM Rev:openib-1.0.0
Apr 06 14:39:14 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
Apr 06 14:39:14 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x65babaa) as the default port.
Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa.
Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x8000000.
Apr 06 14:39:14 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping.

I couldn't do a firmware update yet, since I haven't gotten Mellanox
mst to compile with kernel 2.6.11. Do you have another suggestion how
I could do the upgrade?

Thanks,

Roland


From David.Brean at Sun.COM  Wed Apr  6 07:45:27 2005
From: David.Brean at Sun.COM (David M. Brean)
Date: Wed, 06 Apr 2005 10:45:27 -0400
Subject: [openib-general] IB Address Translation service
In-Reply-To: <1112734934.4490.99.camel@localhost.localdomain>
References: <35EA21F54A45CB47B879F21A91F4862F3FAC81@taurus.voltaire.com>
	<1109715208.11800.41.camel@duffman>
	<Pine.LNX.4.61.0503021116550.5704@jlentini-linux.nane.netapp.com>
	<1109966313.20238.11.camel@duffman>
	<20050305020402.GA3297@greglaptop.internal.keyresearch.com>
	<4229CEA0.7060904@sun.com>
	<1112734934.4490.99.camel@localhost.localdomain>
Message-ID: <4253F607.7060300@sun.com>

Your case #3 is an application where the limitations of RARP on IB 
appear.  I can't think of any other interesting configurations beyond 1-3.

-David

Hal Rosenstock wrote:

>Reviving an old thread...
>
>On Sat, 2005-03-05 at 10:22, David M. Brean wrote: 
>  
>
>>There is an I-D for DHCP on IB.  IPoIB defines a "broadcast" address and 
>>DHCP (and ARP) on IB use it.  Could make RARP work using this mechanism, 
>>but as someone else pointed out, the IB hardware address contains a 
>>QPN.  The I-D for IPoIB says something like:
>>
>>    The link-layer address for IPoIB includes the QPN which might not be
>>    constant across reboots or even across network interface resets.
>>    Cached QPN entries, such as in static ARP entries or in RARP servers
>>    will only work if the implementation(s) using these options ensure
>>    that the QPN associated with an interface is invariant across
>>    reboots/network resets.
>>
>>So, there are requirements on the IPoIB implementation to make RARP 
>>work.  Folks in the IPoIB work group decided not to go much further than 
>>these statements for RARP support since most folks felt that DHCP is (de 
>>facto) replacement.
>>    
>>
>
>There are 3 cases I can envision:
>
>1. A single IPoIB interface per HCA port. In this case, the RARP server
>can just match on the hardware address (port GID) without the QPN.
>
>2. In the case of VLANs, I think we are likely OK as well. In that case,
>there is a separate IP subnet (per PKey) so the port GID is unique per
>IP subnet (the port GID is unique on that partition (IP subnet)). I
>think there is a different QPN per VLAN.
>
>So I don't think that the above 2 cases require an invariant QPN.
>
>3. The third case is multihomed interfaces on the same IPoIB subnet. I
>don't think this is currently supported by IPoIB (but may someday). That
>would either not be supported by RARP or some way to have invariant QPNs
>would be needed. I'm not sure how important this case is.
>
>Is the above correct ? Are there other cases ? 
>
>-- Hal
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>


From halr at voltaire.com  Wed Apr  6 07:58:03 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Apr 2005 10:58:03 -0400
Subject: [openib-general] gen2 opensm
In-Reply-To: <16979.62901.482352.325628@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
Message-ID: <1112799483.4906.32.camel@localhost.localdomain>

On Wed, 2005-04-06 at 10:44, Roland Fehrenbacher wrote:
> > $ /usr/local/ib/bin/ibstatus
> > Infiniband device 'mthca0' port 1 status:
> >         default gid:     fe80:0000:0000:0000:0002:c902:0000:771d
> >         base lid:        0x0
> >         sm lid:          0x0
> >         state:           2: INIT
> >         phys state:      5: LinkUp
> >         rate:            10 Gb/sec (4X)
> > 
> > Infiniband device 'mthca0' port 2 status:
> >         default gid:     fe80:0000:0000:0000:0002:c902:0000:771e
> >         base lid:        0x0
> >         sm lid:          0x0
> >         state:           1: DOWN
> >         phys state:      2: Polling
> >         rate:            2.5 Gb/sec (1X)
> 
>     Hal> That's strange that you can get the port GIDs via ibstatus
>     Hal> but not via ibstat.
> 
>     Hal> The one thing different I see is that the NodeGUID is very
>     Hal> different from the PortGUIDs. Not sure if this messes things
>     Hal> up.
> 
> Somehow the tools don't seem to get the correct information, but it's
> there:
> 
> $ cat /sys/class/infiniband/mthca0/node_guid
> 0002:c902:0000:771c
> 
> $ cat /sys/class/infiniband/mthca0/sys_image_guid
> 0002:c902:0000:771f
> 
> How can this happen?
> 
> > > Running opensm then doesn't activate the ports:
> > > 
> > > Apr 05 19:18:25 [4000] -> OpenSM Rev:openib-1.0.0 .......
> 
>      Hal> I see a bug in this message. I will fix it. Please sync
>      Hal> OpenSM to at least version 2111 and rerun.
> 
> > I will recompile tomorrow, and try a firmware upgrade.
> 
> The error log with the recompiled opensm is now:
> 
> Apr 06 14:39:14 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 06 14:39:14 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 06 14:39:14 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0x0000000030f2ffff,0x0000000000000000
> Apr 06 14:39:14 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x65babaa) as the default port.
> Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x225dabaa.
> Apr 06 14:39:14 [4000] -> osm_vendor_bind: Binding to port 0x8000000.
> Apr 06 14:39:14 [2400A] -> umad_receiver: Failed to obtain request madw for received MAD(method=81 attr=11) -- dropping.
> 
> I couldn't do a firmware update yet, since I haven't gotten Mellanox
> mst to compile with kernel 2.6.11. Do you have another suggestion how
> I could do the upgrade?

Mellanox mst ? Are you using the Mellanox drivers and not OpenIB gen2 ?

-- Hal


From mst at mellanox.co.il  Wed Apr  6 08:13:11 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Apr 2005 18:13:11 +0300
Subject: [openib-general] Re: gen2 opensm
In-Reply-To: <16979.62901.482352.325628@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
Message-ID: <20050406151311.GE20567@mellanox.co.il>

Quoting r. Roland Fehrenbacher <Roland.Fehrenbacher at transtec.de>:
> Subject: Re: gen2 opensm
> I couldn't do a firmware update yet, since I haven't gotten Mellanox
> mst to compile with kernel 2.6.11. Do you have another suggestion how
> I could do the upgrade?
> 
> Thanks,
> 
> Roland

Yes, use mstflint from src/userspace/mstflint
Latest gold disk in general and mst in particular does not support 2.6.11

-- 
MST - Michael S. Tsirkin


From rf at q-leap.de  Wed Apr  6 08:19:52 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Wed, 6 Apr 2005 17:19:52 +0200
Subject: [openib-general] gen2 opensm
In-Reply-To: <1112799483.4906.32.camel@localhost.localdomain>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<1112799483.4906.32.camel@localhost.localdomain>
Message-ID: <16979.65048.982358.532089@gargle.gargle.HOWL>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    >> I couldn't do a firmware update yet, since I haven't gotten
    >> Mellanox mst to compile with kernel 2.6.11. Do you have another
    >> suggestion how I could do the upgrade?

    Hal> Mellanox mst ? Are you using the Mellanox drivers and not
    Hal> OpenIB gen2 ?

No, I just wanted to use them for firmware flashing. But now Michael
told me to flash with mstflint. I'll try.

Thanks Michael.

Roland


From halr at voltaire.com  Wed Apr  6 08:21:05 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Apr 2005 11:21:05 -0400
Subject: [openib-general] gen2 opensm
In-Reply-To: <1112799483.4906.32.camel@localhost.localdomain>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<1112799483.4906.32.camel@localhost.localdomain>
Message-ID: <1112800865.4906.52.camel@localhost.localdomain>

On Wed, 2005-04-06 at 10:58, Hal Rosenstock wrote:

> Mellanox mst ? Are you using the Mellanox drivers and not OpenIB gen2 ?

If you are, that combination doesn't work. You either need to use OpenIB
gen2 (mthca) or use the OpenSM from Mellanox Gold of whatever variant
you are using.

-- Hal


From halr at voltaire.com  Wed Apr  6 08:27:36 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Apr 2005 11:27:36 -0400
Subject: [openib-general] gen2 opensm
In-Reply-To: <16979.65048.982358.532089@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<1112799483.4906.32.camel@localhost.localdomain>
	<16979.65048.982358.532089@gargle.gargle.HOWL>
Message-ID: <1112800951.4906.54.camel@localhost.localdomain>

On Wed, 2005-04-06 at 11:19, Roland Fehrenbacher wrote:
> >>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:
> 
>     >> I couldn't do a firmware update yet, since I haven't gotten
>     >> Mellanox mst to compile with kernel 2.6.11. Do you have another
>     >> suggestion how I could do the upgrade?
> 
>     Hal> Mellanox mst ? Are you using the Mellanox drivers and not
>     Hal> OpenIB gen2 ?
> 
> No, 

Good.

> I just wanted to use them for firmware flashing. But now Michael
> told me to flash with mstflint. I'll try.

Just out of curiousity, what is the architecture of the machine you are
using ?

-- Hal


From rf at q-leap.de  Wed Apr  6 09:19:54 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Wed, 6 Apr 2005 18:19:54 +0200
Subject: [openib-general] Re: gen2 opensm
In-Reply-To: <20050406151311.GE20567@mellanox.co.il>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<20050406151311.GE20567@mellanox.co.il>
Message-ID: <16980.3114.482668.956134@gargle.gargle.HOWL>

>>>>> "Michael" == Michael S Tsirkin <mst at mellanox.co.il> writes:

    Michael> Quoting r. Roland Fehrenbacher
    Michael> <Roland.Fehrenbacher at transtec.de>:
    >> Subject: Re: gen2 opensm I couldn't do a firmware update yet,
    >> since I haven't gotten Mellanox mst to compile with kernel
    >> 2.6.11. Do you have another suggestion how I could do the
    >> upgrade?
    >> 
    >> Thanks,
    >> 
    >> Roland

    Michael> Yes, use mstflint from src/userspace/mstflint Latest gold
    Michael> disk in general and mst in particular does not support
    Michael> 2.6.11

I get 

$ mstflint -d /proc/bus/pci/03/00.0 q
Image type: FailSafe
Chip rev.:  A1
GUIDs:      0002c9020000771c 0002c9020000771d 0002c9020000771e 0002c9020000771f
Board ID:    (MT_0030000001)

What would be the right way to flash this card using the files from 

fw-23108-rel-3_3_2/
fw-23108-rel-3_3_2/fw-23108-a1-debug.mlx
fw-23108-rel-3_3_2/fw-23108-a1-rel.mlx
fw-23108-rel-3_3_2/MTLP23108_128MB.brd
fw-23108-rel-3_3_2/MTLP23108_256MB.brd
fw-23108-rel-3_3_2/MTLP23108_512MB.brd
fw-23108-rel-3_3_2/MTPB23108_128MB.brd
fw-23108-rel-3_3_2/MTPB23108_256MB.brd
fw-23108-rel-3_3_2/BUILD_ID

Roland


From rf at q-leap.de  Wed Apr  6 09:29:41 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Wed, 6 Apr 2005 18:29:41 +0200
Subject: [openib-general] gen2 opensm
In-Reply-To: <1112800951.4906.54.camel@localhost.localdomain>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<1112799483.4906.32.camel@localhost.localdomain>
	<16979.65048.982358.532089@gargle.gargle.HOWL>
	<1112800951.4906.54.camel@localhost.localdomain>
Message-ID: <16980.3701.448062.413906@gargle.gargle.HOWL>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    Hal> Just out of curiousity, what is the architecture of the
    Hal> machine you are using ?

It is a Tyan S2882 with 2 x Opteron 250, 2Gb RAM.

Roland


From mst at mellanox.co.il  Wed Apr  6 09:44:56 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Apr 2005 19:44:56 +0300
Subject: [openib-general] Re: gen2 opensm
In-Reply-To: <16980.3114.482668.956134@gargle.gargle.HOWL>
References: <16978.51804.357810.302532@gargle.gargle.HOWL>
	<1112722935.4634.12.camel@localhost.localdomain>
	<16978.63006.330557.956153@gargle.gargle.HOWL>
	<1112788349.4809.39.camel@localhost.localdomain>
	<16979.62901.482352.325628@gargle.gargle.HOWL>
	<20050406151311.GE20567@mellanox.co.il>
	<16980.3114.482668.956134@gargle.gargle.HOWL>
Message-ID: <20050406164456.GA23565@mellanox.co.il>

Quoting r. Roland Fehrenbacher <rf at q-leap.de>:
> Subject: Re: gen2 opensm
> 
> >>>>> "Michael" == Michael S Tsirkin <mst at mellanox.co.il> writes:
> 
>     Michael> Quoting r. Roland Fehrenbacher
>     Michael> <Roland.Fehrenbacher at transtec.de>:
>     >> Subject: Re: gen2 opensm I couldn't do a firmware update yet,
>     >> since I haven't gotten Mellanox mst to compile with kernel
>     >> 2.6.11. Do you have another suggestion how I could do the
>     >> upgrade?
>     >> 
>     >> Thanks,
>     >> 
>     >> Roland
> 
>     Michael> Yes, use mstflint from src/userspace/mstflint Latest gold
>     Michael> disk in general and mst in particular does not support
>     Michael> 2.6.11
> 
> I get 
> 
> $ mstflint -d /proc/bus/pci/03/00.0 q
> Image type: FailSafe
> Chip rev.:  A1
> GUIDs:      0002c9020000771c 0002c9020000771d 0002c9020000771e 0002c9020000771f
> Board ID:    (MT_0030000001)
> 
> What would be the right way to flash this card using the files from 
> 
> fw-23108-rel-3_3_2/
> fw-23108-rel-3_3_2/fw-23108-a1-debug.mlx
> fw-23108-rel-3_3_2/fw-23108-a1-rel.mlx
> fw-23108-rel-3_3_2/MTLP23108_128MB.brd
> fw-23108-rel-3_3_2/MTLP23108_256MB.brd
> fw-23108-rel-3_3_2/MTLP23108_512MB.brd
> fw-23108-rel-3_3_2/MTPB23108_128MB.brd
> fw-23108-rel-3_3_2/MTPB23108_256MB.brd
> fw-23108-rel-3_3_2/BUILD_ID
> 
> Roland
> 

You want MTLP23108_128MB.brd and fw-23108-a1-rel.mlx then.

Create a binary image with infiniburn (select raw binary format),
And burn the result with mstflint.

If you select a wrong brd file, mstflint
will warn you that PSID (Board ID) is being changed.

MST

-- 
MST - Michael S. Tsirkin


From ardavis at ichips.intel.com  Wed Apr  6 14:20:47 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Wed, 06 Apr 2005 14:20:47 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52oeczoghb.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
Message-ID: <425452AF.6010207@ichips.intel.com>

Roland Dreier wrote:

>    ardavis> Has anyone successfully run uverbs examples with events
>    ardavis> using ibv_get_cq_event? It seems to block forever on my
>    ardavis> system with the pingpong test.
>
>Yes, I have.  I'll try again with the latest code to make sure I
>haven't broken anything recently.
>
> - R.
>
>  
>
Roland,

Did you get a chance to retry events? I pulled the latest from your 
branch and my ibv_pingpong -e testing still blocks forever.

-arlin


From roland at topspin.com  Wed Apr  6 15:17:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 06 Apr 2005 15:17:38 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425452AF.6010207@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Wed, 06 Apr 2005 14:20:47 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com>
Message-ID: <527jjf8s8t.fsf@topspin.com>

    ardavis> Did you get a chance to retry events? I pulled the latest
    ardavis> from your branch and my ibv_pingpong -e testing still
    ardavis> blocks forever.

Yes, it works for me.  I just tried it again and it worked fine.

What kind of system/HCA are you using?  Does ibv_pingpong without -e
work for you?

 - R.


From rf at q-leap.de  Thu Apr  7 05:38:17 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Thu, 7 Apr 2005 14:38:17 +0200
Subject: [openib-general] Flashing Mellanox MT23108
Message-ID: <16981.10681.472289.311124@gargle.gargle.HOWL>

Hi,

can anyone tell me how to flash a Mellanox MT23108 card with
mstflint. When I try the firmware file fw-23108-a1-rel.mlx from
Mellanox I get

$ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b
Not a valid image

Roland


From Jerome.Pioux at bull.com  Thu Apr  7 09:09:56 2005
From: Jerome.Pioux at bull.com (Jerome Pioux)
Date: Thu, 7 Apr 2005 09:09:56 -0700
Subject: [openib-general] Flashing Mellanox MT23108
References: <16981.10681.472289.311124@gargle.gargle.HOWL>
Message-ID: <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>

Hi Roland,

I don't know about mstflint but I used flint in the past from IBGD so it may be the same?
With flint, you need to use the raw image (bin) and not the Mellanox one (mlx).
This is what I used:

flint -d /dev/mst/mt23108_pci_cr2 -i /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b
Image type: FailSafe
Chip rev.:  A1
GUIDs:      0005ad0000016770 0005ad0000016771 0005ad0000016772 0005ad000100d050 
Board ID:  ­

    Burn image with the following GUIDs:
        Node:      0005ad0000016770
        Port1:     0005ad0000016771
        Port2:     0005ad0000016772
        Sys.Image: 0005ad000100d050
etc...

I think that I created the bin image using infiniburn.
I read fw-23108-a1-rel.mlx using infiniburn and write the bin image (raw format).
But if you have infiniburn working, I think that you can burn the mlx image directly.

Jerome


----- Original Message ----- 
  From: Roland Fehrenbacher 
  To: openib-general at openib.org 
  Sent: Thursday, April 07, 2005 5:38 AM
  Subject: [openib-general] Flashing Mellanox MT23108


  Hi,

  can anyone tell me how to flash a Mellanox MT23108 card with
  mstflint. When I try the firmware file fw-23108-a1-rel.mlx from
  Mellanox I get

  $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b
  Not a valid image

  Roland

  _______________________________________________
  openib-general mailing list
  openib-general at openib.org
  http://openib.org/mailman/listinfo/openib-general

  To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050407/030ab4dc/attachment.html>

From ardavis at ichips.intel.com  Thu Apr  7 09:47:57 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Thu, 07 Apr 2005 09:47:57 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <527jjf8s8t.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>
Message-ID: <4255643D.30002@ichips.intel.com>

Roland Dreier wrote:

>    ardavis> Did you get a chance to retry events? I pulled the latest
>    ardavis> from your branch and my ibv_pingpong -e testing still
>    ardavis> blocks forever.
>
>Yes, it works for me.  I just tried it again and it worked fine.
>
>What kind of system/HCA are you using?  Does ibv_pingpong without -e
>work for you?
>
> - R.
>
>  
>
EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0,  hw_rev A0


From halr at voltaire.com  Thu Apr  7 09:50:15 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Apr 2005 12:50:15 -0400
Subject: [openib-general] Re: uverbs events
In-Reply-To: <4255643D.30002@ichips.intel.com>
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
Message-ID: <1112892615.4877.18.camel@localhost.localdomain>

On Thu, 2005-04-07 at 12:47, ardavis wrote:
> EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0,  hw_rev A0

Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ?

-- Hal


From roland at topspin.com  Thu Apr  7 10:09:02 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 07 Apr 2005 10:09:02 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <1112892615.4877.18.camel@localhost.localdomain> (Hal
	Rosenstock's message of "07 Apr 2005 12:50:15 -0400")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
Message-ID: <52zmwa5xap.fsf@topspin.com>

    Hal> Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ?

It's possible that firmware bug is the problem, but if it is I would
expect the non-event mode to fail as well.

 - R.


From mst at mellanox.co.il  Thu Apr  7 11:55:32 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Apr 2005 21:55:32 +0300
Subject: [openib-general] Re: Flashing Mellanox MT23108
In-Reply-To: <004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>
References: <16981.10681.472289.311124@gargle.gargle.HOWL>
	<004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>
Message-ID: <20050407185532.GC13172@mellanox.co.il>

Quoting r. Jerome Pioux <Jerome.Pioux at bull.com>:
> Subject: Re: Flashing Mellanox MT23108
> 
> Hi Roland,
>  
> I don't know about mstflint but I used flint in the past from IBGD so it may be
> the same?
> With flint, you need to use the raw image (bin) and not the Mellanox one (mlx).
> This is what I used:
>  
> flint -d /dev/mst/mt23108_pci_cr2 -i /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin
> b
> Image type: FailSafe
> Chip rev.:  A1
> GUIDs:      0005ad0000016770 0005ad0000016771 0005ad0000016772 0005ad000100d050
> Board ID:  ­
>  
>     Burn image with the following GUIDs:
>         Node:      0005ad0000016770
>         Port1:     0005ad0000016771
>         Port2:     0005ad0000016772
>         Sys.Image: 0005ad000100d050
> etc...
>  
> I think that I created the bin image using infiniburn.
> I read fw-23108-a1-rel.mlx using infiniburn and write the bin image (raw
> format).
> But if you have infiniburn working, I think that you can burn the mlx image
> directly.
>  
> Jerome
>  
>  
> ----- Original Message -----
> 
>     From: Roland Fehrenbacher
>     To: openib-general at openib.org
>     Sent: Thursday, April 07, 2005 5:38 AM
>     Subject: [openib-general] Flashing Mellanox MT23108
>    
>     Hi,
>    
>     can anyone tell me how to flash a Mellanox MT23108 card with
>     mstflint. When I try the firmware file fw-23108-a1-rel.mlx from
>     Mellanox I get
>    
>     $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b
>     Not a valid image
>    
>     Roland
>    

With mstflint you pass in the device location: -d 03:00.0
Otherwise its the same.

-- 
MST - Michael S. Tsirkin


From halr at voltaire.com  Thu Apr  7 12:32:24 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Apr 2005 15:32:24 -0400
Subject: [openib-general] SM Bad Port Handling
Message-ID: <1112902344.4490.91.camel@localhost.localdomain>

Hi,

Below is a writeup on bad port handling by the SM. I would appreciate
any comments on this before I move on to the implementation.

Thanks.

-- Hal


Problem Statement:

Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
NodeDescription to any node it finds. It then requests PortInfo for 
each port which is physically up.

There are scenarios where the port is physically up, but there is no
response to the SM get requests. In this case, the OpenSM keeps
retrying, never gives up, and doesn't service anything else in the
subnet (I'm not 100% positive on this last point).

Assumption:

The proposed solution assumes that the ignore GUIDs file option of
OpenSM only impacts the routing algorithm (path counting) and should not
be extended for bad port handling.

Proposed Solution:

The OpenSM will implement a configurable policy (some number of
consecutive lack of responses to SM requests). At the point of
exhaustion of the timeout/retry strategy, that port will be marked as
"bad" by OpenSM.

At this point, should it attempt to revive the port by bringing the
physical link down and back up ? Should it try this several times before
declaring the port as "bad" ? In any case, this is a refinement on the
basic strategy for dealing with this scenario.

Also, there could also be a periodic "ping" at a slower rate to check if
the "bad" ports revive.

A "bad" port per this scenario still maintains its LID and other state.
OpenSM will indicate a "bad" port detected via an internal port physical
state which it will set to down. The "real" port physical state will be
reflected accurately inside OpenSM.

Once a "bad" port is detected, it will no longer be polled and the
routing algorithm should be invoked to route around this. 

Is there a need to store these "bad" ports persistently (and ignore them
on startup) ?


From eitan at mellanox.co.il  Thu Apr  7 13:02:39 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 7 Apr 2005 23:02:39 +0300 
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>

Hi Hal,

Please see my comments below.

Eitan Zahavi

> Problem Statement:
> 
> Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
> NodeDescription to any node it finds. It then requests PortInfo for
> each port which is physically up.
> 
> There are scenarios where the port is physically up, but there is no
> response to the SM get requests. In this case, the OpenSM keeps
> retrying, never gives up, and doesn't service anything else in the
> subnet (I'm not 100% positive on this last point).
[EZ] I have never seen this!  Are you sure about it? Are you sure we are
talking about gen1 ported to gen2?

What will happen in a case of non responding port is that OpenSM will retry
the send (actually the lower level does it) for the number of retries OpenSM
is configured to use (actually 4 times) and then ignore the port and
everything behind it. The reported topology (on stdout) will have the word
UNKNOWN on the remote side of the link this port connects to.

I will be happy to see a log file that shows what you claim happens. Or even
if you can explain to me how and where in the code causes that. 

I have been checking the way OpenSM handles irresponsive ports during the
the last two weeks, and did not see such case.
> 
> Assumption:
> 
> The proposed solution assumes that the ignore GUIDs file option of
> OpenSM only impacts the routing algorithm (path counting) and should not
> be extended for bad port handling.
[EZ] This is correct.
> 
> Proposed Solution:
> 
> The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM.
[EZ] This is already the current behavior. Nothing should be done.
> 
> At this point, should it attempt to revive the port by bringing the
> physical link down and back up ? Should it try this several times before
> declaring the port as "bad" ? In any case, this is a refinement on the
> basic strategy for dealing with this scenario.
> 
> Also, there could also be a periodic "ping" at a slower rate to check if
> the "bad" ports revive.
[EZ] This will be released in gen1 within 2 weeks or so. The enhancement to
light sweep will include the irresponsive ports in the light sweep. Once
they respond a new heavy sweep will be generated.
> 
> A "bad" port per this scenario still maintains its LID and other state.
> OpenSM will indicate a "bad" port detected via an internal port physical
> state which it will set to down. The "real" port physical state will be
> reflected accurately inside OpenSM.
[EZ] It is better to use the "un-healthy" bit of the physical port - which
OpenSM is already maintaining.
> 
> Once a "bad" port is detected, it will no longer be polled and the
> routing algorithm should be invoked to route around this.
> 
> Is there a need to store these "bad" ports persistently (and ignore them
> on startup) ?
[EZ] No I do not think so.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050407/070c5900/attachment.html>

From iod00d at hp.com  Thu Apr  7 13:12:41 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 7 Apr 2005 13:12:41 -0700
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <1112902344.4490.91.camel@localhost.localdomain>
References: <1112902344.4490.91.camel@localhost.localdomain>
Message-ID: <20050407201241.GJ32545@esmail.cup.hp.com>

On Thu, Apr 07, 2005 at 03:32:24PM -0400, Hal Rosenstock wrote:
...
> Assumption:
> 
> The proposed solution assumes that the ignore GUIDs file option of
> OpenSM only impacts the routing algorithm (path counting) and should not
> be extended for bad port handling.
> 
> Proposed Solution:
> 
> The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM.

Generally speaking, seperating recovery "policy" from "detection"
is a good thing. 

...
> Is there a need to store these "bad" ports persistently (and ignore them
> on startup) ?

If opensm can see the physical link is ok, I would think it save
any state. It's possible a system just hasn't loaded whatever
SW is necessary to talk to the SM and might require operator
intervention to kick that off (e.g. none of my systems auto-reboot
unless I'm testing a specific customer environment).

I expect it's a seperate policy on how long to save information
after the physical link has been dropped - similar to DHCP.


grant


From halr at voltaire.com  Thu Apr  7 13:11:02 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Apr 2005 16:11:02 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
Message-ID: <1112904662.4490.99.camel@localhost.localdomain>

On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote:
> Hi Hal,
> 
> Please see my comments below.
> 
> Eitan Zahavi
> 
> > Problem Statement:
> > 
> > Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
> > NodeDescription to any node it finds. It then requests PortInfo for
> > each port which is physically up.
> > 
> > There are scenarios where the port is physically up, but there is no
> > response to the SM get requests. In this case, the OpenSM keeps
> > retrying, never gives up, and doesn't service anything else in the
> > subnet (I'm not 100% positive on this last point).
> [EZ] I have never seen this!  Are you sure about it? Are you sure we
> are talking about gen1 ported to gen2?
> 
> What will happen in a case of non responding port is that OpenSM will
> retry the send (actually the lower level does it) for the number of
> retries OpenSM is configured to use (actually 4 times) and then ignore
> the port and everything behind it. The reported topology (on stdout)
> will have the word UNKNOWN on the remote side of the link this port
> connects to.
> 
> I will be happy to see a log file that shows what you claim happens.
> Or even if you can explain to me how and where in the code causes
> that. 

This was reported by Ron a while ago on this list. He sent log extracts
of what was going on. It was around when I asked about the Anafa
firmware issue with LFTTop.

> I have been checking the way OpenSM handles irresponsive ports during
> the the last two weeks, and did not see such case.

Is this in both Gold 1.6.1 (OpenSM 1.7/1.7.1 ?) and Gold 1.7 (OpenSM
1.8) ? 

> > Assumption:
> > 
> > The proposed solution assumes that the ignore GUIDs file option of
> > OpenSM only impacts the routing algorithm (path counting) and should
> not
> > be extended for bad port handling.
> [EZ] This is correct.
> > 
> > Proposed Solution:
> > 
> > The OpenSM will implement a configurable policy (some number of
> > consecutive lack of responses to SM requests). At the point of
> > exhaustion of the timeout/retry strategy, that port will be marked
> as
> > "bad" by OpenSM.
> [EZ] This is already the current behavior. Nothing should be done.
> > 
> > At this point, should it attempt to revive the port by bringing the
> > physical link down and back up ? Should it try this several times
> before
> > declaring the port as "bad" ? In any case, this is a refinement on
> the
> > basic strategy for dealing with this scenario.
> > 
> > Also, there could also be a periodic "ping" at a slower rate to
> check if
> > the "bad" ports revive.
> [EZ] This will be released in gen1 within 2 weeks or so.

What OpenSM release will this be ?

>  The enhancement to light sweep will include the irresponsive ports in
> the light sweep. Once they respond a new heavy sweep will be
> generated.
> 
> > 
> > A "bad" port per this scenario still maintains its LID and other
> state.
> > OpenSM will indicate a "bad" port detected via an internal port
> physical
> > state which it will set to down. The "real" port physical state will
> be
> > reflected accurately inside OpenSM.
> [EZ] It is better to use the "un-healthy" bit of the physical port -
> which OpenSM is already maintaining.
> > 
> > Once a "bad" port is detected, it will no longer be polled and the
> > routing algorithm should be invoked to route around this.
> > 
> > Is there a need to store these "bad" ports persistently (and ignore
> them
> > on startup) ?
> [EZ] No I do not think so.

Thanks.

-- Hal


From mlleinin at hpcn.ca.sandia.gov  Thu Apr  7 13:14:46 2005
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Thu, 07 Apr 2005 13:14:46 -0700
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
Message-ID: <1112904886.15180.202.camel@localhost>

On Thu, 2005-04-07 at 23:02 +0300, Eitan Zahavi wrote:
> >  
> > At this point, should it attempt to revive the port by bringing the 
> > physical link down and back up ? Should it try this several times
> before 
> > declaring the port as "bad" ? In any case, this is a refinement on
> the 
> > basic strategy for dealing with this scenario. 
> >  
> > Also, there could also be a periodic "ping" at a slower rate to
> check if 
> > the "bad" ports revive. 
> [EZ] This will be released in gen1 within 2 weeks or so. The
> enhancement to light sweep will include the irresponsive ports in the
> light sweep. Once they respond a new heavy sweep will be generated.
> 
 Are you submitting these changes to gen2?  If not, why not?  


   - Matt


From halr at voltaire.com  Thu Apr  7 13:27:41 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Apr 2005 16:27:41 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <20050407201241.GJ32545@esmail.cup.hp.com>
References: <1112902344.4490.91.camel@localhost.localdomain>
	<20050407201241.GJ32545@esmail.cup.hp.com>
Message-ID: <1112905460.4490.109.camel@localhost.localdomain>

On Thu, 2005-04-07 at 16:12, Grant Grundler wrote:

> > Is there a need to store these "bad" ports persistently (and ignore them
> > on startup) ?
> 
> If opensm can see the physical link is ok, I would think it save
> any state. It's possible a system just hasn't loaded whatever
> SW is necessary to talk to the SM and might require operator
> intervention to kick that off (e.g. none of my systems auto-reboot
> unless I'm testing a specific customer environment).

Yes, I think there is also a partial boot up case where physical link
can be up but the node won't respond to SM MADs.

Still, I'm not sure why this would need to be saved persistently by the
SM. It seems like a transient state that would be detected and if it
goes away that should be detected too. The only issue being that the
detection of the recovery might be longer.

-- Hal


From roland at topspin.com  Thu Apr  7 14:42:49 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 07 Apr 2005 14:42:49 -0700
Subject: [openib-general] [ANNOUNCE] Userspace verbs/roland-uverbs branch
	merged to trunk
Message-ID: <52k6ne5kme.fsf@topspin.com>

I've just finished merging the userspace verbs support from the
roland-uverbs branch to the main trunk (https://openib.org/svn/gen2/trunk/).

For now, all userspace verbs development will be on the main trunk, so
I would suggest that everyone switch from using the roland-uverbs
branch to the trunk as soon as convenient.

Thanks,
  Roland


From greg at kroah.com  Thu Apr  7 17:10:12 2005
From: greg at kroah.com (Greg KH)
Date: Thu, 7 Apr 2005 17:10:12 -0700
Subject: [openib-general] Re: [PATCH][26.5/27] Add MT25204 PCI IDs
In-Reply-To: <52hdiqi22t.fsf@topspin.com>
References: <2005411249.RHQWyM8AFcqb1PM4@topspin.com>
	<52hdiqi22t.fsf@topspin.com>
Message-ID: <20050408001011.GB7010@kroah.com>

On Fri, Apr 01, 2005 at 02:06:50PM -0800, Roland Dreier wrote:
> Ugh, this patch is required to build support for the new Mellanox
> HCAs.  Greg K-H applied it to his tree a while ago but it hasn't made
> it to Linus yet.
> 
> Sorry,
>   Roland
> 
> Add PCI device IDs for new Mellanox MT25204 "Sinai" InfiniHost III Lx HCA.
> 
> Signed-off-by: Roland Dreier <roland at topspin.com>

Already in 2.6.12-rc2.

thanks,

greg k-h


From rf at q-leap.de  Thu Apr  7 23:25:46 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Fri, 8 Apr 2005 08:25:46 +0200
Subject: [openib-general] Re: Flashing Mellanox MT23108
In-Reply-To: <20050407185532.GC13172@mellanox.co.il>
References: <16981.10681.472289.311124@gargle.gargle.HOWL>
	<004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>
	<20050407185532.GC13172@mellanox.co.il>
Message-ID: <16982.9194.487907.278262@gargle.gargle.HOWL>

>>>>> "Michael" == Michael S Tsirkin <mst at mellanox.co.il> writes:

    Michael> Quoting r. Jerome Pioux <Jerome.Pioux at bull.com>:

    >> I don't know about mstflint but I used flint in the past from
    >> IBGD so it may be the same?  With flint, you need to use the
    >> raw image (bin) and not the Mellanox one (mlx).  This is what I
    >> used:

Hi Jerome,

thanks for your help. I knew 'Mellanox flint'. The disadvantage of it
is that it needs all the drivers loaded, while flint goes directly on
the PCI device.

    >> flint -d /dev/mst/mt23108_pci_cr2 -i
    >> /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b Image type:
    >> FailSafe Chip rev.: A1 GUIDs: 0005ad0000016770 0005ad0000016771
    >> 0005ad0000016772 0005ad000100d050 Board ID: ­
    >> 
    >> Burn image with the following GUIDs: Node: 0005ad0000016770
    >> Port1: 0005ad0000016771 Port2: 0005ad0000016772 Sys.Image:
    >> 0005ad000100d050 etc...
    >> 
    >> I think that I created the bin image using infiniburn.  I read
    >> fw-23108-a1-rel.mlx using infiniburn and write the bin image
    >> (raw format).  But if you have infiniburn working, I think that
    >> you can burn the mlx image directly.

    >> ----- Original Message -----
    >> 
    >> From: Roland Fehrenbacher To: openib-general at openib.org Sent:
    >> Thursday, April 07, 2005 5:38 AM Subject: [openib-general]
    >> Flashing Mellanox MT23108
    >> 
    >> Hi,
    >> 
    >> can anyone tell me how to flash a Mellanox MT23108 card with
    >> mstflint. When I try the firmware file fw-23108-a1-rel.mlx from
    >> Mellanox I get
    >> 
    >> $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b
    >> Not a valid image

    Michael> With mstflint you pass in the device location: -d 03:00.0
    Michael> Otherwise its the same.

Unfortunately, now even with the raw image prepared by using infiniburn
from fw-23108-a1-rel.mlx and the correct .brd, I get

mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b
  /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of image limits
)
Not a valid image

Roland


From mst at mellanox.co.il  Fri Apr  8 02:32:07 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 8 Apr 2005 12:32:07 +0300
Subject: [openib-general] Re: Flashing Mellanox MT23108
In-Reply-To: <16982.9194.487907.278262@gargle.gargle.HOWL>
References: <16981.10681.472289.311124@gargle.gargle.HOWL>
	<004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>
	<20050407185532.GC13172@mellanox.co.il>
	<16982.9194.487907.278262@gargle.gargle.HOWL>
Message-ID: <20050408093207.GA21709@mellanox.co.il>

Quoting r. Roland Fehrenbacher <rf at q-leap.de>:
> Subject: Re: Flashing Mellanox MT23108
> 
> >>>>> "Michael" == Michael S Tsirkin <mst at mellanox.co.il> writes:
> 
>     Michael> Quoting r. Jerome Pioux <Jerome.Pioux at bull.com>:
> 
>     >> I don't know about mstflint but I used flint in the past from
>     >> IBGD so it may be the same?  With flint, you need to use the
>     >> raw image (bin) and not the Mellanox one (mlx).  This is what I
>     >> used:
> 
> Hi Jerome,
> 
> thanks for your help. I knew 'Mellanox flint'. The disadvantage of it
> is that it needs all the drivers loaded, while flint goes directly on
> the PCI device.
> 
>     >> flint -d /dev/mst/mt23108_pci_cr2 -i
>     >> /etc/ibfw/fw/fw-cougar-a1-3.3.2-jerome.bin b Image type:
>     >> FailSafe Chip rev.: A1 GUIDs: 0005ad0000016770 0005ad0000016771
>     >> 0005ad0000016772 0005ad000100d050 Board ID: ­
>     >> 
>     >> Burn image with the following GUIDs: Node: 0005ad0000016770
>     >> Port1: 0005ad0000016771 Port2: 0005ad0000016772 Sys.Image:
>     >> 0005ad000100d050 etc...
>     >> 
>     >> I think that I created the bin image using infiniburn.  I read
>     >> fw-23108-a1-rel.mlx using infiniburn and write the bin image
>     >> (raw format).  But if you have infiniburn working, I think that
>     >> you can burn the mlx image directly.
> 
>     >> ----- Original Message -----
>     >> 
>     >> From: Roland Fehrenbacher To: openib-general at openib.org Sent:
>     >> Thursday, April 07, 2005 5:38 AM Subject: [openib-general]
>     >> Flashing Mellanox MT23108
>     >> 
>     >> Hi,
>     >> 
>     >> can anyone tell me how to flash a Mellanox MT23108 card with
>     >> mstflint. When I try the firmware file fw-23108-a1-rel.mlx from
>     >> Mellanox I get
>     >> 
>     >> $ mstflint -d /proc/bus/pci/03/00.0 -i fw-23108-a1-rel.mlx b
>     >> Not a valid image
> 
>     Michael> With mstflint you pass in the device location: -d 03:00.0
>     Michael> Otherwise its the same.
> 
> Unfortunately, now even with the raw image prepared by using infiniburn
> from fw-23108-a1-rel.mlx and the correct .brd, I get
> 
> mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b
>   /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of image limits
> )
> Not a valid image
> 
> Roland
> 

I'll check this on Sunday.
What does mstflint -d 03:00.0 v show?

-- 
MST - Michael S. Tsirkin


From rf at q-leap.de  Fri Apr  8 02:36:39 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Fri, 8 Apr 2005 11:36:39 +0200
Subject: [openib-general] Re: Flashing Mellanox MT23108
In-Reply-To: <20050408093207.GA21709@mellanox.co.il>
References: <16981.10681.472289.311124@gargle.gargle.HOWL>
	<004a01c53b8c$45a4dde0$0211708d@gpv.az05.bull.com>
	<20050407185532.GC13172@mellanox.co.il>
	<16982.9194.487907.278262@gargle.gargle.HOWL>
	<20050408093207.GA21709@mellanox.co.il>
Message-ID: <16982.20647.399413.653949@gargle.gargle.HOWL>

>>>>> "Michael" == Michael S Tsirkin <mst at mellanox.co.il> writes:

    Michael> With mstflint you pass in the device location: -d 03:00.0
    Michael> Otherwise its the same.

    >>  Unfortunately, now even with the raw image prepared by using
    >> infiniburn from fw-23108-a1-rel.mlx and the correct .brd, I get
    >> 
    >> mstflint -d /proc/bus/pci/03/00.0 -i fw-23108.bin b
    >> /0x00030028/ (BOOT2) - read error (Address (0x3002c) is out of
    >> image limits ) Not a valid image

    Michael> I'll check this on Sunday.  What does mstflint -d 03:00.0
    Michael> v show?

I have

mstflint -d 03:00.0 v

Failsafe image:

Invariant       /0x00000028-0x000006f7 (0x0006d0)/ (BOOT2) - OK

Primary   Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK
                /0x00030028-0x00030b3b (0x000b14)/ (BOOT2) - OK
                /0x00030b3c-0x00034aa7 (0x003f6c)/ (BOOT2) - OK
                /0x00034aa8-0x000375f3 (0x002b4c)/ (Configuration) - OK
                /0x000375f4-0x00037627 (0x000034)/ (GUID) - OK
                /0x00037628-0x00044c83 (0x00d65c)/ (DDR) - OK
                /0x00044c84-0x0004d30f (0x00868c)/ (DDR) - OK
                /0x0004d310-0x00061c03 (0x0148f4)/ (DDR) - OK
                /0x00061c04-0x0006d80b (0x00bc08)/ (DDR) - OK
                /0x0006d80c-0x0007099f (0x003194)/ (DDR) - OK
                /0x000709a0-0x0007a5af (0x009c10)/ (DDR) - OK
                /0x0007a5b0-0x0007a707 (0x000158)/ (Configuration) - OK
                /0x0007a708-0x0007a74b (0x000044)/ (Jump addresses) - OK
                /0x0007a74c-0x000841cb (0x009a80)/ (EMT Service) - OK

Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK
                /0x00090028-0x00090b3b (0x000b14)/ (BOOT2) - OK
                /0x00090b3c-0x00094aa7 (0x003f6c)/ (BOOT2) - OK
                /0x00094aa8-0x000975f3 (0x002b4c)/ (Configuration) - OK
                /0x000975f4-0x00097627 (0x000034)/ (GUID) - OK
                /0x00097628-0x000a4c83 (0x00d65c)/ (DDR) - OK
                /0x000a4c84-0x000ad30f (0x00868c)/ (DDR) - OK
                /0x000ad310-0x000c1c03 (0x0148f4)/ (DDR) - OK
                /0x000c1c04-0x000cd80b (0x00bc08)/ (DDR) - OK
                /0x000cd80c-0x000d099f (0x003194)/ (DDR) - OK
                /0x000d09a0-0x000da5af (0x009c10)/ (DDR) - OK
                /0x000da5b0-0x000da707 (0x000158)/ (Configuration) - OK
                /0x000da708-0x000da74b (0x000044)/ (Jump addresses) - OK
                /0x000da74c-0x000e41cb (0x009a80)/ (EMT Service) - OK

Roland


From halr at voltaire.com  Fri Apr  8 07:47:49 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Apr 2005 10:47:49 -0400
Subject: [openib-general] A Couple More CM Queries
Message-ID: <1112971669.4522.147.camel@localhost.localdomain>

Hi Sean,

I have a couple more questions about the CM:

1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be
better saving the highest value and starting there so connection IDs are
less likely to repeat as soon ?

2. Should ib_create_cm_id check return an error if cm_handler == NULL
just to make sure ?

Thanks.

-- Hal


From eitan at mellanox.co.il  Fri Apr  8 08:11:08 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 8 Apr 2005 18:11:08 +0300 
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BB@mtlex01.yok.mtl.com>

Hi Mat,

>  Are you submitting these changes to gen2?  If not, why not?
> 
[EZ] Mellanox is focused on improving the gen1 stack while contributing
everything it builds to the community. OpenSM from gen1 is ported (merged)
into gen2 tree by Hal and Shahar from Voltaire. I will be publishing the
latest changes to OpenSM once they pass minimal QA. They will be posted in:
https://openib.org/svn/gen1/trunk/src/userspace/osm

I will publish the list of changes from previous version in a separate mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/97949929/attachment.html>

From halr at voltaire.com  Fri Apr  8 08:23:44 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Apr 2005 11:23:44 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0B8@mtlex01.yok.mtl.com>
Message-ID: <1112973823.4522.150.camel@localhost.localdomain>

Hi Eitan,

On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote:
> [EZ] It is better to use the "un-healthy" bit of the physical port -
> which OpenSM is already maintaining.

What is the name of this bit and in what structure does it appear ?

Thanks.

-- Hal


From eitan at mellanox.co.il  Fri Apr  8 08:55:39 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 8 Apr 2005 18:55:39 +0300 
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com>

Hi Hal,

This is a physical port attribute so the file is osm_port.h and the
structure is osm_physp_t.
>From the doc on the structure:
*
*  healthy
*     Tracks the health of the port. Normally should be TRUE but 
*     might change as a result of incoming traps indicating the port
*     healthy is questionable.
*

I have been trying my best to find how it can happen that a port that does
not respond will cause OpenSM to continuously poll it. This can not happen
so unless you can explain how it happens please do not contaminate the code
with un-needed code.

The only thing that comes to mind it the case of failure to "Set" some
attributes of devices in the fabric. This happens only after discovery is
completed and only after the validity of the data base is verified (i.e.
each node has ports, each port have a node ...)
In that case of failure to set some attributes (aka LFT, PortInfo etc)
OpenSM will output a clear error message: "Errors in intialization" and will
restart a full sweep of the fabric. 
This is the only way one get an infinite polling on the entire subnet.

In general, it might make sense to try and improve how OpenSM qualifies each
fabric port for the statistics of the number of packet drops versus good
packets it passed through. Note this is complex due to the fact a port might
affect packets that goes through it. And there is no way to know on which
hop on the path the packet was dropped. 

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Friday, April 08, 2005 6:24 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> Hi Eitan,
> 
> On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote:
> > [EZ] It is better to use the "un-healthy" bit of the physical port -
> > which OpenSM is already maintaining.
> 
> What is the name of this bit and in what structure does it appear ?
> 
> Thanks.
> 
> -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/6e2a5d2e/attachment.html>

From mshefty at ichips.intel.com  Fri Apr  8 09:34:24 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 08 Apr 2005 09:34:24 -0700
Subject: [openib-general] Re: A Couple More CM Queries
In-Reply-To: <1112971669.4522.147.camel@localhost.localdomain>
References: <1112971669.4522.147.camel@localhost.localdomain>
Message-ID: <4256B290.7020704@ichips.intel.com>

Hal Rosenstock wrote:
> 1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be
> better saving the highest value and starting there so connection IDs are
> less likely to repeat as soon ?

I _think_ this would result in the IDR tables growing to their maximum 
size, which seems worse than repeating the IDs immediately after their 
timewait expires.

> 2. Should ib_create_cm_id check return an error if cm_handler == NULL
> just to make sure ?

Personally, I don't think it's worth this check for kernel clients, 
unless we want to start checking for NULL parameters everywhere.

While on the CM, I did look at the issue of calling the API out of 
order that you had pointed out before (which could result in accessing 
a NULL port pointer).  I'm not convinced that a simple check for a NULL 
port pointer covers all potential problems.  For example, I'm not sure 
how well the codebase will handle the dynamic removal of a device while 
users are attempting to access the device.

- Sean


From eitan at mellanox.co.il  Fri Apr  8 09:40:09 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 8 Apr 2005 19:40:09 +0300 
Subject: [openib-general] OpenSM work
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BE@mtlex01.yok.mtl.com>

Hi All,

FYI: Mellanox is focusing on the following items on OpenSM development for
the last few weeks:

1.	Stability testing over the IB management simulator:
a.	Randomly pick bad links with high packet drop statistics – success
is SUBNET UP
b.	Route using up/down algorithm – success is no credit loops

2.	Semi-static LID assignment:
a.	Developed an interface for persistent storage of arbitrary data. The
goal is to enable further development of LDAP (ala Troy’s request) or SQL
module. Please see osm_db.h attached
			  <<osm_db.h>> 
b.	Developed file based implementation for osm_db.h
c.	Modify osm_lid_mgr (lid assignment algorithm) to use the LIDs stored
in the persistent storage. Handle all cases of bad file and new LIDs on the
fabric. The –r flag now lets OpenSM overwrite the known data. Persistent
Guid to LIDs data is kept even if the GUID disappears for a while. The code
also handles LID assignment for LMC > 0 in a way better then the previous
algorithm: It used to assign 2^LMC LIDs for every port – even for switches
port 0. Now it will only preserve 1 LID for switch port 0.

3.	Irresponsive port:
a.	The phenomenon is: A port does not respond to the SM during the
discovery stage. OpenSM can not obtain enough data about the port and thus
it does not appear in the final database. Since OpenSM uses light sweeps
when there is no “change detected” it will not query the port until either a
switch sets its “change bit” or send a trap. So that irresponsive port will
never be polled again if there is no heavy sweep.
b.	The solution: 
i.	During discovery track ports (physical ports) that have their
logical link state != DOWN but the port on the other side of the link is not
known to the SM. 
ii.	During light sweep:  not only scan the switches “change bit” but
also test to see if the port on the other side on these ports (from i) is
responding. If it does – issue a heavy sweep.

4.	Head of Queue Life:
a.	Problem: In cases of PCI hardware failure HCAs can not complete RDMA
requests and loose all credits from their input ports (in other words: their
input buffers are filled). So they create back pressure on the fabric. 
b.	Solution: use a fast head of queue time limit on every switch port
that drives an HCA.

5.	SA queries stress testing:
a.	We are exploring max performance of the SA and ways to improve it.

Eitan


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/91c96aae/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osm_db.h
Type: application/octet-stream
Size: 11514 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/91c96aae/attachment.obj>

From halr at voltaire.com  Fri Apr  8 12:06:44 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Apr 2005 15:06:44 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com>
Message-ID: <1112987204.4903.19.camel@localhost.localdomain>

On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote: 
> Hi Hal,
> 
> This is a physical port attribute so the file is osm_port.h and the
> structure is osm_physp_t.
> From the doc on the structure:
> *
> *  healthy
> *     Tracks the health of the port. Normally should be TRUE but 
> *     might change as a result of incoming traps indicating the port
> *     healthy is questionable.
> *

Yup. It's definitely there in the gen2 code base.

> I have been trying my best to find how it can happen that a port that
> does not respond will cause OpenSM to continuously poll it. This can
> not happen so unless you can explain how it happens please do not
> contaminate the code with un-needed code.

This part of the code has not been touched. I've put all meaningful
patches and ideas on how things might change out on this list.

> The only thing that comes to mind it the case of failure to "Set" some
> attributes of devices in the fabric. 

I dug out Ron's emails on this. I don't think that was what was going
on. It was a SubnGet(NodeInfo) which failed. See
http://openib.org/pipermail/openib-general/2005-February/009125.html
for more details.

-- Hal

> This happens only after discovery is completed and only after the
> validity of the data base is verified (i.e. each node has ports, each
> port have a node ...)
> 
> In that case of failure to set some attributes (aka LFT, PortInfo etc)
> OpenSM will output a clear error message: "Errors in intialization"
> and will restart a full sweep of the fabric. 
> 
> This is the only way one get an infinite polling on the entire subnet.

> In general, it might make sense to try and improve how OpenSM
> qualifies each fabric port for the statistics of the number of packet
> drops versus good packets it passed through. Note this is complex due
> to the fact a port might affect packets that goes through it. And
> there is no way to know on which hop on the path the packet was
> dropped. 


From halr at voltaire.com  Fri Apr  8 12:37:32 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Apr 2005 15:37:32 -0400
Subject: [openib-general] SDP Performance
Message-ID: <1112988342.4546.12.camel@localhost.localdomain>

Hi Libor,

A couple of questions about SDP performance:

1. When running the AIO version of TTCP, there appears to be a bandwidth
degradation when using buffer sizes from about 5K to 13K. Do you see
this too ? If so, is there an explanation for this ?

2. Also, is there a program you use to measure SDP latency ?

Thanks.

-- Hal


From eitan at mellanox.co.il  Fri Apr  8 13:05:57 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 8 Apr 2005 23:05:57 +0300 
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com>

Hi Hal,

I have looked up the mail thread. 
I could not find a log file indicating there was a repetitive query of the
bad port.

I know the code and I can not find a reason for it to do a repetitive port. 
 
Can you explain how this can happen? 


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Friday, April 08, 2005 10:07 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > This is a physical port attribute so the file is osm_port.h and the
> > structure is osm_physp_t.
> > From the doc on the structure:
> > *
> > *  healthy
> > *     Tracks the health of the port. Normally should be TRUE but
> > *     might change as a result of incoming traps indicating the port
> > *     healthy is questionable.
> > *
> 
> Yup. It's definitely there in the gen2 code base.
> 
> > I have been trying my best to find how it can happen that a port that
> > does not respond will cause OpenSM to continuously poll it. This can
> > not happen so unless you can explain how it happens please do not
> > contaminate the code with un-needed code.
> 
> This part of the code has not been touched. I've put all meaningful
> patches and ideas on how things might change out on this list.
> 
> > The only thing that comes to mind it the case of failure to "Set" some
> > attributes of devices in the fabric.
> 
> I dug out Ron's emails on this. I don't think that was what was going
> on. It was a SubnGet(NodeInfo) which failed. See
> http://openib.org/pipermail/openib-general/2005-February/009125.html
> for more details.
> 
> -- Hal
> 
> > This happens only after discovery is completed and only after the
> > validity of the data base is verified (i.e. each node has ports, each
> > port have a node ...)
> >
> > In that case of failure to set some attributes (aka LFT, PortInfo etc)
> > OpenSM will output a clear error message: "Errors in intialization"
> > and will restart a full sweep of the fabric.
> >
> > This is the only way one get an infinite polling on the entire subnet.
> 
> > In general, it might make sense to try and improve how OpenSM
> > qualifies each fabric port for the statistics of the number of packet
> > drops versus good packets it passed through. Note this is complex due
> > to the fact a port might affect packets that goes through it. And
> > there is no way to know on which hop on the path the packet was
> > dropped.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/bd1c5217/attachment.html>

From halr at voltaire.com  Fri Apr  8 13:16:34 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Apr 2005 16:16:34 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0BF@mtlex01.yok.mtl.com>
Message-ID: <1112991393.4490.3.camel@localhost.localdomain>

On Fri, 2005-04-08 at 16:05, Eitan Zahavi wrote:
> Hi Hal,
> 
> I have looked up the mail thread. 
> I could not find a log file indicating there was a repetitive query of
> the bad port.

That was what started this thread: I believe that was what Ron reported
at the time which was back at the end of February. Perhaps there is
insufficient log to back that up. 

> I know the code and I can not find a reason for it to do a repetitive
> port. 
>  
> Can you explain how this can happen? 

I haven't looked at the code. I can't explain it. 

I don't even know for sure whether that was what was going on. I will go
back through the thread again.

-- Hal


From iod00d at hp.com  Fri Apr  8 17:34:48 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 8 Apr 2005 17:34:48 -0700
Subject: [openib-general] ia64 perf and FMR
In-Reply-To: <52hdimds6e.fsf@topspin.com>
References: <20050402024048.GN11094@esmail.cup.hp.com>
	<20050404055131.GA19409@esmail.cup.hp.com>
	<52hdimds6e.fsf@topspin.com>
Message-ID: <20050409003448.GI3844@esmail.cup.hp.com>

On Mon, Apr 04, 2005 at 04:43:21PM -0700, Roland Dreier wrote:
> A binary search to find the changeset that makes the difference would
> be really useful.  I read through the svn log from r2046 through r2082
> and I don't see anything that should make a difference to IPoIB.

I've worked backwards and didn't see any changes with netperf TCP_STREAM.
I can't establish why the perfomance is substantially
different with r2050 compared to before:

	SVN Rev		Best	Worst
	r2104-MSIX	3609	2623
	r2081-MSIX	3580    2639
	r2062-MSIX	3609	2618
	r2054-MSIX	3602	2635
	r2050-MSIX	3598	2636

	r2050-IRQ	3594	2433
	r2050-orig	1738(*)

Numbers are in Mbits/s. 3600 is ~450 MB/s. 2600 is ~325 MB/s.

Differences between Best/Worst are caused by binding netperf and
netserver to the same (or different) CPU as the one handling
interrupts. See "-T" in netperf 2.4.0-rc1 "experimental"
release.

(*) I didn't use -T with netperf to explore IRQ assignment.

I think differences in "Best" column are not significant.
(Except for the r2050-orig number of course...)


...
> So I wonder what obvious thing I'm missing...

I'm using gcc-3.3.5 (Debian 1:3.3.5-8) now and may have used gcc 3.4
or a slightly older gcc-3.3.  I might be confusing gcc version
with other work I've done too.  I'm of course kicking myself for being
sloppy and not tracking that precisely.

I'm also using s different version of netperf (2.4.0-rc1).
But I don't expect substantial changes in TCP_STREAM implementation
that might cause 2x difference in performance.
TCP_STREAM test is "mature" code.

My best/worst theory right now is the TS90 switch was in a semi-comatose
state when I collected the perf numbers in January. When I tried to collect
perf data in March, the TS90 switch was non-responsive at the serial
console and a power cycle took care of that. I hadn't cycled power
on the switch since installing it in June 2004 or so.

thanks,
grant


From roland at topspin.com  Sat Apr  9 13:24:20 2005
From: roland at topspin.com (Roland Dreier)
Date: Sat, 09 Apr 2005 13:24:20 -0700
Subject: [openib-general] [PATCH] SEND_INLINE support in libmthca
In-Reply-To: <20050405074213.GC15034@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 5 Apr 2005 10:42:13 +0300")
References: <20050404150235.GZ15034@mellanox.co.il>
	<52is32feq2.fsf@topspin.com> <20050405074213.GC15034@mellanox.co.il>
Message-ID: <528y3r3dhn.fsf@topspin.com>

Thanks, applied.


From mst at mellanox.co.il  Sun Apr 10 01:47:25 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Apr 2005 11:47:25 +0300
Subject: [openib-general] [PATCH] uverbs with static libraries
Message-ID: <20050410084724.GZ20567@mellanox.co.il>

Hello, Roland!
I'd like to get userspace verbs working with static libraries.
My motivation is currently enabling our code coverage tools which only
work well with static libraries, but I expect there to be
other uses.

The following patch makes it possible to link libmthca directly
into the main executable.
Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: libibverbs/src/init.c
===================================================================
--- libibverbs/src/init.c	(revision 2104)
+++ libibverbs/src/init.c	(working copy)
@@ -105,6 +105,8 @@
 		return;
 	}
 
+	load_driver(NULL);
+
 	for (i = 0; i < so_glob.gl_pathc; ++i)
 		load_driver(so_glob.gl_pathv[i]);
 }


-- 
MST - Michael S. Tsirkin


From hozer at hozed.org  Mon Apr 11 07:22:13 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Mon, 11 Apr 2005 09:22:13 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <200544159.Ahk9l0puXy39U6u6@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
Message-ID: <20050411142213.GC26127@kalmia.hozed.org>

> In particular, the memory pinning code in in uverbs_mem.c could stand
> a looking over.  In addition, a sanity check of the write()-based
> scheme for passing commands into the kernel in uverbs_main.c and
> uverbs_cmd.c is probably worthwhile.

How is memory pinning handled? (I haven't had time to read all the code,
so please excuse my ignorance of something obvious).

The old mellanox drivers used to have a hack to call 'sys_mlock', and
promiscuously lock memory any old userspace application asked for. What
is the API for the new uverbs memory registration, and how will things
like memory hotplug and NUMA page migration be able to unpin pages
locked by a user program?

I have applications that would benefit from being able to register 15GB
of memory on a machine with 16GB. Right now, MPI and other possible
users of infiniband in userspace have to play cacheing games and limit
what they can register. But locking all that memory without providing
the kernel a way to unlock it under memory pressure or for page
migration seems like a bad idea.


From roland at topspin.com  Mon Apr 11 08:34:19 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 08:34:19 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411142213.GC26127@kalmia.hozed.org> (Troy Benjegerdes's
	message of "Mon, 11 Apr 2005 09:22:13 -0500")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
Message-ID: <52mzs51g5g.fsf@topspin.com>

    Troy> How is memory pinning handled? (I haven't had time to read
    Troy> all the code, so please excuse my ignorance of something
    Troy> obvious).

The userspace library calls mlock() and then the kernel does
get_user_pages().

    Troy> The old mellanox drivers used to have a hack to call
    Troy> 'sys_mlock', and promiscuously lock memory any old userspace
    Troy> application asked for. What is the API for the new uverbs
    Troy> memory registration, and how will things like memory hotplug
    Troy> and NUMA page migration be able to unpin pages locked by a
    Troy> user program?

The API for uverbs memory registration is ibv_reg_mr(), and right now
the memory is pinned and that's it.

 - R.


From roland at topspin.com  Mon Apr 11 08:36:37 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 08:36:37 -0700
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <20050410084724.GZ20567@mellanox.co.il> (Michael S. Tsirkin's
	message of "Sun, 10 Apr 2005 11:47:25 +0300")
References: <20050410084724.GZ20567@mellanox.co.il>
Message-ID: <52is2t1g1m.fsf@topspin.com>

    Michael> I'd like to get userspace verbs working with static
    Michael> libraries.  My motivation is currently enabling our code
    Michael> coverage tools which only work well with static
    Michael> libraries, but I expect there to be other uses.

Looks reasonable.  With this, do you then do --enable-static when
configuring libmthca or is there anything else required?

 - R.


From mst at mellanox.co.il  Mon Apr 11 09:20:42 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Apr 2005 19:20:42 +0300
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <52is2t1g1m.fsf@topspin.com>
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com>
Message-ID: <20050411162042.GQ2477@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [PATCH] uverbs with static libraries
> 
>     Michael> I'd like to get userspace verbs working with static
>     Michael> libraries.  My motivation is currently enabling our code
>     Michael> coverage tools which only work well with static
>     Michael> libraries, but I expect there to be other uses.
> 
> Looks reasonable.  With this, do you then do --enable-static when
> configuring libmthca or is there anything else required?
> 
>  - R.
> 

A small patch in makefile seems to be required, I'll send that separately
after I clean it up.

-- 
MST - Michael S. Tsirkin


From rf at q-leap.de  Mon Apr 11 09:28:20 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Mon, 11 Apr 2005 18:28:20 +0200
Subject: [openib-general] OpenSM (again)
Message-ID: <16986.42404.540439.952094@gargle.gargle.HOWL>

Hi,

I got gen2 opensm running fine now (there was a problem with a wrong
include file), and managed to get IP running on a network of
currently 40 machines (final size will be 144). Performance is pretty
impressive (initial tests with a simple netpipe): I got a latency of
18microsec, and a maximum throughput of approx. 400MB/sec at packet
size approx. 1MB which then levels of at about 340MB/s for larger
packets.

One problem and two questions:

Problem: When I reboot all the 40 nodes (apart from the one the opensm
is running), the network is non-functional (no pings go through, even
though ports show status "Active") for quite a while (more than 10
minutes) after all the nodes have come up. It then recovers without
intervention. Is this normal? Single node reboots don't affect the
network operation. osm Log file is appended.

Question 1: Can I run opensm in a master slave configuration? I noticed
that there is a priority commandline option, but am not sure how to
apply this.

Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
compute nodes (need fast MPI), and gen2 on the control/storage nodes
(need only IP) with gen2 opensm running on the control nodes. Is there
any reason why this should not work reliably?

Roland

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: osm-port1.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050411/4d15ba5b/attachment.ksh>

From hozer at hozed.org  Mon Apr 11 09:33:42 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Mon, 11 Apr 2005 11:33:42 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52mzs51g5g.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
Message-ID: <20050411163342.GE26127@kalmia.hozed.org>

On Mon, Apr 11, 2005 at 08:34:19AM -0700, Roland Dreier wrote:
>     Troy> How is memory pinning handled? (I haven't had time to read
>     Troy> all the code, so please excuse my ignorance of something
>     Troy> obvious).
> 
> The userspace library calls mlock() and then the kernel does
> get_user_pages().

Is there a check in the kernel that the memory is actually mlock()ed?

What if a malicious (or broken) application does ibv_reg_mr() but
doesn't lock the memory? Does the IB card get a physical address for a
page that might get swapped out?

>     Troy> The old mellanox drivers used to have a hack to call
>     Troy> 'sys_mlock', and promiscuously lock memory any old userspace
>     Troy> application asked for. What is the API for the new uverbs
>     Troy> memory registration, and how will things like memory hotplug
>     Troy> and NUMA page migration be able to unpin pages locked by a
>     Troy> user program?
> 
> The API for uverbs memory registration is ibv_reg_mr(), and right now
> the memory is pinned and that's it.
> 
>  - R.


From roland at topspin.com  Mon Apr 11 09:56:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 09:56:53 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411163342.GE26127@kalmia.hozed.org> (Troy Benjegerdes's
	message of "Mon, 11 Apr 2005 11:33:42 -0500")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
Message-ID: <5264yt1cbu.fsf@topspin.com>

    Troy> Is there a check in the kernel that the memory is actually
    Troy> mlock()ed?

No.

    Troy> What if a malicious (or broken) application does
    Troy> ibv_reg_mr() but doesn't lock the memory? Does the IB card
    Troy> get a physical address for a page that might get swapped
    Troy> out?

No, the kernel does get_user_pages().  So the pages that the HCA gets
will not be swapped or used for anything else.  The only thing a
malicious userspace app can do is screw itself up.

 - R.


From halr at voltaire.com  Mon Apr 11 10:50:24 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 13:50:24 -0400
Subject: [openib-general] [PATCH] mthca: Don't call CQ completion handler if
	it doesn't exist
Message-ID: <1113241284.4490.19.camel@localhost.localdomain>

mthca: Don't call CQ completion handler if it doesn't exist

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: mthca_cq.c
===================================================================
--- mthca_cq.c	(revision 2154)
+++ mthca_cq.c	(working copy)
@@ -206,7 +206,8 @@
 
 	++cq->arm_sn;
 
-	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+	if (cq->ibcq.comp_handler)
+		cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
 }
 
 void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn)


From roland at topspin.com  Mon Apr 11 10:58:53 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 10:58:53 -0700
Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion
 handler if it doesn't exist
In-Reply-To: <1113241284.4490.19.camel@localhost.localdomain> (Hal
	Rosenstock's message of "11 Apr 2005 13:50:24 -0400")
References: <1113241284.4490.19.camel@localhost.localdomain>
Message-ID: <52sm1xyz36.fsf@topspin.com>

    Hal> mthca: Don't call CQ completion handler if it doesn't exist

Why do we want to add this test?  This is adding a conditional branch
in what I think is a fast path, and I would consider it a bug in the
consumer if it creates a CQ with an invalid completion handler and
then requests a completion event for that CQ.  Am I missing something?

 - R.


From hozer at hozed.org  Mon Apr 11 11:01:08 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Mon, 11 Apr 2005 13:01:08 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <5264yt1cbu.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
Message-ID: <20050411180107.GF26127@kalmia.hozed.org>

On Mon, Apr 11, 2005 at 09:56:53AM -0700, Roland Dreier wrote:
>     Troy> Is there a check in the kernel that the memory is actually
>     Troy> mlock()ed?
> 
> No.
> 
>     Troy> What if a malicious (or broken) application does
>     Troy> ibv_reg_mr() but doesn't lock the memory? Does the IB card
>     Troy> get a physical address for a page that might get swapped
>     Troy> out?
> 
> No, the kernel does get_user_pages().  So the pages that the HCA gets
> will not be swapped or used for anything else.  The only thing a
> malicious userspace app can do is screw itself up.
> 
>  - R.

Do we even need the mlock in userspace then?


From halr at voltaire.com  Mon Apr 11 11:01:32 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 14:01:32 -0400
Subject: [openib-general] Re: A Couple More CM Queries
In-Reply-To: <4256B290.7020704@ichips.intel.com>
References: <1112971669.4522.147.camel@localhost.localdomain>
	<4256B290.7020704@ichips.intel.com>
Message-ID: <1113242491.4490.4.camel@localhost.localdomain>

On Fri, 2005-04-08 at 12:34, Sean Hefty wrote: 
> Hal Rosenstock wrote:
> > 1. cm_alloc_id does an idr_get_new_above starting at 1. Might this be
> > better saving the highest value and starting there so connection IDs are
> > less likely to repeat as soon ?
> 
> I _think_ this would result in the IDR tables growing to their maximum 
> size, which seems worse than repeating the IDs immediately after their 
> timewait expires.
> 
> > 2. Should ib_create_cm_id check return an error if cm_handler == NULL
> > just to make sure ?
> 
> Personally, I don't think it's worth this check for kernel clients, 
> unless we want to start checking for NULL parameters everywhere.

Incoming REQs currently use this capability anyhow.

> While on the CM, I did look at the issue of calling the API out of 
> order that you had pointed out before (which could result in accessing 
> a NULL port pointer).  I'm not convinced that a simple check for a NULL 
> port pointer covers all potential problems.  For example, I'm not sure 
> how well the codebase will handle the dynamic removal of a device while 
> users are attempting to access the device.

We may need to handle this at some point. Guess the changes may be larger
if/when we get there.

A couple more questions:

It looks like sending private data in REQ/REP/RTU, but incoming private data 
isn't handled on the receiving side. 

Also, in cm_process_send_error(), where the handler is called

cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event);

might that callback request the CM ID destruction ? If so, some
code is missing to handle this.
  
Thanks.

-- Hal


From roland at topspin.com  Mon Apr 11 11:03:08 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 11:03:08 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411180107.GF26127@kalmia.hozed.org> (Troy Benjegerdes's
	message of "Mon, 11 Apr 2005 13:01:08 -0500")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
Message-ID: <52oeclyyw3.fsf@topspin.com>

    Troy> Do we even need the mlock in userspace then?

Yes, because the kernel may go through and unmap pages from userspace
while trying to swap.  Since we have the page locked in the kernel,
the physical page won't go anywhere, but userspace might end up with a
different page mapped at the same virtual address.

 - R.


From ardavis at ichips.intel.com  Mon Apr 11 11:31:51 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Mon, 11 Apr 2005 11:31:51 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <1112892615.4877.18.camel@localhost.localdomain>
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>	
	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	 <4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
Message-ID: <425AC297.9090706@ichips.intel.com>

Hal Rosenstock wrote:

>On Thu, 2005-04-07 at 12:47, ardavis wrote:
>  
>
>>EM64T server with MT25208 (MT23108 compat mode), fw_ver 4.6.0,  hw_rev A0
>>    
>>
>
>Didn't 4.6.0 have a issue with CQ handling ? Can you try 4.6.2 ?
>
>-- Hal
>
>  
>
4.6.2 did not help.  I don't see any indication of mthca_cq_event 
firing. Could it be an issue with the user mode mthca arm_cq mappings?


From mshefty at ichips.intel.com  Mon Apr 11 11:47:32 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 11 Apr 2005 11:47:32 -0700
Subject: [openib-general] Re: A Couple More CM Queries
In-Reply-To: <1113242491.4490.4.camel@localhost.localdomain>
References: <1112971669.4522.147.camel@localhost.localdomain>	
	<4256B290.7020704@ichips.intel.com>
	<1113242491.4490.4.camel@localhost.localdomain>
Message-ID: <425AC644.1060405@ichips.intel.com>

Hal Rosenstock wrote:
>>>2. Should ib_create_cm_id check return an error if cm_handler == NULL
>>>just to make sure ?
>>
>>Personally, I don't think it's worth this check for kernel clients, 
>>unless we want to start checking for NULL parameters everywhere.
> 
> Incoming REQs currently use this capability anyhow.

Incoming REQs use the cm_handler associated with the listen request.

>>While on the CM, I did look at the issue of calling the API out of 
>>order that you had pointed out before (which could result in accessing 
>>a NULL port pointer).  I'm not convinced that a simple check for a NULL 
>>port pointer covers all potential problems.  For example, I'm not sure 
>>how well the codebase will handle the dynamic removal of a device while 
>>users are attempting to access the device.
> 
> We may need to handle this at some point. Guess the changes may be larger
> if/when we get there.

One of the side effects of changing the CM from using a pointer to a QP 
to just the QPN is that the CM can no longer rely on the device being 
around.  And I agree, this will need to be handled at some point, but 
may not be a huge issue as long as the client is reasonable and 
disconnects before destroying their QP.

> It looks like sending private data in REQ/REP/RTU, but incoming private data 
> isn't handled on the receiving side. 

The private_data is given to the user in the cm_event structure.  Look 
for work->cm_event.private_data = in cm_format_req_event, 
cm_format_rep_event, and cm_rtu_handler.  Note that the private_data is 
only available while in the CM event callback.

> Also, in cm_process_send_error(), where the handler is called
> 
> cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event);
> 
> might that callback request the CM ID destruction ? If so, some
> code is missing to handle this.

Yep - this is a bug.  Send errors should probably be handled using the 
same cm_process_work routine that the receive handling goes through. 
I'll generate a patch for this, but it'll take me a few days, unless 
this is urgent.

- Sean


From halr at voltaire.com  Mon Apr 11 11:57:47 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 14:57:47 -0400
Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion
	handler if it doesn't exist
In-Reply-To: <52sm1xyz36.fsf@topspin.com>
References: <1113241284.4490.19.camel@localhost.localdomain>
	<52sm1xyz36.fsf@topspin.com>
Message-ID: <1113245695.4616.8.camel@localhost.localdomain>

On Mon, 2005-04-11 at 13:58, Roland Dreier wrote: 
>     Hal> mthca: Don't call CQ completion handler if it doesn't exist
> 
> Why do we want to add this test?  This is adding a conditional branch
> in what I think is a fast path, and I would consider it a bug in the
> consumer if it creates a CQ with an invalid completion handler and
> then requests a completion event for that CQ.  Am I missing something?

Then shouldn't this be indicated as an error at create_cq time ?

-- Hal


From halr at voltaire.com  Mon Apr 11 12:01:00 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 15:01:00 -0400
Subject: [openib-general] Re: A Couple More CM Queries
In-Reply-To: <425AC644.1060405@ichips.intel.com>
References: <1112971669.4522.147.camel@localhost.localdomain>
	<4256B290.7020704@ichips.intel.com>
	<1113242491.4490.4.camel@localhost.localdomain>
	<425AC644.1060405@ichips.intel.com>
Message-ID: <1113245847.4616.12.camel@localhost.localdomain>

On Mon, 2005-04-11 at 14:47, Sean Hefty wrote:
> Hal Rosenstock wrote:
> >>>2. Should ib_create_cm_id check return an error if cm_handler == NULL
> >>>just to make sure ?
> >>
> >>Personally, I don't think it's worth this check for kernel clients, 
> >>unless we want to start checking for NULL parameters everywhere.
> > 
> > Incoming REQs currently use this capability anyhow.
> 
> Incoming REQs use the cm_handler associated with the listen request.

Right, but the CM ID is initially created with the NULL handler. That's
all I was saying...

> > It looks like sending private data in REQ/REP/RTU, but incoming private data 
> > isn't handled on the receiving side. 
> 
> The private_data is given to the user in the cm_event structure.  Look 
> for work->cm_event.private_data = in cm_format_req_event, 
> cm_format_rep_event, and cm_rtu_handler.  Note that the private_data is 
> only available while in the CM event callback.

Got it. Thanks.

> > Also, in cm_process_send_error(), where the handler is called
> > 
> > cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event);
> > 
> > might that callback request the CM ID destruction ? If so, some
> > code is missing to handle this.
> 
> Yep - this is a bug.  Send errors should probably be handled using the 
> same cm_process_work routine that the receive handling goes through. 
> I'll generate a patch for this, but it'll take me a few days, unless 
> this is urgent.

Nope; not urgent. Just stumbled across it while looking through things.

-- Hal


From mshefty at ichips.intel.com  Mon Apr 11 12:14:02 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 11 Apr 2005 12:14:02 -0700
Subject: [openib-general] Re: A Couple More CM Queries
In-Reply-To: <1113245847.4616.12.camel@localhost.localdomain>
References: <1112971669.4522.147.camel@localhost.localdomain>	
	<4256B290.7020704@ichips.intel.com>	
	<1113242491.4490.4.camel@localhost.localdomain>	
	<425AC644.1060405@ichips.intel.com>
	<1113245847.4616.12.camel@localhost.localdomain>
Message-ID: <425ACC7A.8090904@ichips.intel.com>

Hal Rosenstock wrote:
>>>Also, in cm_process_send_error(), where the handler is called
>>>
>>>cm_id_priv->id.cm_handler(&cm_id_priv->id, &cm_event);
>>>
>>>might that callback request the CM ID destruction ? If so, some
>>>code is missing to handle this.
>>
>>Yep - this is a bug.  Send errors should probably be handled using the 
>>same cm_process_work routine that the receive handling goes through. 
>>I'll generate a patch for this, but it'll take me a few days, unless 
>>this is urgent.
> 
> Nope; not urgent. Just stumbled across it while looking through things.

Okay - I will try to get to this after finishing RMPP debug.

Thinking about this more, send errors were not handled in the same way 
as receive handling, because I wanted to ensure that send errors were 
always reported to the user.  I.e. I didn't want to deal with a failed 
memory allocation.  I'll try to get a fix in next week.

- Sean


From libor at topspin.com  Mon Apr 11 12:00:14 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 11 Apr 2005 12:00:14 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <1112988342.4546.12.camel@localhost.localdomain>;
	from halr@voltaire.com on Fri, Apr 08, 2005 at 03:37:32PM -0400
References: <1112988342.4546.12.camel@localhost.localdomain>
Message-ID: <20050411120014.A6958@topspin.com>

On Fri, Apr 08, 2005 at 03:37:32PM -0400, Hal Rosenstock wrote:
> Hi Libor,
> 
> A couple of questions about SDP performance:
> 
> 1. When running the AIO version of TTCP, there appears to be a bandwidth
> degradation when using buffer sizes from about 5K to 13K. Do you see
> this too ? If so, is there an explanation for this ?

  This would be the result of transitioning from buffered to zcopy mode
at 5K, which is the zcopy threshold. You can change the threshold with
a socket option, which is exposed in ttcp.aio.c using the -z option. I
was not planning on spending time to determine the correct value of the
default threshold until the code stabalized a bit. At this point I'm
investigating what appears to be an RDMA going into an incorrect location.

> 2. Also, is there a program you use to measure SDP latency ?

  I've used netperf in the past which has a roundtrip test to
measure latency using the regular sockets API. I don't have an
AIO latency test handy at the moment, but I could fix one up
and place it in the examples directory...

-Libor


From roland at topspin.com  Mon Apr 11 11:55:07 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 11:55:07 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425AC297.9090706@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Mon, 11 Apr 2005 11:31:51 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com>
Message-ID: <52fyxxywhg.fsf@topspin.com>

    ardavis> 4.6.2 did not help.

Not surprising.

    ardavis> I don't see any indication of mthca_cq_event
    ardavis> firing. Could it be an issue with the user mode mthca
    ardavis> arm_cq mappings?

It's possible, I guess.  You never said before -- does ibv_pingpong
without the "-e" work?  If so then the "update consumer index"
doorbell is working.  So it's kind of a mystery to me why the "arm CQ"
doorbell would not work.

Are other CQs in the kernel generating events?  For example does IPoIB
work for you?

 - R.


From roland at topspin.com  Mon Apr 11 12:07:35 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 12:07:35 -0700
Subject: [openib-general] Re: [PATCH] mthca: Don't call CQ completion
 handler if it doesn't exist
In-Reply-To: <1113245695.4616.8.camel@localhost.localdomain> (Hal
	Rosenstock's message of "11 Apr 2005 14:57:47 -0400")
References: <1113241284.4490.19.camel@localhost.localdomain>
	<52sm1xyz36.fsf@topspin.com>
	<1113245695.4616.8.camel@localhost.localdomain>
Message-ID: <527jj9yvwo.fsf@topspin.com>

    Roland> Why do we want to add this test?  This is adding a
    Roland> conditional branch in what I think is a fast path, and I
    Roland> would consider it a bug in the consumer if it creates a CQ
    Roland> with an invalid completion handler and then requests a
    Roland> completion event for that CQ.  Am I missing something?

    Hal> Then shouldn't this be indicated as an error at create_cq time ?

How can it be?  We don't know if the consumer is going to call
ib_req_notify_cq() or not.

 - R.


From halr at voltaire.com  Mon Apr 11 12:26:32 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 15:26:32 -0400
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <20050411120014.A6958@topspin.com>
References: <1112988342.4546.12.camel@localhost.localdomain>
	<20050411120014.A6958@topspin.com>
Message-ID: <1113247451.4616.33.camel@localhost.localdomain>

On Mon, 2005-04-11 at 15:00, Libor Michalek wrote:
> On Fri, Apr 08, 2005 at 03:37:32PM -0400, Hal Rosenstock wrote:
> > 2. Also, is there a program you use to measure SDP latency ?
> 
>   I've used netperf in the past which has a roundtrip test to
> measure latency using the regular sockets API. I don't have an
> AIO latency test handy at the moment, but I could fix one up
> and place it in the examples directory...

That would be handy when you get a chance. Thanks.

-- Hal


From ardavis at ichips.intel.com  Mon Apr 11 12:32:53 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Mon, 11 Apr 2005 12:32:53 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52fyxxywhg.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	<4255643D.30002@ichips.intel.com>	<1112892615.4877.18.camel@localhost.localdomain>	<425AC297.9090706@ichips.intel.com>
	<52fyxxywhg.fsf@topspin.com>
Message-ID: <425AD0E5.3040805@ichips.intel.com>

Roland Dreier wrote:

>    ardavis> 4.6.2 did not help.
>
>Not surprising.
>
>    ardavis> I don't see any indication of mthca_cq_event
>    ardavis> firing. Could it be an issue with the user mode mthca
>    ardavis> arm_cq mappings?
>
>It's possible, I guess.  You never said before -- does ibv_pingpong
>without the "-e" work?  If so then the "update consumer index"
>doorbell is working.  So it's kind of a mystery to me why the "arm CQ"
>doorbell would not work.
>
>Are other CQs in the kernel generating events?  For example does IPoIB
>work for you?
>
> - R.
>
>  
>
Yes, ibv_pingpong works without -e and IPoIB is generating events and 
working fine.


From halr at voltaire.com  Mon Apr 11 13:23:17 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 16:23:17 -0400
Subject: [openib-general] OpenSM (again)
In-Reply-To: <16986.42404.540439.952094@gargle.gargle.HOWL>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
Message-ID: <1113250997.4616.53.camel@localhost.localdomain>

On Mon, 2005-04-11 at 12:28, Roland Fehrenbacher wrote: 
> Hi,
> 
> I got gen2 opensm running fine now (there was a problem with a wrong
> include file), and managed to get IP running on a network of
> currently 40 machines (final size will be 144). Performance is pretty
> impressive (initial tests with a simple netpipe): I got a latency of
> 18microsec, and a maximum throughput of approx. 400MB/sec at packet
> size approx. 1MB which then levels of at about 340MB/s for larger
> packets.

That's all good to hear :-)

> One problem and two questions:
> 
> Problem: When I reboot all the 40 nodes (apart from the one the opensm
> is running), the network is non-functional (no pings go through, even
> though ports show status "Active") for quite a while (more than 10
> minutes) after all the nodes have come up. It then recovers without
> intervention. Is this normal? Single node reboots don't affect the
> network operation. osm Log file is appended.

Can you describe your topology ? Is it the following: the SM is
connected to a switch/or switches with the 40 nodes connected off these
switches ?

I'll respond to the log (and these questions) in a separate email
response.

> Question 1: Can I run opensm in a master slave configuration?

Yes. Others are doing this.

>  I noticed
> that there is a priority commandline option, but am not sure how to
> apply this.

SM election occurs per high priority low GUID. So if you don't care
which SM is the master than you don't need to do anything. If you want a
specific order (and it is not in GUID order) then you need to specify
priority.

> Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
> compute nodes (need fast MPI), and gen2 on the control/storage nodes
> (need only IP) with gen2 opensm running on the control nodes. Is there
> any reason why this should not work reliably?

So basically this appears to be an interop question:
1. Will gen2 OpenSM support IBGD nodes ?
2. Will gen2 IPoIB interoperate with IBGD IPoIB ?
I haven't done this but know of no reasons this should not work. Perhaps
others can add to this.

-- Hal 
________________________________________________________________________


From iod00d at hp.com  Mon Apr 11 13:46:56 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 11 Apr 2005 13:46:56 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <20050411120014.A6958@topspin.com>
References: <1112988342.4546.12.camel@localhost.localdomain>
	<20050411120014.A6958@topspin.com>
Message-ID: <20050411204656.GC13577@esmail.cup.hp.com>

On Mon, Apr 11, 2005 at 12:00:14PM -0700, Libor Michalek wrote:
> > 2. Also, is there a program you use to measure SDP latency ?
> 
>   I've used netperf in the past which has a roundtrip test to
> measure latency using the regular sockets API. I don't have an
> AIO latency test handy at the moment, but I could fix one up
> and place it in the examples directory...

netperf -t TCP_RR is really easy to run. I suggest the 2.4.0-rc1
version *experimental* version from www.netperf.org.
2.4.0-rc1 is much more linux friendly than previous versions.
Just do "make config && make install"

You *must* use "-T" option (bind apps to the CPU handling interrupts)
for performance characterization.

I've run this alot for gige but not ran a full set for ia64/IB.
Output sample for IPoIB on HP/ZX1 ia64 (rx2600) below. I expect SDP
to be a bit better and would like to generate full sets for both
IPoIB and SDP this week.

ISTR netpipe also has latency tests but I've not played with netpipe yet.

grant


# /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   16313.15  5.85   5.98   7.174   7.326 
16384  87380 


From halr at voltaire.com  Mon Apr 11 13:52:26 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Apr 2005 16:52:26 -0400
Subject: [openib-general] OpenSM (again)
In-Reply-To: <1113250997.4616.53.camel@localhost.localdomain>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
	<1113250997.4616.53.camel@localhost.localdomain>
Message-ID: <1113252518.4476.3.camel@localhost.localdomain>

On Mon, 2005-04-11 at 16:23, Hal Rosenstock wrote:
> > Problem: When I reboot all the 40 nodes (apart from the one the opensm
> > is running), the network is non-functional (no pings go through, even
> > though ports show status "Active") for quite a while (more than 10
> > minutes) after all the nodes have come up. It then recovers without
> > intervention. Is this normal? Single node reboots don't affect the
> > network operation. osm Log file is appended.
> 
> Can you describe your topology ? Is it the following: the SM is
> connected to a switch/or switches with the 40 nodes connected off these
> switches ?

What is the mix of those 40 nodes in terms of OpenIB (gen2) and gen1 ?
Is there no difference in the behavior of gen2 and gen1 in terms of the
above symptoms ?

-- Hal


From roland at topspin.com  Mon Apr 11 13:58:03 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 13:58:03 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425AD0E5.3040805@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Mon, 11 Apr 2005 12:32:53 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com>
Message-ID: <52wtr9xc84.fsf@topspin.com>

Hmm...

Has anyone else tried userspace verbs on a PCI Express HCA running
4.6.x FW?  If so does "ibv_pingpong -e" work for you?

All of the PCI Express HCAs I have handy are mem-free, but CQ events
work for me with both mem-free HCAs and PCI-X HCAs.

 - R.


From robert.j.woodruff at intel.com  Mon Apr 11 15:09:50 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 11 Apr 2005 15:09:50 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <20050411204656.GC13577@esmail.cup.hp.com>
Message-ID: <ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>


Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.0.0.30 (10.0.0.30) port 0 AF_INET
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

>16384  87380  1       1      60.00   16313.15  5.85   5.98   7.174   7.326 
>16384  87380 

Hi Libor,

What type of platform did you run this on ? CPU speed, type of HCA, etc.

Also, have you run netpipe on SDP, it shows BW and latency for various
sizes. 

woody


From robert.j.woodruff at intel.com  Mon Apr 11 15:13:06 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 11 Apr 2005 15:13:06 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>
Message-ID: <ORSMSX408O1LHK3jhs100000004@orsmsx408.amr.corp.intel.com>


woody wrote >What type of platform did you run this on ? CPU speed, type of
HCA, etc.

Sorry I just read the email and saw that it was an IPF box.


woody


From libor at topspin.com  Mon Apr 11 15:12:59 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 11 Apr 2005 15:12:59 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>;
	from robert.j.woodruff@intel.com on Mon, Apr 11, 2005 at
	03:09:50PM -0700
References: <20050411204656.GC13577@esmail.cup.hp.com>
	<ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>
Message-ID: <20050411151259.B6958@topspin.com>

On Mon, Apr 11, 2005 at 03:09:50PM -0700, Bob Woodruff wrote:
> 
> Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0
> >TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.0.0.30 (10.0.0.30) port 0 AF_INET
> >Local /Remote
> >Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> >Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> >bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> 
> >16384  87380  1       1      60.00   16313.15  5.85   5.98   7.174   7.326 
> >16384  87380 
> 
> Hi Libor,
> 
> What type of platform did you run this on ? CPU speed, type of HCA, etc.

  I didn't run this, it was Grant, and those results were for IPoIB if I
remember his email correctly.

> Also, have you run netpipe on SDP, it shows BW and latency for various
> sizes. 

  No, I usually use ttcp or netperf which do pretty much the same thing,
I would be surprised if netpipe showed radically different numbers.

-Libor


From robert.j.woodruff at intel.com  Mon Apr 11 15:34:21 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 11 Apr 2005 15:34:21 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <20050411151259.B6958@topspin.com>
Message-ID: <ORSMSX408FRaqbC8wSA00000005@orsmsx408.amr.corp.intel.com>

 
Libor >  I didn't run this, it was Grant, and those results were for IPoIB
if I
>remember his email correctly.

> Also, have you run netpipe on SDP, it shows BW and latency for various
> sizes. 

>  No, I usually use ttcp or netperf which do pretty much the same thing,
>I would be surprised if netpipe showed radically different numbers.

>-Libor

I know now, after I hit return I actually read the email. Sorry for the
confusion.

If I get a chance, I will try to get SDP running and run netpipe. 
I agree the numbers won't be much different, but it reports data for various
sizes from 1 byte up to a couple of megabytes, so one can see the curve.


woody


From ardavis at ichips.intel.com  Mon Apr 11 16:17:09 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Mon, 11 Apr 2005 16:17:09 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52wtr9xc84.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	<4255643D.30002@ichips.intel.com>	<1112892615.4877.18.camel@localhost.localdomain>	<425AC297.9090706@ichips.intel.com>
	<52fyxxywhg.fsf@topspin.com>	<425AD0E5.3040805@ichips.intel.com>
	<52wtr9xc84.fsf@topspin.com>
Message-ID: <425B0575.5090702@ichips.intel.com>

Roland Dreier wrote:

>Hmm...
>
>Has anyone else tried userspace verbs on a PCI Express HCA running
>4.6.x FW?  If so does "ibv_pingpong -e" work for you?
>
>All of the PCI Express HCAs I have handy are mem-free, but CQ events
>work for me with both mem-free HCAs and PCI-X HCAs.
>
> - R.
>
>  
>
Roland,

I was debugging this problem and when I added some debug prints in 
mthca_tavor_arm_cq (cq.c), just before the mthca_write64() call it 
started working.  I will take a closer look....

-arlin


From roland at topspin.com  Mon Apr 11 16:20:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 16:20:54 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425B0575.5090702@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Mon, 11 Apr 2005 16:17:09 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com>
Message-ID: <52k6n8yk6h.fsf@topspin.com>

    ardavis> I was debugging this problem and when I added some debug
    ardavis> prints in mthca_tavor_arm_cq (cq.c), just before the
    ardavis> mthca_write64() call it started working.  I will take a
    ardavis> closer look....

Ugh, smells like a compiler optimization problem or a timing problem.
I'm still not seeing what could be going wrong, though.

 - R.


From roland at topspin.com  Mon Apr 11 16:37:49 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 16:37:49 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52k6n8yk6h.fsf@topspin.com> (Roland Dreier's message of "Mon,
	11 Apr 2005 16:20:54 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com>
Message-ID: <52fyxwyjea.fsf@topspin.com>

What distribution and compiler version are you running?  I assume
you're running 64-bit userspace on a 64-bit kernel, right?  What
optimization level is libmthca being built with?

 - R.


From ardavis at ichips.intel.com  Mon Apr 11 16:52:22 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Mon, 11 Apr 2005 16:52:22 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52fyxwyjea.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	<4255643D.30002@ichips.intel.com>	<1112892615.4877.18.camel@localhost.localdomain>	<425AC297.9090706@ichips.intel.com>
	<52fyxxywhg.fsf@topspin.com>	<425AD0E5.3040805@ichips.intel.com>
	<52wtr9xc84.fsf@topspin.com>	<425B0575.5090702@ichips.intel.com>
	<52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com>
Message-ID: <425B0DB6.9090002@ichips.intel.com>

Roland Dreier wrote:

>What distribution and compiler version are you running?  I assume
>you're running 64-bit userspace on a 64-bit kernel, right?  What
>optimization level is libmthca being built with?
>
> - R.
>
>  
>
Redhat EL 4.0, 64-bit
gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)
libmthca built with default settings (-O2)


From akpm at osdl.org  Mon Apr 11 17:13:47 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 11 Apr 2005 17:13:47 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52oeclyyw3.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
Message-ID: <20050411171347.7e05859f.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Troy> Do we even need the mlock in userspace then?
> 
> Yes, because the kernel may go through and unmap pages from userspace
> while trying to swap.  Since we have the page locked in the kernel,
> the physical page won't go anywhere, but userspace might end up with a
> different page mapped at the same virtual address.

That shouldn't happen.  If get_user_pages() has elevated the refcount on a
page then the following can happen:

- The VM may decide to add the page to swapcache (if it's not mmapped
  from a file).

- Once the page is backed by either swapcache of a (mmapped) file, the VM
  may decide the unmap the application's pte's.  A later minor fault by the
  app will cause the same physical page to be remapped.

- The VM may decide to try to write the page to its backing file or swap.
   If it does, the page is still in core, but is now clean.

- Once all pte's are unmapped and the page is clean, the VM may decide to
  try to reclaim the page.  The VM will then see the elevated refcount and
  will bale out, leaving the page in core.

- If your code was doing a read-from-disk (modifying memory), then your
  code should run set_page_dirty() or set_page_dirty_lock() against the
  page before dropping the refcount which get_user_pages() added.  Once the
  page is dirty, the VM can't reclaim it until it has been been written to
  swap or mmapped backing file.

IOW: while the page has an elevated refcount from get_user_pages(), that
physical page is 100% pinned.  Once you've done the
set_page_dirty+put_page(), the page is again under control of the VM.

There should be no need to run mlock() from userspace.


From roland at topspin.com  Mon Apr 11 17:21:04 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 17:21:04 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 11 Apr 2005 17:13:47 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
Message-ID: <521x9gyhe7.fsf@topspin.com>

    Roland> Yes, because the kernel may go through and unmap pages
    Roland> from userspace while trying to swap.  Since we have the
    Roland> page locked in the kernel, the physical page won't go
    Roland> anywhere, but userspace might end up with a different page
    Roland> mapped at the same virtual address.

    Andrew> That shouldn't happen.  If get_user_pages() has elevated
    Andrew> the refcount on a page then the following can happen:

    ...

    Andrew> IOW: while the page has an elevated refcount from
    Andrew> get_user_pages(), that physical page is 100% pinned.
    Andrew> Once you've done the set_page_dirty+put_page(), the page
    Andrew> is again under control of the VM.

Hmm... I've never tested it first hand but Libor assures me there is a
something like what I said.  Libor, did I get the explanation right?

 - R.


From roland at topspin.com  Mon Apr 11 17:10:51 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 17:10:51 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425B0DB6.9090002@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Mon, 11 Apr 2005 16:52:22 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com>
	<52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com>
Message-ID: <527jj8yhv8.fsf@topspin.com>

    ardavis> Redhat EL 4.0, 64-bit

OK, I found a system with that distro installed, although I can't test
the results of the build.  However, I built libmthca with the same
CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe".  I found
that mthca_tavor_arm_cq() compiles to the following tiny fragment:

0000000000001d10 <mthca_tavor_arm_cq>:
    1d10:       48 8b 07                mov    (%rdi),%rax
    1d13:       48 8b 90 a8 ef ff ff    mov    0xffffffffffffefa8(%rax),%rdx
    1d1a:       48 8b 44 24 f8          mov    0xfffffffffffffff8(%rsp),%rax
    1d1f:       48 89 42 20             mov    %rax,0x20(%rdx)
    1d23:       31 c0                   xor    %eax,%eax
    1d25:       c3                      retq

in other words, the compiler seems to be discarding all the
assignments to doorbell[0] and doorbell[1].  I'm not sure if this is a
compiler bug or what -- I need to investigate further.  In any case
can you try the following patch to libmthca and see if it fixes
things:

Index: src/cq.c
===================================================================
--- src/cq.c	(revision 2156)
+++ src/cq.c	(working copy)
@@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq
 			    to_mcq(cq)->cqn);
 	doorbell[1] = 0xffffffff;
 
+	mb();
+
 	mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL);
 
 	return 0;


From ardavis at ichips.intel.com  Mon Apr 11 18:08:10 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Mon, 11 Apr 2005 18:08:10 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <527jj8yhv8.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	<4255643D.30002@ichips.intel.com>	<1112892615.4877.18.camel@localhost.localdomain>	<425AC297.9090706@ichips.intel.com>
	<52fyxxywhg.fsf@topspin.com>	<425AD0E5.3040805@ichips.intel.com>
	<52wtr9xc84.fsf@topspin.com>	<425B0575.5090702@ichips.intel.com>
	<52k6n8yk6h.fsf@topspin.com>	<52fyxwyjea.fsf@topspin.com>
	<425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com>
Message-ID: <425B1F7A.9070100@ichips.intel.com>

Roland Dreier wrote:

>    ardavis> Redhat EL 4.0, 64-bit
>
>OK, I found a system with that distro installed, although I can't test
>the results of the build.  However, I built libmthca with the same
>CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe".  I found
>that mthca_tavor_arm_cq() compiles to the following tiny fragment:
>
>0000000000001d10 <mthca_tavor_arm_cq>:
>    1d10:       48 8b 07                mov    (%rdi),%rax
>    1d13:       48 8b 90 a8 ef ff ff    mov    0xffffffffffffefa8(%rax),%rdx
>    1d1a:       48 8b 44 24 f8          mov    0xfffffffffffffff8(%rsp),%rax
>    1d1f:       48 89 42 20             mov    %rax,0x20(%rdx)
>    1d23:       31 c0                   xor    %eax,%eax
>    1d25:       c3                      retq
>
>in other words, the compiler seems to be discarding all the
>assignments to doorbell[0] and doorbell[1].  I'm not sure if this is a
>compiler bug or what -- I need to investigate further.  In any case
>can you try the following patch to libmthca and see if it fixes
>things:
>
>Index: src/cq.c
>===================================================================
>--- src/cq.c	(revision 2156)
>+++ src/cq.c	(working copy)
>@@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq
> 			    to_mcq(cq)->cqn);
> 	doorbell[1] = 0xffffffff;
> 
>+	mb();
>+
> 	mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL);
> 
> 	return 0;
>
>  
>
Yes, this fixes my problem. Thanks!


From roland at topspin.com  Mon Apr 11 19:51:24 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 19:51:24 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <425B0DB6.9090002@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Mon, 11 Apr 2005 16:52:22 -0700")
References: <424C3722.9070402@ichips.intel.com> <52oeczoghb.fsf@topspin.com>
	<425452AF.6010207@ichips.intel.com> <527jjf8s8t.fsf@topspin.com>
	<4255643D.30002@ichips.intel.com>
	<1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com>
	<52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com>
Message-ID: <52fyxwwvv7.fsf@topspin.com>

OK, I think I understand the problem.  The old code violates the
assumptions that gcc makes with -fstrict-aliasing (which is one of the
optimizations turned on by -O2).  Can you back out the patch to cq.c I
sent and try this patch instead?

Thanks,
  Roland

Index: src/doorbell.h
===================================================================
--- src/doorbell.h	(revision 2156)
+++ src/doorbell.h	(working copy)
@@ -69,14 +69,22 @@ static inline void mthca_write_db_rec(ui
 
 #elif SIZEOF_LONG == 8
 
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
+#elif __BYTE_ORDER == __BIG_ENDIAN
+#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
+#else
+#  error __BYTE_ORDER not defined
+#endif
+
 static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset)
 {
-	*(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val;
+	*(volatile uint64_t *) (ctx->uar + offset) = MTHCA_PAIR_TO_64(val);
 }
 
 static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
 {
-	*(volatile uint64_t *) db = *(uint64_t *) val;
+	*(volatile uint64_t *) db = MTHCA_PAIR_TO_64(val);
 }
 
 #else


From iod00d at hp.com  Mon Apr 11 19:58:35 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 11 Apr 2005 19:58:35 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <527jj8yhv8.fsf@topspin.com>
References: <1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com>
	<52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com>
	<425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com>
Message-ID: <20050412025835.GE13577@esmail.cup.hp.com>

On Mon, Apr 11, 2005 at 05:10:51PM -0700, Roland Dreier wrote:
>     ardavis> Redhat EL 4.0, 64-bit
> 
> OK, I found a system with that distro installed, although I can't test
> the results of the build.  However, I built libmthca with the same
> CFLAGS that rpm seems to use, namely "-g -O2 -m64 -pipe".  I found
> that mthca_tavor_arm_cq() compiles to the following tiny fragment:
> 
> 0000000000001d10 <mthca_tavor_arm_cq>:
>     1d10:       48 8b 07                mov    (%rdi),%rax
>     1d13:       48 8b 90 a8 ef ff ff    mov    0xffffffffffffefa8(%rax),%rdx
>     1d1a:       48 8b 44 24 f8          mov    0xfffffffffffffff8(%rsp),%rax
>     1d1f:       48 89 42 20             mov    %rax,0x20(%rdx)
>     1d23:       31 c0                   xor    %eax,%eax
>     1d25:       c3                      retq
> 
> in other words, the compiler seems to be discarding all the
> assignments to doorbell[0] and doorbell[1].

doorbell[] is a local variable and mthca_write64() is static inline.
I don't see a problem with the assignments to doorbell getting
optimized out since the scope of that variable is completely
visible to gcc. A smart compiler would just use registers and
reduce the 32-bit stores.

I see a problem with "(notify == IB_CQ_SOLICITED ? ....)" code getting
optimized away. "notifier" is passed in parameter (not a constant) and
the function is only invoked as an indirect function call. I don't see
how gcc could know what value notifier will have and optimize the test away.

Hrm...maybe the bug is "notifier" is somehow overloaded to a constant.
You'd have to look at the intermediate "-E" (preprocessed) output.


> I'm not sure if this is a compiler bug or what -- I need to
> investigate further.> In any case
> can you try the following patch to libmthca and see if it fixes
> things:
> 
> Index: src/cq.c
> ===================================================================
> --- src/cq.c	(revision 2156)
> +++ src/cq.c	(working copy)
> @@ -441,6 +441,8 @@ int mthca_tavor_arm_cq(struct ibv_cq *cq
>  			    to_mcq(cq)->cqn);
>  	doorbell[1] = 0xffffffff;
>  
> +	mb();
> +
>  	mthca_write64(doorbell, to_mctx(cq->context), MTHCA_CQ_DOORBELL);

I don't get how this fixes the problem.
mthca_write64() uses a spinlock and I thought that has to enforce
some sort of memory/instruction ordering already. I'm sketchy on
details and can't look it up right now.

hth,
grant


From iod00d at hp.com  Mon Apr 11 20:08:01 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 11 Apr 2005 20:08:01 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>
References: <20050411204656.GC13577@esmail.cup.hp.com>
	<ORSMSX4085nVEyi1wnf00000003@orsmsx408.amr.corp.intel.com>
Message-ID: <20050412030801.GG13577@esmail.cup.hp.com>

On Mon, Apr 11, 2005 at 03:09:50PM -0700, Bob Woodruff wrote:
> 
> Libor ># /usr/local/bin/netperf -c -C -l 60 -H 10.0.0.30 -t TCP_RR -T 0,0
> >TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.0.0.30 (10.0.0.30) port 0 AF_INET
> >Local /Remote
> >Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> >Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> >bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> 
> >16384  87380  1       1      60.00   16313.15  5.85   5.98   7.174   7.326 
> >16384  87380 
> 
> Hi Libor,

s/Libor/Grant/

> What type of platform did you run this on ? CPU speed, type of HCA, etc.
>
> Also, have you run netpipe on SDP, it shows BW and latency for various
> sizes. 

All of that (except CPU speed: 1.5Ghz Madison) was answered in the
original email.  In case you didn't save it, I bounced you another copy.

grant


From roland at topspin.com  Mon Apr 11 20:14:12 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 11 Apr 2005 20:14:12 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <20050412025835.GE13577@esmail.cup.hp.com> (Grant Grundler's
	message of "Mon, 11 Apr 2005 19:58:35 -0700")
References: <1112892615.4877.18.camel@localhost.localdomain>
	<425AC297.9090706@ichips.intel.com> <52fyxxywhg.fsf@topspin.com>
	<425AD0E5.3040805@ichips.intel.com> <52wtr9xc84.fsf@topspin.com>
	<425B0575.5090702@ichips.intel.com> <52k6n8yk6h.fsf@topspin.com>
	<52fyxwyjea.fsf@topspin.com> <425B0DB6.9090002@ichips.intel.com>
	<527jj8yhv8.fsf@topspin.com>
	<20050412025835.GE13577@esmail.cup.hp.com>
Message-ID: <527jj8wut7.fsf@topspin.com>

    Grant> doorbell[] is a local variable and mthca_write64() is
    Grant> static inline.  I don't see a problem with the assignments
    Grant> to doorbell getting optimized out since the scope of that
    Grant> variable is completely visible to gcc. A smart compiler
    Grant> would just use registers and reduce the 32-bit stores.

Actually, what is that the compiler sees that we write to doorbell[]
as a uint32_t but then read from it by dereferencing a uint64_t*.  -O2
turns on -fstrict-aliasing, which allows the compiler to assume that
pointers of different types never alias each other.  So gcc says, hey,
all you do is write to that local doorbell[] variable and never do
anything with the values you write, so I'll just throw away that dead
code.  So gcc ends up only generating code for the store in
mthca_write64() without any code to initialize doorbell[].

 - R.


From eitan at mellanox.co.il  Mon Apr 11 22:07:11 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 12 Apr 2005 08:07:11 +0300
Subject: [openib-general] OpenSM (again)
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0DE@mtlex01.yok.mtl.com>

Hi Roland,

If the case is reproducible, please run "opensm -V" and send us the osm.log

Thanks

Eitan Zahavi

> -----Original Message-----
> From: Roland Fehrenbacher [mailto:rf at q-leap.de]
> Sent: Monday, April 11, 2005 7:28 PM
> To: openib-general at openib.org
> Subject: [openib-general] OpenSM (again)
> 
> Hi,
> 
> I got gen2 opensm running fine now (there was a problem with a wrong
> include file), and managed to get IP running on a network of
> currently 40 machines (final size will be 144). Performance is pretty
> impressive (initial tests with a simple netpipe): I got a latency of
> 18microsec, and a maximum throughput of approx. 400MB/sec at packet
> size approx. 1MB which then levels of at about 340MB/s for larger
> packets.
> 
> One problem and two questions:
> 
> Problem: When I reboot all the 40 nodes (apart from the one the opensm
> is running), the network is non-functional (no pings go through, even
> though ports show status "Active") for quite a while (more than 10
> minutes) after all the nodes have come up. It then recovers without
> intervention. Is this normal? Single node reboots don't affect the
> network operation. osm Log file is appended.
> 
> Question 1: Can I run opensm in a master slave configuration? I noticed
> that there is a priority commandline option, but am not sure how to
> apply this.
> 
> Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
> compute nodes (need fast MPI), and gen2 on the control/storage nodes
> (need only IP) with gen2 opensm running on the control nodes. Is there
> any reason why this should not work reliably?
> 
> Roland

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050412/39ca6a2c/attachment.html>

From iod00d at hp.com  Mon Apr 11 22:34:12 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 11 Apr 2005 22:34:12 -0700
Subject: [openib-general] Re: SDP Performance
In-Reply-To: <ORSMSX408FRaqbC8wSA00000005@orsmsx408.amr.corp.intel.com>
References: <20050411151259.B6958@topspin.com>
	<ORSMSX408FRaqbC8wSA00000005@orsmsx408.amr.corp.intel.com>
Message-ID: <20050412053412.GH13577@esmail.cup.hp.com>

On Mon, Apr 11, 2005 at 03:34:21PM -0700, Bob Woodruff wrote:
> If I get a chance, I will try to get SDP running and run netpipe. 
> I agree the numbers won't be much different, but it reports data for various
> sizes from 1 byte up to a couple of megabytes, so one can see the curve.

netperf inludes a shell script to do the same thing: tcp_rr_script

grant


From tziporet at mellanox.co.il  Mon Apr 11 22:36:52 2005
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 12 Apr 2005 08:36:52 +0300
Subject: [openib-general] OpenSM (again)
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF347@mtlex01.yok.mtl.com>

regarding the question 2:
> Question 2: I plan to run the gen1/Mellanox IBGD drivers on the
> compute nodes (need fast MPI), and gen2 on the control/storage nodes
> (need only IP) with gen2 opensm running on the control nodes. Is there
> any reason why this should not work reliably?

We tried it in Mellanox once and it did work properly (we used OpenSM from
gen1 and IPoIB from gen1 & gen2 on 2 different machines). So although its
not QAed I see no reason that it will not work for you.

Tziporet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050412/ec94ac50/attachment.html>

From tziporet at mellanox.co.il  Mon Apr 11 23:28:14 2005
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 12 Apr 2005 09:28:14 +0300
Subject: [openib-general] Re: uverbs events
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com>

Very important - there is a bug in gcc version 3.4.2 that had been fixed in
gcc 3.4.3.
This bug ((# 17581) heart us in VAPI when full optimizations is working in
bits or on 64 bits systems.

So I suggest that you replace gcc with gcc 3.4.3.

Tziporet

-----Original Message-----
From: Roland Dreier [mailto:roland at topspin.com]
Sent: Tuesday, April 12, 2005 6:14 AM
To: Grant Grundler
Cc: openib-general at openib.org
Subject: Re: [openib-general] Re: uverbs events


    Grant> doorbell[] is a local variable and mthca_write64() is
    Grant> static inline.  I don't see a problem with the assignments
    Grant> to doorbell getting optimized out since the scope of that
    Grant> variable is completely visible to gcc. A smart compiler
    Grant> would just use registers and reduce the 32-bit stores.

Actually, what is that the compiler sees that we write to doorbell[]
as a uint32_t but then read from it by dereferencing a uint64_t*.  -O2
turns on -fstrict-aliasing, which allows the compiler to assume that
pointers of different types never alias each other.  So gcc says, hey,
all you do is write to that local doorbell[] variable and never do
anything with the values you write, so I'll just throw away that dead
code.  So gcc ends up only generating code for the store in
mthca_write64() without any code to initialize doorbell[].

 - R.
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050412/b9f25cfa/attachment.html>

From halr at voltaire.com  Tue Apr 12 02:56:29 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Apr 2005 05:56:29 -0400
Subject: [openib-general] OpenSM (again)
In-Reply-To: <16986.42404.540439.952094@gargle.gargle.HOWL>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
Message-ID: <1113299742.4476.40.camel@localhost.localdomain>

On Mon, 2005-04-11 at 12:28, Roland Fehrenbacher wrote: 
> Problem: When I reboot all the 40 nodes (apart from the one the opensm
> is running), the network is non-functional (no pings go through, even
> though ports show status "Active") for quite a while (more than 10
> minutes) after all the nodes have come up. It then recovers without
> intervention. Is this normal? Single node reboots don't affect the
> network operation. osm Log file is appended.
> 
> ______________________________________________________________________
> Apr 10 15:05:55 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 10 15:05:55 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:05:55 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:05:55 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
> Apr 10 15:05:55 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
> Apr 10 15:05:55 [4000] -> osm_vendor_bind: Unable to register class 129 version 1.
> Apr 10 15:05:55 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
> Apr 10 15:05:55 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
> Apr 10 15:06:58 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 10 15:06:58 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:06:58 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:06:58 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
> Apr 10 15:06:58 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
> Apr 10 15:06:58 [4000] -> osm_vendor_bind: Unable to register class 129 version 1.
> Apr 10 15:06:58 [4000] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
> Apr 10 15:06:58 [4000] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
> Apr 10 15:07:44 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 10 15:07:44 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:07:44 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 10 15:07:44 [4000] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004013c1) as the default port.
> Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
> Apr 10 15:07:44 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c1.
> Apr 10 15:07:44 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0011 TID:0x000000000000000a
> Apr 10 15:07:44 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.

This is a SubnGet of NodeInfo which is timing out.

> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
> 
This is a SubnGet of  PkeyTable which is timing out.

> Apr 10 15:07:45 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.

> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.
> Apr 10 15:07:46 [2400A] -> __osm_sa_mad_ctrl_rcv_callback: Received an SA mad while SM in first sweep. Mad ignored.

These are SA MADs being received when SM is not yet ready to handle
them. They could be SA sets of MCMemberRecord (from IPoIB). SA clients
in end nodes should retry them (assuming not exhaust their timeout/retry
strategy).

For debug purposes, it might be nice to display the method and attribute
of the SA MAD.

> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SELF.
> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:46 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=11) -- dropping.
> Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
> Apr 10 15:07:46 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
> Apr 10 15:07:47 [2400A] -> umad_receiver: send completed with error(method=1 attr=16) -- dropping.
> Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:07:47 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET.
> Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:08:16 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 10 15:24:26 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.

In the most recent OpenSM (gen1), this has been changed from error to warning. (That doesn't explain the delay in connectivity).

> Apr 11 08:32:17 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004c
> Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900
> Apr 11 08:32:17 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1
> Apr 11 08:32:17 [18007] -> Discovered new port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies 
> Apr 11 08:32:17 [18007] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_IDLE_TIME_PROCESS_REQUEST(9) in state OSM_SM_STATE_PROCESS_REQUEST_WAIT.
> Apr 11 08:35:27 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0028 TID:0x000000000000004d
> Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0028 GID:0xfe80000000000000,0x0002c9010befe900
> Apr 11 08:35:27 [18007] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0011 GID:0xfe80000000000000,0x0002c902004013c1
> Apr 11 08:35:27 [18007] -> Removed port with GUID:0x0002c902004012e9 LID range [0x3D,0x3D] of node:MT23108 InfiniHost Mellanox Technologies 

At what point, did it start working again ? Was it at 15:24 ? (That
appears to be a 16-17 minute delay in connectivity).

-- Hal


From SWOODING at qinetiq.com  Tue Apr 12 04:18:44 2005
From: SWOODING at qinetiq.com (Wooding Steve)
Date: Tue, 12 Apr 2005 12:18:44 +0100
Subject: [openib-general] AIO SDP and ttcp.aio: Event errors
Message-ID: <C5A085BEA630D811A6730008C7841FE60B92A7@pdn-mail-1.dera.gov.uk>

Hi,

I have been putting ttcp.aio through its paces and have a few questions.

1. When -l is larger than 131072 I get an Event error <-22> on the transmit
side and no data to transferred. Changing values of -n and -a do not make
any difference.

2. When using a value of 1 for -a (so I suppose this is non-aio), I get an
Event error of <-32> on the transmit side and an <-104> on the receiver end.
Only some of the data is transferred.

3. For future reference, where can I find out what these Event error codes
mean to give me a glue of what's going wrong.

4. I sometimes see significant differences in the transfer speed reported on
the transmit and receiver ends. Is one more right than the other?

My system details are:
Two nodes with Dual Xeon 64-bit processors
HCA: MT25208 (in MT23108 compat mode) with 128MB of ram
OS: RHEL 4 (64-bit)
Gen2 stack version: trunk of 2113 (subversion revision number)

Thanks,

Steve.

The Information contained in this E-Mail and any subsequent correspondence
is private and is intended solely for the intended recipient(s).
For those other than the recipient any disclosure, copying, distribution,
or any action taken or omitted to be taken in reliance on such information
is prohibited and may be unlawful.

Emails and other electronic communication with QinetiQ may be monitored.
Calls to QinetiQ may be recorded for quality control,
regulatory and monitoring purposes.


From roland at topspin.com  Tue Apr 12 08:38:36 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 08:38:36 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com>
	(Tziporet Koren's message of "Tue, 12 Apr 2005 09:28:14 +0300")
References: <506C3D7B14CDD411A52C00025558DED6064BF34C@mtlex01.yok.mtl.com>
Message-ID: <52r7hguhs3.fsf@topspin.com>

    Tziporet> Very important - there is a bug in gcc version 3.4.2
    Tziporet> that had been fixed in gcc 3.4.3.  This bug ((# 17581)
    Tziporet> heart us in VAPI when full optimizations is working in
    Tziporet> bits or on 64 bits systems.

Thanks, but if the bug you're talking about is

    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17581

then I don't think that's going to affect us -- we don't seem to do
any 64-bit arithmetic inside a switch statement.

 - R.


From ardavis at ichips.intel.com  Tue Apr 12 09:07:30 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Tue, 12 Apr 2005 09:07:30 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <52fyxwwvv7.fsf@topspin.com>
References: <424C3722.9070402@ichips.intel.com>
	<52oeczoghb.fsf@topspin.com>	<425452AF.6010207@ichips.intel.com>
	<527jjf8s8t.fsf@topspin.com>	<4255643D.30002@ichips.intel.com>	<1112892615.4877.18.camel@localhost.localdomain>	<425AC297.9090706@ichips.intel.com>
	<52fyxxywhg.fsf@topspin.com>	<425AD0E5.3040805@ichips.intel.com>
	<52wtr9xc84.fsf@topspin.com>	<425B0575.5090702@ichips.intel.com>
	<52k6n8yk6h.fsf@topspin.com>	<52fyxwyjea.fsf@topspin.com>
	<425B0DB6.9090002@ichips.intel.com> <52fyxwwvv7.fsf@topspin.com>
Message-ID: <425BF242.5070003@ichips.intel.com>

Roland Dreier wrote:

>OK, I think I understand the problem.  The old code violates the
>assumptions that gcc makes with -fstrict-aliasing (which is one of the
>optimizations turned on by -O2).  Can you back out the patch to cq.c I
>sent and try this patch instead?
>
>Thanks,
>  Roland
>
>Index: src/doorbell.h
>===================================================================
>--- src/doorbell.h	(revision 2156)
>+++ src/doorbell.h	(working copy)
>@@ -69,14 +69,22 @@ static inline void mthca_write_db_rec(ui
> 
> #elif SIZEOF_LONG == 8
> 
>+#if __BYTE_ORDER == __LITTLE_ENDIAN
>+#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
>+#elif __BYTE_ORDER == __BIG_ENDIAN
>+#  define MTHCA_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
>+#else
>+#  error __BYTE_ORDER not defined
>+#endif
>+
> static inline void mthca_write64(uint32_t val[2], struct mthca_context *ctx, int offset)
> {
>-	*(volatile uint64_t *) (ctx->uar + offset) = *(uint64_t *) val;
>+	*(volatile uint64_t *) (ctx->uar + offset) = MTHCA_PAIR_TO_64(val);
> }
> 
> static inline void mthca_write_db_rec(uint32_t val[2], uint32_t *db)
> {
>-	*(volatile uint64_t *) db = *(uint64_t *) val;
>+	*(volatile uint64_t *) db = MTHCA_PAIR_TO_64(val);
> }
> 
> #else
>
>  
>
Done. Works fine.


From rf at q-leap.de  Tue Apr 12 09:46:59 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Tue, 12 Apr 2005 18:46:59 +0200
Subject: [openib-general] OpenSM (again)
In-Reply-To: <1113250997.4616.53.camel@localhost.localdomain>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
	<1113250997.4616.53.camel@localhost.localdomain>
Message-ID: <16987.64387.656148.774577@gargle.gargle.HOWL>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    >> Problem: When I reboot all the 40 nodes (apart from the one the
    >> opensm is running), the network is non-functional (no pings go
    >> through, even though ports show status "Active") for quite a
    >> while (more than 10 minutes) after all the nodes have come
    >> up. It then recovers without intervention. Is this normal?
    >> Single node reboots don't affect the network operation. osm Log
    >> file is appended.

    Hal> Can you describe your topology ? Is it the following: the SM
    Hal> is connected to a switch/or switches with the 40 nodes
    Hal> connected off these switches ?

Yes, the 40 nodes are connected to a single 144 port switch.

    Hal> I'll respond to the log (and these questions) in a separate
    Hal> email response.

    >> Question 1: Can I run opensm in a master slave configuration?

    Hal> Yes. Others are doing this.

    >> I noticed that there is a priority commandline option, but am
    >> not sure how to apply this.

    Hal> SM election occurs per high priority low GUID. So if you
    Hal> don't care which SM is the master than you don't need to do
    Hal> anything. If you want a specific order (and it is not in GUID
    Hal> order) then you need to specify priority.

Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts

priority 0 server

Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a

priority 15 server

Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.

When I kill the priority 15 server however, the priority 0 server runs
amok with continous log messages like:

Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.

I assume that the handover to the priority 0 opensm hasn't worked
then. For additional information: This test was done on a
point-to-point connection between 2 adapters.

Roland


From rf at q-leap.de  Tue Apr 12 09:47:51 2005
From: rf at q-leap.de (Roland Fehrenbacher)
Date: Tue, 12 Apr 2005 18:47:51 +0200
Subject: [openib-general] OpenSM (again)
In-Reply-To: <1113252518.4476.3.camel@localhost.localdomain>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
	<1113250997.4616.53.camel@localhost.localdomain>
	<1113252518.4476.3.camel@localhost.localdomain>
Message-ID: <16987.64439.352837.517744@gargle.gargle.HOWL>

>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    Hal> On Mon, 2005-04-11 at 16:23, Hal Rosenstock wrote:
    >> > Problem: When I reboot all the 40 nodes (apart from the one
    >> the opensm > is running), the network is non-functional (no
    >> pings go through, even > though ports show status "Active") for
    >> quite a while (more than 10 > minutes) after all the nodes have
    >> come up. It then recovers without > intervention. Is this
    >> normal? Single node reboots don't affect the > network
    >> operation. osm Log file is appended.
    >> 
    >> Can you describe your topology ? Is it the following: the SM is
    >> connected to a switch/or switches with the 40 nodes connected
    >> off these switches ?

    Hal> What is the mix of those 40 nodes in terms of OpenIB (gen2)
    Hal> and gen1 ?  Is there no difference in the behavior of gen2
    Hal> and gen1 in terms of the above symptoms ?

So far all nodes are gen2.

Roland


From halr at voltaire.com  Tue Apr 12 10:00:17 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Apr 2005 13:00:17 -0400
Subject: [openib-general] OpenSM (again)
In-Reply-To: <16987.64387.656148.774577@gargle.gargle.HOWL>
References: <16986.42404.540439.952094@gargle.gargle.HOWL>
	<1113250997.4616.53.camel@localhost.localdomain>
	<16987.64387.656148.774577@gargle.gargle.HOWL>
Message-ID: <1113325216.4523.8.camel@localhost.localdomain>

On Tue, 2005-04-12 at 12:46, Roland Fehrenbacher wrote: 
>     Hal> SM election occurs per high priority low GUID. So if you
>     Hal> don't care which SM is the master than you don't need to do
>     Hal> anything. If you want a specific order (and it is not in GUID
>     Hal> order) then you need to specify priority.
> 
> Ok. I tried this, specifying priority 0 on one server, and priority 15
> on another one. I assume priority 15, will be the master.
> If I first start the priority 0 opensm, and then the priority 15 one,
> things look normal: Log excerpts
> 
> priority 0 server
> 
> Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
> Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
> Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
> Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
> Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
> Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
> Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
> Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
> 
> priority 15 server
> 
> Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
> Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
> Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
> Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
> Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
> 
> When I kill the priority 15 server however, the priority 0 server runs
> amok with continous log messages like:
> 
> Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
> Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.

Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the
priority 0 server failing (no matching SubnGetResp received) which is
"normal" if you killed the priority 15 server.

Do the messages ever subside ?

> I assume that the handover to the priority 0 opensm hasn't worked
> then.

This isn't really handover but that is another matter.
You should be able to use the sminfo diag to see whether this SM has
assumed the MASTER role.

-- Hal


From iod00d at hp.com  Tue Apr 12 10:34:37 2005
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 12 Apr 2005 10:34:37 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <527jj8wut7.fsf@topspin.com>
References: <52fyxxywhg.fsf@topspin.com> <425AD0E5.3040805@ichips.intel.com>
	<52wtr9xc84.fsf@topspin.com> <425B0575.5090702@ichips.intel.com>
	<52k6n8yk6h.fsf@topspin.com> <52fyxwyjea.fsf@topspin.com>
	<425B0DB6.9090002@ichips.intel.com> <527jj8yhv8.fsf@topspin.com>
	<20050412025835.GE13577@esmail.cup.hp.com>
	<527jj8wut7.fsf@topspin.com>
Message-ID: <20050412173437.GB17646@esmail.cup.hp.com>

On Mon, Apr 11, 2005 at 08:14:12PM -0700, Roland Dreier wrote:
>     Grant> doorbell[] is a local variable and mthca_write64() is
>     Grant> static inline.  I don't see a problem with the assignments
>     Grant> to doorbell getting optimized out since the scope of that
>     Grant> variable is completely visible to gcc. A smart compiler
>     Grant> would just use registers and reduce the 32-bit stores.
> 
> Actually, what is that the compiler sees that we write to doorbell[]
> as a uint32_t but then read from it by dereferencing a uint64_t*.  -O2
> turns on -fstrict-aliasing, which allows the compiler to assume that
> pointers of different types never alias each other.  So gcc says, hey,
> all you do is write to that local doorbell[] variable and never do
> anything with the values you write, so I'll just throw away that dead
> code.  So gcc ends up only generating code for the store in
> mthca_write64() without any code to initialize doorbell[].

Yup - I saw your followup right after I posted.
But this is a better explanation...thanks!

grant


From mst at mellanox.co.il  Tue Apr 12 11:23:57 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Apr 2005 21:23:57 +0300
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <521x9gyhe7.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<521x9gyhe7.fsf@topspin.com>
Message-ID: <20050412182357.GA24047@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
> 
>     Roland> Yes, because the kernel may go through and unmap pages
>     Roland> from userspace while trying to swap.  Since we have the
>     Roland> page locked in the kernel, the physical page won't go
>     Roland> anywhere, but userspace might end up with a different page
>     Roland> mapped at the same virtual address.
> 
>     Andrew> That shouldn't happen.  If get_user_pages() has elevated
>     Andrew> the refcount on a page then the following can happen:
> 
>     ...
> 
>     Andrew> IOW: while the page has an elevated refcount from
>     Andrew> get_user_pages(), that physical page is 100% pinned.
>     Andrew> Once you've done the set_page_dirty+put_page(), the page
>     Andrew> is again under control of the VM.
> 
> Hmm... I've never tested it first hand but Libor assures me there is a
> something like what I said.  Libor, did I get the explanation right?
> 
>  - R.

Roland, is it possible that what you describe is the behaviour of older kernels?

Digging around in rmap.c, I see the following code in try_to_unmap_one:

        /*
         * Don't pull an anonymous page out from under get_user_pages.
         * GUP carefully breaks COW and raises page count (while holding
         * page_table_lock, as we have here) to make sure that the page
         * cannot be freed.  If we unmap that page here, a user write
         * access to the virtual address will bring back the page, but
         * its raised count will (ironically) be taken to mean it's not
         * an exclusive swap page, do_wp_page will replace it by a copy
         * page, and the user never get to see the data GUP was holding
         * the original page for.
         *
         * This test is also useful for when swapoff (unuse_process) has
         * to drop page lock: its reference to the page stops existing
         * ptes from being unmapped, so swapoff can make progress.
         */
        if (PageSwapCache(page) &&
            page_count(page) != page_mapcount(page) + 2) {
                ret = SWAP_FAIL;
                goto out_unmap;
        }

This was added in http://linus.bkbits.net:8080/linux-2.5/patch at 1.1722.120.6
on 2004-06-05 , i.e. as far as I can see around 2.6.7, and the comment says:

>>>>>>>>>>>>>>>>>>>>>>
> [PATCH] mm: get_user_pages vs. try_to_unmap
> 
> Andrea Arcangeli's fix to an ironic weakness with get_user_pages. 
> 
> try_to_unmap_one must check page_count against page->mapcount before unmapping
> a swapcache page: because the raised pagecount by which get_user_pages ensures
> the page cannot be freed, will cause any write fault to see that page as not
> exclusively owned, and therefore a copy page will be substituted for it - the
> reverse of what's intended.
> 
> rmap.c was entirely free of such page_count heuristics before, I tried hard to
> avoid putting this in.  But Andrea's fix rarely gives a false positive; and
> although it might be nicer to change exclusive_swap_page etc.  to rely on
> page->mapcount instead, it seems likely that we'll want to get rid of
> page->mapcount later, so better not to entrench its use.
> 
> Signed-off-by: Hugh Dickins <hugh at veritas.com>
> Signed-off-by: Andrew Morton <akpm at osdl.org>
> Signed-off-by: Linus Torvalds <torvalds at osdl.org>
>>>>>>>>>>>>>>>>>>>>>>

Seems quite like the situation that you described. Does my analysis make sence?

Since this case seems to be explicitly handled,
it is probably safe to rely on this behaviour or try_to_unmap,
avoiding the need for mlock, is it not?

-- 
MST - Michael S. Tsirkin


From iod00d at hp.com  Tue Apr 12 11:47:30 2005
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 12 Apr 2005 11:47:30 -0700
Subject: [openib-general] failed to allocate buffer page
Message-ID: <20050412184730.GE17646@esmail.cup.hp.com>

Hi,
Haven't checked yet waht the "side effect" of this error was,
but here is the output so people are aware of it.

This is the first time I've seen this. I've been doing
(unload, reload, test) loops alot last week. Just scripted
the set of commands and got the error on the first try.

Not reproducible on this or other boxes.

grant


gsyprf3:/usr/src/linux-2.6# reload_ib 
+ IPoIB=51
+ ifconfig ib0 down
+ rmmod ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core
ERROR: Module ib_sdp does not exist in /proc/modules
ERROR: Module ib_cm does not exist in /proc/modules
ACPI: PCI interrupt for device 0000:81:00.0 disabled
GSI 60 (level, low) -> vector 69 unregisterd.
+ modprobe ib_mthca msi_x=1
ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0)
GSI 60 (level, low) -> CPU 1 (0x0100) vector 69
ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69
+ modprobe ib_ipoib
+ modprobe ib_sdp
modprobe: page allocation failure. order:0, mode:0x20

Call Trace:
 [<a00000010000f3a0>] show_stack+0x80/0xa0
                                sp=e00000002920fc50 bsp=e0000000292090c0
 [<a00000010000f3f0>] dump_stack+0x30/0x60
                                sp=e00000002920fe20 bsp=e0000000292090a8
 [<a0000001000dd5d0>] __alloc_pages+0x5d0/0x8a0
                                sp=e00000002920fe20 bsp=e000000029209028
 [<a0000001000dd900>] __get_free_pages+0x60/0x120
                                sp=e00000002920fe30 bsp=e000000029209000
 [<a00000020025f930>] sdp_buff_pool_alloc+0xf0/0x3e0 [ib_sdp]
                                sp=e00000002920fe30 bsp=e000000029208f70
 [<a0000002002600a0>] sdp_buff_pool_init+0x480/0x620 [ib_sdp]
                                sp=e00000002920fe30 bsp=e000000029208f28
 [<a000000200204260>] sdp_init+0xe0/0x4e0 [ib_sdp]
                                sp=e00000002920fe30 bsp=e000000029208ef8
 [<a0000001000c56f0>] sys_init_module+0x470/0x640
                                sp=e00000002920fe30 bsp=e000000029208e80
 [<a00000010000a600>] ia64_ret_from_syscall+0x0/0x20
                                sp=e00000002920fe30 bsp=e000000029208e80
WARN: : Failed to allocate buffer page. <1024:747>
NET: Registered protocol family 27
+ ifconfig ib0 10.0.0.51 netmask 255.255.255.0 broadcast 10.0.0.255
+ ifconfig ib1 10.0.1.51 netmask 255.255.255.0 broadcast 10.0.1.255
gsyprf3:/usr/src/linux-2.6#


From iod00d at hp.com  Tue Apr 12 12:25:01 2005
From: iod00d at hp.com (Grant Grundler)
Date: Tue, 12 Apr 2005 12:25:01 -0700
Subject: [openib-general] NULL ptr derefence
Message-ID: <20050412192501.GA18034@esmail.cup.hp.com>

System panic'd when I ran the "reload_ib" script with NULL ptr.
Odd that I didn't see any problems with switching around module versions
by hand before. Scripting it seems to have exposed more race conditions
or something.

Sorry, I'm not sure which rev of openib code was running on this machine.
Is there some way I can tell what SVN version from the binaries in
/lib/modules/'uname -r' directory?

It's possible this was already fixed...

thanks,
grant


ionize:/usr/src/linux-2.6# reload_ib 
+ IPoIB=113
+ ifconfig ib0 down
Unable to handle kernel NULL pointer dereference (address 0000000000000000)
ib_mad1[1882]: Oops 8813272891392 [1]
Modules linked in: ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core tg3 dm_mod e1000 e100

Pid: 1882, CPU 1, comm:              ib_mad1
psr : 0000101008026018 ifs : 800000000000038b ip  : [<a0000002001214d0>]    Not tainted
ip is at ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa]
unat: 0000000000000000 pfs : 000000000000048d rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 000000000000a941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000200121a30 b6  : a000000100002d70 b7  : a000000200121440
f6  : 1003e8080808080808081 f7  : 1003e0000000000001400
f8  : 1003e0000000000001400 f9  : 1003e00000000000027d8
f10 : 1003e000000000ff00000 f11 : 1003e000000003b5f2d38
r1  : a000000200320000 r2  : a000000200123270 r3  : e0000001014a7d98
r8  : a000000200121440 r9  : 0000000000000006 r10 : 0000000000000003
r11 : 0000000000000001 r12 : e0000001014a7d20 r13 : e0000001014a0000
r14 : 0000000000000000 r15 : e0000002ead26588 r16 : a0000002001252d8
r17 : 0000000000000000 r18 : 0000000000000001 r19 : 0000000000000000
r20 : e00000000f05cf60 r21 : 0000000000000000 r22 : e00000000f05cf60
r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000200200
r26 : e000000100d22e70 r27 : 0000001008026018 r28 : e0000002e907c418
r29 : 0000000000100100 r30 : 0000000000000000 r31 : a000000200125da0

Call Trace:
 [<a00000010000f3a0>] show_stack+0x80/0xa0
                                sp=e0000001014a78e0 bsp=e0000001014a1190
 [<a00000010000fc00>] show_regs+0x7e0/0x800
                                sp=e0000001014a7ab0 bsp=e0000001014a1130
 [<a000000100033730>] die+0x150/0x1c0
                                sp=e0000001014a7ac0 bsp=e0000001014a10f0
 [<a000000100053b70>] ia64_do_page_fault+0x370/0x980
                                sp=e0000001014a7ac0 bsp=e0000001014a1088
 [<a00000010000a780>] ia64_leave_kernel+0x0/0x260
                                sp=e0000001014a7b50 bsp=e0000001014a1088
 [<a0000002001214d0>] ib_sa_mcmember_rec_callback+0x90/0xe0 [ib_sa]
                                sp=e0000001014a7d20 bsp=e0000001014a1030
 [<a000000200121a30>] send_handler+0x110/0x280 [ib_sa]
                                sp=e0000001014a7d70 bsp=e0000001014a0fe0
 [<a0000002000d26f0>] ib_mad_complete_send_wr+0x330/0x380 [ib_mad]
                                sp=e0000001014a7d70 bsp=e0000001014a0f90
 [<a0000002000d2920>] ib_mad_send_done_handler+0x1e0/0x2e0 [ib_mad]
                                sp=e0000001014a7d70 bsp=e0000001014a0f20
 [<a0000002000d2f00>] ib_mad_completion_handler+0x180/0x200 [ib_mad]
                                sp=e0000001014a7d80 bsp=e0000001014a0ed0
 [<a0000001000b1490>] worker_thread+0x3d0/0x520
                                sp=e0000001014a7db0 bsp=e0000001014a0e48
 [<a0000001000bb9e0>] kthread+0x160/0x180
                                sp=e0000001014a7e20 bsp=e0000001014a0e10
 [<a000000100011410>] kernel_thread_helper+0xd0/0x100
                                sp=e0000001014a7e30 bsp=e0000001014a0de0
 [<a0000001000090e0>] start_kernel_thread+0x20/0x40
                                sp=e0000001014a7e30 bsp=e0000001014a0de0
 + rmmod ib_ipoib ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core
<6>NET: Unregistered protocol family 27


From roland at topspin.com  Tue Apr 12 12:29:59 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 12:29:59 -0700
Subject: [openib-general] NULL ptr derefence
In-Reply-To: <20050412192501.GA18034@esmail.cup.hp.com> (Grant Grundler's
	message of "Tue, 12 Apr 2005 12:25:01 -0700")
References: <20050412192501.GA18034@esmail.cup.hp.com>
Message-ID: <52br8ju72g.fsf@topspin.com>

    Grant> System panic'd when I ran the "reload_ib" script with NULL
    Grant> ptr.  Odd that I didn't see any problems with switching
    Grant> around module versions by hand before. Scripting it seems
    Grant> to have exposed more race conditions or something.

I think we've seen this before but I never tracked it down.  I'll take
another look.

    Grant> Sorry, I'm not sure which rev of openib code was running on
    Grant> this machine.  Is there some way I can tell what SVN
    Grant> version from the binaries in /lib/modules/'uname -r'
    Grant> directory?

Not really that I know of, unfortunately.

 - R.


From roland at topspin.com  Tue Apr 12 12:28:36 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 12:28:36 -0700
Subject: [openib-general] failed to allocate buffer page
In-Reply-To: <20050412184730.GE17646@esmail.cup.hp.com> (Grant Grundler's
	message of "Tue, 12 Apr 2005 11:47:30 -0700")
References: <20050412184730.GE17646@esmail.cup.hp.com>
Message-ID: <52fyxvu74r.fsf@topspin.com>

This looks like a GFP_ATOMIC allocation failing.  Not sure where in
the SDP code it's being triggered.

 - R.


From steve at wooding.uklinux.net  Tue Apr 12 13:08:15 2005
From: steve at wooding.uklinux.net (Steven Wooding)
Date: Tue, 12 Apr 2005 21:08:15 +0100
Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors
Message-ID: <425C2AAF.2050700@wooding.uklinux.net>

Hi,

I have been putting ttcp.aio through its paces and have a few questions.

1. When -l is larger than 131072 I get an Event error <-22> on the transmit
side and no data to transferred. Changing values of -n and -a do not make
any difference.

2. When using a value of 1 for -a (so I suppose this is non-aio), I get an
Event error of <-32> on the transmit side and an <-104> on the receiver end.
Only some of the data is transferred.

3. For future reference, where can I find out what these Event error codes
mean to give me a glue of what's going wrong.

4. I sometimes see significant differences in the transfer speed reported on
the transmit and receiver ends. Is one more right than the other?

My system details are:
Two nodes with Dual Xeon 64-bit processors
HCA: MT25208 (in MT23108 compat mode) with 128MB of ram
OS: RHEL 4 (64-bit)
Gen2 stack version: trunk of 2113 (subversion revision number)

Thanks,

Steve.


From libor at topspin.com  Tue Apr 12 13:03:05 2005
From: libor at topspin.com (Libor Michalek)
Date: Tue, 12 Apr 2005 13:03:05 -0700
Subject: [openib-general] failed to allocate buffer page
In-Reply-To: <20050412184730.GE17646@esmail.cup.hp.com>;
	from iod00d@hp.com on Tue, Apr 12, 2005 at 11:47:30AM -0700
References: <20050412184730.GE17646@esmail.cup.hp.com>
Message-ID: <20050412130305.C6958@topspin.com>

On Tue, Apr 12, 2005 at 11:47:30AM -0700, Grant Grundler wrote:
> Hi,
> Haven't checked yet waht the "side effect" of this error was,
> but here is the output so people are aware of it.
> 
> This is the first time I've seen this. I've been doing
> (unload, reload, test) loops alot last week. Just scripted
> the set of commands and got the error on the first try.
> 
> Not reproducible on this or other boxes.
>
>  [<a00000020025f930>] sdp_buff_pool_alloc+0xf0/0x3e0 [ib_sdp]
>                                 sp=e00000002920fe30 bsp=e000000029208f70

  This is the SDP buffer allocator, at init time it pre-allocates
some buffers that are used for transfers. The alloctor uses ATOMIC
since it can be called during run time. I'll add a function parameter
to determine how the allocator should be called.

-Libor


From libor at topspin.com  Tue Apr 12 13:46:13 2005
From: libor at topspin.com (Libor Michalek)
Date: Tue, 12 Apr 2005 13:46:13 -0700
Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors
In-Reply-To: <425C2AAF.2050700@wooding.uklinux.net>;
	from steve@wooding.uklinux.net on Tue, Apr 12, 2005 at 09:08:15PM
	+0100
References: <425C2AAF.2050700@wooding.uklinux.net>
Message-ID: <20050412134613.D6958@topspin.com>

On Tue, Apr 12, 2005 at 09:08:15PM +0100, Steven Wooding wrote:
> Hi,
> 
> I have been putting ttcp.aio through its paces and have a few questions.
> 
> 1. When -l is larger than 131072 I get an Event error <-22> on the transmit
> side and no data to transferred. Changing values of -n and -a do not make
> any difference.

  The FMRs need to be sized at initialization time. The code currently
picks 128K as the size for the FMRs, and does not support an AIO operation
that would span multiple FMRs. If you want to try larger AIO operations
with the current code you will need to recompile SDP with a larger FMR
size, which is determined by the constant SDP_IOCB_SIZE_MAX in sdp_iocb.h
It's been a while since I've last tried this, if you try it and have
problems let me know.

> 2. When using a value of 1 for -a (so I suppose this is non-aio), I get an
> Event error of <-32> on the transmit side and an <-104> on the receiver end.
> Only some of the data is transferred.

  I'll look into this, I'm seeing a problem on longer runs myself. With a
value of 1 for -a it still uses aio, the value only means how many aio
operations can be outstanding at a given time. This just means that a
single buffer will be submitted for read/write and a new one will not
be submitted until that buffer's IO completes.

> 3. For future reference, where can I find out what these Event error codes
> mean to give me a glue of what's going wrong.

  The errors are errno values. I'll make a note to write up which errors
are possible and what they are likely to mean.

> 4. I sometimes see significant differences in the transfer speed reported on
> the transmit and receiver ends. Is one more right than the other?

  Are the wall clock times for the data transfers small, on the order
of a few seconds? How big of a wall clock time difference are you seeing?

-Libor


From steve at wooding.uklinux.net  Tue Apr 12 14:28:20 2005
From: steve at wooding.uklinux.net (Steven Wooding)
Date: Tue, 12 Apr 2005 22:28:20 +0100
Subject: [openib-general] Repost: AIO SDP and ttcp.aio: Event errors
In-Reply-To: <20050412134613.D6958@topspin.com>
References: <425C2AAF.2050700@wooding.uklinux.net>
	<20050412134613.D6958@topspin.com>
Message-ID: <425C3D74.8090705@wooding.uklinux.net>

1. OK. That's fair enough. I'll give that ago.

2. Yeah, it also occurs for large values of -n, say 10000.

3. Great.

4. Yeah, the times are small as I'm only doing short runs (-n 1000) to 
avoid the -32/-104 errors. I'll try pushing -n up a bit.

Thanks Libor,

Steve.


Libor Michalek wrote:

>On Tue, Apr 12, 2005 at 09:08:15PM +0100, Steven Wooding wrote:
>  
>
>>Hi,
>>
>>I have been putting ttcp.aio through its paces and have a few questions.
>>
>>1. When -l is larger than 131072 I get an Event error <-22> on the transmit
>>side and no data to transferred. Changing values of -n and -a do not make
>>any difference.
>>    
>>
>
>  The FMRs need to be sized at initialization time. The code currently
>picks 128K as the size for the FMRs, and does not support an AIO operation
>that would span multiple FMRs. If you want to try larger AIO operations
>with the current code you will need to recompile SDP with a larger FMR
>size, which is determined by the constant SDP_IOCB_SIZE_MAX in sdp_iocb.h
>It's been a while since I've last tried this, if you try it and have
>problems let me know.
>
>  
>
>>2. When using a value of 1 for -a (so I suppose this is non-aio), I get an
>>Event error of <-32> on the transmit side and an <-104> on the receiver end.
>>Only some of the data is transferred.
>>    
>>
>
>  I'll look into this, I'm seeing a problem on longer runs myself. With a
>value of 1 for -a it still uses aio, the value only means how many aio
>operations can be outstanding at a given time. This just means that a
>single buffer will be submitted for read/write and a new one will not
>be submitted until that buffer's IO completes.
>
>  
>
>>3. For future reference, where can I find out what these Event error codes
>>mean to give me a glue of what's going wrong.
>>    
>>
>
>  The errors are errno values. I'll make a note to write up which errors
>are possible and what they are likely to mean.
>
>  
>
>>4. I sometimes see significant differences in the transfer speed reported on
>>the transmit and receiver ends. Is one more right than the other?
>>    
>>
>
>  Are the wall clock times for the data transfers small, on the order
>of a few seconds? How big of a wall clock time difference are you seeing?
>
>-Libor
>
>
>  
>


From halr at voltaire.com  Tue Apr 12 14:58:58 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Apr 2005 17:58:58 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0BC@mtlex01.yok.mtl.com>
Message-ID: <1113343036.4476.0.camel@localhost.localdomain>

Hi Eitan,

On Fri, 2005-04-08 at 11:55, Eitan Zahavi wrote: 
> Hi Hal,
> 
> This is a physical port attribute so the file is osm_port.h and the
> structure is osm_physp_t.
> From the doc on the structure:
> *
> *  healthy
> *     Tracks the health of the port. Normally should be TRUE but 
> *     might change as a result of incoming traps indicating the port
> *     healthy is questionable.
> *
> 
> I have been trying my best to find how it can happen that a port that
> does not respond will cause OpenSM to continuously poll it. This can
> not happen so unless you can explain how it happens please do not
> contaminate the code with un-needed code.

In looking at the unhealthy code, it appears to me that the unhealthy
bit is only set if the SM receives traps 129-131 and not if the SMA does
not respond to SM MADs so these ports will not be detected and hence not
bypassed.

-- Hal


From surs at cse.ohio-state.edu  Tue Apr 12 16:28:05 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Tue, 12 Apr 2005 19:28:05 -0400
Subject: [openib-general] segmentation fault with ibv_pingpong
Message-ID: <20050412232804.GA30632@cse.ohio-state.edu>

Hi,

I am facing a segmentation fault problem with the OpenIB Gen2 drivers
while executing `ibv_pingpong' test. The description of the problem is
given below. Can someone point out what may be going wrong here? I have
included as much information as I thought would be required, but if
more specific information is needed, I can provide it.

Thanks,
Sayantan.

Hardware:
---------

Two Dual Intel Xeon EM64T 3.4 GHz nodes
PCI-Express I/O bus
MT25208 Mellanox HCAs (rev a0)

Software:
---------
RedHat AS 4
2.6.11.6/2.6.11.7 kernel with Gen2 InfiniBand drivers
Firmware version 5.0.1
OpenIB Gen2 drivers (user verbs from main branch)
OpenSM (OpenIB version/IBGD 1.7.0 both of them result in the same)


Both the machines display their ports as ACTIVE.

[surs at x1:~] cat /sys/class/infiniband/mthca0/ports/1/state
4: ACTIVE

[surs at x5:bin] lsmod | grep ib
ib_uverbs              28056  0 
ib_umad                17696  0 
ib_mthca              113952  0 
ib_mad                 38576  2 ib_umad,ib_mthca
ib_core                52352  4 ib_uverbs,ib_umad,ib_mthca,ib_mad
libata                 53000  1 ata_piix
scsi_mod              151888  3 libata,aic79xx,sd_mod

Now, if I try to run ibv_pingpong, I get this error:

--->

[surs at x1:~] ibv_pingpong
Segmentation fault
[surs at x1:~]
Message from syslogd at x1 at Mon Apr 11 18:37:18 2005 ...
x1 kernel: invalid operand: 0000 [1] SMP

<---

The relevant part from the kernel log:

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at pci_gart:537
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: ib_uverbs ib_umad ib_mthca ib_mad ib_core parport_pc
lp parport autofs4 nfs lockd sunrpc dm_mod video button battery ac md5
ipv6 uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core e1000 floppy ext3 jbd
ata_piix libata aic79xx sd_mod scsi_mod
Pid: 4034, comm: ibv_pingpong Not tainted 2.6.11.6
RIP: 0010:[<ffffffff8011da86>] <ffffffff8011da86>{dma_map_sg+223}
RSP: 0018:ffff81001d4dfd58  EFLAGS: 00010246
RAX: 000000001b92b000 RBX: ffff81001764fbf8 RCX: 000000001b92b000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000ff0 R11: 0000000000000246 R12: ffff81001764f000
R13: ffff81001764fbf8 R14: 0000000000000001 R15: ffff81001f92c070
FS:  00002aaaaacca000(0000) GS:ffffffff804c6380(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000515000 CR3: 0000000013145000 CR4: 00000000000006e0
Process ibv_pingpong (pid: 4034, threadinfo ffff81001d4de000, task
ffff81001dc641f0)
        Stack: ffff81001dc641f0 0000000000000000 0000000100000000
ffff81001764fbd0
        ffff8100176f3000 ffff81001764f000 0000000000513000
0000000000000001
        ffff81001facfa40 ffffffff8824b621
        Call Trace:<ffffffff8824b621>{:ib_mthca:mthca_map_user_db+366}
        <ffffffff88249f54>{:ib_mthca:mthca_create_cq+115}
<ffffffff880f38f6>{:ib_uverbs:ib_uverbs_create_cq+165}
        <ffffffff880f2608>{:ib_uverbs:ib_uverbs_write+139}
        <ffffffff8017443b>{vfs_write+207}
<ffffffff8017454a>{sys_write+69}
        <ffffffff8010e29e>{system_call+126}

Code: 0f 0b 0f 91 32 80 ff ff ff ff 19 02 89 f8 49 8b 97 08 01 00
RIP <ffffffff8011da86>{dma_map_sg+223} RSP <ffff81001d4dfd58>

----------------------------------------------------------------

Now, if I try to run ibv_pingpong under gdb (sender side), I get it it
to progress a little bit more (but not to completion). The receiver
prints this now:

<---

[surs at x5:examples] ibv_pingpong 192.168.107.2
local address:  LID 0x0002, QPN 0x000404, PSN 0x104788
remote address: LID 0x0001, QPN 0x000404, PSN 0x08b81e
[ 0] 00000404
[ 4] b3000000
[ 8] fd000003
[ c] 110000c0
[10] 15810000
[14] 00000010
[18] 00008002
[1c] ff100000
Failed status 12 for wr_id 2

--->


-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From roland at topspin.com  Tue Apr 12 16:38:56 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 16:38:56 -0700
Subject: [openib-general] segmentation fault with ibv_pingpong
In-Reply-To: <20050412232804.GA30632@cse.ohio-state.edu> (Sayantan Sur's
	message of "Tue, 12 Apr 2005 19:28:05 -0400")
References: <20050412232804.GA30632@cse.ohio-state.edu>
Message-ID: <52u0mbsgz3.fsf@topspin.com>

Thanks for the report.  I think I have all the information I need and
I'll try to figure out what's happening.

 - R.


From roland at topspin.com  Tue Apr 12 16:50:11 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 16:50:11 -0700
Subject: [openib-general] segmentation fault with ibv_pingpong
In-Reply-To: <20050412232804.GA30632@cse.ohio-state.edu> (Sayantan Sur's
	message of "Tue, 12 Apr 2005 19:28:05 -0400")
References: <20050412232804.GA30632@cse.ohio-state.edu>
Message-ID: <52ll7nsggc.fsf@topspin.com>

OK, I think I see the problem.  Can you please try this patch and let
me know if it helps?

Thanks,
  Roland

--- infiniband/hw/mthca/mthca_memfree.c	(revision 2156)
+++ infiniband/hw/mthca/mthca_memfree.c	(working copy)
@@ -384,6 +384,7 @@ int mthca_map_user_db(struct mthca_dev *
 	if (ret < 0)
 		goto out;
 
+	db_tab->page[i].mem.length = 4096;
 	db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK;
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);


From surs at cse.ohio-state.edu  Tue Apr 12 17:15:19 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Tue, 12 Apr 2005 20:15:19 -0400
Subject: [openib-general] segmentation fault with ibv_pingpong
In-Reply-To: <52ll7nsggc.fsf@topspin.com>
References: <20050412232804.GA30632@cse.ohio-state.edu>
	<52ll7nsggc.fsf@topspin.com>
Message-ID: <425C6497.7060703@cse.ohio-state.edu>

Roland Dreier wrote:

>OK, I think I see the problem.  Can you please try this patch and let
>me know if it helps?
>
>Thanks,
>  Roland
>
>--- infiniband/hw/mthca/mthca_memfree.c	(revision 2156)
>+++ infiniband/hw/mthca/mthca_memfree.c	(working copy)
>@@ -384,6 +384,7 @@ int mthca_map_user_db(struct mthca_dev *
> 	if (ret < 0)
> 		goto out;
> 
>+	db_tab->page[i].mem.length = 4096;
> 	db_tab->page[i].mem.offset = uaddr & ~PAGE_MASK;
> 
> 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
>  
>

Great! This helps. I can execute ibv_pingpong without any errors.

[surs at x5:~] ibv_pingpong
  local address:  LID 0x0001, QPN 0x000404, PSN 0x297822
  remote address: LID 0x0002, QPN 0x000404, PSN 0x1c5b52
8192000 bytes in 0.02 seconds = 3403.05 Mbit/sec
1000 iters in 0.02 seconds = 19.26 usec/iter

Thanks,
Sayantan.

>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>


From roland at topspin.com  Tue Apr 12 17:27:26 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 12 Apr 2005 17:27:26 -0700
Subject: [openib-general] segmentation fault with ibv_pingpong
In-Reply-To: <425C6497.7060703@cse.ohio-state.edu> (Sayantan Sur's message
	of "Tue, 12 Apr 2005 20:15:19 -0400")
References: <20050412232804.GA30632@cse.ohio-state.edu>
	<52ll7nsggc.fsf@topspin.com> <425C6497.7060703@cse.ohio-state.edu>
Message-ID: <52hdibseq9.fsf@topspin.com>

    Sayantan> Great! This helps. I can execute ibv_pingpong without
    Sayantan> any errors.

Thanks for testing.  I committed this change to the subversion tree.

 - R.


From libor at topspin.com  Tue Apr 12 18:04:47 2005
From: libor at topspin.com (Libor Michalek)
Date: Tue, 12 Apr 2005 18:04:47 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org>;
	from akpm@osdl.org on Mon, Apr 11, 2005 at 05:13:47PM -0700
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
Message-ID: <20050412180447.E6958@topspin.com>

On Mon, Apr 11, 2005 at 05:13:47PM -0700, Andrew Morton wrote:
> Roland Dreier <roland at topspin.com> wrote:
> >
> >     Troy> Do we even need the mlock in userspace then?
> > 
> > Yes, because the kernel may go through and unmap pages from userspace
> > while trying to swap.  Since we have the page locked in the kernel,
> > the physical page won't go anywhere, but userspace might end up with a
> > different page mapped at the same virtual address.

With the last few kernels I haven't had a chance to retest the problem
that pushed us in the direction of using mlock. I will go back and do
so with the latest kernel. Below I've given a quick description of the
issue.

> That shouldn't happen.  If get_user_pages() has elevated the refcount on a
> page then the following can happen:
> 
> - The VM may decide to add the page to swapcache (if it's not mmapped
>   from a file).
> 
> - Once the page is backed by either swapcache of a (mmapped) file, the VM
>   may decide the unmap the application's pte's.  A later minor fault by the
>   app will cause the same physical page to be remapped.

The driver did use get_user_pages() to elevated the refcount on all the
pages it was going to use for IO, as well as call set_page_dirty() since
the pages were going to have data written to them from the device.

The problem we were seeing is that the minor fault by the app resulted
in a new physical page getting mapped for the application. The page that
had the elevated refcount was still waiting for the data to be written
to by the driver at the time that the app accessed the page causing the
minor fault. Obviously since the app had a new mapping the data written
by the driver was lost.

It looks like code was added to try_to_unmap_one() to address this, so
hopefully it's no longer an issue...


-Libor


From eitan at mellanox.co.il  Tue Apr 12 22:28:28 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 13 Apr 2005 08:28:28 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com>

> -- Hal 
> In looking at the unhealthy code, it appears to me that the unhealthy
> bit is only set if the SM receives traps 129-131 and not if the SMA does
> not respond to SM MADs so these ports will not be detected and hence not
> bypassed.
> 
[EZ] This is true. Currently there is only one cause for the un-healthy bits
to be set - which are exactly as you point - these traps. The point I was
trying to make was that this bit is the mechanism for flagging a port status
is bad. 
What I did recommend was to write a "statistical" analysis of Directed Route
packet drop - such that we can find the ports with a high drop rate and mark
them as un-healthy. If you mark every port that does not respond to a MAD as
un-healthy you can suffer from flaky links somewhere on the route to that
port. Only analysis of the number of good packets vs. dropped packets can
lead you to the right bad port.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/5c2cc748/attachment.html>

From eitan at mellanox.co.il  Tue Apr 12 22:50:25 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 13 Apr 2005 08:50:25 +0300
Subject: [openib-general] OpenSM (again)
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F2@mtlex01.yok.mtl.com>

FYI: OpenSM implements master handover in a "lazy" or "less intrusive"
manner:

OpenSM will only handoff a subnet to the new master on a heavy sweep
sequence. 
So if you start an SM and then start one with higher priority - the handoff
will not happen unless there was some change in the subnet (trap or switch
"change bit").

The main reason for this behavior is the concept of "light sweep" that
minimizes the discovery to checking of "change bits" and now also
"irresponsive ports". So the new SM is not even discovered by the SM. 

The benefit is that as long as there is no change in the subnet the active
SM does not transfer the ownership to the new one - which has an overhead on
the entire subnet
(client re-registration or even LID changes).

This behavior is compliant as the spec says:
C14-60.2.1: If a Master SM finds another Master SM with lower priority (or
same priority and higher GUID) it shall ensure that it is the highest
priority
(or same priority and lower GUID) on the subnet, and if so it shall wait for
the other Master (or Masters) to relinquish control if its portion of the
subnet.
C14-61.2.2: If a Master SM determines that a lower priority Master SM
has not performed a handover within a vendor-specific time period, then
it shall not change the state of the subnet.
 
Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, April 12, 2005 8:00 PM
> To: rf at q-leap.de
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] OpenSM (again)
> 
> On Tue, 2005-04-12 at 12:46, Roland Fehrenbacher wrote:
> >     Hal> SM election occurs per high priority low GUID. So if you
> >     Hal> don't care which SM is the master than you don't need to do
> >     Hal> anything. If you want a specific order (and it is not in GUID
> >     Hal> order) then you need to specify priority.
> >
> > Ok. I tried this, specifying priority 0 on one server, and priority 15
> > on another one. I assume priority 15, will be the master.
> > If I first start the priority 0 opensm, and then the priority 15 one,
> > things look normal: Log excerpts
> >
> > priority 0 server
> >
> > Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
> > Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded
dispatcher.
> > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
> num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> > Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
> num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port
0x2c902004013c2.
> > Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port
0x2c902004013c2.
> > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received
Generic
> Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
> > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
> num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
> > Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received
Generic
> Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
> > Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
> num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
> > Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received
Generic
> Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
> > Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice
type:4
> num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
> >
> > priority 15 server
> >
> > Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
> > Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded
dispatcher.
> > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
> num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> > Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice
type:3
> num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
> > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port
0x2c9020040133a.
> > Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port
0x2c9020040133a.
> > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> > Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an
> Invalid Delete Request.
> >
> > When I kill the priority 15 server however, the priority 0 server runs
> > amok with continous log messages like:
> >
> > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with
error(method=1
> attr=20) -- dropping.
> > Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with
error(method=1
> attr=20) -- dropping.
> 
> Attribute 0x20 is SMInfo. This is just the SubnGet(SMInfo) from the
> priority 0 server failing (no matching SubnGetResp received) which is
> "normal" if you killed the priority 15 server.
> 
> Do the messages ever subside ?
> 
> > I assume that the handover to the priority 0 opensm hasn't worked
> > then.
> 
> This isn't really handover but that is another matter.
> You should be able to use the sminfo diag to see whether this SM has
> assumed the MASTER role.
> 
> -- Hal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/41fbaa37/attachment.html>

From tziporet at mellanox.co.il  Wed Apr 13 01:19:44 2005
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 13 Apr 2005 11:19:44 +0300
Subject: [openib-general] Re: uverbs events
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com>

Well this is the description of this bug, but we got this problem of 64-bits
arithmetic on a simple if and not switch.
>From this reason I suggested to give it a try.

Tziporet

-----Original Message-----
From: Roland Dreier [mailto:roland at topspin.com]
Sent: Tuesday, April 12, 2005 6:39 PM
To: Tziporet Koren
Cc: Grant Grundler; openib-general at openib.org
Subject: Re: [openib-general] Re: uverbs events


    Tziporet> Very important - there is a bug in gcc version 3.4.2
    Tziporet> that had been fixed in gcc 3.4.3.  This bug ((# 17581)
    Tziporet> heart us in VAPI when full optimizations is working in
    Tziporet> bits or on 64 bits systems.

Thanks, but if the bug you're talking about is

    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17581

then I don't think that's going to affect us -- we don't seem to do
any 64-bit arithmetic inside a switch statement.

 - R.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/096ea1d3/attachment.html>

From halr at voltaire.com  Wed Apr 13 02:00:05 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Apr 2005 05:00:05 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF0F0@mtlex01.yok.mtl.com>
Message-ID: <1113380170.4479.18.camel@localhost.localdomain>

On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> [EZ] This is true. Currently there is only one cause for the
> un-healthy bits to be set - which are exactly as you point - these
> traps. The point I was trying to make was that this bit is the
> mechanism for flagging a port status is bad. 
> 
> What I did recommend was to write a "statistical" analysis of Directed
> Route packet drop - such that we can find the ports with a high drop
> rate and mark them as un-healthy. If you mark every port that does not
> respond to a MAD as un-healthy you can suffer from flaky links
> somewhere on the route to that port. Only analysis of the number of
> good packets vs. dropped packets can lead you to the right bad port.

The original proposal on this said the following:

"The OpenSM will implement a configurable policy (some number of
consecutive lack of responses to SM requests). At the point of
exhaustion of the timeout/retry strategy, that port will be marked as
"bad" by OpenSM."

Any idea on what might make a good default threshold (for consecutive
retries) ? Do you think there is no sufficient default ?

If a link is flaky and MADs can't get through, should it be used for non
MAD traffic ?

Also note that the proposal also said:

"Also, there could also be a periodic "ping" at a slower rate to check
if the "bad" ports revive."

In terms of analysis of good v. errored and dropped packets (along the
path to that node), there are OpenIB diagnostic tools to help with this.

-- Hal


From eitan at mellanox.co.il  Wed Apr 13 02:20:52 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 13 Apr 2005 12:20:52 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF0F9@mtlex01.yok.mtl.com>

I probably did not make point very clear:

It is bad (not to say wrong) to disqualify a port and mark it as bad port if
it did not respond to queries.
The cause of the issue might be a flaky link on the directed route to the
port.
If the SM would be able to find that flaky link port it would avoid marking
the wrong ports. More over, the port that was almost marked as bad by the
simplistic algorithm you propose will be discovered and operational as there
many other paths to reach it - walking around the real bad port !

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Wednesday, April 13, 2005 12:00 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> > [EZ] This is true. Currently there is only one cause for the
> > un-healthy bits to be set - which are exactly as you point - these
> > traps. The point I was trying to make was that this bit is the
> > mechanism for flagging a port status is bad.
> >
> > What I did recommend was to write a "statistical" analysis of Directed
> > Route packet drop - such that we can find the ports with a high drop
> > rate and mark them as un-healthy. If you mark every port that does not
> > respond to a MAD as un-healthy you can suffer from flaky links
> > somewhere on the route to that port. Only analysis of the number of
> > good packets vs. dropped packets can lead you to the right bad port.
> 
> The original proposal on this said the following:
> 
> "The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM."
> 
> Any idea on what might make a good default threshold (for consecutive
> retries) ? Do you think there is no sufficient default ?
> 
> If a link is flaky and MADs can't get through, should it be used for non
> MAD traffic ?
> 
> Also note that the proposal also said:
> 
> "Also, there could also be a periodic "ping" at a slower rate to check
> if the "bad" ports revive."
> 
> In terms of analysis of good v. errored and dropped packets (along the
> path to that node), there are OpenIB diagnostic tools to help with this.
> 
> -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/abbfcf3c/attachment.html>

From shaharf at voltaire.com  Wed Apr 13 07:03:21 2005
From: shaharf at voltaire.com (shaharf)
Date: Wed, 13 Apr 2005 17:03:21 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <D4F8F0B3820E754C887699BEF26A8940545B7E@taurus.voltaire.com>

Eitan, 

	Your analysis is not completely accurate. The SM configure the
subnet using direct mads only, and it builds a spanning tree of direct
routes. What I want to say, is that that it doesn't matter why exactly a
port is unreachable. Once a port can not be reached, you can either
retry the entire heavy sweep process, but if the problem repeats itself
(X times) on the same port, you have no alternative other then disable
it. If the SM will have an alternative method of building direct paths,
then such alternative path could be attempted. Currently it is not
relevant. Speaking of "statistical analysis", what are the odds that a
port will behave well when it is queried directly, but starts to loose
packets when a direct route is routed through it, and behave
consistently during all retries? Again, even if this is the case (and in
understatement, I am not sure how frequent it is), the port behind it is
unreachable and therefore "bad".

The current unhealthy port mechanism is not redundant to this "bad" port
mechanism because it does not handle the same case. Both mechanisms are
required. The issue if they can share the same status bit is really an
implementation issue.

Relying of traps is very problematic in some cases, particularly in
initial bring up sweep when the SM lid is not even configured (remember
VTEC?).

Shahar   
	

________________________________________
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi
Sent: Wednesday, April 13, 2005 11:21 AM
To: Hal Rosenstock; Eitan Zahavi
Cc: openib-general at openib.org
Subject: RE: [openib-general] SM Bad Port Handling

I probably did not make point very clear: 
It is bad (not to say wrong) to disqualify a port and mark it as bad
port if it did not respond to queries. 
The cause of the issue might be a flaky link on the directed route to
the port. 
If the SM would be able to find that flaky link port it would avoid
marking the wrong ports. More over, the port that was almost marked as
bad by the simplistic algorithm you propose will be discovered and
operational as there many other paths to reach it - walking around the
real bad port !
Eitan Zahavi 
Design Technology Director 
Mellanox Technologies LTD 
Tel:+972-4-9097208 
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

> -----Original Message----- 
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Wednesday, April 13, 2005 12:00 PM 
> To: Eitan Zahavi 
> Cc: openib-general at openib.org 
> Subject: RE: [openib-general] SM Bad Port Handling 
> 
> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: 
> > [EZ] This is true. Currently there is only one cause for the 
> > un-healthy bits to be set - which are exactly as you point - these 
> > traps. The point I was trying to make was that this bit is the 
> > mechanism for flagging a port status is bad. 
> > 
> > What I did recommend was to write a "statistical" analysis of
Directed 
> > Route packet drop - such that we can find the ports with a high drop

> > rate and mark them as un-healthy. If you mark every port that does
not 
> > respond to a MAD as un-healthy you can suffer from flaky links 
> > somewhere on the route to that port. Only analysis of the number of 
> > good packets vs. dropped packets can lead you to the right bad port.

> 
> The original proposal on this said the following: 
> 
> "The OpenSM will implement a configurable policy (some number of 
> consecutive lack of responses to SM requests). At the point of 
> exhaustion of the timeout/retry strategy, that port will be marked as 
> "bad" by OpenSM." 
> 
> Any idea on what might make a good default threshold (for consecutive 
> retries) ? Do you think there is no sufficient default ? 
> 
> If a link is flaky and MADs can't get through, should it be used for
non 
> MAD traffic ? 
> 
> Also note that the proposal also said: 
> 
> "Also, there could also be a periodic "ping" at a slower rate to check

> if the "bad" ports revive." 
> 
> In terms of analysis of good v. errored and dropped packets (along the

> path to that node), there are OpenIB diagnostic tools to help with
this. 
> 
> -- Hal 


From roland at topspin.com  Wed Apr 13 09:22:37 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 13 Apr 2005 09:22:37 -0700
Subject: [openib-general] Re: uverbs events
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com>
	(Tziporet Koren's message of "Wed, 13 Apr 2005 11:19:44 +0300")
References: <506C3D7B14CDD411A52C00025558DED6064BF35D@mtlex01.yok.mtl.com>
Message-ID: <52d5syr6ia.fsf@topspin.com>

    Tziporet> Well this is the description of this bug, but we got
    Tziporet> this problem of 64-bits arithmetic on a simple if and
    Tziporet> not switch.

OK, thanks.  Do you know of any distributions that are shipping gcc 3.4.2?

 - R.


From roland at topspin.com  Wed Apr 13 11:28:03 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 13 Apr 2005 11:28:03 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050412182357.GA24047@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 12 Apr 2005 21:23:57 +0300")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <521x9gyhe7.fsf@topspin.com>
	<20050412182357.GA24047@mellanox.co.il>
Message-ID: <52sm1upm4s.fsf@topspin.com>

OK, I'm by no means an expert on this, but Libor and I looked at
rmap.c a little more, and there is code:

	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
			ptep_clear_flush_young(vma, address, pte)) {
		ret = SWAP_FAIL;
		goto out_unmap;
	}

before the check

	if (PageSwapCache(page) &&
	    page_count(page) != page_mapcount(page) + 2) {
		ret = SWAP_FAIL;
		goto out_unmap;
	}

If userspace allocates some memory but doesn't touch it aside from
passing the address in to the kernel, which does get_user_pages(), the
PTE will be young in that first test, right?  Does that mean that
the userspace mapping will be cleared and userspace will get a
different physical page if it faults that address back in?

 - R.


From akpm at osdl.org  Wed Apr 13 12:32:30 2005
From: akpm at osdl.org (Andrew Morton)
Date: Wed, 13 Apr 2005 12:32:30 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52sm1upm4s.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<521x9gyhe7.fsf@topspin.com>
	<20050412182357.GA24047@mellanox.co.il>
	<52sm1upm4s.fsf@topspin.com>
Message-ID: <20050413123230.7a18dff5.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
> OK, I'm by no means an expert on this, but Libor and I looked at
> rmap.c a little more, and there is code:
> 
> 	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
> 			ptep_clear_flush_young(vma, address, pte)) {
> 		ret = SWAP_FAIL;
> 		goto out_unmap;
> 	}
> 
> before the check
> 
> 	if (PageSwapCache(page) &&
> 	    page_count(page) != page_mapcount(page) + 2) {
> 		ret = SWAP_FAIL;
> 		goto out_unmap;
> 	}
> 
> If userspace allocates some memory but doesn't touch it aside from
> passing the address in to the kernel, which does get_user_pages(), the
> PTE will be young in that first test, right?

If get_user_pages() was called with write=1, get_user_pages() will fault in
a real page and yes, I guess it'll be pte_young.

If get_user_pages() was called with write=0, get_user_pages() will fault
in a mapping of the zero page and we'd never get this far.

> Does that mean that
> the userspace mapping will be cleared and userspace will get a
> different physical page if it faults that address back in? 
>

We won't try to unmap a page's ptes until that page has file-or-swapcache
backing.

If the pte is then cleared, a subsequent minor fault will reestablish the
mapping to the same physical page.  A major fault cannot happen because the
page was pinned by get_user_pages().


From halr at voltaire.com  Wed Apr 13 13:33:59 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Apr 2005 16:33:59 -0400
Subject: [openib-general] Re: [PATCH] teach ifconfig about ib [WAS: Latest
	IPoIB Bringup Questions]
In-Reply-To: <1107811374.6917.6.camel@duffman>
References: <1098985903.17991.74.camel@hpc-1> <1107811374.6917.6.camel@duffman>
Message-ID: <1113424438.4479.140.camel@localhost.localdomain>

Hi Tom,

On Mon, 2005-02-07 at 16:22, Tom Duffy wrote:
> [Responding to an old message]
> 
> On Thu, 2004-10-28 at 13:51 -0400, Hal Rosenstock wrote:
> > Should we teach ifconfig to display Link Encap: INFINIBAND ?
> 
> Still has the problem of truncating the address to the first 14 bytes.

I finally did this.  The HWaddr does not look right to me. Although the
formatting is correct now, it doesn't contain the port GUID as I think
it should. Does it appear for you ? Thanks.

-- Hal

./ifconfig ib0
ib0       Link encap:InfiniBand  HWaddr
00:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
 /usr/local/ib/bin/ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0008:f104:0396:0559

/usr/local/ib/bin/ibstat
CA 'mthca0'
...
        Node GUID: 0x0008f10403960558
...
        Port 1:
 ...
                Port GUID: 0x0008f10403960559

        Port 2:
...
                Port GUID: 0x0008f1040396055a

> diff -Nur net-tools-1.60/config.in net-tools-1.60-ib/config.in
> --- net-tools-1.60/config.in	2000-05-21 07:32:12.000000000 -0700
> +++ net-tools-1.60-ib/config.in	2005-02-07 10:45:14.108286619 -0800
> @@ -82,6 +82,7 @@
>  bool '(Cisco)-HDLC/LAPB support' HAVE_HWHDLCLAPB n
>  bool 'IrDA support' HAVE_HWIRDA y
>  bool 'Econet hardware support' HAVE_HWEC n
> +bool 'InfiniBand hardware support' HAVE_HWIB y
>  *
>  *
>  *           Other Features.
> diff -Nur net-tools-1.60/config.make net-tools-1.60-ib/config.make
> --- net-tools-1.60/config.make	2005-02-07 11:58:18.536146922 -0800
> +++ net-tools-1.60-ib/config.make	2005-02-07 12:04:03.596462891 -0800
> @@ -30,6 +30,7 @@
>  HAVE_HWHDLCLAPB=1
>  HAVE_HWIRDA=1
>  HAVE_HWEC=1
> +HAVE_HWIB=1
>  HAVE_FW_MASQUERADE=1
>  HAVE_IP_TOOLS=1
>  HAVE_MII=1
> Binary files net-tools-1.60/ipmaddr and net-tools-1.60-ib/ipmaddr differ
> Binary files net-tools-1.60/iptunnel and net-tools-1.60-ib/iptunnel differ
> diff -Nur net-tools-1.60/lib/hw.c net-tools-1.60-ib/lib/hw.c
> --- net-tools-1.60/lib/hw.c	2000-05-20 11:27:25.000000000 -0700
> +++ net-tools-1.60-ib/lib/hw.c	2005-02-07 09:56:22.315428035 -0800
> @@ -73,6 +73,8 @@
>  
>  extern struct hwtype ec_hwtype;
>  
> +extern struct hwtype ib_hwtype;
> +
>  static struct hwtype *hwtypes[] =
>  {
>  
> @@ -144,6 +146,9 @@
>  #if HAVE_HWX25
>      &x25_hwtype,
>  #endif
> +#if HAVE_HWIB
> +    &ib_hwtype,
> +#endif
>      &unspec_hwtype,
>      NULL
>  };
> @@ -217,6 +222,9 @@
>  #if HAVE_HWEC
>      ec_hwtype.title = _("Econet");
>  #endif
> +#if HAVE_HWIB
> +    ib_hwtype.title = _("InfiniBand");
> +#endif
>      sVhwinit = 1;
>  }
>  
> diff -Nur net-tools-1.60/lib/ib.c net-tools-1.60-ib/lib/ib.c
> --- net-tools-1.60/lib/ib.c	1969-12-31 16:00:00.000000000 -0800
> +++ net-tools-1.60-ib/lib/ib.c	2005-02-07 12:55:04.635559244 -0800
> @@ -0,0 +1,147 @@
> +/*
> + * lib/ib.c        This file contains an implementation of the "Infiniband"
> + *              support functions.
> + *
> + * Version:     $Id: ib.c,v 1.1 2005/02/06 11:00:47 tduffy Exp $
> + *
> + * Author:      Fred N. van Kempen, <waltje at uwalt.nl.mugnet.org>
> + *              Copyright 1993 MicroWalt Corporation
> + *		Tom Duffy <tduffy at sun.com>
> + *
> + *              This program is free software; you can redistribute it
> + *              and/or  modify it under  the terms of  the GNU General
> + *              Public  License as  published  by  the  Free  Software
> + *              Foundation;  either  version 2 of the License, or  (at
> + *              your option) any later version.
> + */
> +#include "config.h"
> +
> +#if HAVE_HWIB
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <net/if_arp.h>
> +#include <linux/if_infiniband.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <errno.h>
> +#include <ctype.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include "net-support.h"
> +#include "pathnames.h"
> +#include "intl.h"
> +#include "util.h"
> +
> +extern struct hwtype ib_hwtype;
> +
> +
> +/* Display an InfiniBand address in readable format. */
> +static char *pr_ib(unsigned char *ptr)
> +{
> +    static char buff[128];
> +    char *pos;
> +    unsigned int i;
> +
> +    pos = buff;
> +    for (i = 0; i < INFINIBAND_ALEN; i++) {
> +	pos += sprintf(pos, "%02X:", (*ptr++ & 0377));
> +    }
> +    buff[strlen(buff) - 1] = '\0';
> +
> +    /* snprintf(buff, sizeof(buff), "%02X:%02X:%02X:%02X:%02X:%02X",
> +	     (ptr[0] & 0377), (ptr[1] & 0377), (ptr[2] & 0377),
> +	     (ptr[3] & 0377), (ptr[4] & 0377), (ptr[5] & 0377)
> +	);
> +    */
> +    return (buff);
> +}
> +
> +
> +/* Input an Infiniband address and convert to binary. */
> +static int in_ib(char *bufp, struct sockaddr *sap)
> +{
> +    unsigned char *ptr;
> +    char c, *orig;
> +    int i;
> +    unsigned val;
> +
> +    sap->sa_family = ib_hwtype.type;
> +    ptr = sap->sa_data;
> +
> +    i = 0;
> +    orig = bufp;
> +    while ((*bufp != '\0') && (i < INFINIBAND_ALEN)) {
> +	val = 0;
> +	c = *bufp++;
> +	if (isdigit(c))
> +	    val = c - '0';
> +	else if (c >= 'a' && c <= 'f')
> +	    val = c - 'a' + 10;
> +	else if (c >= 'A' && c <= 'F')
> +	    val = c - 'A' + 10;
> +	else {
> +#ifdef DEBUG
> +	    fprintf(stderr, _("in_ib(%s): invalid infiniband address!\n"), orig);
> +#endif
> +	    errno = EINVAL;
> +	    return (-1);
> +	}
> +	val <<= 4;
> +	c = *bufp;
> +	if (isdigit(c))
> +	    val |= c - '0';
> +	else if (c >= 'a' && c <= 'f')
> +	    val |= c - 'a' + 10;
> +	else if (c >= 'A' && c <= 'F')
> +	    val |= c - 'A' + 10;
> +	else if (c == ':' || c == 0)
> +	    val >>= 4;
> +	else {
> +#ifdef DEBUG
> +	    fprintf(stderr, _("in_ib(%s): invalid infiniband address!\n"), orig);
> +#endif
> +	    errno = EINVAL;
> +	    return (-1);
> +	}
> +	if (c != 0)
> +	    bufp++;
> +	*ptr++ = (unsigned char) (val & 0377);
> +	i++;
> +
> +	/* We might get a semicolon here - not required. */
> +	if (*bufp == ':') {
> +	    if (i == INFINIBAND_ALEN) {
> +#ifdef DEBUG
> +		fprintf(stderr, _("in_ib(%s): trailing : ignored!\n"),
> +			orig)
> +#endif
> +		    ;		/* nothing */
> +	    }
> +	    bufp++;
> +	}
> +    }
> +
> +    /* That's it.  Any trailing junk? */
> +    if ((i == INFINIBAND_ALEN) && (*bufp != '\0')) {
> +#ifdef DEBUG
> +	fprintf(stderr, _("in_ib(%s): trailing junk!\n"), orig);
> +	errno = EINVAL;
> +	return (-1);
> +#endif
> +    }
> +#ifdef DEBUG
> +    fprintf(stderr, "in_ib(%s): %s\n", orig, pr_ib(sap->sa_data));
> +#endif
> +
> +    return (0);
> +}
> +
> +
> +struct hwtype ib_hwtype =
> +{
> +    "infiniband", NULL, ARPHRD_INFINIBAND, INFINIBAND_ALEN,
> +    pr_ib, in_ib, NULL
> +};
> +
> +
> +#endif				/* HAVE_HWETHER */
> diff -Nur net-tools-1.60/lib/Makefile net-tools-1.60-ib/lib/Makefile
> --- net-tools-1.60/lib/Makefile	2000-10-28 03:59:42.000000000 -0700
> +++ net-tools-1.60-ib/lib/Makefile	2005-02-07 10:02:14.662640164 -0800
> @@ -16,7 +16,7 @@
>  #
>  
> 
> -HWOBJS	 = hw.o loopback.o slip.o ether.o ax25.o ppp.o arcnet.o tr.o tunnel.o frame.o sit.o rose.o ash.o fddi.o hippi.o hdlclapb.o strip.o irda.o ec_hw.o x25.o
> +HWOBJS	 = hw.o loopback.o slip.o ether.o ax25.o ppp.o arcnet.o tr.o tunnel.o frame.o sit.o rose.o ash.o fddi.o hippi.o hdlclapb.o strip.o irda.o ec_hw.o x25.o ib.o
>  AFOBJS	 = unix.o inet.o inet6.o ax25.o ipx.o ddp.o ipx.o netrom.o af.o rose.o econet.o x25.o
>  AFGROBJS = inet_gr.o inet6_gr.o ipx_gr.o ddp_gr.o netrom_gr.o ax25_gr.o rose_gr.o getroute.o x25_gr.o
>  AFSROBJS = inet_sr.o inet6_sr.o netrom_sr.o ipx_sr.o setroute.o x25_sr.o
> Binary files net-tools-1.60/mii-tool and net-tools-1.60-ib/mii-tool differ
> 


From jlentini at netapp.com  Wed Apr 13 13:49:15 2005
From: jlentini at netapp.com (James Lentini)
Date: Wed, 13 Apr 2005 16:49:15 -0400 (EDT)
Subject: [openib-general] target InfiniBand release
Message-ID: <Pine.LNX.4.61.0504131644140.1689@jlentini-linux.nane.netapp.com>


Which version of the InfiniBand specification does the gen2 stack 
target? Release 1.0, 1.1, or 1.2?

james


From halr at voltaire.com  Wed Apr 13 13:51:43 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Apr 2005 16:51:43 -0400
Subject: [openib-general] target InfiniBand release
In-Reply-To: <Pine.LNX.4.61.0504131644140.1689@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.61.0504131644140.1689@jlentini-linux.nane.netapp.com>
Message-ID: <1113425503.4479.155.camel@localhost.localdomain>

On Wed, 2005-04-13 at 16:49, James Lentini wrote:
> Which version of the InfiniBand specification does the gen2 stack 
> target? Release 1.0, 1.1, or 1.2?

Mostly 1.1 but a little 1.2

-- Hal


From mshefty at ichips.intel.com  Wed Apr 13 13:57:48 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 13 Apr 2005 13:57:48 -0700
Subject: [openib-general] target InfiniBand release
In-Reply-To: <Pine.LNX.4.61.0504131644140.1689@jlentini-linux.nane.netapp.com>
References: <Pine.LNX.4.61.0504131644140.1689@jlentini-linux.nane.netapp.com>
Message-ID: <425D87CC.9080109@ichips.intel.com>

James Lentini wrote:
> 
> Which version of the InfiniBand specification does the gen2 stack 
> target? Release 1.0, 1.1, or 1.2?

I reference the 1.2 version of the spec when coding.

- Sean


From tduffy at sun.com  Wed Apr 13 14:27:54 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 13 Apr 2005 14:27:54 -0700
Subject: [openib-general] Re: [PATCH] teach ifconfig about ib [WAS: Latest
	IPoIB Bringup Questions]
In-Reply-To: <1113424438.4479.140.camel@localhost.localdomain>
References: <1098985903.17991.74.camel@hpc-1> <1107811374.6917.6.camel@duffman>
	<1113424438.4479.140.camel@localhost.localdomain>
Message-ID: <1113427674.26977.0.camel@duffman>

On Wed, 2005-04-13 at 16:33 -0400, Hal Rosenstock wrote:
> Hi Tom,
> 
> On Mon, 2005-02-07 at 16:22, Tom Duffy wrote:
> > [Responding to an old message]
> > 
> > On Thu, 2004-10-28 at 13:51 -0400, Hal Rosenstock wrote:
> > > Should we teach ifconfig to display Link Encap: INFINIBAND ?
> > 
> > Still has the problem of truncating the address to the first 14 bytes.
> 
> I finally did this.  The HWaddr does not look right to me. Although the
> formatting is correct now, it doesn't contain the port GUID as I think
> it should. Does it appear for you ? Thanks.

No, the address is always truncated.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/30186d33/attachment.sig>

From tziporet at mellanox.co.il  Wed Apr 13 23:08:28 2005
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 14 Apr 2005 09:08:28 +0300
Subject: [openib-general] Re: uverbs events
Message-ID: <506C3D7B14CDD411A52C00025558DED6064BF376@mtlex01.yok.mtl.com>

Fedora core 3
RH AS 4 already use gcc 3.4.3

Tziporet

-----Original Message-----
From: Roland Dreier [mailto:roland at topspin.com]
Sent: Wednesday, April 13, 2005 7:23 PM
To: Tziporet Koren
Cc: Grant Grundler; openib-general at openib.org
Subject: Re: [openib-general] Re: uverbs events


    Tziporet> Well this is the description of this bug, but we got
    Tziporet> this problem of 64-bits arithmetic on a simple if and
    Tziporet> not switch.

OK, thanks.  Do you know of any distributions that are shipping gcc 3.4.2?

 - R.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050414/c25104c3/attachment.html>

From eitan at mellanox.co.il  Wed Apr 13 23:29:08 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Apr 2005 09:29:08 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com>

Hi Shahar,

> 	Your analysis is not completely accurate. The SM configure the
> subnet using direct mads only, and it builds a spanning tree of direct
> routes. What I want to say, is that that it doesn't matter why exactly a
> port is unreachable. Once a port can not be reached, you can either
> retry the entire heavy sweep process, but if the problem repeats itself
> (X times) on the same port, you have no alternative other then disable
> it.
The point is that the real "bad" ports are not the ones that are killing
100% of packets
(since they will simply have a "DOWN" state and vanish).

The real bad ports are the ones that pass < 25% (as we use retry of 4) of
packets that goes through them. 

When such a port happen to be on a switch it will normally cause other ports
to appear to be "bad" - NOT ITSELF !
The reason for it is that the number of packets sent through a switch port
(not a leaf switch port) is much larger then the number of packets that
deals with the discovery of the port itself. All the ports "behind" the
switch port will go through that port. And there is a much higher chance for
ALL the packets that goes to an end-port be dropped then the chance for ALL
the packets that goes through the switch ports to be dropped).

So if you implement the feature the way it was proposed what you will end up
with is disconnecting end-ports and not the real bad port.

Why is it bad? It is bad since in tree topology the end-ports always have an
alternate path to the SM. If you could find the real flaky bad port - you
could still communicate with all the end-ports.

So how do we find that bad port/cable that causes other port to appear bad?
We have internally had many long discussions on this topic. The algorithm is
not fully developed yet. But several things are clear:
1. One needs to track the number of successful and bad packet flowing
through each port. Such that a failure rate can be obtained for each port.
2. Topology based analysis should be used to find the common point that is
first to have a high drop rate on the directed route tree.
3. Alternate directed routes might be used to invalidate "suspicious" ports.

In any case, I was not proposing relying on traps. I was suggesting to use
the 
"healthy" bit on physical ports as the way to carry the information about
"bad" ports (once we correctly find them) into the rest of the algorithms
used by the SM.

Regarding the need to "disconnect" a bad HCA "end-port" - I still have not
seen any log showing OpenSM going through infinite "polling" of bad ports.
As I know the code - I can not believe this is possible - so unless you have
a log that shows this phenomena (and not another one) please do not chance
this path.

One last word. I would highly recommend using the management simulator for
setting arbitrary (random) bad packet drops and test any algorithm you might
think of.

EZ

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: shaharf [mailto:shaharf at voltaire.com]
> Sent: Wednesday, April 13, 2005 5:03 PM
> To: Eitan Zahavi; Hal Rosenstock
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> Eitan,
> 
> 	Your analysis is not completely accurate. The SM configure the
> subnet using direct mads only, and it builds a spanning tree of direct
> routes. What I want to say, is that that it doesn't matter why exactly a
> port is unreachable. Once a port can not be reached, you can either
> retry the entire heavy sweep process, but if the problem repeats itself
> (X times) on the same port, you have no alternative other then disable
> it. If the SM will have an alternative method of building direct paths,
> then such alternative path could be attempted. Currently it is not
> relevant. Speaking of "statistical analysis", what are the odds that a
> port will behave well when it is queried directly, but starts to loose
> packets when a direct route is routed through it, and behave
> consistently during all retries? Again, even if this is the case (and in
> understatement, I am not sure how frequent it is), the port behind it is
> unreachable and therefore "bad".
> 
> The current unhealthy port mechanism is not redundant to this "bad" port
> mechanism because it does not handle the same case. Both mechanisms are
> required. The issue if they can share the same status bit is really an
> implementation issue.
> 
> Relying of traps is very problematic in some cases, particularly in
> initial bring up sweep when the SM lid is not even configured (remember
> VTEC?).
> 
> Shahar
> 
> 
> ________________________________________
> From: openib-general-bounces at openib.org
> [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi
> Sent: Wednesday, April 13, 2005 11:21 AM
> To: Hal Rosenstock; Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> I probably did not make point very clear:
> It is bad (not to say wrong) to disqualify a port and mark it as bad
> port if it did not respond to queries.
> The cause of the issue might be a flaky link on the directed route to
> the port.
> If the SM would be able to find that flaky link port it would avoid
> marking the wrong ports. More over, the port that was almost marked as
> bad by the simplistic algorithm you propose will be discovered and
> operational as there many other paths to reach it - walking around the
> real bad port !
> Eitan Zahavi
> Design Technology Director
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Wednesday, April 13, 2005 12:00 PM
> > To: Eitan Zahavi
> > Cc: openib-general at openib.org
> > Subject: RE: [openib-general] SM Bad Port Handling
> >
> > On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> > > [EZ] This is true. Currently there is only one cause for the
> > > un-healthy bits to be set - which are exactly as you point - these
> > > traps. The point I was trying to make was that this bit is the
> > > mechanism for flagging a port status is bad.
> > >
> > > What I did recommend was to write a "statistical" analysis of
> Directed
> > > Route packet drop - such that we can find the ports with a high drop
> 
> > > rate and mark them as un-healthy. If you mark every port that does
> not
> > > respond to a MAD as un-healthy you can suffer from flaky links
> > > somewhere on the route to that port. Only analysis of the number of
> > > good packets vs. dropped packets can lead you to the right bad port.
> 
> >
> > The original proposal on this said the following:
> >
> > "The OpenSM will implement a configurable policy (some number of
> > consecutive lack of responses to SM requests). At the point of
> > exhaustion of the timeout/retry strategy, that port will be marked as
> > "bad" by OpenSM."
> >
> > Any idea on what might make a good default threshold (for consecutive
> > retries) ? Do you think there is no sufficient default ?
> >
> > If a link is flaky and MADs can't get through, should it be used for
> non
> > MAD traffic ?
> >
> > Also note that the proposal also said:
> >
> > "Also, there could also be a periodic "ping" at a slower rate to check
> 
> > if the "bad" ports revive."
> >
> > In terms of analysis of good v. errored and dropped packets (along the
> 
> > path to that node), there are OpenIB diagnostic tools to help with
> this.
> >
> > -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050414/81a0d050/attachment.html>

From mst at mellanox.co.il  Thu Apr 14 01:06:48 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Apr 2005 11:06:48 +0300
Subject: [openib-general] Re: patches
In-Reply-To: <52hdif3ggn.fsf@topspin.com>
References: <20050408093558.GB21709@mellanox.co.il>
	<52psx545sy.fsf@topspin.com>
	<20050409172150.GA31200@mellanox.co.il>
	<52hdif3ggn.fsf@topspin.com>
Message-ID: <20050414080648.GE32526@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: patches
> 
>     Michael> If I remember
>     Michael> correctly alloc_consistent and free consistent in init_ib
>     Michael> currently get different sizes, isnt that wrong?
> 
> Yes, that needs to be fixed.
> 
>  - R.
> 

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c
===================================================================
--- src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c	(revision 2169)
+++ src/linux-kernel/infiniband/hw/mthca/mthca_cmd.c	(working copy)
@@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev,
 	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
 			CMD_TIME_CLASS_A, status);
 
-	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma);
 	return err;
 }
 

-- 
MST - Michael S. Tsirkin


From halr at voltaire.com  Thu Apr 14 03:32:02 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Apr 2005 06:32:02 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF101@mtlex01.yok.mtl.com>
Message-ID: <1113474722.4479.299.camel@localhost.localdomain>

On Thu, 2005-04-14 at 02:29, Eitan Zahavi wrote:
> The point is that the real "bad" ports are not the ones that are
> killing 100% of packets
> (since they will simply have a "DOWN" state and vanish).
> 
> The real bad ports are the ones that pass < 25% (as we use retry of 4)
> of packets that goes through them. 

When the SM sends a direct route MAD it saves the port guid (and port
num) in the madw context, so that when there is a reply or timeout you
can easily find the port. That means you dont have to walk the entire DR
path to find the unhealthy port. That means that the peer port (from
which we arrived to the bad port) is unhealthy. Does this address your
concern ?

-- Hal


From eitan at mellanox.co.il  Thu Apr 14 04:13:18 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Apr 2005 14:13:18 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com>

> 
> When the SM sends a direct route MAD it saves the port guid (and port
> num) in the madw context, so that when there is a reply or timeout you
> can easily find the port. That means you dont have to walk the entire DR
> path to find the unhealthy port. That means that the peer port (from
> which we arrived to the bad port) is unhealthy. Does this address your
> concern ?
> 
[EZ] Not at all. Although the target port is known. The flaky link that
fails the mad might be anywhere along the path to the port. So, if you mark
the target port as bad you might be marking the wrong port!

 [EZ] Let me clarify with an example:
SM=HCA1/P1 -> SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1
 
\..SW4/P4->SW3/P4..SW3/P5->SW3/P2../
                           
If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to HCA2
using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as bad then you
actually will loose that HCA for no good reason since another path from SM
to HCA2 exists.

EZ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050414/947f5600/attachment.html>

From roland at topspin.com  Thu Apr 14 11:39:17 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 14 Apr 2005 11:39:17 -0700
Subject: [openib-general] NULL ptr derefence
In-Reply-To: <20050412192501.GA18034@esmail.cup.hp.com> (Grant Grundler's
	message of "Tue, 12 Apr 2005 12:25:01 -0700")
References: <20050412192501.GA18034@esmail.cup.hp.com>
Message-ID: <52zmw1mcdm.fsf@topspin.com>

I think I have this figured out: if you unload ib_ipoib and
ib_sa_query in quick succession, ib_ipoib sends MCMember requests to
the SA to leave its multicast groups.  Normally, because IPoIB sets a
timeout of 0, no callback is generated and so it's fine that IPoIB
passes a NULL callback.  However, if ib_sa_query is unloaded right
afterwards, the send of the request doesn't get a chance to complete
and so a cancel callback is generated.

If this crash is at all reproducible for you, can you try this patch
and see if it helps?

Thanks,
  Roland

--- infiniband/core/sa_query.c	(revision 1781)
+++ infiniband/core/sa_query.c	(working copy)
@@ -587,7 +587,7 @@
 
 	init_mad(query->sa_query.mad, agent);
 
-	query->sa_query.callback              = ib_sa_path_rec_callback;
+	query->sa_query.callback              = callback ? ib_sa_path_rec_callback : NULL;
 	query->sa_query.release               = ib_sa_path_rec_release;
 	query->sa_query.port                  = port;
 	query->sa_query.mad->mad_hdr.method   = IB_MGMT_METHOD_GET;
@@ -663,7 +663,7 @@
 
 	init_mad(query->sa_query.mad, agent);
 
-	query->sa_query.callback              = ib_sa_mcmember_rec_callback;
+	query->sa_query.callback              = callback ? ib_sa_mcmember_rec_callback : NULL;
 	query->sa_query.release               = ib_sa_mcmember_rec_release;
 	query->sa_query.port                  = port;
 	query->sa_query.mad->mad_hdr.method   = method;
@@ -698,20 +698,21 @@
 	if (!query)
 		return;
 
-	switch (mad_send_wc->status) {
-	case IB_WC_SUCCESS:
-		/* No callback -- already got recv */
-		break;
-	case IB_WC_RESP_TIMEOUT_ERR:
-		query->callback(query, -ETIMEDOUT, NULL);
-		break;
-	case IB_WC_WR_FLUSH_ERR:
-		query->callback(query, -EINTR, NULL);
-		break;
-	default:
-		query->callback(query, -EIO, NULL);
-		break;
-	}
+	if (query->callback)
+		switch (mad_send_wc->status) {
+		case IB_WC_SUCCESS:
+			/* No callback -- already got recv */
+			break;
+		case IB_WC_RESP_TIMEOUT_ERR:
+			query->callback(query, -ETIMEDOUT, NULL);
+			break;
+		case IB_WC_WR_FLUSH_ERR:
+			query->callback(query, -EINTR, NULL);
+			break;
+		default:
+			query->callback(query, -EIO, NULL);
+			break;
+		}
 
 	dma_unmap_single(agent->device->dma_device,
 			 pci_unmap_addr(query, mapping),
@@ -736,7 +737,7 @@
 	query = idr_find(&query_idr, mad_recv_wc->wc->wr_id);
 	spin_unlock_irqrestore(&idr_lock, flags);
 
-	if (query) {
+	if (query && query->callback) {
 		if (mad_recv_wc->wc->status == IB_WC_SUCCESS)
 			query->callback(query,
 					mad_recv_wc->recv_buf.mad->mad_hdr.status ?


From roland at topspin.com  Thu Apr 14 12:35:48 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 14 Apr 2005 12:35:48 -0700
Subject: [openib-general] Topspin, Cisco and OpenIB
Message-ID: <52r7hdm9rf.fsf@topspin.com>

By now I'm sure most of you have heard the news that Cisco is
acquiring Topspin.  As you can see from the headline of the release
(http://newsroom.cisco.com/dlls/2005/corp_041405.html?CMP=ILC-001):

  Cisco Systems to Acquire Topspin Communications

  Broadens Data Center Portfolio with Server Fabric Switches,
  InfiniBand Technology, and Server Virtualization Software

Cisco is putting InfiniBand front and center, and everything I've
heard from Cisco during the process confirms that they are excited
about IB and want to continue to expand the IB market in high
performance computing and beyond.

Open source InfiniBand software is a key part of this plan, and Libor
and I will continue our efforts in OpenIB.  If anything, the Cisco
acquisition will allow us to focus even more resources on OpenIB, and
I think the deal will be a huge win for OpenIB and the InfiniBand world.

Thanks,
  Roland


From iod00d at hp.com  Thu Apr 14 14:20:34 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 14 Apr 2005 14:20:34 -0700
Subject: [openib-general] NULL ptr derefence
In-Reply-To: <52zmw1mcdm.fsf@topspin.com>
References: <20050412192501.GA18034@esmail.cup.hp.com>
	<52zmw1mcdm.fsf@topspin.com>
Message-ID: <20050414212034.GH25145@esmail.cup.hp.com>

On Thu, Apr 14, 2005 at 11:39:17AM -0700, Roland Dreier wrote:
> I think I have this figured out: if you unload ib_ipoib and
> ib_sa_query in quick succession, ib_ipoib sends MCMember requests to
> the SA to leave its multicast groups.  Normally, because IPoIB sets a
> timeout of 0, no callback is generated and so it's fine that IPoIB
> passes a NULL callback.  However, if ib_sa_query is unloaded right
> afterwards, the send of the request doesn't get a chance to complete
> and so a cancel callback is generated.
> 
> If this crash is at all reproducible for you, can you try this patch
> and see if it helps?

I haven't reproduced it yet...but I'm going to put a machine
in an infinite loop running the unload/load script.

Once I know how long it takes to reproduce, I can comfortably
tell you if it's fixed or not.

thanks,
grant


From iod00d at hp.com  Thu Apr 14 16:58:13 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 14 Apr 2005 16:58:13 -0700
Subject: [openib-general] NULL ptr derefence
In-Reply-To: <52zmw1mcdm.fsf@topspin.com>
References: <20050412192501.GA18034@esmail.cup.hp.com>
	<52zmw1mcdm.fsf@topspin.com>
Message-ID: <20050414235813.GA26772@esmail.cup.hp.com>

On Thu, Apr 14, 2005 at 11:39:17AM -0700, Roland Dreier wrote:
...
> If this crash is at all reproducible for you, 
...

I tried to reproduce the crash with SVN r2168.
But I was only able to produce a "hang".
The one liner was:
	while :
	do
		date
		reload_ib
	done

"reload_ib" just unloads all the modules, loads ib_mthca, ib_ipoib
and ib_sdp modules, and lastly ifconfig's up the ib0/1 interfaces.

I had a "sleep 3" after 'date' and that ran for 10 minutes or
so with no problems. Without the sleep, it ran for 5 minutes
with no problem. I then ran "ping -f 10.0.0.113" from another
host just to get the target a bit busy and that hung the target
machine after a few minutes.

I've parked the reload_ib script, System.map, and "errdump init"
(ib_hang-2.6.11-pa1.txt) output on
	ftp://gsyprf10.external.hp.com/pub/openib/

I've got a couple other fires and administrivia to deal with
and won't be able to mess with this again today.

grant


From kjreilly at us.ibm.com  Thu Apr 14 19:32:37 2005
From: kjreilly at us.ibm.com (Kevin Reilly)
Date: Thu, 14 Apr 2005 22:32:37 -0400
Subject: [openib-general] openIB gen2 user space verbs API
Message-ID: <OF457D6B23.566DB008-ON85256FE4.000D9B2E-85256FE4.000DF84A@us.ibm.com>


I was wonder where i could find information on the openIB gen2 user space
verbs API?  One of my key questions
is how much different then VAPI.  Is it a superset of VAPI and if so what
function was added?

Kevin J. Reilly
STSM, HPC Architecture
-Federation/HPS  Chief Engineer
-HPC interconnect architect
(office) 845-433-7976  (tieline) 8-293-7976


From roland at topspin.com  Thu Apr 14 19:55:40 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 14 Apr 2005 19:55:40 -0700
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <OF457D6B23.566DB008-ON85256FE4.000D9B2E-85256FE4.000DF84A@us.ibm.com>
	(Kevin Reilly's message of "Thu, 14 Apr 2005 22:32:37 -0400")
References: <OF457D6B23.566DB008-ON85256FE4.000D9B2E-85256FE4.000DF84A@us.ibm.com>
Message-ID: <527jj4n3yr.fsf@topspin.com>

    Kevin> I was wonder where i could find information on the openIB
    Kevin> gen2 user space verbs API?  One of my key questions is how
    Kevin> much different then VAPI.  Is it a superset of VAPI and if
    Kevin> so what function was added?

Right now the best way to find out about the userspace verbs API is to
look at the libibverbs source.  The include file infiniband/verbs.h
and the code in the examples directory are probably the best places to
get started.

The current code implements all the main verbs likely to be used by
userspace applications.  There are no functions that I can think of
added beyond what's in VAPI.

 - R.


From RAISCH at de.ibm.com  Fri Apr 15 06:15:10 2005
From: RAISCH at de.ibm.com (Christoph Raisch)
Date: Fri, 15 Apr 2005 15:15:10 +0200
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <527jj4n3yr.fsf@topspin.com>
Message-ID: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>

Roland,
reading the userspace infiniband/verbs.h file,
where did query QP go?

Gruss / Regards . . . Christoph Raisch


openib-general-bounces at openib.org wrote on 15.04.2005 04:55:40:

>     Kevin> I was wonder where i could find information on the openIB
>     Kevin> gen2 user space verbs API?  One of my key questions is how
>     Kevin> much different then VAPI.  Is it a superset of VAPI and if
>     Kevin> so what function was added?
> 
> Right now the best way to find out about the userspace verbs API is to
> look at the libibverbs source.  The include file infiniband/verbs.h
> and the code in the examples directory are probably the best places to
> get started.
> 
> The current code implements all the main verbs likely to be used by
> userspace applications.  There are no functions that I can think of
> added beyond what's in VAPI.
> 
>  - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050415/240740f1/attachment.html>

From halr at voltaire.com  Fri Apr 15 06:35:10 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Apr 2005 09:35:10 -0400
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
Message-ID: <1113572109.4479.68.camel@localhost.localdomain>

On Fri, 2005-04-15 at 09:15, Christoph Raisch wrote:
> Roland,
> reading the userspace infiniband/verbs.h file,
> where did query QP go?

That's one that is missing in user verbs (and mthca) currently. It is in
kernel verbs but returns -ENOSYS currently.

-- Hal

> 
> Gruss / Regards . . . Christoph Raisch
> 
> 
> openib-general-bounces at openib.org wrote on 15.04.2005 04:55:40:
> 
> >     Kevin> I was wonder where i could find information on the openIB
> >     Kevin> gen2 user space verbs API?  One of my key questions is
> how
> >     Kevin> much different then VAPI.  Is it a superset of VAPI and
> if
> >     Kevin> so what function was added?
> > 
> > Right now the best way to find out about the userspace verbs API is
> to
> > look at the libibverbs source.  The include file infiniband/verbs.h
> > and the code in the examples directory are probably the best places
> to
> > get started.
> > 
> > The current code implements all the main verbs likely to be used by
> > userspace applications.  There are no functions that I can think of
> > added beyond what's in VAPI.
> > 
> >  - R.
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From roland at topspin.com  Fri Apr 15 08:49:20 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 08:49:20 -0700
Subject: [openib-general] openIB gen2 user space verbs API
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
Message-ID: <527jj4kpkv.fsf@topspin.com>

    Christoph> Roland, reading the userspace infiniband/verbs.h file,
    Christoph> where did query QP go?

It's not implemented yet.  Is there an application that needs it?

 - R.


From ardavis at ichips.intel.com  Fri Apr 15 13:27:41 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Fri, 15 Apr 2005 13:27:41 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
Message-ID: <426023BD.8080504@ichips.intel.com>

Hello Roland,

I have openib uDAPL up and running with most of our internal MPI test 
suites (Intel-MPI). Pretty impressive with such an early code drop of 
user verbs. Nice job!

With a little stress, I see the following oops (running latest from the 
trunk). Let me know if you need any more information.

Apr 15 13:03:27 iclust-19 kernel:  <1>Unable to handle kernel NULL 
pointer dereference at 0000000000000010 RIP:
Apr 15 13:03:27 iclust-19 kernel: <ffffffff803815f0>{ib_umem_get+272}
Apr 15 13:03:27 iclust-19 kernel: PGD 33933067 PUD 32a58067 PMD 0
Apr 15 13:03:27 iclust-19 kernel: Oops: 0000 [2] SMP
Apr 15 13:03:27 iclust-19 kernel: CPU 0
Apr 15 13:03:27 iclust-19 kernel: Modules linked in:
Apr 15 13:03:27 iclust-19 kernel: Pid: 13502, comm: transpose2 Not 
tainted 2.6.11
Apr 15 13:03:27 iclust-19 kernel: RIP: 0010:[<ffffffff803815f0>] 
<ffffffff803815f0>{ib_umem_get+272}
Apr 15 13:03:27 iclust-19 kernel: RSP: 0018:ffff81002ed4ddd8  EFLAGS: 
00010206
Apr 15 13:03:27 iclust-19 kernel: RAX: 0000800000000000 RBX: 
000000000000b000 RCX: 00007fffffff5000
Apr 15 13:03:27 iclust-19 kernel: RDX: 0000000000000000 RSI: 
00007fffffff5000 RDI: ffff810027f9e940
Apr 15 13:03:27 iclust-19 kernel: RBP: 00007fffffff5000 R08: 
0000000000000000 R09: 0000000000000000
Apr 15 13:03:27 iclust-19 kernel: R10: 0000000000030b24 R11: 
0000000000000000 R12: ffff810031815c80
Apr 15 13:03:27 iclust-19 kernel: R13: 0000000000000000 R14: 
00007fffffff5000 R15: ffff81002ed15000
Apr 15 13:03:27 iclust-19 kernel: FS:  00002aaaaae55f40(0000) 
GS:ffffffff805fe400(0000) knlGS:0000000000000000
Apr 15 13:03:27 iclust-19 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 15 13:03:27 iclust-19 kernel: CR2: 0000000000000010 CR3: 
0000000034b16000 CR4: 00000000000006e0
Apr 15 13:03:27 iclust-19 kernel: Process transpose2 (pid: 13502, 
threadinfo ffff81002ed4c000, task ffff81003e3f62f0)
Apr 15 13:03:27 iclust-19 kernel: Stack: ffff810033391ab8 
ffffffff80168e62 000000000000000d ffff810031815cc8
Apr 15 13:03:27 iclust-19 kernel:        000000000000000b 
0000000000000000 ffff810031815ca8 ffff81000235a000
Apr 15 13:03:27 iclust-19 kernel:        ffffffff804ca110 0000000000000030
Apr 15 13:03:27 iclust-19 kernel: Call 
Trace:<ffffffff80168e62>{handle_mm_fault+418} 
<ffffffff80380424>{ib_uverbs_reg_mr+212}
Apr 15 13:03:27 iclust-19 kernel:        
<ffffffff8037f486>{ib_uverbs_write+150} <ffffffff8017ad14>{vfs_write+196}
Apr 15 13:03:27 iclust-19 kernel:        
<ffffffff8017ae73>{sys_write+83} <ffffffff8010e30a>{system_call+126}
Apr 15 13:03:27 iclust-19 kernel:
Apr 15 13:03:27 iclust-19 kernel:
Apr 15 13:03:27 iclust-19 kernel: Code: 4c 8b 72 10 eb ba 49 89 ee 49 81 
e6 00 f0 ff ff 8b 4c 24 20
Apr 15 13:03:27 iclust-19 kernel: RIP 
<ffffffff803815f0>{ib_umem_get+272} RSP <ffff81002ed4ddd8>
Apr 15 13:03:27 iclust-19 kernel: CR2: 0000000000000010

Thanks,

-arlin


From roland at topspin.com  Fri Apr 15 13:19:27 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 13:19:27 -0700
Subject: [openib-general] Re: Gen2 User verbs usage
In-Reply-To: <20050415182702.GA5572@cse.ohio-state.edu> (Sayantan Sur's
	message of "Fri, 15 Apr 2005 14:27:02 -0400")
References: <20050415182702.GA5572@cse.ohio-state.edu>
Message-ID: <52hdi7kd2o.fsf@topspin.com>

>>>>> "Sayantan" == Sayantan Sur <surs at cse.ohio-state.edu> writes:

    Sayantan> Hi Roland, I have some questions regarding the usage of
    Sayantan> the new Gen2 verbs.

    Sayantan> 1. Polling CQ : I notice that this verb is little
    Sayantan> different from VAPI_poll_cq, in the sense that it
    Sayantan> accepts a parameter to poll for multiple completion
    Sayantan> entries. So, if I have a statement like:

    Sayantan>     497 ne = ibv_poll_cq(hca.cq, 1, &wc);

    Sayantan> I want to poll for one completion. Does `ne' hold the
    Sayantan> return status or number of elements actually pulled out
    Sayantan> of CQ?

Sorry, this should be documented better in the userspace library.  The
semantics are identical to the ib_poll_cq() function in the kernel:

 * Poll a CQ for (possibly multiple) completions.  If the return value
 * is < 0, an error occurred.  If the return value is >= 0, it is the
 * number of completions returned.  If the return value is
 * non-negative and < num_entries, then the CQ was emptied.

    Sayantan> 2. Posting RDMA write : Do these statements for
    Sayantan> preparing a RDMA write IB descriptor make sense?

    Sayantan>     472     sr_desc.send_flags = IBV_SEND_SIGNALED;
    Sayantan>     473     sr_desc.opcode = IBV_WR_RDMA_WRITE;
    Sayantan>     474     sr_desc.wr_id = 0;
    Sayantan>     475     sr_desc.num_sge = 1;

    Sayantan>     477     sr_desc.sg_list = &(sg_entry);

    Sayantan>     479     sr_desc.wr.rdma.remote_addr = (uintptr_t) (rbuf.buf);
    Sayantan>     480     sr_desc.wr.rdma.rkey = rbuf.rkey;

    Sayantan>     483     sg_entry.addr = (uintptr_t) (lbuf.buf);
    Sayantan>     484     sg_entry.length = len;
    Sayantan>     485     sg_entry.lkey = lbuf.mr->lkey;

    Sayantan> Essentially, I don't understand what the `send_flags'
    Sayantan> field means.

Yes, this all makes sense.  The send_flags field can hold any
combination (|'ed together) of the flags IBV_SEND_FENCE,
IBV_SEND_SIGNALED, IBV_SEND_SOLICITED and IBV_SEND_INLINE.  FENCE
means that strict ordering will be enforced, as described in section
10.8 of the IB spec.  IBV_SEND_SIGNALED means that a CQ entry will be
generated when the send is completed (this flag is ignored if the QP
is created with sq_sig_all != 0, since all sends will generate CQ
entries anyway).  SOLICITED means that the solicited bit will be set
in the message so that the remote side will receive a solicited
completion event.  INLINE means the verbs driver should try to copy
the data directly into the send work request to reduce latency.

    Sayantan> On a side note, if you think that this sort of
    Sayantan> discussion is useful in openib-general, please feel free
    Sayantan> to Cc to that list.

Yes, I definitely think these questions should go through the mailing
list so that all the subscribers (plus any future archive searchers!)
can learn from the answers.

Thanks,
  Roland


From roland at topspin.com  Fri Apr 15 13:30:24 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 13:30:24 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
In-Reply-To: <426023BD.8080504@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Fri, 15 Apr 2005 13:27:41 -0700")
References: <426023BD.8080504@ichips.intel.com>
Message-ID: <52r7hbixzz.fsf@topspin.com>

    ardavis> Hello Roland, I have openib uDAPL up and running with
    ardavis> most of our internal MPI test suites (Intel-MPI). Pretty
    ardavis> impressive with such an early code drop of user
    ardavis> verbs. Nice job!

Cool!

    ardavis> With a little stress, I see the following oops (running
    ardavis> latest from the trunk). Let me know if you need any more
    ardavis> information.

Thanks, I'll try to take a look at that code and see if I can figure
it out.

 - R.


From surs at cse.ohio-state.edu  Fri Apr 15 14:58:38 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Fri, 15 Apr 2005 17:58:38 -0400
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <527jj4kpkv.fsf@topspin.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
Message-ID: <20050415215836.GA6479@cse.ohio-state.edu>

Hi,

* On Apr,5 Roland Dreier<roland at topspin.com> wrote :
>     Christoph> Roland, reading the userspace infiniband/verbs.h file,
>     Christoph> where did query QP go?
> 
> It's not implemented yet.  Is there an application that needs it?

In VAPI, the query QP is used to find the inline size.
e.g.

ret = VAPI_query_qp(viadev.nic, viadev.qp_hndl[i],
                    &qp_query_attr, &qp_query_attr_mask,
                    &qp_query_init_attr);
[...]
inline_size = qp_query_attr.cap.max_inline_data_sq;


Is there another way to find out the inline size in Gen2 verbs?

Thanks,
Sayantan.

> 
>  - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From surs at cse.ohio-state.edu  Fri Apr 15 15:10:09 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Fri, 15 Apr 2005 18:10:09 -0400
Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message <
	4KB
Message-ID: <20050415221008.GB6479@cse.ohio-state.edu>

Hi,

If I run `ibv_pingpong' with msg size argument < 4096, I am encountering
a IBV_WC_LOC_LEN_ERR. I have pasted the output from my run. Is this
usage correct? Is anybody else getting the same error too?

If I may ask, what do the messages like:

  [ 0] 00860404

mean? They are not coming from `ibv_pingpong' but from somewhere else.
Are they supposed to be debug messages? If yes, how can I interpret
them?

Thanks,
Sayantan.

==============
[surs at x5:latency] ibv_pingpong --size=4095
  local address:  LID 0x0001, QPN 0x860404, PSN 0x51bebd
  remote address: LID 0x0002, QPN 0x860404, PSN 0x4be4ee
  [ 0] 00860404
  [ 4] 00000000
  [ 8] 0020669c
  [ c] 00000000
  [10] 01d70000
  [14] 00207388
  [18] 00000004
  [1c] fe100000
  [ 0] 00860404
  [ 4] 0001000a
  [ 8] 1f20669c
  [ c] 00000300
  [10] 05f90000
  [14] 000001f3
  [18] 00000044
  [1c] fe100000
Failed status 1 for wr_id 1
==============
[surs at x5:latency] ibv_pingpong --size=4096
  local address:  LID 0x0001, QPN 0x850404, PSN 0x79dca6
  remote address: LID 0x0002, QPN 0x850404, PSN 0xff01c2
8192000 bytes in 0.02 seconds = 3411.38 Mbit/sec
1000 iters in 0.02 seconds = 19.21 usec/iter
==============

Platform description -

Hardware:
---------

Two Dual Intel Xeon EM64T 3.4 GHz nodes
PCI-Express I/O bus
MT25208 Mellanox HCAs (rev a0)

Software:
---------
RedHat AS 4
2.6.11.6/2.6.11.7 kernel with Gen2 InfiniBand drivers
Firmware version 5.0.1
OpenIB Gen2 drivers (user verbs from main branch)
OpenSM (OpenIB version/IBGD 1.7.0 both of them result in the same)


-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From roland at topspin.com  Fri Apr 15 15:07:35 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 15:07:35 -0700
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <20050415215836.GA6479@cse.ohio-state.edu> (Sayantan Sur's
	message of "Fri, 15 Apr 2005 17:58:38 -0400")
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
	<20050415215836.GA6479@cse.ohio-state.edu>
Message-ID: <52r7hbhexk.fsf@topspin.com>

    Sayantan> In VAPI, the query QP is used to find the inline size.

    Sayantan> Is there another way to find out the inline size in Gen2
    Sayantan> verbs?

It's not implemented yet, but I would have ibv_create_qp():

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
			     struct ibv_qp_init_attr *qp_init_attr);

to pass the max inline size back in the qp_init_attr->qp_cap.max_inline_data
member.

I'll code this up on Monday, it's pretty trivial.

 - R.


From roland at topspin.com  Fri Apr 15 15:15:18 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 15:15:18 -0700
Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message
	< 4KB
In-Reply-To: <20050415221008.GB6479@cse.ohio-state.edu> (Sayantan Sur's
	message of "Fri, 15 Apr 2005 18:10:09 -0400")
References: <20050415221008.GB6479@cse.ohio-state.edu>
Message-ID: <52mzrzhekp.fsf@topspin.com>

    Sayantan> Hi, If I run `ibv_pingpong' with msg size argument <
    Sayantan> 4096, I am encountering a IBV_WC_LOC_LEN_ERR. I have
    Sayantan> pasted the output from my run. Is this usage correct? Is
    Sayantan> anybody else getting the same error too?

Are you passing the --size argument to both the client and server
ibv_pingpong programs?  If not, then one side will post a receive
buffer too small to receive the message that the other side sends, and
you will get a local length error.

    Sayantan> If I may ask, what do the messages like:

    Sayantan>   [ 0] 00860404

    Sayantan> mean? They are not coming from `ibv_pingpong' but from
    Sayantan> somewhere else.  Are they supposed to be debug messages? 
    Sayantan> If yes, how can I interpret them?

That's temporary debugging code from libmthca, specifically the CQ
polling code.  It's dumping out the 32 byte CQ entry written by the
HCA hardware.  You will need Mellanox documentation to interpret it
(or you can just read the code in libmthca/src/cq.c).

 - R.


From roland at topspin.com  Fri Apr 15 15:25:10 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 15 Apr 2005 15:25:10 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
In-Reply-To: <426023BD.8080504@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Fri, 15 Apr 2005 13:27:41 -0700")
References: <426023BD.8080504@ichips.intel.com>
Message-ID: <52ekdbhe49.fsf@topspin.com>

    ardavis> With a little stress, I see the following oops (running
    ardavis> latest from the trunk). Let me know if you need any more
    ardavis> information.

Can you try this patch and let me know if it helps at all?

Thanks,
  Roland

--- infiniband/core/uverbs_mem.c	(revision 2156)
+++ infiniband/core/uverbs_mem.c	(working copy)
@@ -60,16 +60,16 @@ static void __ib_umem_unmark(struct ib_u
 	unsigned long cur_base;
 
 	vma = find_vma(mm, umem->user_base);
+	cur_base = umem->user_base;
 
-	for (cur_base = umem->user_base;
-	     cur_base < umem->user_base + umem->length;
-	     cur_base = vma->vm_end) {
-		if (!vma || vma->vm_start > umem->user_base + umem->length)
+	while (cur_base < umem->user_base + umem->length && vma &&
+	       vma->vm_start < umem->user_base + umem->length) {
 			break;
 
 		if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 			vma->vm_flags &= ~VM_DONTCOPY;
 
+		cur_base = vma->vm_end;
 		vma = vma->vm_next;
 	}
 }
@@ -102,10 +102,9 @@ int ib_umem_get(struct ib_device *dev, s
 	down_write(&current->mm->mmap_sem);
 
 	vma = find_vma(current->mm, mem->user_base);
+	cur_base = mem->user_base;
 
-	for (cur_base = mem->user_base;
-	     cur_base < mem->user_base + size;
-	     cur_base = vma->vm_end) {
+	while (cur_base < mem->user_base + size) {
 		if (!vma || vma->vm_start > cur_base) {
 			ret = -ENOMEM;
 			goto out;
@@ -114,6 +113,7 @@ int ib_umem_get(struct ib_device *dev, s
 		if (!(vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 			vma->vm_flags |= VM_DONTCOPY;
 
+		cur_base = vma->vm_end;
 		vma = vma->vm_next;
 	}
 

From robert.j.woodruff at intel.com  Fri Apr 15 15:37:40 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 15 Apr 2005 15:37:40 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
Message-ID: <ORSMSX408A1XvpFVjCR00000008@orsmsx408.amr.corp.intel.com>


People have been asking for a backport patch 
of the openib.org code for the 2.6.9 kernel, so I backported
the latest code that was released to kernel.org. 

Attached is a patch that can be applied to a 2.6.9 kernel
that contains the same code that is in 2.6.12-rc2-mm3. 
Roland tells me that there probably will not be any more 
changes, so this should match what is released in 2.6.12. 

I have done limited testing with IPoIB on 2.6.9 from kernel.org and 
the RedHat version of 2.6.9 that is in EL4.0 and it seems to 
work fine.  It was tested on an Itanium tiger2 platform, but the changes
were 
small and it should work for all other platforms as well. 

Matt, should we post this to the downloads web page for
people that want to download it. 

woody


-------------- next part --------------
A non-text attachment was scrubbed...
Name: linux-2.6.9-ib.patch.bz2
Type: application/octet-stream
Size: 120133 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050415/53e15c57/attachment.obj>

From iod00d at hp.com  Fri Apr 15 15:52:49 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 15 Apr 2005 15:52:49 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <ORSMSX408A1XvpFVjCR00000008@orsmsx408.amr.corp.intel.com>
References: <ORSMSX408A1XvpFVjCR00000008@orsmsx408.amr.corp.intel.com>
Message-ID: <20050415225249.GC30386@esmail.cup.hp.com>

On Fri, Apr 15, 2005 at 03:37:40PM -0700, Bob Woodruff wrote:
> 
> People have been asking for a backport patch 
> of the openib.org code for the 2.6.9 kernel, so I backported
> the latest code that was released to kernel.org. 
...
> Matt, should we post this to the downloads web page for
> people that want to download it. 

openib.org has a "gen2/src/linux-kernel/patches/" directory
for this sort of backport patches.

Maybe add a link to an SVN web interface for patches?

grant


From robert.j.woodruff at intel.com  Fri Apr 15 15:57:47 2005
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 15 Apr 2005 15:57:47 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
Message-ID: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408>


>openib.org has a "gen2/src/linux-kernel/patches/" directory
>for this sort of backport patches.

>Maybe add a link to an SVN web interface for patches?

>grant

Makes sense. 
Perhaps we should make a subdirectory for 
gen2/src/linux-kernel/patches/2.6.9,
since I suspect that we will have additional backport patches
for 2.6.9 going forward, i.e., when the user-mode and SDP support is
complete
and released to kernel.org someone may want a packport to 2.6.9.

woody


From iod00d at hp.com  Fri Apr 15 16:11:08 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 15 Apr 2005 16:11:08 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F00041FAA70@orsmsx408>
Message-ID: <20050415231108.GD30386@esmail.cup.hp.com>

On Fri, Apr 15, 2005 at 03:57:47PM -0700, Woodruff, Robert J wrote:
> Perhaps we should make a subdirectory for 
> gen2/src/linux-kernel/patches/2.6.9,

I don't think that's needed since the file name should make that obvious
and we won't have that many seperate 2.6.9 patches.

> since I suspect that we will have additional backport patches
> for 2.6.9 going forward, i.e., when the user-mode and SDP support is
> complete and released to kernel.org someone may want a packport to 2.6.9.

Just name new chunks appropriately or update the existing ones.
e.g. linux-2.6.9-uverbs.diff could be the uverbs patch for 2.6.9 support.
Ditto for SDP.
Maybe add an enumerator so that one apply patches that would collide
in some same files without having to physically glob the patches together.

BTW, patches can contain the same text that we post to the mailing
list to document the contents. The existing linux-2.6.11-sinai.diff
is a good example.

grant


From robert.j.woodruff at intel.com  Fri Apr 15 16:21:52 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 15 Apr 2005 16:21:52 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <20050415231108.GD30386@esmail.cup.hp.com>
Message-ID: <ORSMSX408FRaqbC8wSA00000009@orsmsx408.amr.corp.intel.com>

 
>Just name new chunks appropriately or update the existing ones.
>e.g. linux-2.6.9-uverbs.diff could be the uverbs patch for 2.6.9 support.
>Ditto for SDP.
>Maybe add an enumerator so that one apply patches that would collide
>in some same files without having to physically glob the patches together.

>grant

My current thinking was to provide one patch that contained all of
the InfiniBand code that was released in a specific kernel.org release
that also contained the changes for the backport.
That way, the user would only have to apply one patch to do the backport.

Alternatively, we could take the approach of just putting a patch for 
each component that is a diff of what was released to kernel.org. In this 
case the user would first apply the kernel.org patches and then multiple
backport patches. Not sure which way will be harder to maintain going
forward.

woody


From iod00d at hp.com  Fri Apr 15 16:26:58 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 15 Apr 2005 16:26:58 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <ORSMSX408A1XvpFVjCR00000008@orsmsx408.amr.corp.intel.com>
References: <ORSMSX408A1XvpFVjCR00000008@orsmsx408.amr.corp.intel.com>
Message-ID: <20050415232658.GE30386@esmail.cup.hp.com>

On Fri, Apr 15, 2005 at 03:37:40PM -0700, Bob Woodruff wrote:
...
> Attached is a patch that can be applied to a 2.6.9 kernel
> that contains the same code that is in 2.6.12-rc2-mm3. 

Normally patches shouldn't be compressed - search engines
can't/don't dig through those. And my selfish reason for
complaining is I have to save the patch and manually
uncompress it to look at it.


diff -Naurp linux-2.6.9/Documentation/ioctl-number.txt.rej linux-2.6.9-ib/Docume
ntation/ioctl-number.txt.rej
--- linux-2.6.9/Documentation/ioctl-number.txt.rej      1969-12-31 16:00:00.0000
00000 -0800
+++ linux-2.6.9-ib/Documentation/ioctl-number.txt.rej   2005-04-13 12:18:20.0000
00000 -0700

Is this really a file in the tree?
Or just an artifact that should be removed?

...
> I have done limited testing with IPoIB on 2.6.9 from kernel.org and 
> the RedHat version of 2.6.9 that is in EL4.0 and it seems to 
> work fine.  It was tested on an Itanium tiger2 platform, but the changes
> were small and it should work for all other platforms as well. 

"the changes were small" is not how I would describe an 800k patch.
This "patch" isn't a patch in the traditional sense since most
of the 800k is new (drivers/infiniband). I'm pretty sure you
were referring to changes you made, not counting new code.
But I can't see that since it's all globbed together.

This would be better split up into three patches:
1) add drivers/infiniband (and document which SVN rev you started with)
2) changes to drivers/infiniband to use 2.6.9 services
3) changes to the kernel to support drivers/infiniband

That way we can just update each peice as necessary.

Possible names to call each peice (just ideas, you probably
have better names):
	diff-2.6.9-01-openib_drivers
	diff-2.6.9-02-openib_fixup
	diff-2.6.9-03-ib_kernel_changes

$0.02,
grant


From iod00d at hp.com  Fri Apr 15 16:33:25 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 15 Apr 2005 16:33:25 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <ORSMSX408FRaqbC8wSA00000009@orsmsx408.amr.corp.intel.com>
References: <20050415231108.GD30386@esmail.cup.hp.com>
	<ORSMSX408FRaqbC8wSA00000009@orsmsx408.amr.corp.intel.com>
Message-ID: <20050415233325.GF30386@esmail.cup.hp.com>

On Fri, Apr 15, 2005 at 04:21:52PM -0700, Bob Woodruff wrote:
> My current thinking was to provide one patch that contained all of
> the InfiniBand code that was released in a specific kernel.org release
> that also contained the changes for the backport.
> That way, the user would only have to apply one patch to do the backport.

I agree that's more convenient for users, but harder to maintain
and review. I don't expect that many people to apply this kind
of patch to a RH release which they bought support for. It will
basically void the support contract.

Wouldn't someone rolling their own kernels is more likely to be
running something newer than 2.6.9? (to follow this example)


> Alternatively, we could take the approach of just putting a patch for 
> each component that is a diff of what was released to kernel.org. In this 
> case the user would first apply the kernel.org patches and then multiple
> backport patches. Not sure which way will be harder to maintain going
> forward.

*nod*. Exactly what I was thinking. As a developer, I'm inclined
to keep my life easier (smaller patches) and I think the distro's
would prefer smaller patches. Is anyone from the SuSE/RH distro's
on this list and care to comment?

thanks,
grant


From robert.j.woodruff at intel.com  Fri Apr 15 17:23:25 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 15 Apr 2005 17:23:25 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <20050415232658.GE30386@esmail.cup.hp.com>
Message-ID: <ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>

Grant wrote, 
>Possible names to call each peice (just ideas, you probably
>have better names):
>	diff-2.6.9-01-openib_drivers
>	diff-2.6.9-02-openib_fixup
>	diff-2.6.9-03-ib_kernel_changes

>$0.02,
>grant

Sure,  if that would make more sense and would be 
easier to review and maintain, we can
certainly do it that way. 

Might want to have something like, 

diff-2.6.9-01-openib_drivers-SVNxxx
diff-2.6.9-02-openib_fixup-SVNxxx
diff-2.6.9-03-ib_kernel_changes

where xxx is the SVN version

so that we can have fixups that match a specific SVN version.


From robert.j.woodruff at intel.com  Fri Apr 15 17:26:13 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 15 Apr 2005 17:26:13 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <20050415232658.GE30386@esmail.cup.hp.com>
Message-ID: <ORSMSX408FRaqbC8wSA0000000b@orsmsx408.amr.corp.intel.com>

Grant wrote,
 >diff -Naurp linux-2.6.9/Documentation/ioctl-number.txt.rej
linux-2.6.9-ib/Docume
>ntation/ioctl-number.txt.rej
>--- linux-2.6.9/Documentation/ioctl-number.txt.rej      1969-12-31
16:00:00.0000
>00000 -0800
>+++ linux-2.6.9-ib/Documentation/ioctl-number.txt.rej   2005-04-13
12:18:20.0000
>00000 -0700

>Is this really a file in the tree?
>Or just an artifact that should be removed?

I'll check. Looks like something that can be removed.

woody


From iod00d at hp.com  Fri Apr 15 19:07:15 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 15 Apr 2005 19:07:15 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>
References: <20050415232658.GE30386@esmail.cup.hp.com>
	<ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>
Message-ID: <20050416020715.GG30386@esmail.cup.hp.com>

On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote:
> Might want to have something like, 
> 
> diff-2.6.9-01-openib_drivers-SVNxxx
> diff-2.6.9-02-openib_fixup-SVNxxx
> diff-2.6.9-03-ib_kernel_changes
> 
> where xxx is the SVN version
> 
> so that we can have fixups that match a specific SVN version.

Sure, that's a good idea.

And it reminds me that we still have no clue which version
of SVN someone's IB kernel driver is based on. Supporting
this is going to be painful unless this is dealt with.
Anyone have a clue how to get SVN or "make" to embed
the version number in any resulting .o ?

It's bad enough when distro's patch drivers without updating
the rev number. But not having a reliable rev number to start
with is even worse.

thanks,
grant


From surs at cse.ohio-state.edu  Fri Apr 15 22:26:37 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Sat, 16 Apr 2005 01:26:37 -0400
Subject: [openib-general] IBV_WC_LOC_LEN_ERR in ibv_pingpong for message
	< 4KB
In-Reply-To: <52mzrzhekp.fsf@topspin.com>
References: <20050415221008.GB6479@cse.ohio-state.edu>
	<52mzrzhekp.fsf@topspin.com>
Message-ID: <4260A20D.1090009@cse.ohio-state.edu>

Roland Dreier wrote:
>     Sayantan> Hi, If I run `ibv_pingpong' with msg size argument <
>     Sayantan> 4096, I am encountering a IBV_WC_LOC_LEN_ERR. I have
>     Sayantan> pasted the output from my run. Is this usage correct? Is
>     Sayantan> anybody else getting the same error too?
> 
> Are you passing the --size argument to both the client and server
> ibv_pingpong programs?  If not, then one side will post a receive
> buffer too small to receive the message that the other side sends, and
> you will get a local length error.

Thanks for your reply. Yes, I was assuming a `perf_main' style of usage, 
where it isn't necessary to pass the size information to the receiver 
command line.

> 
>     Sayantan> If I may ask, what do the messages like:
> 
>     Sayantan>   [ 0] 00860404
> 
>     Sayantan> mean? They are not coming from `ibv_pingpong' but from
>     Sayantan> somewhere else.  Are they supposed to be debug messages? 
>     Sayantan> If yes, how can I interpret them?
> 
> That's temporary debugging code from libmthca, specifically the CQ
> polling code.  It's dumping out the 32 byte CQ entry written by the
> HCA hardware.  You will need Mellanox documentation to interpret it
> (or you can just read the code in libmthca/src/cq.c).

Okay. I'll look at that file.

Thanks,
Sayantan.

> 
>  - R.


-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From kjreilly at us.ibm.com  Sat Apr 16 07:49:56 2005
From: kjreilly at us.ibm.com (Kevin Reilly)
Date: Sat, 16 Apr 2005 10:49:56 -0400
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <527jj4n3yr.fsf@topspin.com>
Message-ID: <OF45EEBBB4.5B4DD430-ON85256FE5.00512CB8-85256FE5.005179B2@us.ibm.com>


Thanks Roland,  I guess i can make the assumption that any prototype work
layered over VAPI can be fairly easily
ported over the new gen2 interface?

Kevin J. Reilly
STSM, HPC Architecture
-Federation/HPS  Chief Engineer
-HPC interconnect architect
(office) 845-433-7976  (tieline) 8-293-7976


             Roland Dreier                                                 
             <roland at topspin.c                                             
             om>                                                        To 
                                       Kevin Reilly/Poughkeepsie/IBM at IBMUS 
             04/14/2005 10:55                                           cc 
             PM                        openib-general at openib.org           
                                                                   Subject 
                                       Re: [openib-general] openIB gen2    
                                       user space verbs API                
                                                                           
                                                                           
    Kevin> I was wonder where i could find information on the openIB
    Kevin> gen2 user space verbs API?  One of my key questions is how
    Kevin> much different then VAPI.  Is it a superset of VAPI and if
    Kevin> so what function was added?

Right now the best way to find out about the userspace verbs API is to
look at the libibverbs source.  The include file infiniband/verbs.h
and the code in the examples directory are probably the best places to
get started.

The current code implements all the main verbs likely to be used by
userspace applications.  There are no functions that I can think of
added beyond what's in VAPI.

  - R.


From roland at topspin.com  Sat Apr 16 08:50:01 2005
From: roland at topspin.com (Roland Dreier)
Date: Sat, 16 Apr 2005 08:50:01 -0700
Subject: [openib-general] openIB gen2 user space verbs API
In-Reply-To: <OF45EEBBB4.5B4DD430-ON85256FE5.00512CB8-85256FE5.005179B2@us.ibm.com>
	(Kevin Reilly's message of "Sat, 16 Apr 2005 10:49:56 -0400")
References: <OF45EEBBB4.5B4DD430-ON85256FE5.00512CB8-85256FE5.005179B2@us.ibm.com>
Message-ID: <52y8big1qt.fsf@topspin.com>

    Kevin> Thanks Roland, I guess i can make the assumption that any
    Kevin> prototype work layered over VAPI can be fairly easily
    Kevin> ported over the new gen2 interface?

Yes, I think so.

 - R.


From mst at mellanox.co.il  Sat Apr 16 09:58:04 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 16 Apr 2005 19:58:04 +0300
Subject: [openib-general] Re: Gen2 User verbs usage
In-Reply-To: <52hdi7kd2o.fsf@topspin.com>
References: <20050415182702.GA5572@cse.ohio-state.edu>
	<52hdi7kd2o.fsf@topspin.com>
Message-ID: <20050416165804.GA854@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> INLINE means the verbs driver should try to copy
> the data directly into the send work request to reduce latency.

There currently doesnt seem to exist a way for userspace to know
when is setting the INLINE flag possible.

It would seem what we need is another attribute passed to
create_qp that would specify the max inline buffer size, and probably 
an hca attribute to give the maximum legal value for this attribute.

Does this make sence, and would you accept a patch like this?

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Sat Apr 16 10:00:13 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 16 Apr 2005 20:00:13 +0300
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9
	kernel
In-Reply-To: <20050416020715.GG30386@esmail.cup.hp.com>
References: <20050415232658.GE30386@esmail.cup.hp.com>
	<ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>
	<20050416020715.GG30386@esmail.cup.hp.com>
Message-ID: <20050416170013.GB854@mellanox.co.il>

Quoting r. Grant Grundler <iod00d at hp.com>:
> Subject: Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
> 
> On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote:
> > Might want to have something like, 
> > 
> > diff-2.6.9-01-openib_drivers-SVNxxx
> > diff-2.6.9-02-openib_fixup-SVNxxx
> > diff-2.6.9-03-ib_kernel_changes
> > 
> > where xxx is the SVN version
> > 
> > so that we can have fixups that match a specific SVN version.
> 
> Sure, that's a good idea.
> 
> And it reminds me that we still have no clue which version
> of SVN someone's IB kernel driver is based on. Supporting
> this is going to be painful unless this is dealt with.
> Anyone have a clue how to get SVN or "make" to embed
> the version number in any resulting .o ?
> 
> It's bad enough when distro's patch drivers without updating
> the rev number. But not having a reliable rev number to start
> with is even worse.
> 
> thanks,
> grant

You have to run the svnversion utility to get the revision.

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Sat Apr 16 10:23:03 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 16 Apr 2005 20:23:03 +0300
Subject: [openib-general] Re: openIB gen2 user space verbs API
In-Reply-To: <52r7hbhexk.fsf@topspin.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
	<20050415215836.GA6479@cse.ohio-state.edu>
	<52r7hbhexk.fsf@topspin.com>
Message-ID: <20050416172303.GC854@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: openIB gen2 user space verbs API
> 
>     Sayantan> In VAPI, the query QP is used to find the inline size.
> 
>     Sayantan> Is there another way to find out the inline size in Gen2
>     Sayantan> verbs?
> 
> It's not implemented yet, but I would have ibv_create_qp():
> 
> struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
> 			     struct ibv_qp_init_attr *qp_init_attr);
> 
> to pass the max inline size back in the qp_init_attr->qp_cap.max_inline_data
> member.
> 
> I'll code this up on Monday, it's pretty trivial.
> 
>  - R.

An application would need to know what values is it legal to pass to
create_qp.
Maybe it makes sence to implement something like query_hca, and let
it return the maximum legal value?

-- 
MST - Michael S. Tsirkin


From gdror at mellanox.co.il  Sat Apr 16 14:09:00 2005
From: gdror at mellanox.co.il (Dror Goldenberg)
Date: Sun, 17 Apr 2005 00:09:00 +0300
Subject: [openib-general] Static Rate Questions
Message-ID: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com>

Hi,

I have couple of questions regarding static rate implementation.

- In struct ib_ah_attr, static_rate is defined as u8. What are the 
  expected values that static_rate is supposed to take ? Is it 
  absolute Gb/s ? Gb/s in 2.5Gb/s units ? or relative rate to
  port speed ?

  Looking at ipoib_main I understand that the static_rate is supposed
  to be the relative rate to port speed. In other words, a divider for the 
  current port speed.

- For some reason I don't static_rate initialization for SDP. This should
  either be in SDP code or in the CM (cm_init_qp_rtr_attr()).

- In mthca, there are two places setting up the static_rate. One
  for AH which looks fine. The other one for QP which I believe has
  a bug.

  mthca_qp.c:
	qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) <<
3;
  Should be  
	qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate);

  Because max_stat_rate is bits 10:3 at offset 8h of Address Path.

	
- A question for next generation HW. Would you find it more useful that the 
  HCA supports static rate as an absolute speed (Gb/s) or as a relative
ratio to
  the current port speed ?

Thanks
Dror
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050417/d6cf8dd2/attachment.html>

From ftillier at infiniconsys.com  Sat Apr 16 15:26:48 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Sat, 16 Apr 2005 15:26:48 -0700
Subject: [openib-general] Static Rate Questions
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com>
Message-ID: <000001c542d3$62c382c0$6501a8c0@infiniconsys.com>

> From: Dror Goldenberg [mailto:gdror at mellanox.co.il]
> Sent: Saturday, April 16, 2005 2:09 PM
> 
> - A question for next generation HW. Would you find it more useful that
> the HCA supports static rate as an absolute speed (Gb/s) or as a relative
> ratio to the current port speed?

Personally I'd like to see everything use absolute not relative rates.  It
seems that the static rate is really used as IPD, which I find a bit
counter-intuitive.

- Fab


From mst at mellanox.co.il  Sun Apr 17 02:32:45 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 17 Apr 2005 12:32:45 +0300
Subject: [openib-general] [PATCH] fix management/README
Message-ID: <20050417093245.GA16996@mellanox.co.il>

Fix build instructions to refer to directories that actually exist.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: src/userspace/management/README
===================================================================
--- src/userspace/management/README	(revision 2171)
+++ src/userspace/management/README	(working copy)
@@ -61,7 +61,7 @@
 ./autogen.sh && ./configure && make && make install
 2. In osm/complib and osm/libvendor, run:
 ./autogen.sh && ./configure && make && make install
-3. In all util, diag, and osm/opensm subdirectories, run:
+3. In all util/mad_test, diags, and osm/opensm subdirectories, run:
 ./autogen.sh && ./configure
 4. At top level of management, run:
 make && make install 

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Sun Apr 17 04:36:47 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 17 Apr 2005 14:36:47 +0300
Subject: [openib-general] performance tests uploaded to contrib
Message-ID: <20050417113647.GF16996@mellanox.co.il>

Hello!
I created https://openib.org/svn/trunk/contrib/mellanox/perftest

This directory includes gen2 uverbs microbenchmarks -
useful as usage examples and for performance tuning.

Testing methodology:
- CPU clock instruction is used to get CPU clock without context switch.
- Median (as opposed to average) result is reported. The median is
  less sensitive to extreme scores. An option to report the full result
  distribution for alternative statistical analysis is provided.

Architectures supported:
- i686, x86_64

Tests in this directory: there is currently one test:

rdma_lat.c - latency test with RDMA write transactions

Code is originally based on the pingping test.
I intentionally did not rename functions from pingpong_ to rdma_
to make it easier to share some code with libibverbs/examples later.

Current results:

I currently observe latency below 3.5 usec.

Drop me a note if you find this useful.

Thanks,

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Sun Apr 17 06:56:58 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 17 Apr 2005 16:56:58 +0300
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <52is2t1g1m.fsf@topspin.com>
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com>
Message-ID: <20050417135658.GK16996@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [PATCH] uverbs with static libraries
> 
>     Michael> I'd like to get userspace verbs working with static
>     Michael> libraries.  My motivation is currently enabling our code
>     Michael> coverage tools which only work well with static
>     Michael> libraries, but I expect there to be other uses.
> 
> Looks reasonable.  With this, do you then do --enable-static when
> configuring libmthca or is there anything else required?
> 
>  - R.
> 

You also need the patch below to enable static libraries.
Then when you link you must pass

 -u openib_driver_init -rdynamic

to gcc, to pull in the driver library.
Roland, please let me know whether you plan to apply this and the previous
patch.


Enable static version of libmthca.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: libmthca/configure.in
===================================================================
--- libmthca/configure.in	(revision 2171)
+++ libmthca/configure.in	(working copy)
@@ -6,7 +6,6 @@ AC_CONFIG_SRCDIR([src/mthca.h])
 AC_CONFIG_AUX_DIR(config)
 AM_CONFIG_HEADER(config.h)
 AM_INIT_AUTOMAKE(libmthca, 0.9.0)
-AC_DISABLE_STATIC
 AM_PROG_LIBTOOL
 
 dnl Checks for programs
-- 
MST - Michael S. Tsirkin


From steve at wooding.uklinux.net  Sun Apr 17 08:06:02 2005
From: steve at wooding.uklinux.net (Steven Wooding)
Date: Sun, 17 Apr 2005 16:06:02 +0100
Subject: [openib-general] Advice about adapting ibv_pingpong to use UC
Message-ID: <42627B5A.7010709@wooding.uklinux.net>

Hi,

I wonder if someone working on the gen2 uverbs would be so kind as to 
give me some advice on adapting the ibv_pingpong program to use a UC QP 
type rather than RC. I was previously able to do this with the Mellanox 
stack by changing the qp_type attribute and then not setting variables 
that are only needed for RC (timeout and retry periods etc).

However, when I perform the same trick with ibv_pingpong it errors on 
the function call that should put the QP into the INIT state. I can't 
see what to change in that function call to get it to the next state change.

I realise that the general demand for the UC type connection is low, but 
my application is a real-time interface where retries are not an option 
I'm afraid.

Thank you in advance for any help the busy gen2 developers are able to 
offer.

Regards,

Steve.

x86_64, RHEL 4, gen2 2169.


From tduffy at sun.com  Sun Apr 17 12:40:37 2005
From: tduffy at sun.com (Tom Duffy)
Date: Sun, 17 Apr 2005 12:40:37 -0700
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for
	2.6.9	kernel
In-Reply-To: <20050416170013.GB854@mellanox.co.il>
References: <20050415232658.GE30386@esmail.cup.hp.com>
	<ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>
	<20050416020715.GG30386@esmail.cup.hp.com>
	<20050416170013.GB854@mellanox.co.il>
Message-ID: <1113766837.9390.2.camel@duffman>

On Sat, 2005-04-16 at 20:00 +0300, Michael S. Tsirkin wrote:
> You have to run the svnversion utility to get the revision.

You could embed the $Revision$ in the MODULE_VERSION.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050417/703bb1eb/attachment.sig>

From mst at mellanox.co.il  Sun Apr 17 12:47:45 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 17 Apr 2005 22:47:45 +0300
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6
	.9	kernel
In-Reply-To: <1113766837.9390.2.camel@duffman>
References: <1113766837.9390.2.camel@duffman>
Message-ID: <20050417194745.GA9442@mellanox.co.il>

Quoting r. Tom Duffy <tduffy at sun.com>:
> Subject: Re: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6 .9?kernel
> 
> On Sat, 2005-04-16 at 20:00 +0300, Michael S. Tsirkin wrote:
> > You have to run the svnversion utility to get the revision.
> 
> You could embed the $Revision$ in the MODULE_VERSION.
> 
> -tduffy
> 

AFAIK thats the last revision the specific fiel changed, which is typically
not what you want.


-- 
MST - Michael S. Tsirkin


From surs at cse.ohio-state.edu  Sun Apr 17 12:54:33 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Sun, 17 Apr 2005 15:54:33 -0400
Subject: [openib-general] performance tests uploaded to contrib
In-Reply-To: <20050417113647.GF16996@mellanox.co.il>
References: <20050417113647.GF16996@mellanox.co.il>
Message-ID: <20050417195432.GA22185@cse.ohio-state.edu>

Michael,

> Current results:
> 
> I currently observe latency below 3.5 usec.
> 
> Drop me a note if you find this useful.

Thanks for putting this up on the contrib tree. I have run this
rdma_latency, and am getting around 3.35 us (without switch).

Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7

Do you have any idea when a port of the popular `perf_main' will be
available? As more people try to use the Gen2 verbs, `perf_main' (or
something similar) can help people evaluate different IB operations and
also to have example code to use different features of IB.

Thanks,
Sayantan.

> 
> Thanks,
> 
> -- 
> MST - Michael S. Tsirkin
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From mst at mellanox.co.il  Sun Apr 17 13:40:07 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 17 Apr 2005 23:40:07 +0300
Subject: [openib-general] performance tests uploaded to contrib
In-Reply-To: <20050417195432.GA22185@cse.ohio-state.edu>
References: <20050417113647.GF16996@mellanox.co.il>
	<20050417195432.GA22185@cse.ohio-state.edu>
Message-ID: <20050417204007.GB9442@mellanox.co.il>

Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> Subject: Re: [openib-general] performance tests uploaded to contrib
> 
> Michael,
> 
> > Current results:
> > 
> > I currently observe latency below 3.5 usec.
> > 
> > Drop me a note if you find this useful.
> 
> Thanks for putting this up on the contrib tree. I have run this
> rdma_latency, and am getting around 3.35 us (without switch).
> 
> Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7
> 
> Do you have any idea when a port of the popular `perf_main' will be
> available? As more people try to use the Gen2 verbs, `perf_main' (or
> something similar) can help people evaluate different IB operations and
> also to have example code to use different features of IB.
> 
> Thanks,
> Sayantan.
> 

I dont plan to port the monolithic perf_main to gen2.

Instead, I plan to upload a set of microbenchmarks each testing
a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send
bandwidth etc.

I hope that this will help achieve better code clarity than what we 
have in perf_main.

What are the features you are most interested in?

By the way, you can already see some example code in libibverbs/examples,
although that is not necessarily benchmark-oriented.
I used the code in pingpong.c as a starting point for rdma latency
test, and it can be used with very little changes as rc send latency
test.

-- 
MST - Michael S. Tsirkin


From abhijitngpune at indiatimes.com  Mon Apr 18 08:20:55 2005
From: abhijitngpune at indiatimes.com (abhijitngpune)
Date: Mon, 18 Apr 2005 11:20:55 -0400
Subject: [openib-general] Re: Re: Re: Subment Management
Message-ID: <200504180458.KAA17566@WS0005.indiatimes.com>

hi,


&gt;Are you using Voltaire SM ? It has richer capabilities here than OpenSM.
&gt;Someone in support should be able to explain how to use the various
&gt;routing/pathing algorithms supported by VSM. Who is your support person
&gt;?


   Could you plz suggest me any person from Support Team who can help me in this problem?  


Abhijeet 

"Hal Rosenstock" wrote:


Hi Abhi,

On Thu, 2005-04-14 at 07:26, abhijitngpune wrote:
&gt; Hi,
&gt; 
&gt; Thanks,
&gt; 
&gt; Regd. Fabric manager : It was just a theoritical concern. I
&gt; was doubtful regd voltaire SM.
&gt; 
&gt; Regd. Subnet topology:
&gt; 
&gt; Let me explain the scenario. I want to create
&gt; (logical) topology like this :
&gt; 
&gt; N
&gt; 
&gt; / | \
&gt; 
&gt; N ---- | ---- N
&gt; 
&gt; / \ | / \ 
&gt; 
&gt; N --- N --- N
&gt; 
&gt; N: node (hope u understand the topology)
&gt; 
&gt; How to create such kind of virtual subnet topology? 

This is where LMC &gt; 0 comes in. It allows for multiple paths to be
utilized between nodes.

&gt; Do u think it is possible over star connected nodes(Interconnected
&gt; topology is star, becoz we have one voltaire switch to which all 78
&gt; nodes are connected )?

Yes.

&gt; How subnet management will help me to get this topology? What should
&gt; be my basic step in this scenario?. 

Are you using Voltaire SM ? It has richer capabilities here than OpenSM.
Someone in support should be able to explain how to use the various
routing/pathing algorithms supported by VSM. Who is your support person
?

What host stack are you running on the end nodes ? (They appear to be HP
nodes). Also, what applications/ULPs/protocols are you intending on
running in this topology ?

Thanks.

-- Hal

&gt; Abhi,
&gt; 
&gt; CML
&gt; 
&gt; 
&gt; 
&gt; 
&gt; 
&gt; "Hal Rosenstock" wrote:
&gt; 
&gt; 
&gt; 
&gt; Hi Abhi,
&gt; 
&gt; On Thu, 2005-04-14 at 04:40, abhijitngpune wrote:
&gt; &gt; Hi, My lab has recently purchased Voltaire 9288 - 288 Port
&gt; Infiniband
&gt; &gt; Cluster Switch and HP cluster with 78 nodes. I have some
&gt; doubts
&gt; &gt; related to subnet management. 1. Does the fabric manager
&gt; work on
&gt; &gt; Graph(non fat tree) topologies?
&gt; 
&gt; As Shahar has explained in previous emails, it does. Is this a
&gt; theoretical concern or are you having a real world problem ?
&gt; 
&gt; &gt; 2. Since the interconnect topology is star how can i get
&gt; logical
&gt; &gt; topology which is graph (containing cycles)? 
&gt; 
&gt; The SM uses one of its routing/pathing policies to determine
&gt; the
&gt; topology. Which SM are you using ? Are you using Voltaire's SM
&gt; or 
&gt; OpenSM ?
&gt; 
&gt; &gt; How can i create subnet with (logical) topology on the top
&gt; of the
&gt; &gt; underlying star topology? 
&gt; 
&gt; The SM automatically does this based on the routing/pathing
&gt; policy. Note
&gt; that the routing/pathing policy is beyo n! d the IB spec (and
&gt; left to the
&gt; SM vendor/implementor).
&gt; 
&gt; &gt; 3. Can i install Mellanox gold s/w for infiniband? 
&gt; 
&gt; The short answer is yes. You can run Mellanox Gold on the end
&gt; nodes,
&gt; OpenIB, or the Voltaire host stack on the end nodes. It
&gt; depends on what
&gt; applications/ULPs you are trying to run. For things in common,
&gt; not all
&gt; interop experiments have been run if you are trying for a
&gt; heterogeneous
&gt; environment (a mix of those end node stacks).
&gt; 
&gt; &gt; 4. Does voltiare has its own subnet simulator? Abhi CML
&gt; 
&gt; Shahar will need to answer this one.
&gt; 
&gt; -- Hal
&gt; &gt; 
&gt; &gt;
&gt; ______________________________________________________________________
&gt; 
&gt; 
&gt; 
&gt; 
&gt; ______________________________________________________________________
&gt; Indiatimes Email now powered by APIC Advantage. Help!
&gt; M! y PresenceHelp
&gt; 
&gt; ______________________________________________________________________

Indiatimes Email now powered by APIC Advantage. Help! 
Help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050418/2324ce10/attachment.html>

From ogerlitz at voltaire.com  Sun Apr 17 23:26:32 2005
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 18 Apr 2005 09:26:32 +0300
Subject: [openib-general] openIB gen2 user space verbs API
Message-ID: <D4F8F0B3820E754C887699BEF26A89406AF22B@taurus.voltaire.com>

Christoph> Roland, reading the userspace infiniband/verbs.h file, where
did query QP go?

Roland>It's not implemented yet.  Is there an application that needs it?

Roland,

Other than getting the max inline size, query qp is used to get the
current QP state, examples are 
app error flow (eg when modify qp failed) and app APM flow to sense some
of the state transitions
done by the HW. These are only examples I quickly thought of, I guess
there are more. 

I understand and like the atitude of striving for simplicity but I guess
that for production mthca and the 
uverbs library would need to support all the query functions existing in
VAPI (see below, maybe to the exception
of query ah / eec / mw).

Indeed, it can be that some of them (eg the first four below) can be
implemented in mthca only and their
result cached/mmaped to be used by the uverbs libs. 

Or.

[root at zeta ogerlitz]# grep query /usr/mellanox/include/vapi.h | grep
Func | awk '{ print $3}'
VAPI_query_hca_cap
VAPI_query_hca_port_prop
VAPI_query_hca_gid_tbl
VAPI_query_hca_pkey_tbl
VAPI_query_addr_hndl
VAPI_query_qp
VAPI_query_qp_ext
VAPI_query_srq
VAPI_query_cq
VAPI_query_eec_attr
VAPI_query_mr
VAPI_query_mw

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier
Sent: Saturday, April 16, 2005 1:08 AM
To: Sayantan Sur
Cc: Christoph Raisch; openib-general at openib.org
Subject: Re: [openib-general] openIB gen2 user space verbs API

    Sayantan> In VAPI, the query QP is used to find the inline size.

    Sayantan> Is there another way to find out the inline size in Gen2
    Sayantan> verbs?

It's not implemented yet, but I would have ibv_create_qp():

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
			     struct ibv_qp_init_attr *qp_init_attr);

to pass the max inline size back in the
qp_init_attr->qp_cap.max_inline_data
member.

I'll code this up on Monday, it's pretty trivial.

 - R.
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From surs at cse.ohio-state.edu  Mon Apr 18 06:00:48 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Mon, 18 Apr 2005 09:00:48 -0400
Subject: [openib-general] performance tests uploaded to contrib
In-Reply-To: <20050417204007.GB9442@mellanox.co.il>
References: <20050417113647.GF16996@mellanox.co.il>
	<20050417195432.GA22185@cse.ohio-state.edu>
	<20050417204007.GB9442@mellanox.co.il>
Message-ID: <4263AF80.1000601@cse.ohio-state.edu>

Michael S. Tsirkin wrote:
> Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> 
>>Subject: Re: [openib-general] performance tests uploaded to contrib
>>
>>Michael,
>>
>>
>>>Current results:
>>>
>>>I currently observe latency below 3.5 usec.
>>>
>>>Drop me a note if you find this useful.
>>
>>Thanks for putting this up on the contrib tree. I have run this
>>rdma_latency, and am getting around 3.35 us (without switch).
>>
>>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7
>>
>>Do you have any idea when a port of the popular `perf_main' will be
>>available? As more people try to use the Gen2 verbs, `perf_main' (or
>>something similar) can help people evaluate different IB operations and
>>also to have example code to use different features of IB.
>>
>>Thanks,
>>Sayantan.
>>
> 
> 
> I dont plan to port the monolithic perf_main to gen2.
> 
> Instead, I plan to upload a set of microbenchmarks each testing
> a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send
> bandwidth etc.
> 
> I hope that this will help achieve better code clarity than what we 
> have in perf_main.

That is fine. As long as there is some ibverbs level benchmark suite, it 
will ease the transition.

> 
> What are the features you are most interested in?

By RDMA latency/bandwidth, do you mean both RDMA write & read? Will 
there be any Atomic latency tests also?

> 
> By the way, you can already see some example code in libibverbs/examples,
> although that is not necessarily benchmark-oriented.
> I used the code in pingpong.c as a starting point for rdma latency
> test, and it can be used with very little changes as rc send latency
> test.

Yes, this was helpful for me too. Apart from providing code examples, I 
was suggesting that a much more comprehensive benchmark suite will help 
people moving from other vendor stacks to Gen2 to have a quick 
comparison of performance offered by both stacks.

Thanks,
Sayantan.


-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From mst at mellanox.co.il  Mon Apr 18 06:08:56 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 16:08:56 +0300
Subject: [openib-general] performance tests uploaded to contrib
In-Reply-To: <4263AF80.1000601@cse.ohio-state.edu>
References: <20050417113647.GF16996@mellanox.co.il>
	<20050417195432.GA22185@cse.ohio-state.edu>
	<20050417204007.GB9442@mellanox.co.il>
	<4263AF80.1000601@cse.ohio-state.edu>
Message-ID: <20050418130856.GF17566@mellanox.co.il>

Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> Subject: Re: [openib-general] performance tests uploaded to contrib
> 
> Michael S. Tsirkin wrote:
> >Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> >
> >>Subject: Re: [openib-general] performance tests uploaded to contrib
> >>
> >>Michael,
> >>
> >>
> >>>Current results:
> >>>
> >>>I currently observe latency below 3.5 usec.
> >>>
> >>>Drop me a note if you find this useful.
> >>
> >>Thanks for putting this up on the contrib tree. I have run this
> >>rdma_latency, and am getting around 3.35 us (without switch).
> >>
> >>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7
> >>
> >>Do you have any idea when a port of the popular `perf_main' will be
> >>available? As more people try to use the Gen2 verbs, `perf_main' (or
> >>something similar) can help people evaluate different IB operations and
> >>also to have example code to use different features of IB.
> >>
> >>Thanks,
> >>Sayantan.
> >>
> >
> >
> >I dont plan to port the monolithic perf_main to gen2.
> >
> >Instead, I plan to upload a set of microbenchmarks each testing
> >a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send
> >bandwidth etc.
> >
> >I hope that this will help achieve better code clarity than what we 
> >have in perf_main.
> 
> That is fine. As long as there is some ibverbs level benchmark suite, it 
> will ease the transition.
> 
> >
> >What are the features you are most interested in?
> 
> By RDMA latency/bandwidth, do you mean both RDMA write & read?

RDMA write test is out there, I hope to upload the read test RSN.

> Will 
> there be any Atomic latency tests also?

Sure, why not. Is it a priority for you?

> >
> >By the way, you can already see some example code in libibverbs/examples,
> >although that is not necessarily benchmark-oriented.
> >I used the code in pingpong.c as a starting point for rdma latency
> >test, and it can be used with very little changes as rc send latency
> >test.
> 
> Yes, this was helpful for me too. Apart from providing code examples, I 
> was suggesting that a much more comprehensive benchmark suite will help 
> people moving from other vendor stacks to Gen2 to have a quick 
> comparison of performance offered by both stacks.
> 
> Thanks,
> Sayantan.
> 

I agree.

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Mon Apr 18 07:50:46 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 17:50:46 +0300
Subject: [openib-general] ttcp.aio - kernel NULL pointer dereference
Message-ID: <20050418145046.GG17566@mellanox.co.il>

Hello, Libor!
Every once in a while, when I run ttcp 

I get a kernel NULL pointer dereference from SDP

I compiled the ttcp.aio test with 

gcc -I../../../linux-kernel/infiniband/ulp/sdp ttcp.aio.c -O2 -o ttcp.aio.x -laio

I run ttcp on the server as

./ttcp.aio.x -r -l 100 -a 10

and the client as

./ttcp.aio.x -t -l 100 -n 100 -a 10 11.4.8.155

I repeated this test several times, sometimes getting 
ttcp-t: Event error <-32> <5275648>
messages and sometimes not.
It was the server that finally crashed.

My kernel is 2.6.11 + latest openib svn (rev 2171).

The log file leading to the crash is below:

Apr 18 17:34:11 swlab155 kernel:  ERR: : IOCB <0> cancel <0> flag <0040> size <1:0:1>
Apr 18 17:34:22 swlab155 kernel:  ERR: : IOCB <0> cancel <0> flag <0040> size <100:0:100>
Apr 18 17:34:41 swlab155 kernel:  ERR: : VMA lock <528000:100> error <-12> <1:8:8>
Apr 18 17:34:41 swlab155 kernel:  ERR: : VMA lock <52c000:100> error <-12> <1:8:8>
Apr 18 17:34:49 swlab155 kernel:  ERR: : VMA lock <528000:100> error <-12> <1:8:8>
Apr 18 17:34:49 swlab155 kernel:  ERR: : VMA lock <52c000:100> error <-12> <1:8:8>
Apr 18 17:34:59 swlab155 kernel:  ERR: : VMA lock <528000:100> error <-12> <1:8:8>
Apr 18 17:34:59 swlab155 kernel:  ERR: : VMA lock <52c000:100> error <-12> <1:8:8>
Apr 18 17:34:59 swlab155 kernel: WARN: : Unexpected conn state. conn <9> state <ff01:fd01>
Apr 18 17:35:22 swlab155 kernel:  ERR: : IOCB <0> cancel <0> flag <0040> size <100:0:100>
Apr 18 17:35:34 swlab155 last message repeated 5 times
Apr 18 17:35:44 swlab155 kernel:  ERR: : VMA lock <528000:100> error <-12> <1:8:8>
Apr 18 17:35:44 swlab155 kernel:  ERR: : VMA lock <52c000:100> error <-12> <1:8:8>
Apr 18 17:35:52 swlab155 kernel:  ERR: : VMA lock <528000:100> error <-12> <1:8:8>
Apr 18 17:35:52 swlab155 kernel:  ERR: : VMA lock <52c000:100> error <-12> <1:8:8>
Apr 18 17:35:52 swlab155 kernel: WARN: : Cancel read with no IOCB. <2:0:00000005>
Apr 18 17:35:52 swlab155 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: 
Apr 18 17:35:52 swlab155 kernel: <ffffffff80389f5e>{_spin_lock_irqsave+9}
Apr 18 17:35:52 swlab155 kernel: PGD 15cb56067 PUD 15cbb4067 PMD 0 
Apr 18 17:35:52 swlab155 kernel: Oops: 0002 [1] SMP 
Apr 18 17:35:52 swlab155 kernel: CPU 0 
Apr 18 17:35:52 swlab155 kernel: Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core
Apr 18 17:35:52 swlab155 kernel: Pid: 6, comm: events/0 Not tainted 2.6.11-openib
Apr 18 17:35:52 swlab155 kernel: RIP: 0010:[_spin_lock_irqsave+9/27] <ffffffff80389f5e>{_spin_lock_irqsave+9}
Apr 18 17:35:52 swlab155 kernel: RIP: 0010:[<ffffffff80389f5e>] <ffffffff80389f5e>{_spin_lock_irqsave+9}
Apr 18 17:35:52 swlab155 kernel: RSP: 0000:ffff8100dfe9fe08  EFLAGS: 00010092
Apr 18 17:35:52 swlab155 kernel: RAX: 0000000000000064 RBX: 0000000000000000 RCX: ffff81015c596528
Apr 18 17:35:52 swlab155 kernel: RDX: 0000000000000000 RSI: 0000000000000064 RDI: 0000000000000038
Apr 18 17:35:52 swlab155 kernel: RBP: ffff81014dd23080 R08: ffff8100dfe9e000 R09: 0000000000000000
Apr 18 17:35:52 swlab155 kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000068
Apr 18 17:35:52 swlab155 kernel: R13: 0000000000000064 R14: 0000000000000038 R15: 0000000000000000
Apr 18 17:35:52 swlab155 kernel: FS:  0000000000000000(0000) GS:ffffffff80522c80(0000) knlGS:0000000000000000
Apr 18 17:35:52 swlab155 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Apr 18 17:35:52 swlab155 kernel: CR2: 0000000000000038 CR3: 000000015c626000 CR4: 00000000000006e0
Apr 18 17:35:52 swlab155 kernel: Process events/0 (pid: 6, threadinfo ffff8100dfe9e000, task ffff8100dff02750)
Apr 18 17:35:52 swlab155 kernel: Stack: 0000000000000292 ffffffff8018663b 0000000000000286 ffff81014dec1680 
Apr 18 17:35:52 swlab155 kernel:        ffff81014dec1718 ffff8100dffa2000 ffff81014dec1680 0000000000000292 
Apr 18 17:35:52 swlab155 kernel:        ffffffff8804a8c2 ffffffff8804a956 
Apr 18 17:35:52 swlab155 kernel: Call Trace:<ffffffff8018663b>{aio_complete+129} <ffffffff8804a8c2>{:ib_sdp:do_iocb_complete+0} 
Apr 18 17:35:52 swlab155 kernel:        <ffffffff8804a956>{:ib_sdp:do_iocb_complete+148} <ffffffff80140b1f>{worker_thread+476} 
Apr 18 17:35:52 swlab155 kernel:        <ffffffff8012d10b>{default_wake_function+0} <ffffffff8012d10b>{default_wake_function+0} 
Apr 18 17:35:52 swlab155 kernel:        <ffffffff80140943>{worker_thread+0} <ffffffff80144a12>{kthread+206} 
Apr 18 17:35:52 swlab155 kernel:        <ffffffff8010dc43>{child_rip+8} <ffffffff80144944>{kthread+0} 
Apr 18 17:35:52 swlab155 kernel:        <ffffffff8010dc3b>{child_rip+0} 
Apr 18 17:35:52 swlab155 kernel: 
Apr 18 17:35:52 swlab155 kernel: Code: f0 fe 0f 0f 88 8b 01 00 00 48 8b 04 24 48 83 c4 08 c3 fa f0 
Apr 18 17:35:52 swlab155 kernel: RIP <ffffffff80389f5e>{_spin_lock_irqsave+9} RSP <ffff8100dfe9fe08>
Apr 18 17:35:52 swlab155 kernel: CR2: 0000000000000038

-- 
MST - Michael S. Tsirkin


From roland at topspin.com  Mon Apr 18 07:42:36 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 07:42:36 -0700
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il>
Message-ID: <52u0m4cfj7.fsf@topspin.com>

    Michael> You also need the patch below to enable static libraries.
    Michael> Then when you link you must pass

    Michael>  -u openib_driver_init -rdynamic

    Michael> to gcc, to pull in the driver library.  Roland, please
    Michael> let me know whether you plan to apply this and the
    Michael> previous patch.

I was waiting to see what changes you required to libmthca.  However I
don't really see the point of deleting AC_DISABLE_STATIC -- if someone
wants a static library then all that's required is passing
"--enable-static" to the configure script.  What am I missing?

I guess I will apply the patch to libibverbs to call openib_driver_init
if it's linked in directly.

 - R.


From roland at topspin.com  Mon Apr 18 07:42:37 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 07:42:37 -0700
Subject: [openib-general] Re: openIB gen2 user space verbs API
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
	<20050415215836.GA6479@cse.ohio-state.edu>
	<52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il>
Message-ID: <52oecccfj6.fsf@topspin.com>

    Michael> An application would need to know what values is it legal
    Michael> to pass to create_qp.  Maybe it makes sence to implement
    Michael> something like query_hca, and let it return the maximum
    Michael> legal value?

The Mellanox VAPI just used the maximum number of sg entries for the
send queue to calculate the inline data value for a QP.  Does it make
sense to change this interface?

 - R.


From surs at cse.ohio-state.edu  Mon Apr 18 07:52:09 2005
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Mon, 18 Apr 2005 10:52:09 -0400
Subject: [openib-general] performance tests uploaded to contrib
In-Reply-To: <20050418130856.GF17566@mellanox.co.il>
References: <20050417113647.GF16996@mellanox.co.il>
	<20050417195432.GA22185@cse.ohio-state.edu>
	<20050417204007.GB9442@mellanox.co.il>
	<4263AF80.1000601@cse.ohio-state.edu>
	<20050418130856.GF17566@mellanox.co.il>
Message-ID: <20050418145208.GA23304@cse.ohio-state.edu>

* On Apr,5 Michael S. Tsirkin<mst at mellanox.co.il> wrote :
> Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> > Subject: Re: [openib-general] performance tests uploaded to contrib
> > 
> > Michael S. Tsirkin wrote:
> > >Quoting r. Sayantan Sur <surs at cse.ohio-state.edu>:
> > >
> > >>Subject: Re: [openib-general] performance tests uploaded to contrib
> > >>
> > >>Michael,
> > >>
> > >>
> > >>>Current results:
> > >>>
> > >>>I currently observe latency below 3.5 usec.
> > >>>
> > >>>Drop me a note if you find this useful.
> > >>
> > >>Thanks for putting this up on the contrib tree. I have run this
> > >>rdma_latency, and am getting around 3.35 us (without switch).
> > >>
> > >>Our platform is Dual Xeon EM64T with RH AS 4. Kernel version 2.6.11.7
> > >>
> > >>Do you have any idea when a port of the popular `perf_main' will be
> > >>available? As more people try to use the Gen2 verbs, `perf_main' (or
> > >>something similar) can help people evaluate different IB operations and
> > >>also to have example code to use different features of IB.
> > >>
> > >>Thanks,
> > >>Sayantan.
> > >>
> > >
> > >
> > >I dont plan to port the monolithic perf_main to gen2.
> > >
> > >Instead, I plan to upload a set of microbenchmarks each testing
> > >a specific feature: rdma latency/rc send latency/rdma bandwidth/rc send
> > >bandwidth etc.
> > >
> > >I hope that this will help achieve better code clarity than what we 
> > >have in perf_main.
> > 
> > That is fine. As long as there is some ibverbs level benchmark suite, it 
> > will ease the transition.
> > 
> > >
> > >What are the features you are most interested in?
> > 
> > By RDMA latency/bandwidth, do you mean both RDMA write & read?
> 
> RDMA write test is out there, I hope to upload the read test RSN.
> 
> > Will 
> > there be any Atomic latency tests also?
> 
> Sure, why not. Is it a priority for you?

Nope. Not a priority RDMA Write/Read will do for some time to come.

Thanks,
Sayantan.

> 
> > >
> > >By the way, you can already see some example code in libibverbs/examples,
> > >although that is not necessarily benchmark-oriented.
> > >I used the code in pingpong.c as a starting point for rdma latency
> > >test, and it can be used with very little changes as rc send latency
> > >test.
> > 
> > Yes, this was helpful for me too. Apart from providing code examples, I 
> > was suggesting that a much more comprehensive benchmark suite will help 
> > people moving from other vendor stacks to Gen2 to have a quick 
> > comparison of performance offered by both stacks.
> > 
> > Thanks,
> > Sayantan.
> > 
> 
> I agree.
> 
> -- 
> MST - Michael S. Tsirkin

-- 
---------------------------------------------------------
Sayantan Sur            Graduate Research Assistant

395 Dreese Labs,        Computer Science and Engineering
Ohio State University,  Office : 774, Dreese Labs
Columbus,               email  : surs at cse.ohio-state.edu
Ohio - 43210.           phone(res) : 614.688.9792
USA.                    phone(off) : 614.292.8501
---------------------------------------------------------


From mst at mellanox.co.il  Mon Apr 18 08:24:52 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 18:24:52 +0300
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <52u0m4cfj7.fsf@topspin.com>
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com>
	<20050417135658.GK16996@mellanox.co.il>
	<52u0m4cfj7.fsf@topspin.com>
Message-ID: <20050418152452.GH17566@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [PATCH] uverbs with static libraries
> 
>     Michael> You also need the patch below to enable static libraries.
>     Michael> Then when you link you must pass
> 
>     Michael>  -u openib_driver_init -rdynamic
> 
>     Michael> to gcc, to pull in the driver library.

By the way, any ideas on how to make this step easier for
users of the static library?

>     Michael> Roland, please let me know whether you plan to
>     Michael> apply this and the previous patch.
> 
> I was waiting to see what changes you required to libmthca.  However I
> don't really see the point of deleting AC_DISABLE_STATIC -- if someone
> wants a static library then all that's required is passing
> "--enable-static" to the configure script.  What am I missing?

Does this work for you? Is mthca.a created?

This does not seem to work for me - static library isnt
created unless I remove the AC_DISABLE_STATIC.

Put another way - whats the harm in always building the
static version as well? Other libraries (e.g. libibverbs)
build both static and shared versions by default.

> I guess I will apply the patch to libibverbs to call openib_driver_init
> if it's linked in directly.
> 
>  - R.
> 

-- 
MST - Michael S. Tsirkin


From roland at topspin.com  Mon Apr 18 07:57:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 07:57:38 -0700
Subject: [openib-general] Re: Gen2 User verbs usage
References: <20050415182702.GA5572@cse.ohio-state.edu>
	<52hdi7kd2o.fsf@topspin.com> <20050416165804.GA854@mellanox.co.il>
Message-ID: <52is2kceu5.fsf@topspin.com>

    Michael> There currently doesnt seem to exist a way for userspace
    Michael> to know when is setting the INLINE flag possible.

I seem to remember some discussion where we said the inline flag was
just a hint for the low-level driver, so it's always possible to set it.

    Michael> It would seem what we need is another attribute passed to
    Michael> create_qp that would specify the max inline buffer size,
    Michael> and probably an hca attribute to give the maximum legal
    Michael> value for this attribute.

Do we care enough about this special feature to do this?  It seems
Mellanox VAPI worked out OK without that.

 - R.


From robert.j.woodruff at intel.com  Mon Apr 18 08:27:10 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 18 Apr 2005 08:27:10 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <20050416020715.GG30386@esmail.cup.hp.com>
Message-ID: <ORSMSX408FRaqbC8wSA0000000d@orsmsx408.amr.corp.intel.com>

 
>On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote:
>> Might want to have something like, 
>> 
>> diff-2.6.9-01-openib_drivers-SVNxxx
>> diff-2.6.9-02-openib_fixup-SVNxxx
>> diff-2.6.9-03-ib_kernel_changes
>> 
>> where xxx is the SVN version
>> 
>> so that we can have fixups that match a specific SVN version.

>Sure, that's a good idea.

>And it reminds me that we still have no clue which version
>of SVN someone's IB kernel driver is based on. Supporting
>this is going to be painful unless this is dealt with.
>Anyone have a clue how to get SVN or "make" to embed
>the version number in any resulting .o ?

>It's bad enough when distro's patch drivers without updating
>the rev number. But not having a reliable rev number to start
>with is even worse.

>thanks,
>grant

Ok, I will look at generating patches for the base drivers, the fixups, and
the kernel patches. I also think that it is a good idea to affix a SVN rev
or some other rev. number to a particular driver. Not sure of the best
way to do this. In the past I have used schemes where the rev number is
generated with a build number and date and put into a string in a version.h
file
that is built into the module. 

Not sure what is the best thing for this project. 

woody


From mst at mellanox.co.il  Mon Apr 18 08:33:29 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 18:33:29 +0300
Subject: [openib-general] Re: Gen2 User verbs usage
In-Reply-To: <52is2kceu5.fsf@topspin.com>
References: <20050415182702.GA5572@cse.ohio-state.edu>
	<52hdi7kd2o.fsf@topspin.com> <20050416165804.GA854@mellanox.co.il>
	<52is2kceu5.fsf@topspin.com>
Message-ID: <20050418153329.GI17566@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: Gen2 User verbs usage
> 
>     Michael> There currently doesnt seem to exist a way for userspace
>     Michael> to know when is setting the INLINE flag possible.
> 
> I seem to remember some discussion where we said the inline flag was
> just a hint for the low-level driver, so it's always possible to set it.

I'm not against this approach, on principle, what bothers me latency vs cpu
utilization tradeoff is involved so the low-level driver may not be
the right place to take that decision.

There's also the point that for inline you dont need to pass in
a valid rkey, so the app may be better off knowing about it.

Finally, if its just a hint a separate pass over the s/g list
would be needed to calculate the size and check it fits inline,
which kind of implies performance penalty.

What is your opinion?

>     Michael> It would seem what we need is another attribute passed to
>     Michael> create_qp that would specify the max inline buffer size,
>     Michael> and probably an hca attribute to give the maximum legal
>     Michael> value for this attribute.
> 
> Do we care enough about this special feature to do this?  It seems
> Mellanox VAPI worked out OK without that.
> 
>  - R.
> 

It may be users are only recently waking to this feature.
Latency benefits for small to medium sized messages are significant.

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Mon Apr 18 08:41:46 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 18:41:46 +0300
Subject: [openib-general] Re: openIB gen2 user space verbs API
In-Reply-To: <52oecccfj6.fsf@topspin.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
	<20050415215836.GA6479@cse.ohio-state.edu>
	<52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il>
	<52oecccfj6.fsf@topspin.com>
Message-ID: <20050418154146.GJ17566@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: openIB gen2 user space verbs API
> 
>     Michael> An application would need to know what values is it legal
>     Michael> to pass to create_qp.  Maybe it makes sence to implement
>     Michael> something like query_hca, and let it return the maximum
>     Michael> legal value?
> 
> The Mellanox VAPI just used the maximum number of sg entries for the
> send queue to calculate the inline data value for a QP.  Does it make
> sense to change this interface?
> 
>  - R.
> 

It was actually you who first proposed the change :) But I'd like
to defend that decision.

I think applications have different latency/CPU utilization tradeoffs.
A microbenchmark may want to push as much data as possible inline,
another application may not.

So I think what you end up doing with VAPI API is applications starting with
the inline size they actually want and hardcoding the tavor size to s/g entries
ratio to get the right values in query qp.

If they do it right they will even work with other HCAs, just more
slowly (say the app checks qp properties and sees it didnt get the inline size
that it wanted, so it doesnt use inline), but isnt what I proposed cleaner?

Since its an ABI change anyway - lets do it right?

Its easy enough to implement - just say the word ...

-- 
MST - Michael S. Tsirkin


From robert.j.woodruff at intel.com  Mon Apr 18 08:41:08 2005
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 18 Apr 2005 08:41:08 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
Message-ID: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408>

 
>>On Fri, Apr 15, 2005 at 05:23:25PM -0700, Bob Woodruff wrote:
>>> Might want to have something like, 
>>> 
>>> diff-2.6.9-01-openib_drivers-SVNxxx
>>> diff-2.6.9-02-openib_fixup-SVNxxx
>>> diff-2.6.9-03-ib_kernel_changes
>>> 
>>> where xxx is the SVN version
>>> 
>>> so that we can have fixups that match a specific SVN version.

>>Sure, that's a good idea.

>Ok, I will look at generating patches for the base drivers, the fixups,
and
>the kernel patches. 

Roland, do you know what the SVN rev was for that latest code that was
submitted to 2.6.12-rc2-mm3. That is the version that we discussed
starting
with for an initial 2.6.9 backport, but as suggested, I want to embed
the 
SVN rev. into the file name of the patches for clarity. 

woody


From mst at mellanox.co.il  Mon Apr 18 08:48:46 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 18:48:46 +0300
Subject: [openib-general] Re: openIB gen2 user space verbs API
In-Reply-To: <52oecccfj6.fsf@topspin.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>
	<527jj4kpkv.fsf@topspin.com>
	<20050415215836.GA6479@cse.ohio-state.edu>
	<52r7hbhexk.fsf@topspin.com> <20050416172303.GC854@mellanox.co.il>
	<52oecccfj6.fsf@topspin.com>
Message-ID: <20050418154846.GK17566@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: openIB gen2 user space verbs API
> 
>     Michael> An application would need to know what values is it legal
>     Michael> to pass to create_qp.  Maybe it makes sence to implement
>     Michael> something like query_hca, and let it return the maximum
>     Michael> legal value?
> 
> The Mellanox VAPI just used the maximum number of sg entries for the
> send queue to calculate the inline data value for a QP.  Does it make
> sense to change this interface?
> 
>  - R.
> 

Lets look at an example like pingpong.c

I think its cleanest for this test to have --inline flag which would
make all work requests inline.
The test will then create the qp setting the right inline size.

If the user passed a size too big to be inline, we want the test to
be able to detect this and print a clear error message, not fail in
create_qp.

Right?

-- 
MST - Michael S. Tsirkin


From roland at topspin.com  Mon Apr 18 08:42:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 08:42:38 -0700
Subject: [openib-general] Advice about adapting ibv_pingpong to use UC
References: <42627B5A.7010709@wooding.uklinux.net>
Message-ID: <52d5ssccr5.fsf@topspin.com>

    Steven> Hi, I wonder if someone working on the gen2 uverbs would
    Steven> be so kind as to give me some advice on adapting the
    Steven> ibv_pingpong program to use a UC QP type rather than RC. I
    Steven> was previously able to do this with the Mellanox stack by
    Steven> changing the qp_type attribute and then not setting
    Steven> variables that are only needed for RC (timeout and retry
    Steven> periods etc).

Unfortunately UC support has not been implemented in the gen2 stack.
It probably wouldn't be that hard but I can't say when I'll get a
chance to look at it.

 - R.


From roland at topspin.com  Mon Apr 18 08:42:41 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 08:42:41 -0700
Subject: [openib-general] Static Rate Questions
References: <506C3D7B14CDD411A52C00025558DED6078F2B30@mtlex01.yok.mtl.com>
Message-ID: <527jj0ccr2.fsf@topspin.com>

    Dror> - In struct ib_ah_attr, static_rate is defined as u8. What
    Dror> are the expected values that static_rate is supposed to take
    Dror> ? Is it absolute Gb/s ? Gb/s in 2.5Gb/s units ? or relative
    Dror> rate to port speed ?

My understanding is that it is the inter-packet delay.  However I
don't have any strong objection to changing this interface.

    Dror> - In mthca, there are two places setting up the
    Dror> static_rate. One for AH which looks fine. The other one for
    Dror> QP which I believe has a bug.

Yes, you're right.  I'm not sure where the " << 3" came from; I've
deleted it.
	
    Dror> - A question for next generation HW. Would you find it more
    Dror> useful that the HCA supports static rate as an absolute
    Dror> speed (Gb/s) or as a relative ratio to the current port
    Dror> speed ?

I'm not sure it makes much difference one way or the other.  We're not
talking about a lot of complex code to deal with static rates,
whatever the hardware interface is.

 - R.


From roland at topspin.com  Mon Apr 18 08:46:43 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 08:46:43 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408> (Robert J.
	Woodruff's message of "Mon, 18 Apr 2005 08:41:08 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408>
Message-ID: <523btocckc.fsf@topspin.com>

    Robert> Roland, do you know what the SVN rev was for that latest
    Robert> code that was submitted to 2.6.12-rc2-mm3. That is the
    Robert> version that we discussed starting with for an initial
    Robert> 2.6.9 backport, but as suggested, I want to embed the SVN
    Robert> rev. into the file name of the patches for clarity.

It doesn't really make sense to talk about the svn rev for that tree,
since I went through and picked some patches but not others to merge
upstream.

 - R.


From timur.tabi at ammasso.com  Mon Apr 18 09:09:35 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 11:09:35 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52mzs51g5g.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
Message-ID: <4263DBBF.9040801@ammasso.com>

Roland Dreier wrote:
>     Troy> How is memory pinning handled? (I haven't had time to read
>     Troy> all the code, so please excuse my ignorance of something
>     Troy> obvious).
> 
> The userspace library calls mlock() and then the kernel does
> get_user_pages().

Why do you call mlock() and get_user_pages()?  In our code, we only call mlock(), and the 
memory is pinned.  We have a test case that fails if only get_user_pages() is called, but 
it passes if only mlock() is called.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From arjan at infradead.org  Mon Apr 18 09:16:12 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Mon, 18 Apr 2005 18:16:12 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263DBBF.9040801@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<4263DBBF.9040801@ammasso.com>
Message-ID: <1113840973.6274.84.camel@laptopd505.fenrus.org>

On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote:
> Roland Dreier wrote:
> >     Troy> How is memory pinning handled? (I haven't had time to read
> >     Troy> all the code, so please excuse my ignorance of something
> >     Troy> obvious).
> > 
> > The userspace library calls mlock() and then the kernel does
> > get_user_pages().
> 
> Why do you call mlock() and get_user_pages()?  In our code, we only call mlock(), and the 
> memory is pinned. 

this is a myth; linux is free to move the page about in physical memory
even if it's mlock()ed!!

And even then, the user can munlock the memory from another thread etc
etc. Not a good idea.

get_user_pages() is used from AIO and other parts of the kernel for
similar purposes and in fact is designed for it, so it better work. If
it has bugs those should be fixed, not worked around!


From timur.tabi at ammasso.com  Mon Apr 18 09:22:29 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 11:22:29 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050411171347.7e05859f.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
Message-ID: <4263DEC5.5080909@ammasso.com>

Andrew Morton wrote:
> Roland Dreier <roland at topspin.com> wrote:
> 
>>    Troy> Do we even need the mlock in userspace then?
>>
>>Yes, because the kernel may go through and unmap pages from userspace
>>while trying to swap.  Since we have the page locked in the kernel,
>>the physical page won't go anywhere, but userspace might end up with a
>>different page mapped at the same virtual address.
> 
> 
> That shouldn't happen.  If get_user_pages() has elevated the refcount on a
> page then the following can happen:
> 
> - The VM may decide to add the page to swapcache (if it's not mmapped
>   from a file).
> 
> - Once the page is backed by either swapcache of a (mmapped) file, the VM
>   may decide the unmap the application's pte's.  A later minor fault by the
>   app will cause the same physical page to be remapped.

That's not what we're seeing.  We have hardware that does DMA over the network (much like 
the Infiniband stuff), and we have a testcase that fails if get_user_pages() is used, but 
not if mlock() is used.  Consider two computers on a network, X and Y.  Both have our 
hardware, which can transfer a page of memory from a given physical address on X to a 
physical address on Y.

1) Application on X allocates a block of memory, and passes the virtual address to the driver.
2) Driver on X calls get_user_pages() and then obtains a physical address for the memory.
3) Application and driver on Y do the same thing.
4) App X fills memory with some data D.
5) App X then allocates as much memory as it possibly can.  It touches every page in this 
memory, and then frees the memory.  This will force other pages to be swapped out, 
including the supposedly pinned memory.
6) App X then tells Driver X to transfer data D to computer Y.
7) App Y compares data D and finds that it doesn't match with it's supposed to.

Conclusion: during step 5, the data in pinned memory is swapped out or something.  I'm not 
sure where it goes.

We can only demonstrate this problem using our hardware, because you need the ability to 
transfer memory without using the CPU.  We were going to prepare a test case and ship same 
hardware to a few kernel developers to prove our point, but now that we're able to call 
mlock() in non-user processes, we decided it wasn't worth our time.  Actually, I 
discovered that I can call cap_raise() and set the ulimit structure, which gives me the 
ability to call mlock() on any amount of memory from any process in 2.4 and 2.6 kernels, 
which we need to support.  If I had thought of that earlier, I wouldn't have needed all 
those hacks to call sys_mlock() from the driver.


-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From timur.tabi at ammasso.com  Mon Apr 18 09:25:20 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 11:25:20 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113840973.6274.84.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>	 <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
Message-ID: <4263DF70.2060702@ammasso.com>

Arjan van de Ven wrote:

> this is a myth; linux is free to move the page about in physical memory
> even if it's mlock()ed!!

Then Linux has a very odd definition of the word "locked".

> And even then, the user can munlock the memory from another thread etc
> etc. Not a good idea.

Well, that's okay, because then the app is doing something stupid, so we don't worry about 
that.

> get_user_pages() is used from AIO and other parts of the kernel for
> similar purposes and in fact is designed for it, so it better work. If
> it has bugs those should be fixed, not worked around!

I've been complaining about get_user_pages() not working for a long time now, but I can 
only demonstrate the problem with our hardware.  See my other post in this thread for details.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From iod00d at hp.com  Mon Apr 18 09:31:39 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 18 Apr 2005 09:31:39 -0700
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9
	kernel
In-Reply-To: <20050416170013.GB854@mellanox.co.il>
References: <20050415232658.GE30386@esmail.cup.hp.com>
	<ORSMSX4081XvpFVjCRG0000000a@orsmsx408.amr.corp.intel.com>
	<20050416020715.GG30386@esmail.cup.hp.com>
	<20050416170013.GB854@mellanox.co.il>
Message-ID: <20050418163139.GB6931@esmail.cup.hp.com>

On Sat, Apr 16, 2005 at 08:00:13PM +0300, Michael S. Tsirkin wrote:
> You have to run the svnversion utility to get the revision.

that's fine if I have the source tree.
(I usually just look in .svn/entries)

But I need to track version number in existing binaries
so I know which source tree they come from.

grant


From roland at topspin.com  Mon Apr 18 08:57:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 08:57:38 -0700
Subject: [openib-general] Re: patches
References: <20050408093558.GB21709@mellanox.co.il>
	<52psx545sy.fsf@topspin.com> <20050409172150.GA31200@mellanox.co.il>
	<52hdif3ggn.fsf@topspin.com> <20050414080648.GE32526@mellanox.co.il>
Message-ID: <52wtr0axhp.fsf@topspin.com>

Thanks, it turns out I made the same mistake several times:

--- infiniband/hw/mthca/mthca_cmd.c	(revision 2156)
+++ infiniband/hw/mthca/mthca_cmd.c	(working copy)
@@ -1053,7 +1053,7 @@ int mthca_QUERY_ADAPTER(struct mthca_dev
 	MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET);
 
 out:
-	pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma);
+	pci_free_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, outbox, outdma);
 	return err;
 }
 
@@ -1224,7 +1224,7 @@ int mthca_INIT_IB(struct mthca_dev *dev,
 	err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB,
 			CMD_TIME_CLASS_A, status);
 
-	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	pci_free_consistent(dev->pdev, INIT_IB_IN_SIZE, inbox, indma);
 	return err;
 }
 
@@ -1269,7 +1269,7 @@ int mthca_SET_IB(struct mthca_dev *dev, 
 	err = mthca_cmd(dev, indma, port, 0, CMD_SET_IB,
 			CMD_TIME_CLASS_B, status);
 
-	pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma);
+	pci_free_consistent(dev->pdev, INIT_SET_IB_IN_SIZE, inbox, indma);
 	return err;
 }
 

From roland at topspin.com  Mon Apr 18 08:57:39 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 08:57:39 -0700
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
References: <20050410084724.GZ20567@mellanox.co.il>
Message-ID: <52r7h8axho.fsf@topspin.com>

It made more sense to me to load a static driver once before calling
find_drivers() for each entry in our driver path.  Is there anything
wrong with this?

 - R.

--- libibverbs/src/init.c	(revision 2156)
+++ libibverbs/src/init.c	(working copy)
@@ -198,6 +198,11 @@ static void INIT ibverbs_init(void)
 	if (ibv_init_mem_map())
 		return;
 
+	/*
+	 * Check if a driver is statically linked, and if so load it first.
+	 */
+	load_driver(NULL);
+
 	user_path = getenv(OPENIB_DRIVER_PATH_ENV);
 	if (user_path) {
 		wr_path = strdupa(user_path);


From roland at topspin.com  Mon Apr 18 09:11:56 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 09:11:56 -0700
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <20050418152452.GH17566@mellanox.co.il> (Michael S. Tsirkin's
	message of "Mon, 18 Apr 2005 18:24:52 +0300")
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com> <20050417135658.GK16996@mellanox.co.il>
	<52u0m4cfj7.fsf@topspin.com> <20050418152452.GH17566@mellanox.co.il>
Message-ID: <52mzrwawtv.fsf@topspin.com>

    Michael> Does this work for you? Is mthca.a created?

I just tried this:

    $ ../libmthca/configure  --enable-static CPPFLAGS=-I$(pwd)/../libibverbs/include --prefix=$HOME/junk
    $ make
    $ make install

and I get this:

    $ tree ~/junk
    /data/home/roland/junk
    `-- lib
        `-- infiniband
            |-- mthca.a
            |-- mthca.la
            `-- mthca.so

so yes, it looks like it works.

    Michael> Put another way - whats the harm in always building the
    Michael> static version as well? Other libraries (e.g. libibverbs)
    Michael> build both static and shared versions by default.

I don't think of libmthca as a library really.  It's a plug in loaded
by libibverbs.  In some specialized circumstances it may be useful to
build it statically but in general it's just unneeded confusion.

 - R.


From roland at topspin.com  Mon Apr 18 09:12:45 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 09:12:45 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263DBBF.9040801@ammasso.com> (Timur Tabi's message of "Mon,
	18 Apr 2005 11:09:35 -0500")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<4263DBBF.9040801@ammasso.com>
Message-ID: <52is2kawsi.fsf@topspin.com>

    Timur> Why do you call mlock() and get_user_pages()?  In our code,
    Timur> we only call mlock(), and the memory is pinned.  We have a
    Timur> test case that fails if only get_user_pages() is called,
    Timur> but it passes if only mlock() is called.

What if a buggy/malicious userspace program doesn't call mlock()?

 - R.


From roland at topspin.com  Mon Apr 18 09:27:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 09:27:38 -0700
Subject: [openib-general] openIB gen2 user space verbs API
References: <D4F8F0B3820E754C887699BEF26A89406AF22B@taurus.voltaire.com>
Message-ID: <52d5ssaw3p.fsf@topspin.com>

    Or> Other than getting the max inline size, query qp is used to
    Or> get the current QP state, examples are app error flow (eg when
    Or> modify qp failed) and app APM flow to sense some of the state
    Or> transitions done by the HW. These are only examples I quickly
    Or> thought of, I guess there are more.

Which applications are using these operations as you describe?

 - R.


From iod00d at hp.com  Mon Apr 18 09:34:50 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 18 Apr 2005 09:34:50 -0700
Subject: [openib-general] [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
In-Reply-To: <523btocckc.fsf@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408>
	<523btocckc.fsf@topspin.com>
Message-ID: <20050418163450.GC6931@esmail.cup.hp.com>

On Mon, Apr 18, 2005 at 08:46:43AM -0700, Roland Dreier wrote:
>     Robert> Roland, do you know what the SVN rev was for that latest
>     Robert> code that was submitted to 2.6.12-rc2-mm3. That is the
>     Robert> version that we discussed starting with for an initial
>     Robert> 2.6.9 backport, but as suggested, I want to embed the SVN
>     Robert> rev. into the file name of the patches for clarity.
> 
> It doesn't really make sense to talk about the svn rev for that tree,
> since I went through and picked some patches but not others to merge
> upstream.

I agree.
What gets delivered by kernel.org has it's own version control.
I'm mostly concerned about when people pull from SVN on openib.org.

grant


From mshefty at ichips.intel.com  Mon Apr 18 09:38:17 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 18 Apr 2005 09:38:17 -0700
Subject: [openib-general] Re: openIB gen2 user space verbs API
In-Reply-To: <52oecccfj6.fsf@topspin.com>
References: <OFBA2DAD75.1AFBF257-ONC1256FE4.004885E4-C1256FE4.0048CCEE@de.ibm.com>	<527jj4kpkv.fsf@topspin.com>	<20050415215836.GA6479@cse.ohio-state.edu>	<52r7hbhexk.fsf@topspin.com>
	<20050416172303.GC854@mellanox.co.il> <52oecccfj6.fsf@topspin.com>
Message-ID: <4263E279.1010401@ichips.intel.com>

Roland Dreier wrote:
>     Michael> An application would need to know what values is it legal
>     Michael> to pass to create_qp.  Maybe it makes sence to implement
>     Michael> something like query_hca, and let it return the maximum
>     Michael> legal value?
> 
> The Mellanox VAPI just used the maximum number of sg entries for the
> send queue to calculate the inline data value for a QP.  Does it make
> sense to change this interface?

IMO, from a pure API perspective, max inline size and SG entries should 
be separate, even if a current implementation ties them together.

- Sean


From shaharf at voltaire.com  Mon Apr 18 09:39:25 2005
From: shaharf at voltaire.com (shaharf)
Date: Mon, 18 Apr 2005 19:39:25 +0300
Subject: [openib-general] mthca_cmd is broken?
Message-ID: <D4F8F0B3820E754C887699BEF26A8940545B84@taurus.voltaire.com>

It seems that there is a naming problem with INIT_SET_IB_IN_SIZE. I
guess it should be SET_IB_IN_SIZE.

 
Shahar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050418/97cd857b/attachment.html>

From hch at infradead.org  Mon Apr 18 09:43:16 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Mon, 18 Apr 2005 17:43:16 +0100
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263DEC5.5080909@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
Message-ID: <20050418164316.GA27697@infradead.org>

On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote:
> That's not what we're seeing.  We have hardware that does DMA over the 
> network (much like the Infiniband stuff), and we have a testcase that fails 
> if get_user_pages() is used, but not if mlock() is used.

If you don't share your testcase it's unlikely to be fixed.


From timur.tabi at ammasso.com  Mon Apr 18 09:45:57 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 11:45:57 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050418164316.GA27697@infradead.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
Message-ID: <4263E445.8000605@ammasso.com>

Christoph Hellwig wrote:
> On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote:
> 
>>That's not what we're seeing.  We have hardware that does DMA over the 
>>network (much like the Infiniband stuff), and we have a testcase that fails 
>>if get_user_pages() is used, but not if mlock() is used.
> 
> 
> If you don't share your testcase it's unlikely to be fixed.

As I said, the testcase only works with our hardware, and it's also very large.  It's one 
small test that's part of a huge test suite.  It takes a couple hours just to install the 
damn thing.

We want to produce a simpler test case that demonstrates the problem in an 
easy-to-understand manner, but we don't have time to do that now.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From timur.tabi at ammasso.com  Mon Apr 18 09:50:06 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 11:50:06 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52is2kawsi.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>	<4263DBBF.9040801@ammasso.com>
	<52is2kawsi.fsf@topspin.com>
Message-ID: <4263E53E.3090107@ammasso.com>

Roland Dreier wrote:
>     Timur> Why do you call mlock() and get_user_pages()?  In our code,
>     Timur> we only call mlock(), and the memory is pinned.  We have a
>     Timur> test case that fails if only get_user_pages() is called,
>     Timur> but it passes if only mlock() is called.
> 
> What if a buggy/malicious userspace program doesn't call mlock()?

Our library calls mlock() when the apps requests memory to be "registered".  We then call 
munlock() when the app requests the memory to be unregistered.  All apps talk to our 
library for all services.  No apps talk to the driver directly.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From roland at topspin.com  Mon Apr 18 09:45:18 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 18 Apr 2005 09:45:18 -0700
Subject: [openib-general] mthca_cmd is broken?
In-Reply-To: <D4F8F0B3820E754C887699BEF26A8940545B84@taurus.voltaire.com>
	(shaharf@voltaire.com's
	message of "Mon, 18 Apr 2005 19:39:25 +0300")
References: <D4F8F0B3820E754C887699BEF26A8940545B84@taurus.voltaire.com>
Message-ID: <528y3gava9.fsf@topspin.com>

Yes, you're right.  Fixed.


From timur.tabi at ammasso.com  Mon Apr 18 10:15:02 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 12:15:02 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050412180447.E6958@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<20050412180447.E6958@topspin.com>
Message-ID: <4263EB16.1090904@ammasso.com>

Libor Michalek wrote:

> The problem we were seeing is that the minor fault by the app resulted
> in a new physical page getting mapped for the application. The page that
> had the elevated refcount was still waiting for the data to be written
> to by the driver at the time that the app accessed the page causing the
> minor fault. Obviously since the app had a new mapping the data written
> by the driver was lost.

Thanks Libor, this is much better explanation of the problem than what I posted.

> It looks like code was added to try_to_unmap_one() to address this, so
> hopefully it's no longer an issue...

I doubt it.  I tried this with an earlier 2.6 kernel, and get_user_pages() was still not 
enough to really pin the memory down.  Maybe it works in 2.6.12, but that doesn't help me 
any, because our driver needs to support all 2.4 and 2.6 kernels.  Currently, mlock() 
alone seems to be good enough, but I'm going to add calls to get_user_pages() just to be sure.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From mst at mellanox.co.il  Mon Apr 18 10:39:55 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 20:39:55 +0300
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <52r7h8axho.fsf@topspin.com>
References: <20050410084724.GZ20567@mellanox.co.il>
	<52r7h8axho.fsf@topspin.com>
Message-ID: <20050418173955.GB19702@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [PATCH] uverbs with static libraries
> 
> It made more sense to me to load a static driver once before calling
> find_drivers() for each entry in our driver path.  Is there anything
> wrong with this?
> 
>  - R.
> 
> --- libibverbs/src/init.c	(revision 2156)
> +++ libibverbs/src/init.c	(working copy)
> @@ -198,6 +198,11 @@ static void INIT ibverbs_init(void)
>  	if (ibv_init_mem_map())
>  		return;
>  
> +	/*
> +	 * Check if a driver is statically linked, and if so load it first.
> +	 */
> +	load_driver(NULL);
> +
>  	user_path = getenv(OPENIB_DRIVER_PATH_ENV);
>  	if (user_path) {
>  		wr_path = strdupa(user_path);
> 

ok

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Mon Apr 18 11:15:19 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 21:15:19 +0300
Subject: [openib-general] Re: [PATCH] uverbs with static libraries
In-Reply-To: <52mzrwawtv.fsf@topspin.com>
References: <20050410084724.GZ20567@mellanox.co.il>
	<52is2t1g1m.fsf@topspin.com>
	<20050417135658.GK16996@mellanox.co.il>
	<52u0m4cfj7.fsf@topspin.com>
	<20050418152452.GH17566@mellanox.co.il>
	<52mzrwawtv.fsf@topspin.com>
Message-ID: <20050418181519.GA19943@mellanox.co.il>

Hi, Roland!

Quoting r. Roland Dreier <roland at topspin.com>:
>     Michael> Put another way - whats the harm in always building the
>     Michael> static version as well? Other libraries (e.g. libibverbs)
>     Michael> build both static and shared versions by default.
> 
> I don't think of libmthca as a library really.  It's a plug in loaded
> by libibverbs.

Whats the point of a static libibverbs then?
Some people may want to build an executable without external
dependencies, they clearly need both libraries static.
Others may not care, they may be better of with shared.

> In some specialized circumstances it may be useful to
> build it statically

Hopefully it shall be there for the developer, who shall have no need to
build it. If the default is not to include the static version distributions
wont package it so it wont be there for developers to use.

If the default does build static and shared version,
distributions will put the static version in a separate -devel rpm
together with header files so people who dont build apps wont need it.

> but in general it's just unneeded confusion.
> 
>  - R.

What kind of confusion?

-- 
MST - Michael S. Tsirkin


From libor at topspin.com  Mon Apr 18 11:15:26 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 18 Apr 2005 11:15:26 -0700
Subject: [openib-general] Re: ttcp.aio - kernel NULL pointer dereference
In-Reply-To: <20050418145046.GG17566@mellanox.co.il>;
	from mst@mellanox.co.il on Mon, Apr 18, 2005 at 05:50:46PM +0300
References: <20050418145046.GG17566@mellanox.co.il>
Message-ID: <20050418111526.A7553@topspin.com>

On Mon, Apr 18, 2005 at 05:50:46PM +0300, Michael S. Tsirkin wrote:
> Hello, Libor!
> Every once in a while, when I run ttcp 
> 
> I get a kernel NULL pointer dereference from SDP
> 
> My kernel is 2.6.11 + latest openib svn (rev 2171).

  Michael, on what type of system are you seeing this?

-Libor


From mst at mellanox.co.il  Mon Apr 18 11:29:47 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 21:29:47 +0300
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9
	kernel
In-Reply-To: <523btocckc.fsf@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408>
	<523btocckc.fsf@topspin.com>
Message-ID: <20050418182947.GC19943@mellanox.co.il>

Quoting r. Roland Dreier <roland at topspin.com>:
> Subject: Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9 kernel
> 
>     Robert> Roland, do you know what the SVN rev was for that latest
>     Robert> code that was submitted to 2.6.12-rc2-mm3. That is the
>     Robert> version that we discussed starting with for an initial
>     Robert> 2.6.9 backport, but as suggested, I want to embed the SVN
>     Robert> rev. into the file name of the patches for clarity.
> 
> It doesn't really make sense to talk about the svn rev for that tree,
> since I went through and picked some patches but not others to merge
> upstream.
> 
>  - R.
> 

Roland, maybe, when you do this, you can put a copy of the source as you
submit it under gen2/branches? I think that would solve Grant's problem
but may be too hard to do in practice.
Would that fit easily with your workflow - you are using quilt, right?

-- 
MST - Michael S. Tsirkin


From mst at mellanox.co.il  Mon Apr 18 11:31:51 2005
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Apr 2005 21:31:51 +0300
Subject: [openib-general] Re: ttcp.aio - kernel NULL pointer dereference
In-Reply-To: <20050418111526.A7553@topspin.com>
References: <20050418145046.GG17566@mellanox.co.il>
	<20050418111526.A7553@topspin.com>
Message-ID: <20050418183151.GD19943@mellanox.co.il>

Quoting r. Libor Michalek <libor at topspin.com>:
> Subject: Re: ttcp.aio - kernel NULL pointer dereference
> 
> On Mon, Apr 18, 2005 at 05:50:46PM +0300, Michael S. Tsirkin wrote:
> > Hello, Libor!
> > Every once in a while, when I run ttcp 
> > 
> > I get a kernel NULL pointer dereference from SDP
> > 
> > My kernel is 2.6.11 + latest openib svn (rev 2171).
> 
>   Michael, on what type of system are you seeing this?
> 
> -Libor
> 

Intel nocona with Arbel native.

-- 
MST - Michael S. Tsirkin


From arjan at infradead.org  Mon Apr 18 12:40:40 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Mon, 18 Apr 2005 21:40:40 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263DF70.2060702@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<4263DF70.2060702@ammasso.com>
Message-ID: <1113853240.6274.99.camel@laptopd505.fenrus.org>

On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote:
> Arjan van de Ven wrote:
> 
> > this is a myth; linux is free to move the page about in physical memory
> > even if it's mlock()ed!!
> 
> Then Linux has a very odd definition of the word "locked".
> 
> > And even then, the user can munlock the memory from another thread etc
> > etc. Not a good idea.
> 
> Well, that's okay, because then the app is doing something stupid, so we don't worry about 
> that.

you should since that physical page can be reused, say by a root
process, and you'd be majorly screwed


From timur.tabi at ammasso.com  Mon Apr 18 13:00:02 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 15:00:02 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113853240.6274.99.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>	 <4263DBBF.9040801@ammasso.com>	
	<1113840973.6274.84.camel@laptopd505.fenrus.org>	
	<4263DF70.2060702@ammasso.com>
	<1113853240.6274.99.camel@laptopd505.fenrus.org>
Message-ID: <426411C2.5040703@ammasso.com>

Arjan van de Ven wrote:

> you should since that physical page can be reused, say by a root
> process, and you'd be majorly screwed

I don't understand what you mean by "reused".  The whole point behind pinning the memory 
is that it stays where it is.  It doesn't get moved around and it doesn't get swapped out.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From iod00d at hp.com  Mon Apr 18 13:02:49 2005
From: iod00d at hp.com (Grant Grundler)
Date: Mon, 18 Apr 2005 13:02:49 -0700
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for 2.6.9
	kernel
In-Reply-To: <20050418182947.GC19943@mellanox.co.il>
References: <1AC79F16F5C5284499BB9591B33D6F000424683E@orsmsx408>
	<523btocckc.fsf@topspin.com>
	<20050418182947.GC19943@mellanox.co.il>
Message-ID: <20050418200249.GG6931@esmail.cup.hp.com>

On Mon, Apr 18, 2005 at 09:29:47PM +0300, Michael S. Tsirkin wrote:
> Quoting r. Roland Dreier <roland at topspin.com>:
> > It doesn't really make sense to talk about the svn rev for that tree,
> > since I went through and picked some patches but not others to merge
> > upstream.
> 
> Roland, maybe, when you do this, you can put a copy of the source as you
> submit it under gen2/branches? I think that would solve Grant's problem
> but may be too hard to do in practice.

Not really since I can see the kernel rev and then pull the matching
kernel source. Well unless people are using -mm or linus' bk tree
directly. But not that many people do that.

grant


From arjan at infradead.org  Mon Apr 18 13:05:42 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Mon, 18 Apr 2005 22:05:42 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426411C2.5040703@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<4263DF70.2060702@ammasso.com>
	<1113853240.6274.99.camel@laptopd505.fenrus.org>
	<426411C2.5040703@ammasso.com>
Message-ID: <1113854742.6274.101.camel@laptopd505.fenrus.org>

On Mon, 2005-04-18 at 15:00 -0500, Timur Tabi wrote:
> Arjan van de Ven wrote:
> 
> > you should since that physical page can be reused, say by a root
> > process, and you'd be majorly screwed
> 
> I don't understand what you mean by "reused".  The whole point behind pinning the memory 
> is that it stays where it is.  It doesn't get moved around and it doesn't get swapped out.
> 
you just said that you didn't care that it got munlock'd. So you don't
care that it gets freed either. And then reused.


From blist at aon.at  Mon Apr 18 13:07:12 2005
From: blist at aon.at (Bernhard Fischer)
Date: Mon, 18 Apr 2005 22:07:12 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113853240.6274.99.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<4263DF70.2060702@ammasso.com>
	<1113853240.6274.99.camel@laptopd505.fenrus.org>
Message-ID: <20050418200711.GI15688@aon.at>

On Mon, Apr 18, 2005 at 09:40:40PM +0200, Arjan van de Ven wrote:
>On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote:
>> Arjan van de Ven wrote:
>> 
>> > this is a myth; linux is free to move the page about in physical memory
>> > even if it's mlock()ed!!
darn, yes, this is true.
I know people who introduced
#define VM_RESERVED     0x00080000      /* Don't unmap it from swap_out
*/
to vm_flags just because of this. I'll just hold my breath and won't
delve further.
>> 
>> Then Linux has a very odd definition of the word "locked".
>> 
>> > And even then, the user can munlock the memory from another thread etc
>> > etc. Not a good idea.
>> 
>> Well, that's okay, because then the app is doing something stupid, so we don't worry about 
>> that.
>
>you should since that physical page can be reused, say by a root
>process, and you'd be majorly screwed


From timur.tabi at ammasso.com  Mon Apr 18 13:19:33 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 18 Apr 2005 15:19:33 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113854742.6274.101.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>	 <4263DBBF.9040801@ammasso.com>	
	<1113840973.6274.84.camel@laptopd505.fenrus.org>	
	<4263DF70.2060702@ammasso.com>	
	<1113853240.6274.99.camel@laptopd505.fenrus.org>	
	<426411C2.5040703@ammasso.com>
	<1113854742.6274.101.camel@laptopd505.fenrus.org>
Message-ID: <42641655.1080403@ammasso.com>

Arjan van de Ven wrote:

> you just said that you didn't care that it got munlock'd. So you don't
> care that it gets freed either. And then reused.

Well, I can live with the app being able to call munlock(), because the apps that our 
customers use don't call munlock().  What I can't live with is a bug in the kernel that 
causes pinned pages to be swapped or moved.

Obviously, I would rather call get_user_pages() instead of mlock(), but I can't, because 
get_user_pages doesn't work.  The page doesn't stay pinned at the physical address, but it 
does if I call mlock() and get_user_pages().

Actually, in our tests, calling mlock() appears to be good enough, but I'll update our 
code to call get_user_pages() as well.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com


From hermannv at supermicro.com  Mon Apr 18 18:37:37 2005
From: hermannv at supermicro.com (Hermann von Drateln)
Date: Mon, 18 Apr 2005 18:37:37 -0700
Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD
Message-ID: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAAE4LjNMw16EaNXEizozGxbcKAAAAQAAAA9g3WQ/f3hkCyA1px0KEs5wEAAAAA@supermicro.com>

To all Open IB Members!

 
This week we are commencing the verification and validation of our Dual Xeon
800 MHz FSB MBD.

 
The board is very similar to our 

 
http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cfm

 
And it made to be fitted on our  CSE-513

 
http://www.supermicro.com/products/chassis/1U/?chs=513

 
For any one that would like to obtain more detail information on this new
board and our new product road map send me an email to provide you with
Power point presentation if information.

 
Best regards,

 
Hermann von Drateln

Director Business Development

  
  USA TEL  1 408  503 8110

  USA CEL  1 408 306 8110

 
   Eu Tel  + 49 173 286 6883

   Eu Fax + 49 69 255 77303

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050418/41af9e88/attachment.html>

From yulia.plavunova at t-platforms.ru  Tue Apr 19 03:01:34 2005
From: yulia.plavunova at t-platforms.ru (Yulia Plavunova)
Date: Tue, 19 Apr 2005 14:01:34 +0400
Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD
Message-ID: <3DD22B58943FBB47A82692E3D4671BF5142928@srv04.merle.ru>

Dear Hermann, 

 
I represent T-Platforms, Russian HPC integrator (www.t-platforms.ru
<http://www.t-platforms.ru/> ). We are very interested in getting more
info on your Dual Xeon 800 MHz FSB MBD. I'd appreciate you send the
presentation and the roadmap. 

 
Looking forward to your reply, 

 
Best regards, 

 
Yulia Plavunova

 
Manager of Vendor&International Relations, T-Platforms

 
Tlf.: (+7-095) 9565414

________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Hermann von
Drateln
Sent: Tuesday, April 19, 2005 5:38 AM
To: openib-general at openib.org
Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD
Importance: High

 
To all Open IB Members!

 
This week we are commencing the verification and validation of our Dual
Xeon 800 MHz FSB MBD.

 
The board is very similar to our 

 
http://www.supermicro.com/products/motherboard/Xeon800/E7320/X6DVA-EG.cf
m

 
And it made to be fitted on our  CSE-513

 
http://www.supermicro.com/products/chassis/1U/?chs=513

 
For any one that would like to obtain more detail information on this
new board and our new product road map send me an email to provide you
with Power point presentation if information.

 
Best regards,

 
Hermann von Drateln

Director Business Development

  
  USA TEL  1 408  503 8110

  USA CEL  1 408 306 8110

 
   Eu Tel  + 49 173 286 6883

   Eu Fax + 49 69 255 77303

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/3929ada1/attachment.html>

From ogerlitz at voltaire.com  Tue Apr 19 05:26:58 2005
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Apr 2005 15:26:58 +0300
Subject: [openib-general] openIB gen2 user space verbs API
Message-ID: <D4F8F0B3820E754C887699BEF26A894070B9C9@taurus.voltaire.com>

Roland>Which applications are using these operations as you describe?

Ignoring APM, one can claim that using query qp to get the qp state is
means of debugging and not
needed for an app regular flow. Correct, we are using it in our dapl
code when modify qp fails
and i know an HPC app using it in its startup code in the same manner.

Taking APM into account, as the QP REARMED to ARMED state transition is
done by the HCA HW 
when data is delivered over the RC connection, there are apps that do qp
query to sense this transition
and modify the qp mig state to MIGRATED.

Other than debugging and APM one can implement resource tracking code
that can query a specific
qp per request, or qp caching scheme that keeps created/init-ed or even
connected QPs and before/after
handing them to consumers queries the QP for verifyig the state. Some of
the gen1 stacks have resource
tracking as i describe here, also there are apps doing this caching i
mentioned.

For getting the inline size (which you indeed propose to return its max
size from the qp create func)
every app which wishes to use inline (eg MVAPICH) would query the qp
state.

To me query qp and hca capabilities seems as a must, the other queries
(CQ MR etc) are less important.

Or.
-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier
Sent: Monday, April 18, 2005 7:28 PM
To: Or Gerlitz
Cc: openib-general at openib.org
Subject: Re: [openib-general] openIB gen2 user space verbs API

    Or> Other than getting the max inline size, query qp is used to
    Or> get the current QP state, examples are app error flow (eg when
    Or> modify qp failed) and app APM flow to sense some of the state
    Or> transitions done by the HW. These are only examples I quickly
    Or> thought of, I guess there are more.

Which applications are using these operations as you describe?

 - R.
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Tue Apr 19 06:56:16 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Apr 2005 09:56:16 -0400
Subject: [openib-general] [PATCH] fix management/README
In-Reply-To: <20050417093245.GA16996@mellanox.co.il>
References: <20050417093245.GA16996@mellanox.co.il>
Message-ID: <1113918975.4880.14.camel@localhost.localdomain>

On Sun, 2005-04-17 at 05:32, Michael S. Tsirkin wrote: 
> Fix build instructions to refer to directories that actually exist.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Apr 19 07:00:05 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Apr 2005 10:00:05 -0400
Subject: [openib-general] SM Bad Port Handling
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF10B@mtlex01.yok.mtl.com>
Message-ID: <1113919205.4880.24.camel@localhost.localdomain>

On Thu, 2005-04-14 at 07:13, Eitan Zahavi wrote: 
> [EZ] Not at all. Although the target port is known. The flaky link
> that fails the mad might be anywhere along the path to the port. So,
> if you mark the target port as bad you might be marking the wrong
> port!

OpenSM does look for traps 128-131 which includes Local link integrity
(129) which is likely from these noisy ports, right ?

>  [EZ] Let me clarify with an example:
> SM=HCA1/P1 ->
> SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1
>                                            
> \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../
>                            
> If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to
> HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as
> bad then you actually will loose that HCA for no good reason since
> another path from SM to HCA2 exists.

OK. To be really sure about the failed port, one could then walk the
entire DR path from the SM to the perceived non responding port and if
the same port along the path doesn't respond some number of times (say
4) in a row, that port's peer port can be marked as unhealthy. Is this
algorithm acceptable ?

-- Hal


From eitan at mellanox.co.il  Tue Apr 19 08:05:32 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Apr 2005 18:05:32 +0300
Subject: [openib-general] SM Bad Port Handling
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF134@mtlex01.yok.mtl.com>

Yes, the algorithm looks reasonable. 
I would make the number of packets required to qualify the ports on the way
a parameter with default value of 10 or 20 (surely not 4).

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Tuesday, April 19, 2005 5:00 PM
> To: Eitan Zahavi
> Cc: 'shaharf'; openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> On Thu, 2005-04-14 at 07:13, Eitan Zahavi wrote:
> > [EZ] Not at all. Although the target port is known. The flaky link
> > that fails the mad might be anywhere along the path to the port. So,
> > if you mark the target port as bad you might be marking the wrong
> > port!
> 
> OpenSM does look for traps 128-131 which includes Local link integrity
> (129) which is likely from these noisy ports, right ?
> 
> >  [EZ] Let me clarify with an example:
> > SM=HCA1/P1 ->
> > SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1
> >
> > \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../
> >
> > If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to
> > HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as
> > bad then you actually will loose that HCA for no good reason since
> > another path from SM to HCA2 exists.
> 
> OK. To be really sure about the failed port, one could then walk the
> entire DR path from the SM to the perceived non responding port and if
> the same port along the path doesn't respond some number of times (say
> 4) in a row, that port's peer port can be marked as unhealthy. Is this
> algorithm acceptable ?
> 
> -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/ebb3f101/attachment.html>

From tduffy at sun.com  Tue Apr 19 10:28:07 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 19 Apr 2005 10:28:07 -0700
Subject: [openib-general] Infiniband embedded dual Xeon 800 MHz MBD
In-Reply-To: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAAE4LjNMw16EaNXEizozGxbcKAAAAQAAAA9g3WQ/f3hkCyA1px0KEs5wEAAAAA@supermicro.com>
References: <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAAE4LjNMw16EaNXEizozGxbcKAAAAQAAAA9g3WQ/f3hkCyA1px0KEs5wEAAAAA@supermicro.com>
Message-ID: <1113931687.13847.1.camel@duffman>

On Mon, 2005-04-18 at 18:37 -0700, Hermann von Drateln wrote: 
> This week we are commencing the verification and validation of our
> Dual Xeon 800 MHz FSB MBD.

So, does this board have built-in Infiniband?  If not, how is it
relevant to this list?

Or can I conclude that this was simply SPAM.

-tduffy 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/b6a90550/attachment.sig>

From ardavis at ichips.intel.com  Tue Apr 19 11:42:22 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Tue, 19 Apr 2005 11:42:22 -0700
Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get
In-Reply-To: <52ekdbhe49.fsf@topspin.com>
References: <426023BD.8080504@ichips.intel.com> <52ekdbhe49.fsf@topspin.com>
Message-ID: <4265510E.2080502@ichips.intel.com>

Roland Dreier wrote:

>    ardavis> With a little stress, I see the following oops (running
>    ardavis> latest from the trunk). Let me know if you need any more
>    ardavis> information.
>
>Can you try this patch and let me know if it helps at all?
>
>Thanks,
>  Roland
>
>  
>
Yes, works great. Thanks!

-arlin


From roland at topspin.com  Tue Apr 19 11:49:01 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 19 Apr 2005 11:49:01 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
In-Reply-To: <4265510E.2080502@ichips.intel.com> (ardavis@ichips.intel.com's
	message of "Tue, 19 Apr 2005 11:42:22 -0700")
References: <426023BD.8080504@ichips.intel.com> <52ekdbhe49.fsf@topspin.com>
	<4265510E.2080502@ichips.intel.com>
Message-ID: <527jiy7gbm.fsf@topspin.com>

    ardavis> Yes, works great. Thanks!

Cool, I've committed this.

 - R.


From ardavis at ichips.intel.com  Tue Apr 19 14:49:56 2005
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Tue, 19 Apr 2005 14:49:56 -0700
Subject: [openib-general] [PATCH] udapl provider
Message-ID: <20050419144956.359e2d6c.ardavis@ichips.intel.com>

Fixes for socket CM to prevent blocking and allow more uDAPL applications to run successfully.

Signed-off-by: Arlin Davis <ardavis at ichips.intel.com>

Index: udapl/Makefile
===================================================================
--- udapl/Makefile	(revision 2190)
+++ udapl/Makefile	(working copy)
@@ -122,7 +122,6 @@ endif
 #
 ifeq ($(VERBS),openib)
 PROVIDER = $(TOPDIR)/../openib
-DAPL_IBLIB_DIR = /usr/local/lib
 CFLAGS   += -DSOCKET_CM -DOPENIB -DCQ_WAIT_OBJECT
 CFLAGS   += -I/usr/local/include/infiniband
 endif
@@ -139,7 +138,7 @@ endif
 
 CFLAGS   += -I. 
 CFLAGS   += -I.. 
-CFLAGS   += -I../../dat/include 
+CFLAGS   += -I../dat/include 
 CFLAGS   += -I../include 
 
 CFLAGS   += -I$(PROVIDER)
@@ -234,8 +233,9 @@ PROVIDER_SRCS += dapl_openib_util.c dapl
 endif
 
 ifeq ($(VERBS),openib)
-LDFLAGS += -libverbs 
-LDFLAGS += -L /usr/local/lib/ 
+LDFLAGS += -libverbs /usr/local/lib/infiniband/mthca.so
+LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib
+LDFLAGS += -rpath /usr/local/lib/infiniband 
 PROVIDER_SRCS  = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c 
 PROVIDER_SRCS += dapl_ib_cm.c dapl_ib_mem.c
 endif
Index: openib/TODO
===================================================================
--- openib/TODO	(revision 2190)
+++ openib/TODO	(working copy)
@@ -2,16 +2,17 @@
 IB Verbs:
 - CQ resize?
 - query call to get current qp state 
-- ibv_get_cq_event() blocks until event arrives. need timed event and wakeup
+- ibv_get_cq_event() needs timed event call and wakeup
 - query call to get device attributes
-- poll_cq return codes not exported
+- current implementation only supports one event per device
+- memory window support
 
 DAPL:
 - Build udapl issues with mthca having reverse dependencies to ibverbs
-- When CM arrives: change modify_qp_state RTS RTR calls
+- When real CM arrives: change modify_qp_state RTS RTR calls
 - reinit EP needs a QP timewait completion notification
-- disconnect clean 
-- add cq_object wakeup, time based cq_object wait
+- code disconnect clean 
+- add cq_object wakeup, time based cq_object wait when verbs support arrives
 - update uDAPL code with real CM and ATS support
 - etc, etc.
 
Index: openib/dapl_ib_util.c
===================================================================
--- openib/dapl_ib_util.c	(revision 2190)
+++ openib/dapl_ib_util.c	(working copy)
@@ -53,18 +53,12 @@ static const char rcsid[] = "$Id:  $";
 #include "dapl_adapter_util.h"
 #include "dapl_ib_util.h"
 
-#include <dlfcn.h>
 #include <stdlib.h>
 #include <netinet/tcp.h>
 #include <sys/utsname.h>
 #include <unistd.h>	
 #include <fcntl.h>
 
-/* set default path */
-#define OPENIB_VERBS_PATH_DEFAULT	"/usr/local/lib/libibverbs.so"
-static char *				ibv_path;
-static void *				ibv_handle = NULL;
-
 int g_dapl_loopback_connection = 0;
 
 #ifdef SOCKET_CM
@@ -110,32 +104,11 @@ DAT_RETURN getipaddr( char *addr, int ad
  */
 int32_t dapls_ib_init (void)
 {	
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL, "dapls_ib_init() called\n");
-
-	ibv_path = getenv("OPENIB_VERBS_PATH");
-	
-	if (ibv_path == NULL)
-		ibv_path = OPENIB_VERBS_PATH_DEFAULT;
-	
-	dapl_dbg_log(DAPL_DBG_TYPE_UTIL," loading verbs library %s\n",ibv_path);
-
-	ibv_handle = dlopen(ibv_path, RTLD_NOW | RTLD_GLOBAL);
-	if (ibv_handle == NULL ) {
-		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
-			     " library load failure %s\n", dlerror());
-		return -1;
-	}
-
 	return 0;
 }
 
 int32_t dapls_ib_release (void)
 {
-	dapl_dbg_log (DAPL_DBG_TYPE_UTIL, "dapls_ib_release() called\n");
-	
-	if (ibv_handle)
-		dlclose(ibv_handle);
-
 	return 0;
 }
 
@@ -166,13 +139,6 @@ DAT_RETURN dapls_ib_open_hca (
 	dapl_dbg_log (DAPL_DBG_TYPE_UTIL, 
 		      " open_hca: %s - %p\n", hca_name, hca_ptr );
 
-	if (ibv_handle == NULL) {
-		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
-			     " Failure loading IB verbs library %s \n", 
-			     ibv_path);
-		return DAT_PROVIDER_NOT_FOUND;
-	}
-
 	/* Get list of all IB devices, find match, open */
 	dev_list = ibv_get_devices();
 	dlist_start(dev_list);
@@ -201,6 +167,29 @@ DAT_RETURN dapls_ib_open_hca (
 	}
   
 #ifdef SOCKET_CM
+	/* initialize cr_list lock */
+	dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock);
+	if (dat_status != DAT_SUCCESS)
+	{
+		dapl_dbg_log (DAPL_DBG_TYPE_ERR, 
+			" open_hca: failed to init lock\n");
+		return dat_status;
+	}
+
+	/* initialize CM list for listens on this HCA */
+	dapl_llist_init_head(&hca_ptr->ib_trans.list);
+
+	/* create thread to process inbound connect request */
+	dat_status = dapl_os_thread_create(cr_thread, 
+					   (void*)hca_ptr, 
+					   &hca_ptr->ib_trans.thread );
+	if (dat_status != DAT_SUCCESS)
+	{
+		dapl_dbg_log (DAPL_DBG_TYPE_ERR, 
+		" open_hca: failed to create thread\n");
+		return dat_status;
+	}
+
 	/* get the IP address of the device */
 	dat_status = getipaddr((char*)&hca_ptr->hca_address, 
 				sizeof(DAT_SOCK_ADDR6) );
@@ -243,6 +232,20 @@ DAT_RETURN dapls_ib_close_hca (	IN   DAP
 			return(dapl_convert_errno(errno,"ib_close_device"));
 		hca_ptr->ib_hca_handle = IB_INVALID_HANDLE;
 	}
+
+#if SOCKET_CM
+	/* destroy cr_thread and lock */
+	hca_ptr->ib_trans.destroy = 1;
+	while (hca_ptr->ib_trans.destroy) {
+		struct timespec	sleep, remain;
+		sleep.tv_sec = 0;
+		sleep.tv_nsec = 10000000; /* 10 ms */
+		dapl_dbg_log(DAPL_DBG_TYPE_UTIL, 
+			     " close_hca: waiting for cr_thread\n");
+		nanosleep (&sleep, &remain);
+	}
+	dapl_os_lock_destroy(&hca_ptr->ib_trans.lock);
+#endif
 	return (DAT_SUCCESS);
 }
   
@@ -297,18 +300,18 @@ DAT_RETURN dapls_ib_query_hca (
 			((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff );
 
 	/* TODO: need verbs query call */
-	ia_attr->max_eps                  = 1000;
-        ia_attr->max_dto_per_ep           = 1000;
-        ia_attr->max_rdma_read_per_ep     = 4;
-        ia_attr->max_evds                 = 1000;
-        ia_attr->max_evd_qlen             = 1000;
-	ia_attr->max_iov_segments_per_dto = 10;
-        ia_attr->max_lmrs                 = 1000;
+	ia_attr->max_eps                  = 64000;
+        ia_attr->max_dto_per_ep           = 64000;
+        ia_attr->max_rdma_read_per_ep     = 8;
+        ia_attr->max_evds                 = 64000;
+        ia_attr->max_evd_qlen             = 64000;
+	ia_attr->max_iov_segments_per_dto = 32;
+        ia_attr->max_lmrs                 = 64000;
         ia_attr->max_lmr_block_size       = 0x80000000;
-        ia_attr->max_rmrs                 = 1000;
+        ia_attr->max_rmrs                 = 64000;
         ia_attr->max_lmr_virtual_address  = 0x80000000;
         ia_attr->max_rmr_target_address   = 0x80000000;
-        ia_attr->max_pzs                  = 1000;
+        ia_attr->max_pzs                  = 64000;
         ia_attr->max_mtu_size             = 0x80000000;
         ia_attr->max_rdma_size            = 0x80000000;
         ia_attr->num_transport_attr       = 0;
@@ -333,12 +336,12 @@ DAT_RETURN dapls_ib_query_hca (
 	if (ep_attr != NULL) {
 		ep_attr->max_mtu_size     = 0x80000000;
 		ep_attr->max_rdma_size    = 0x80000000;
-		ep_attr->max_recv_dtos    = 1000;
-		ep_attr->max_request_dtos = 1000;
-		ep_attr->max_recv_iov     = 10;
-		ep_attr->max_request_iov  = 10;
-		ep_attr->max_rdma_read_in = 4;
-		ep_attr->max_rdma_read_out= 4;
+		ep_attr->max_recv_dtos    = 64000;
+		ep_attr->max_request_dtos = 64000;
+		ep_attr->max_recv_iov     = 32;
+		ep_attr->max_request_iov  = 32;
+		ep_attr->max_rdma_read_in = 8;
+		ep_attr->max_rdma_read_out= 8;
 		dapl_dbg_log (DAPL_DBG_TYPE_UTIL, 
 			" query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", 
 			ep_attr->max_mtu_size,
@@ -394,7 +397,7 @@ DAT_RETURN dapls_ib_setup_async_callback
 	    hca_ptr->async_cq_error = callback;
 	    break;
 	case DAPL_ASYNC_CQ_COMPLETION:
-	   hca_ptr->async_cq_completion = callback;
+	   hca_ptr->async_cq = callback;
 	   break;
 	case DAPL_ASYNC_QP_ERROR:
 	    hca_ptr->async_qp_error = callback;
Index: openib/dapl_ib_mem.c
===================================================================
--- openib/dapl_ib_mem.c	(revision 2190)
+++ openib/dapl_ib_mem.c	(working copy)
@@ -1,26 +1,25 @@
 /*
- * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved.
- *
- * This Software is licensed under either one of the following two licenses:
+ * This Software is licensed under one of the following licenses:
  *
  * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    in the file LICENSE.txt in the root directory. The license is also
  *    available from the Open Source Initiative, see
  *    http://www.opensource.org/licenses/cpl.php.
- * OR
  *
- * 2) under the terms of the "The BSD License" a copy of which is in the file
- *    LICENSE2.txt in the root directory. The license is also available from
- *    the Open Source Initiative, see
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
  *    http://www.opensource.org/licenses/bsd-license.php.
  *
- * Licensee has the right to choose either one of the above two licenses.
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
  *
- * Redistributions of source code must retain both the above copyright
- * notice and either one of the license notices.
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
  *
  * Redistributions in binary form must reproduce both the above copyright
- * notice, either one of the license notices in the documentation
+ * notice, one of the license notices in the documentation
  * and/or other materials provided with the distribution.
  */
 
@@ -31,7 +30,7 @@
  * PURPOSE: Intel DET APIs: Memory windows, registration,
  *           and protection domain 
  *
- * $Id:$
+ * $Id: $
  *
  **********************************************************************/
 
@@ -182,8 +181,11 @@ dapls_ib_mr_register (
 			ia_ptr, lmr, virt_addr, length, privileges );
 
 	/* TODO: shared memory */
-	if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL)        
+	if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) {
+		dapl_dbg_log( DAPL_DBG_TYPE_ERR,
+		     " mr_register_shared: NOT IMPLEMENTED\n");    
 		return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
+	}
 
 	/* local read is default on IB */ 
 	lmr->mr_handle = 
@@ -266,6 +268,7 @@ dapls_ib_mr_register_shared (
 	IN  DAPL_LMR		    *lmr,
 	IN  DAT_MEM_PRIV_FLAGS	privileges )
 {
+    dapl_dbg_log(DAPL_DBG_TYPE_ERR," mr_register_shared: NOT IMPLEMENTED\n");
     return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
 }
 
@@ -289,6 +292,8 @@ DAT_RETURN
 dapls_ib_mw_alloc (
 	IN  DAPL_RMR	*rmr )
 {
+
+	dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_alloc: NOT IMPLEMENTED\n");
    	return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
 }
 
@@ -312,6 +317,7 @@ DAT_RETURN
 dapls_ib_mw_free (
 	IN  DAPL_RMR 	*rmr )
 {	
+	dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_free: NOT IMPLEMENTED\n");
 	return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
 }
 
@@ -343,6 +349,7 @@ dapls_ib_mw_bind (
 	IN  DAT_MEM_PRIV_FLAGS		mem_priv,
 	IN  DAT_BOOLEAN			is_signaled)
 {
+	dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_bind: NOT IMPLEMENTED\n");
 	return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
 }
 
@@ -371,6 +378,7 @@ dapls_ib_mw_unbind (
 	IN  DAPL_COOKIE	*cookie,
 	IN  DAT_BOOLEAN	is_signaled )
 {
+	dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_unbind: NOT IMPLEMENTED\n");
 	return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE);  
 }
 
Index: openib/dapl_ib_cm.c
===================================================================
--- openib/dapl_ib_cm.c	(revision 2190)
+++ openib/dapl_ib_cm.c	(working copy)
@@ -74,10 +74,12 @@ static DAT_RETURN dapli_socket_listen ( 
 					DAT_CONN_QUAL		serviceID,
 					DAPL_SP			*sp_ptr );
 
-static DAT_RETURN dapli_socket_accept(	DAPL_EP			*ep_ptr,
-					DAPL_CR			*cr_ptr,
-					DAT_COUNT		p_size,
-					DAT_PVOID		p_data );
+static DAT_RETURN dapli_socket_accept(	ib_cm_srvc_handle_t cm_ptr );
+
+static DAT_RETURN dapli_socket_accept_final(	DAPL_EP		*ep_ptr,
+						DAPL_CR		*cr_ptr,
+						DAT_COUNT	p_size,
+						DAT_PVOID	p_data );
 
 /* XXX temporary hack to get lid */
 static uint16_t dapli_get_lid(IN struct ibv_device *dev, IN int port)
@@ -114,6 +116,7 @@ dapli_socket_connect (	DAPL_EP			*ep_ptr
 	DAPL_IA		*ia_ptr = ep_ptr->header.owner_ia;
 	int		len, opt = 1;
 	struct iovec    iovec[2];
+	short		rtu_data = htons(0x0E0F);
 	
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d\n", r_qual);
 			
@@ -133,7 +136,7 @@ dapli_socket_connect (	DAPL_EP			*ep_ptr
 		return DAT_INSUFFICIENT_RESOURCES;
 	}
 
-	((struct sockaddr_in*)r_addr)->sin_port = r_qual;
+	((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual);
 
 	if ( connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) < 0 ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
@@ -153,9 +156,11 @@ dapli_socket_connect (	DAPL_EP			*ep_ptr
 	cm_ptr->dst.p_size = p_size;
 	iovec[0].iov_base = &cm_ptr->dst;
 	iovec[0].iov_len  = sizeof(ib_qp_cm_t);
-	iovec[1].iov_base = p_data;
-	iovec[1].iov_len  = p_size;
-	len = writev( cm_ptr->socket, iovec, 2 );
+	if ( p_size ) {
+		iovec[1].iov_base = p_data;
+		iovec[1].iov_len  = p_size;
+	}
+	len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) );
     	if ( len != (p_size + sizeof(ib_qp_cm_t)) ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
 			     " connect write: ERR %s, wcnt=%d\n",
@@ -190,8 +195,8 @@ dapli_socket_connect (	DAPL_EP			*ep_ptr
 
 	/* read private data into cm_handle if any present */
 	if ( cm_ptr->dst.p_size ) {
-		iovec[1].iov_base = cm_ptr->p_data;
-		iovec[1].iov_len  = cm_ptr->dst.p_size;
+		iovec[0].iov_base = cm_ptr->p_data;
+		iovec[0].iov_len  = cm_ptr->dst.p_size;
 		len = readv( cm_ptr->socket, iovec, 1 );
 		if ( len != cm_ptr->dst.p_size ) {
 			dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
@@ -213,10 +218,11 @@ dapli_socket_connect (	DAPL_EP			*ep_ptr
 	ep_ptr->qp_state = IB_QP_STATE_RTS;
 
 	/* complete handshake after final QP state change */
-	write(cm_ptr->socket, "QP_RTR_RTS", sizeof "QP_RTR_RTS");
+	write(cm_ptr->socket, &rtu_data, sizeof(rtu_data) );
 
 	/* init cm_handle and post the event with private data */
 	ep_ptr->cm_handle = cm_ptr;
+	dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); 
 	dapl_evd_connection_callback(   ep_ptr->cm_handle, 
 					IB_CME_CONNECTED, 
 					cm_ptr->p_data, 
@@ -248,9 +254,7 @@ dapli_socket_listen (	DAPL_IA		*ia_ptr,
 {
 	struct sockaddr_in	addr;
 	ib_cm_srvc_handle_t	cm_ptr = NULL;
-	void			*p_data = NULL;
-	int			l_sock = -1;
-	int			len, opt = 1;
+	int			opt = 1;
 	DAT_RETURN		dat_status = DAT_SUCCESS;
 
 	dapl_dbg_log (	DAPL_DBG_TYPE_EP,
@@ -263,26 +267,30 @@ dapli_socket_listen (	DAPL_IA		*ia_ptr,
 
 	(void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) );
 	
-	cm_ptr->socket = -1;
+	cm_ptr->socket = cm_ptr->l_socket = -1;
 	cm_ptr->sp = sp_ptr;
 	cm_ptr->hca_ptr = ia_ptr->hca_ptr;
 	
 	/* bind, listen, set sockopt, accept, exchange data */
-	if ((l_sock = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
+	if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
 		dapl_dbg_log (DAPL_DBG_TYPE_ERR, 
 				"socket for listen returned %d\n", errno);
 		dat_status = DAT_INSUFFICIENT_RESOURCES;
 		goto bail;
 	}
 
-	setsockopt(l_sock,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt));
-	addr.sin_port        = serviceID;
+	setsockopt(cm_ptr->l_socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt));
+	addr.sin_port        = htons(serviceID);
 	addr.sin_family      = AF_INET;
 	addr.sin_addr.s_addr = INADDR_ANY;
 
-	if (( bind( l_sock,(struct sockaddr*)&addr, sizeof(addr) ) < 0) ||
-		   (listen( l_sock, 1 ) < 0) ) {
+	if (( bind( cm_ptr->l_socket,(struct sockaddr*)&addr, sizeof(addr) ) < 0) ||
+		   (listen( cm_ptr->l_socket, 128 ) < 0) ) {
 	
+		dapl_dbg_log( DAPL_DBG_TYPE_ERR,
+				" listen: ERROR %s on conn_qual 0x%x\n",
+				strerror(errno),serviceID); 
+
 		if ( errno == EADDRINUSE )
 			dat_status = DAT_CONN_QUAL_IN_USE;
 		else
@@ -290,109 +298,144 @@ dapli_socket_listen (	DAPL_IA		*ia_ptr,
 
 		goto bail;
 	}
+	
+	/* set cm_handle for this service point, save listen socket */
+	sp_ptr->cm_srvc_handle = cm_ptr;
 
-	/* block on the accept */
-        len = sizeof(cm_ptr->dst.ia_address);
-        cm_ptr->socket = accept(l_sock, 
-				(struct sockaddr*)&cm_ptr->dst.ia_address, 
+	/* add to SP->CR thread list */
+	dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry);
+	dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock );
+	dapl_llist_add_tail(&cm_ptr->hca_ptr->ib_trans.list, 
+			    (DAPL_LLIST_ENTRY*)&cm_ptr->entry, cm_ptr);
+	dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock);
+
+	dapl_dbg_log( DAPL_DBG_TYPE_CM,
+			" listen: qual 0x%x cr %p s_fd %d\n",
+			ntohs(serviceID), cm_ptr, cm_ptr->l_socket ); 
+	
+	return dat_status;
+bail:
+	dapl_dbg_log( DAPL_DBG_TYPE_ERR,
+			" listen: ERROR on conn_qual 0x%x\n",serviceID); 
+	if ( cm_ptr->l_socket >= 0 )
+		close( cm_ptr->l_socket );
+	dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
+	return dat_status;
+}
+
+
+/*
+ * PASSIVE: send local QP information, private data, and wait for 
+ *	    active side to respond with QP RTS/RTR status 
+ */
+static DAT_RETURN 
+dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr)
+{
+	ib_cm_handle_t	acm_ptr;
+	void		*p_data = NULL;
+	int		len;
+	DAT_RETURN	dat_status = DAT_SUCCESS;
+		
+	/* Allocate accept CM and initialize */
+	if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) 
+		return DAT_INSUFFICIENT_RESOURCES;
+
+	(void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) );
+	
+	acm_ptr->socket = -1;
+	acm_ptr->sp = cm_ptr->sp;
+	acm_ptr->hca_ptr = cm_ptr->hca_ptr;
+
+	len = sizeof(acm_ptr->dst.ia_address);
+	acm_ptr->socket = accept(cm_ptr->l_socket, 
+				(struct sockaddr*)&acm_ptr->dst.ia_address, 
 				&len );
 
-        if ( cm_ptr->socket < 0 ) {
+	if ( acm_ptr->socket < 0 ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     " listen accept: ERR %s\n",strerror(errno)); 
+			" accept: ERR %s on FD %d l_cr %p\n",
+			strerror(errno),cm_ptr->l_socket,cm_ptr); 
 		dat_status = DAT_INTERNAL_ERROR;
 		goto bail;
    	}
 
 	/* read in DST QP info, IA address. check for private data */
-	len = read( cm_ptr->socket, &cm_ptr->dst, sizeof(ib_qp_cm_t) );
+	len = read( acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t) );
 	if ( len != sizeof(ib_qp_cm_t) ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     " listen read: ERR %s, rcnt=%d\n",
-			     strerror(errno), len); 
+			" accept read: ERR %s, rcnt=%d\n",
+			strerror(errno), len); 
 		dat_status = DAT_INTERNAL_ERROR;
 		goto bail;
 
 	}
-
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, 
-		     " listen: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n",
-		     cm_ptr->dst.port, cm_ptr->dst.lid, 
-		     cm_ptr->dst.qpn, cm_ptr->dst.p_size ); 
+		" accept: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n",
+		acm_ptr->dst.port, acm_ptr->dst.lid, 
+		acm_ptr->dst.qpn, acm_ptr->dst.p_size ); 
 
 	/* validate private data size before reading */
-	if ( cm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) {
+	if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     " listen read: psize (%d) wrong\n",
-			     cm_ptr->dst.p_size ); 
+			" accept read: psize (%d) wrong\n",
+			acm_ptr->dst.p_size ); 
 		dat_status = DAT_INTERNAL_ERROR;
 		goto bail;
 	}
 
 	/* read private data into cm_handle if any present */
-	if ( cm_ptr->dst.p_size ) {
-		len = read( cm_ptr->socket, 
-			    cm_ptr->p_data, 
-			    cm_ptr->dst.p_size );
-		if ( len != cm_ptr->dst.p_size ) {
+	if ( acm_ptr->dst.p_size ) {
+		len = read( acm_ptr->socket, 
+			    acm_ptr->p_data, acm_ptr->dst.p_size );
+		if ( len != acm_ptr->dst.p_size ) {
 			dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-				" listen read pdata: ERR %s, rcnt=%d\n",
+				" accept read pdata: ERR %s, rcnt=%d\n",
 				strerror(errno), len ); 
 			dat_status = DAT_INTERNAL_ERROR;
 			goto bail;
 		}
-		p_data = cm_ptr->p_data;
+		dapl_dbg_log(DAPL_DBG_TYPE_EP, 
+				" accept: psize=%d read\n",
+				acm_ptr->dst.p_size); 
+		p_data = acm_ptr->p_data;
 	}
-		
-	/* set cm_handle for this service point */
-	sp_ptr->cm_srvc_handle = cm_ptr;
 	
-	/* 
-	 * dapls_ib_accept_connection send QP information
-	 * and complete CM handshake
-	 */
-
 	/* trigger CR event and return SUCCESS */
-	dapls_cr_callback(  cm_ptr,
+	dapls_cr_callback(  acm_ptr,
 			    IB_CME_CONNECTION_REQUEST_PENDING,
 		            p_data,
-			    sp_ptr );
+			    acm_ptr->sp );
 
-	return dat_status;
+	return DAT_SUCCESS;
 
 bail:
-	if ( l_sock >= 0 )
-		close( l_sock );
-	if ( cm_ptr->socket >= 0 )
-		close( cm_ptr->socket );
-	if ( cm_ptr )
-    		dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
-
-	return dat_status;
+	if ( acm_ptr->socket >=0 )
+		close( acm_ptr->socket );
+	dapl_os_free( acm_ptr, sizeof( *acm_ptr ) );
+	return DAT_INTERNAL_ERROR;
 }
 
 
-
-/*
- * PASSIVE: send local QP information, private data, and wait for 
- *	    active side to respond with QP RTS/RTR status 
- */
 static DAT_RETURN 
-dapli_socket_accept( DAPL_EP		*ep_ptr,
-		     DAPL_CR		*cr_ptr,
-		     DAT_COUNT		p_size,
-		     DAT_PVOID		p_data )
+dapli_socket_accept_final( DAPL_EP		*ep_ptr,
+			   DAPL_CR		*cr_ptr,
+			   DAT_COUNT		p_size,
+		           DAT_PVOID		p_data )
 {
-	ib_cm_handle_t	cm_ptr = cr_ptr->ib_cm_handle;
 	DAPL_IA		*ia_ptr = ep_ptr->header.owner_ia;
+	ib_cm_handle_t	cm_ptr = cr_ptr->ib_cm_handle;
 	ib_qp_cm_t	qp_cm;
 	struct iovec    iovec[2];
 	int		len;
-	char		r_buf[10] = "XX_XXX_XXX";
+	short		rtu_data = 0;
 
 	if (p_size >  IB_MAX_REP_PDATA_SIZE) 
-		return (DAT_LENGTH_ERROR);
+		return DAT_LENGTH_ERROR;
 
+	/* must have a accepted socket */
+	if ( cm_ptr->socket < 0 )
+		return DAT_INTERNAL_ERROR;
+	
 	/* modify QP to RTR and then to RTS with remote info already read */
 	if ( dapls_modify_qp_state( ep_ptr->qp_handle, 
 				    IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS )
@@ -413,42 +456,42 @@ dapli_socket_accept( DAPL_EP		*ep_ptr,
 	qp_cm.p_size = p_size;
 	iovec[0].iov_base = &qp_cm;
 	iovec[0].iov_len  = sizeof(ib_qp_cm_t);
-	iovec[1].iov_base = p_data;
-	iovec[1].iov_len  = p_size;
-	len = writev( cm_ptr->socket, iovec, 2 );
+	if (p_size) {
+		iovec[1].iov_base = p_data;
+		iovec[1].iov_len  = p_size;
+	}
+	len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) );
     	if (len != (p_size + sizeof(ib_qp_cm_t))) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     " connect write: ERR %s, wcnt=%d\n",
+			     " accept_final: ERR %s, wcnt=%d\n",
 			     strerror(errno), len); 
 		goto bail;
 	}
 	dapl_dbg_log(DAPL_DBG_TYPE_EP, 
-		     " accept: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n",
+		     " accept_final: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n",
 		     qp_cm.port, qp_cm.lid, qp_cm.qpn, qp_cm.p_size ); 
-			 
+	
 	/* complete handshake after final QP state change */
-	len = read(cm_ptr->socket, r_buf, sizeof(r_buf) );
-	if ( len != sizeof(r_buf) ) {
+	len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data) );
+	if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) {
 		dapl_dbg_log(DAPL_DBG_TYPE_ERR, 
-			     " accept: ERR %s, rcnt=%d\n",
-			     strerror(errno), len); 
+			     " accept_final: ERR %s, rcnt=%d rdata=%x\n",
+			     strerror(errno), len, ntohs(rtu_data) ); 
 		goto bail;
 	}
 
 	/* final data exchange if remote QP state is good to go */
-	dapl_dbg_log( DAPL_DBG_TYPE_EP," accept: %s \n", r_buf); 
-
-	dapls_cr_callback ( cm_ptr,
-    			    IB_CME_CONNECTED,
-    			    NULL,
-			    cm_ptr->sp );
-
+	dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); 
+	dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp );
 	return DAT_SUCCESS;
 
 bail:
-	dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept: ERR !QP_RTR_RTS \n"); 
-	close( cm_ptr->socket );
+	dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR !QP_RTR_RTS \n"); 
+	if ( cm_ptr >= 0 )
+		close( cm_ptr->socket );
+	dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
 	dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */
+
 	return DAT_INTERNAL_ERROR;
 }
 
@@ -482,7 +525,7 @@ dapls_ib_connect (
 	IN  DAT_IA_ADDRESS_PTR		remote_ia_address,
 	IN  DAT_CONN_QUAL		remote_conn_qual,
 	IN  DAT_COUNT			private_data_size,
-	IN  DAT_PVOID			private_data )
+	IN  void			*private_data )
 {
 	DAPL_EP		*ep_ptr;
 	ib_qp_handle_t	qp_ptr;
@@ -545,18 +588,19 @@ dapls_ib_disconnect (
 	dapls_ib_reinit_ep(ep_ptr);
 
 #endif
-	
-	if ( ep_ptr->cr_ptr )	
+	if ( ep_ptr->cr_ptr ) {
 		dapls_cr_callback ( ep_ptr->cm_handle,
 				    IB_CME_DISCONNECTED,
 				    NULL,
 				    ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr );
-	else
+	} else {
 		dapl_evd_connection_callback ( ep_ptr->cm_handle,
 						IB_CME_DISCONNECTED,
 						NULL,
 						ep_ptr );
-
+		ep_ptr->cm_handle = NULL;
+		dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
+	}	
 	return DAT_SUCCESS;
 }
 
@@ -584,7 +628,6 @@ dapls_ib_disconnect_clean (
 	IN  DAT_BOOLEAN			active,
 	IN  const ib_cm_events_t	ib_cm_event )
 {
-    
     return;
 }
 
@@ -644,25 +687,22 @@ dapls_ib_remove_conn_listener (
 	IN  DAPL_IA		*ia_ptr,
 	IN  DAPL_SP		*sp_ptr )
 {
-
 	ib_cm_srvc_handle_t	cm_ptr = sp_ptr->cm_srvc_handle;
 
 	dapl_dbg_log (DAPL_DBG_TYPE_EP,
 			"dapls_ib_remove_conn_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n",
 			ia_ptr, sp_ptr, cm_ptr );
 #ifdef SOCKET_CM
-
 	/* close accepted socket, free cm_srvc_handle and return */
 	if ( cm_ptr != NULL ) {
-		if ( cm_ptr->socket > 0 ) {
-			close( cm_ptr->socket );
-			cm_ptr->socket = 0;
+		if ( cm_ptr->l_socket >= 0 ) {
+			close( cm_ptr->l_socket );
+			cm_ptr->socket = -1;
 		}
-	    	dapl_os_free( cm_ptr, sizeof( *cm_ptr ) );
+	    	/* cr_thread will free */
 		sp_ptr->cm_srvc_handle = NULL;
 	}
 	return DAT_SUCCESS;
-
 #else
 	return DAT_NOT_IMPLEMENTED;   
 
@@ -717,7 +757,7 @@ dapls_ib_accept_connection (
 	}
     
 #ifdef SOCKET_CM
-	return ( dapli_socket_accept(ep_ptr, cr_ptr, p_size, p_data) );
+	return ( dapli_socket_accept_final(ep_ptr, cr_ptr, p_size, p_data) );
 #else
 	return DAT_NOT_IMPLEMENTED;   
 #endif
@@ -756,13 +796,13 @@ dapls_ib_reject_connection (
 	/* just close the socket and return */
 	if ( cm_ptr->socket > 0 ) {
 		close( cm_ptr->socket );
-		cm_ptr->socket = 0;
+		cm_ptr->socket = -1;
 	}
-	
 	return DAT_SUCCESS;
-
+#else
+	return DAT_NOT_IMPLEMENTED;   
 #endif
-	return DAT_SUCCESS;
+	
 
 }
 
@@ -984,6 +1024,76 @@ dapls_ib_get_cm_event (
     return ib_cm_event;
 }
 
+/* async CR processing thread to avoid blocking applications */
+void cr_thread(void *arg) 
+{
+    struct dapl_hca	*hca_ptr = arg;
+    ib_cm_srvc_handle_t	cr, next_cr;
+    int			max_fd;
+    fd_set		rfd,rfds;
+    struct timeval	to;
+     
+    dapl_os_lock( &hca_ptr->ib_trans.lock );
+    while ( !hca_ptr->ib_trans.destroy ) {
+	
+	FD_ZERO( &rfds ); 
+	max_fd = -1;
+	
+	if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list))
+            next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list);
+	else
+	    next_cr = NULL;
+
+	while (next_cr) {
+	    cr = next_cr;
+	    dapl_dbg_log (DAPL_DBG_TYPE_CM," thread: cm_ptr %p\n", cr );
+	    if (cr->l_socket == -1 || hca_ptr->ib_trans.destroy) {
+
+		dapl_dbg_log(DAPL_DBG_TYPE_CM," thread: Freeing %p\n", cr);
+		next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list,
+						(DAPL_LLIST_ENTRY*)&cr->entry );
+		dapl_llist_remove_entry(&hca_ptr->ib_trans.list, 
+					(DAPL_LLIST_ENTRY*)&cr->entry);
+		dapl_os_free( cr, sizeof(*cr) );
+		continue;
+	    }
+	          
+	    FD_SET( cr->l_socket, &rfds ); /* add to select set */
+	    if ( cr->l_socket > max_fd )
+		max_fd = cr->l_socket;
+
+	    /* individual select poll to check for work */
+	    FD_ZERO(&rfd);
+	    FD_SET(cr->l_socket, &rfd);
+	    dapl_os_unlock(&hca_ptr->ib_trans.lock);	
+	    to.tv_sec  = 0;
+	    to.tv_usec = 0;
+	    if ( select(cr->l_socket + 1,&rfd, NULL, NULL, &to) < 0) {
+		dapl_dbg_log (DAPL_DBG_TYPE_CM,
+			  " thread: ERR %s on cr %p sk %d\n", 
+			  strerror(errno), cr, cr->l_socket);
+		close(cr->l_socket);
+		cr->l_socket = -1;
+	    } else if ( FD_ISSET(cr->l_socket, &rfd) && 
+			dapli_socket_accept(cr)) {
+		close(cr->l_socket);
+		cr->l_socket = -1;
+	    }
+	    dapl_os_lock( &hca_ptr->ib_trans.lock );
+	    next_cr =  dapl_llist_next_entry(&hca_ptr->ib_trans.list,
+					     (DAPL_LLIST_ENTRY*)&cr->entry );
+	} 
+	dapl_os_unlock( &hca_ptr->ib_trans.lock );
+	to.tv_sec  = 0;
+	to.tv_usec = 500000; /* wakeup and check destroy */
+	select(max_fd + 1, &rfds, NULL, NULL, &to);
+	dapl_os_lock( &hca_ptr->ib_trans.lock );
+    } 
+    dapl_os_unlock( &hca_ptr->ib_trans.lock );	
+    hca_ptr->ib_trans.destroy = 0;
+    dapl_dbg_log(DAPL_DBG_TYPE_CM," thread(hca %p) exit\n",hca_ptr);
+}
+
 /* Real IBv CM */
 #else 
 
Index: openib/dapl_ib_qp.c
===================================================================
--- openib/dapl_ib_qp.c	(revision 2190)
+++ openib/dapl_ib_qp.c	(working copy)
@@ -1,26 +1,25 @@
 /*
- * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved.
- *
- * This Software is licensed under either one of the following two licenses:
+ * This Software is licensed under one of the following licenses:
  *
  * 1) under the terms of the "Common Public License 1.0" a copy of which is
- *    in the file LICENSE.txt in the root directory. The license is also
  *    available from the Open Source Initiative, see
  *    http://www.opensource.org/licenses/cpl.php.
- * OR
  *
- * 2) under the terms of the "The BSD License" a copy of which is in the file
- *    LICENSE2.txt in the root directory. The license is also available from
- *    the Open Source Initiative, see
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *    available from the Open Source Initiative, see
  *    http://www.opensource.org/licenses/bsd-license.php.
  *
- * Licensee has the right to choose either one of the above two licenses.
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *    copy of which is available from the Open Source Initiative, see
+ *    http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
  *
- * Redistributions of source code must retain both the above copyright
- * notice and either one of the license notices.
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
  *
  * Redistributions in binary form must reproduce both the above copyright
- * notice, either one of the license notices in the documentation
+ * notice, one of the license notices in the documentation
  * and/or other materials provided with the distribution.
  */
 
@@ -30,7 +29,7 @@
  *
  * PURPOSE: QP routines for access to DET Verbs
  *
- * $Id:$
+ * $Id: $
  **********************************************************************/
 
 #include "dapl.h"
@@ -311,7 +310,7 @@ dapls_modify_qp_state ( IN ib_qp_handle_
 			qp_attr.path_mtu 		= IBV_MTU_1024;
 			qp_attr.dest_qp_num 		= qp_cm->qpn;
 			qp_attr.rq_psn 			= 1;
-			qp_attr.max_dest_rd_atomic	= 1;
+			qp_attr.max_dest_rd_atomic	= 8;
 			qp_attr.min_rnr_timer		= 12;
 			qp_attr.ah_attr.is_global	= 0;
 			qp_attr.ah_attr.dlid		= qp_cm->lid;
@@ -338,7 +337,7 @@ dapls_modify_qp_state ( IN ib_qp_handle_
 			qp_attr.retry_cnt	= 7;
 			qp_attr.rnr_retry	= 7;
 			qp_attr.sq_psn		= 1;
-			qp_attr.max_rd_atomic	= 1;
+			qp_attr.max_rd_atomic	= 8;
 			dapl_dbg_log (DAPL_DBG_TYPE_EP,
 			      " modify_qp_rts: psn %x or %x\n",
 			      qp_attr.sq_psn, qp_attr.max_rd_atomic );
Index: openib/README
===================================================================
--- openib/README	(revision 2190)
+++ openib/README	(working copy)
@@ -41,8 +41,8 @@ A simple dapl test just for initial open
 
 known issues:
 
-	early drop, good luck! Only tested with a simple dtest.
-	see TODO for more details
-	events not working?? 
+	early drop, only tested with simple dtest and dapltest SR.
+	no memory windows support in ibverbs, dat_create_rmr fails.
+	
 
 
Index: openib/dapl_ib_util.h
===================================================================
--- openib/dapl_ib_util.h	(revision 2190)
+++ openib/dapl_ib_util.h	(working copy)
@@ -79,10 +79,23 @@ typedef struct _ib_qp_cm
 
 } ib_qp_cm_t;
 
-/* EP->cm_handle for connect, SP->cm_srvc_handle for listen */
+/* 
+ * dapl_llist_entry in dapl.h but dapl.h depends on provider 
+ * typedef's in this file first. move dapl_llist_entry out of dapl.h
+ */
+struct ib_llist_entry
+{
+    struct dapl_llist_entry	*flink;
+    struct dapl_llist_entry	*blink;
+    void			*data;
+    struct dapl_llist_entry	*list_head;
+};
+
 struct ib_cm_handle
 { 
-	int			socket; 
+	struct ib_llist_entry	entry;
+	int			socket;
+	int			l_socket; 
 	struct dapl_hca		*hca_ptr;
 	DAT_HANDLE		cr;
 	DAT_HANDLE		sp;	
@@ -112,6 +125,9 @@ typedef enum 
 
 } ib_cm_events_t;
 
+/* prototype for cm thread */
+void cr_thread (void *arg);
+
 #else
 
 /* TODO: Waiting for IB CM to define */
@@ -205,11 +221,6 @@ typedef struct dapl_evd		*ib_wait_obj_ha
  * ibv_post_recv - Return 0, -1 & bad_wr 
  */
 
-/* definitions from libmthca/src/cq.c, should be in verbs.h */
-#define IB_CQ_OK	0
-#define IB_CQ_EMPTY	-1
-#define IB_POLL_ERR	-2
-
 /* async handler for CQ, QP, and unafiliated */
 typedef void (*ib_async_handler_t)(
     IN    ib_hca_handle_t    ib_hca_handle,
@@ -221,11 +232,18 @@ typedef struct _ib_hca_transport
 { 
 	struct	ibv_device	*ib_dev;
 	ib_cq_handle_t		ib_cq_empty;
+
+#if SOCKET_CM
+	int			destroy;
+	DAPL_OS_THREAD		thread;
+	DAPL_OS_LOCK		lock;	
+	struct dapl_llist_entry	*list;	
+#endif
 	ib_async_handler_t	async_unafiliated;
 	ib_async_handler_t	async_cq_error;
-	ib_async_handler_t	async_cq_completion;
+	ib_async_handler_t	async_cq;
 	ib_async_handler_t	async_qp_error;
-	
+
 } ib_hca_tranport_t;
 
 /* provider specfic fields for shared memory support */
Index: openib/dapl_ib_cq.c
===================================================================
--- openib/dapl_ib_cq.c	(revision 2190)
+++ openib/dapl_ib_cq.c	(working copy)
@@ -382,10 +382,10 @@ DAT_RETURN dapls_ib_completion_notify (
  * Output:
  * 	none
  *
- * Returns:
+ * Returns: 
  * 	DAT_SUCCESS
  *	DAT_QUEUE_EMPTY
- *	dapl_convert_errno
+ *	
  */
 DAT_RETURN dapls_ib_completion_poll (
 	IN  DAPL_HCA			*hca_ptr,
@@ -393,15 +393,12 @@ DAT_RETURN dapls_ib_completion_poll (
 	IN  ib_work_completion_t	*wc_ptr)
 {
 	int	ret;
-    	
-	ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr);
+
+    	ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr);
 	if (ret == 1) 
 		return	DAT_SUCCESS;
-	else if ((ret == IB_CQ_OK) || (ret == IB_CQ_EMPTY)) 
-		return	DAT_QUEUE_EMPTY;
-	else 
-		return(dapl_convert_errno(EFAULT,"poll_cq"));;
-
+	
+	return	DAT_QUEUE_EMPTY;
 }
 
 #ifdef CQ_WAIT_OBJECT
@@ -447,24 +444,45 @@ dapls_ib_wait_object_wait (
 	IN ib_wait_obj_handle_t	    p_cq_wait_obj_handle,
 	IN u_int32_t 		    timeout)
 {
-	int status;
-	ib_cq_handle_t	cq = p_cq_wait_obj_handle->ib_cq_handle;
-	struct ibv_cq	*ibv_cq;
-	void		*ibv_ctx;
+	DAPL_EVD		*evd_ptr = p_cq_wait_obj_handle;
+	ib_cq_handle_t		cq = evd_ptr->ib_cq_handle;
+	struct ibv_cq		*ibv_cq;
+	void			*ibv_ctx;
+	int			status = EINVAL; /* invalid handle */
 
 	dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, 
 			" cq_object_wait: dev %p evd %p cq %p, time %d\n", 
-			cq->context, p_cq_wait_obj_handle, cq, timeout );
+			cq->context, evd_ptr, cq, timeout );
 
-	/* will block forever, only 1 per device for now?? */
 	/* TODO: add timeout, map each CQ created?? */
-	if (cq) {
-		status = ibv_get_cq_event(cq->context, 0, &ibv_cq, &ibv_ctx);
-		if (!status && (ibv_cq == cq)) 
-			return DAT_SUCCESS;
+	/* Multiple EVD's sharing one event handle for now */
+	while (evd_ptr->ib_cq_handle) {
+		
+		status = ibv_get_cq_event(cq->context, 
+					  0, &ibv_cq, &ibv_ctx);
+		if (status) 
+			break;
+
+		/* EVD mismatch, process DTO callback for this EVD */
+		if (ibv_cq != cq) {
+			ib_hca_tranport_t *hca_ptr = 
+				&evd_ptr->header.owner_ia->hca_ptr->ib_trans;
+							
+			if ( hca_ptr->async_cq ) 
+				hca_ptr->async_cq(cq->context,
+						  (ib_error_record_t*)ibv_cq,
+						  ibv_ctx);
+			
+			continue;
+		} 
+		break;
 	}
 	
-	return(dapl_convert_errno(EFAULT,"cq_wait_object_wait"));
+	dapl_dbg_log (DAPL_DBG_TYPE_UTIL, 
+		      " cq_object_wait: RET cq %p ibv_cq %p ibv_ctx %p %x\n",
+		      cq,ibv_cq,ibv_ctx,status);
+	
+	return(dapl_convert_errno(status,"cq_wait_object_wait"));
 	
 }
 #endif


From mshefty at ichips.intel.com  Tue Apr 19 16:01:45 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:01:45 -0700
Subject: [openib-general] [PATCH] [VERBS] new verbs call to allocate AH
	using WC
Message-ID: <20050419160145.6244bce6.mshefty@ichips.intel.com>

This patch will add a new call to ib_verbs.h to allocate an address handle
using a received work completion.  This call will be used by the MAD RMPP
code (to be submitted shortly in a separate patch).

Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Index: include/ib_verbs.h
===================================================================
--- include/ib_verbs.h	(revision 2168)
+++ include/ib_verbs.h	(working copy)
@@ -971,6 +971,21 @@
 struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr);
 
 /**
+ * ib_create_ah_from_wc - Creates an address handle associated with the
+ *   sender of the specified work completion.
+ * @pd: The protection domain associated with the address handle.
+ * @wc: Work completion information associated with a received message.
+ * @grh: References the received global route header.  This parameter is
+ *   ignored unless the work completion indicates that the GRH is valid.
+ * @port_num: The outbound port number to associate with the address.
+ *
+ * The address handle is used to reference a local or global destination
+ * in all UD QP post sends.
+ */
+struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
+				   struct ib_grh *grh, u8 port_num);
+
+/**
  * ib_modify_ah - Modifies the address vector associated with an address
  *   handle.
  * @ah: The address handle to modify.
Index: core/verbs.c
===================================================================
--- core/verbs.c	(revision 2168)
+++ core/verbs.c	(working copy)
@@ -40,6 +40,7 @@
 #include <linux/err.h>
 
 #include <ib_verbs.h>
+#include <ib_cache.h>
 
 /* Protection domains */
 
@@ -87,6 +88,40 @@
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
+				   struct ib_grh *grh, u8 port_num)
+{
+	struct ib_ah_attr ah_attr;
+	u32 flow_class;
+	u16 gid_index;
+	int ret;
+
+	memset(&ah_attr, 0, sizeof ah_attr);
+	ah_attr.dlid = wc->slid;
+	ah_attr.sl = wc->sl;
+	ah_attr.src_path_bits = wc->dlid_path_bits;
+	ah_attr.port_num = port_num;
+	
+	if (wc->wc_flags & IB_WC_GRH) {
+		ah_attr.ah_flags = IB_AH_GRH;
+		ah_attr.grh.dgid = grh->dgid;
+
+		ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num,
+					 &gid_index);
+		if (ret)
+			return ERR_PTR(ret);
+
+		ah_attr.grh.sgid_index = (u8) gid_index;
+		flow_class = be32_to_cpu(&grh->version_tclass_flow);
+		ah_attr.grh.flow_label = flow_class & 0xFFFFF;
+		ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF;
+		ah_attr.grh.hop_limit = grh->hop_limit;
+	}
+
+	return ib_create_ah(pd, &ah_attr);
+}
+EXPORT_SYMBOL(ib_create_ah_from_wc);
+
 int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
 {
 	return ah->device->modify_ah ?


From robert.j.woodruff at intel.com  Tue Apr 19 16:17:58 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Tue, 19 Apr 2005 16:17:58 -0700
Subject: [openib-general] Re: [ANNOUNCE][PATCH] Backport patch for
	2.6.9kernel
In-Reply-To: <20050418200249.GG6931@esmail.cup.hp.com>
Message-ID: <ORSMSX408FRaqbC8wSA0000000e@orsmsx408.amr.corp.intel.com>

Grant wrote, 
>> 
>> Roland, maybe, when you do this, you can put a copy of the source as you
>> submit it under gen2/branches? I think that would solve Grant's problem
>> but may be too hard to do in practice.

>Not really since I can see the kernel rev and then pull the matching
>kernel source. Well unless people are using -mm or linus' bk tree
>directly. But not that many people do that.

>grant

Ok, I split the large patch into 3, one for kernel diffs, one for the openib
drivers,
and one for openib fixups that are needed to backport. 

It might be good to come up with some way to track which SVN rev went into 
which kernel.org release, but until that is done, we can just embed the rev.
of the 
kernel.org infiniband source that was back ported, as I have done in the
three patches attached.
These allow a back port of what is in 2.6.12 back to 2.6.9.
The patches should be applied in order, 01, 02, and 03. 

For that matter, we may only want to back port the stable versions of code
that are released with each kernel.org release and thus may not need the 
SVN number in the back port patches. 

Note that since 2.6.12 is not quite released, we should also consider these
release candidates and not put them into SVN until 2.6.12 is
released. I have done limited testing with IPoIB and it appears to work ok.
It would be good if someone else could also try them, perhaps on some other
platform type. I tested them on Itanium. 

woody
-------------- next part --------------
A non-text attachment was scrubbed...
Name: infiniband-2.6.12-to-2.6.9-kernel-fixups-01.diff
Type: application/octet-stream
Size: 12580 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/3cdafb5c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: infiniband-2.6.12-to-2.6.9-openib-drivers-02.diff
Type: application/octet-stream
Size: 802662 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/3cdafb5c/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: infiniband-2.6.12-to-2.6.9-openib-fixups-03.diff
Type: application/octet-stream
Size: 864 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050419/3cdafb5c/attachment-0002.obj>

From mshefty at ichips.intel.com  Tue Apr 19 16:20:32 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:20:32 -0700
Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h
Message-ID: <20050419162032.46e339ab.mshefty@ichips.intel.com>

This patch moves the definitions of the SA MAD and header from ib_sa.h and
sa_query.c to ib_mad.h.  The definitions are needed by RMPP.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Index: include/ib_mad.h
===================================================================
--- include/ib_mad.h	(revision 2168)
+++ include/ib_mad.h	(working copy)
@@ -39,6 +39,8 @@
 #if !defined( IB_MAD_H )
 #define IB_MAD_H
 
+#include <linux/pci.h>
+
 #include <ib_verbs.h>
 
 /* Management base version */
@@ -115,6 +117,12 @@
 	union ib_gid	dgid;
 } __attribute__ ((packed));
 
+/*
+ * These structures must be packed because they have 64-bit fields
+ * that are only 32-bit aligned.  64-bit architectures will lay them
+ * out wrong otherwise.  (And unfortunately they are sent on the wire
+ * so we can't change the layout)
+ */
 struct ib_mad_hdr {
 	u8	base_version;
 	u8	mgmt_class;
@@ -137,6 +145,17 @@
 	u32	paylen_newwin;
 } __attribute__ ((packed));
 
+typedef u64 __bitwise ib_sa_comp_mask;
+
+#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n))
+
+struct ib_sa_hdr {
+	u64			sm_key;
+	u16			attr_offset;
+	u16			reserved;
+	ib_sa_comp_mask		comp_mask;
+} __attribute__ ((packed));
+
 struct ib_mad {
 	struct ib_mad_hdr	mad_hdr;
 	u8			data[232];
@@ -148,6 +167,13 @@
 	u8			data[220];
 } __attribute__ ((packed));
 
+struct ib_sa_mad {
+	struct ib_mad_hdr	mad_hdr;
+	struct ib_rmpp_hdr	rmpp_hdr;
+	struct ib_sa_hdr	sa_hdr;
+	u8			data[200];
+} __attribute__ ((packed));
+
 struct ib_vendor_mad {
 	struct ib_mad_hdr	mad_hdr;
 	struct ib_rmpp_hdr	rmpp_hdr;
@@ -418,8 +444,7 @@
 			  void *buf);
 
 /**
- * ib_free_recv_mad - Returns data buffers used to receive a MAD to the
- *   access layer.
+ * ib_free_recv_mad - Returns data buffers used to receive a MAD.
  * @mad_recv_wc: Work completion information for a received MAD.
  *
  * Clients receiving MADs through their ib_mad_recv_handler must call this
Index: include/ib_sa.h
===================================================================
--- include/ib_sa.h	(revision 2168)
+++ include/ib_sa.h	(working copy)
@@ -89,10 +89,6 @@
 	}
 }
 
-typedef u64 __bitwise ib_sa_comp_mask;
-
-#define IB_SA_COMP_MASK(n)	((__force ib_sa_comp_mask) cpu_to_be64(1ull << n))
-
 /*
  * Structures for SA records are named "struct ib_sa_xxx_rec."  No
  * attempt is made to pack structures to match the physical layout of
Index: core/sa_query.c
===================================================================
--- core/sa_query.c	(revision 2168)
+++ core/sa_query.c	(working copy)
@@ -50,26 +50,6 @@
 MODULE_DESCRIPTION("InfiniBand subnet administration query support");
 MODULE_LICENSE("Dual BSD/GPL");
 
-/*
- * These two structures must be packed because they have 64-bit fields
- * that are only 32-bit aligned.  64-bit architectures will lay them
- * out wrong otherwise.  (And unfortunately they are sent on the wire
- * so we can't change the layout)
- */
-struct ib_sa_hdr {
-	u64			sm_key;
-	u16			attr_offset;
-	u16			reserved;
-	ib_sa_comp_mask		comp_mask;
-} __attribute__ ((packed));
-
-struct ib_sa_mad {
-	struct ib_mad_hdr	mad_hdr;
-	struct ib_rmpp_hdr	rmpp_hdr;
-	struct ib_sa_hdr	sa_hdr;
-	u8			data[200];
-} __attribute__ ((packed));
-
 struct ib_sa_sm_ah {
 	struct ib_ah        *ah;
 	struct kref          ref;


From ardavis at ichips.intel.com  Tue Apr 19 16:27:14 2005
From: ardavis at ichips.intel.com (ardavis)
Date: Tue, 19 Apr 2005 16:27:14 -0700
Subject: [openib-general] uverbs API
Message-ID: <426593D2.4030707@ichips.intel.com>

Hello Roland,

Now that I have the initial drop of uDAPL running I would like to 
discuss some possible modifications/additions.

Here is my TODO list that I need some feedback on....

- resize_cq
- query_device
- ib_query_gid
- ibv_get_cq_event(), need timed event call and wakeup
- current implementation supports one event per device, plans for more?
- memory window support

Thanks,

-arlin


From roland at topspin.com  Tue Apr 19 16:28:50 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 19 Apr 2005 16:28:50 -0700
Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h
In-Reply-To: <20050419162032.46e339ab.mshefty@ichips.intel.com> (Sean
	Hefty's message of "Tue, 19 Apr 2005 16:20:32 -0700")
References: <20050419162032.46e339ab.mshefty@ichips.intel.com>
Message-ID: <52br8a5ost.fsf@topspin.com>

Why do you need to add this:

    > +#include <linux/pci.h>

to ib_mad.h?  I didn't see anything new that would use it.

 - R.


From roland at topspin.com  Tue Apr 19 16:32:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 19 Apr 2005 16:32:38 -0700
Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h
In-Reply-To: <20050419162032.46e339ab.mshefty@ichips.intel.com> (Sean
	Hefty's message of "Tue, 19 Apr 2005 16:20:32 -0700")
References: <20050419162032.46e339ab.mshefty@ichips.intel.com>
Message-ID: <527jiy5omh.fsf@topspin.com>

By the way:

    > +/*
    > + * These structures must be packed because they have 64-bit fields
    > + * that are only 32-bit aligned.  64-bit architectures will lay them
    > + * out wrong otherwise.  (And unfortunately they are sent on the wire
    > + * so we can't change the layout)
    > + */

I just had a quick look at ib_mad.h and it seems that none of the
packed structures already in that file actually need the
__attribute__((packed)) -- everything is already aligned to its size
as far as I can tell.  It might be worth checking to make sure I'm not
missing anything, and then removing the packed attribute -- if nothing
else, this will shrink the IA64 code a fair bit.

 - R.


From mshefty at ichips.intel.com  Tue Apr 19 16:34:12 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:34:12 -0700
Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h
In-Reply-To: <52br8a5ost.fsf@topspin.com>
References: <20050419162032.46e339ab.mshefty@ichips.intel.com>
	<52br8a5ost.fsf@topspin.com>
Message-ID: <42659574.2010802@ichips.intel.com>

Roland Dreier wrote:
> Why do you need to add this:
> 
>     > +#include <linux/pci.h>
> 
> to ib_mad.h?  I didn't see anything new that would use it.

I don't think it's needed with this patch.  sorry... I'm was trying to 
break apart my RMPP changes into a few, smaller patches to make the 
review easier.  I'll remove it from this patch.

- Sean


From mshefty at ichips.intel.com  Tue Apr 19 16:35:25 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:35:25 -0700
Subject: [openib-general] [PATCH] relocate SA MAD definitions to ib_mad.h
In-Reply-To: <527jiy5omh.fsf@topspin.com>
References: <20050419162032.46e339ab.mshefty@ichips.intel.com>
	<527jiy5omh.fsf@topspin.com>
Message-ID: <426595BD.4060500@ichips.intel.com>

Roland Dreier wrote:

> By the way:
> 
>     > +/*
>     > + * These structures must be packed because they have 64-bit fields
>     > + * that are only 32-bit aligned.  64-bit architectures will lay them
>     > + * out wrong otherwise.  (And unfortunately they are sent on the wire
>     > + * so we can't change the layout)
>     > + */
> 
> I just had a quick look at ib_mad.h and it seems that none of the
> packed structures already in that file actually need the
> __attribute__((packed)) -- everything is already aligned to its size
> as far as I can tell.  It might be worth checking to make sure I'm not
> missing anything, and then removing the packed attribute -- if nothing
> else, this will shrink the IA64 code a fair bit.

I'll double check this and remove the attribute packed if so.

- Sean


From mshefty at ichips.intel.com  Tue Apr 19 16:39:28 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:39:28 -0700
Subject: [openib-general] [PATCH] [VERBS] new verbs call to allocate AH
	using WC
In-Reply-To: <20050419160145.6244bce6.mshefty@ichips.intel.com>
References: <20050419160145.6244bce6.mshefty@ichips.intel.com>
Message-ID: <426596B0.8030505@ichips.intel.com>

Sean Hefty wrote:

> Index: include/ib_verbs.h
> ===================================================================
...
> +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
> +				   struct ib_grh *grh, u8 port_num);
> +
...
> +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
> +				   struct ib_grh *grh, u8 port_num)
> +{

It looks like I missed including the change that moved struct ib_grh 
from ib_mad.h to ib_verbs.h.

- Sean


From mshefty at ichips.intel.com  Tue Apr 19 16:54:30 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 19 Apr 2005 16:54:30 -0700
Subject: [openib-general] [RFC] patch for new send MAD allocation routines
Message-ID: <20050419165430.786a1dfc.mshefty@ichips.intel.com>

This pseudo-patch (meaning I haven't tested it separately from the other RMPP
changes) defines new a new structure and calls for allocation of a MAD that
can be posted on the send queue.  It tries to combine functionality common
to several MAD agents into a single location.  It is currently only used
by an RMPP test program.  My plan was to update other agents once these
calls were part of the standard build.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


Index: include/ib_mad.h
===================================================================
--- include/ib_mad.h	(revision 2168)
+++ include/ib_mad.h	(working copy)
@@ -39,6 +39,8 @@
 #if !defined( IB_MAD_H )
 #define IB_MAD_H
 
+#include <linux/pci.h>
+
 #include <ib_verbs.h>
 
 /* Management base version */
@@ -157,6 +159,30 @@
 } __attribute__ ((packed));
 
 /**
+ * ib_mad_send_buf - MAD data buffer and work request for sends.
+ * @mad: References an allocated MAD data buffer.  The size of the data
+ *   buffer is specified in the @send_wr.length field.
+ * @mapping: DMA mapping information.
+ * @mad_agent: MAD agent that allocated the buffer.
+ * @context: User-controlled context fields.
+ * @send_wr: An initialized work request structure used when sending the MAD.
+ *   The wr_id field of the work request is initialized to reference this
+ *   data structure.
+ * @sge: A scatter-gather list referenced by the work request.
+ *
+ * Users are responsible for initializing the MAD buffer itself, with the
+ * exception of specifying the payload length field in any RMPP MAD.
+ */
+struct ib_mad_send_buf {
+	struct ib_mad		*mad;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct ib_mad_agent	*mad_agent;
+	void			*context[2];
+	struct ib_send_wr	send_wr;
+	struct ib_sge		sge;
+};
+
+/**
  * ib_get_rmpp_resptime - Returns the RMPP response time.
  * @rmpp_hdr: An RMPP header.
  */
@@ -478,4 +504,35 @@
 int ib_process_mad_wc(struct ib_mad_agent *mad_agent,
 		      struct ib_wc *wc);
 
+/**
+ * ib_create_send_mad - Allocate and initialize a data buffer and work request
+ *   for sending a MAD.
+ * @mad_agent: Specifies the registered MAD service to associate with the MAD.
+ * @remote_qpn: Specifies the QPN of the receiving node.
+ * @pkey_index: Specifies which PKey the MAD will be send using.  This field
+ *   is valid only if the remote_qpn is QP 1.
+ * @ah: References the address handle used to transfer to the remote node.
+ * @hdr_len: Indicates the size of the data header of the MAD.  This length
+ *   should include the common MAD header, RMPP header, plus any class
+ *   specific header.
+ * @data_len: Indicates the size of any user-transfered data.  The call will
+ *   automatically adjust the allocated buffer size to account for any
+ *   additional padding that may be necessary.
+ *
+ * This is a helper routine that may be used to allocate a MAD.  Users are
+ * not required to allocate outbound MADs using this call.  The returned
+ * MAD send buffer will reference a data buffer usable for sending a MAD, along
+ * with an intialized work request structure.
+ */
+struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent,
+					    u32 remote_qpn, u16 pkey_index,
+					    struct ib_ah *ah,
+					    int hdr_len, int data_len);
+
+/**
+ * ib_free_send_mad - Returns data buffers used to send a MAD.
+ * @send_buf: Previously allocated send data buffer.
+ */
+void ib_free_send_mad(struct ib_mad_send_buf *send_buf);
+
 #endif /* IB_MAD_H */
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 2168)
+++ core/mad.c	(working copy)
@@ -766,6 +766,89 @@
 	return ret;
 }
 
+static int get_buf_length(int hdr_len, int data_len)
+{
+	int seg_size, pad;
+
+	seg_size = sizeof(struct ib_mad) - hdr_len;
+	if (data_len && seg_size) {
+		pad = seg_size - data_len % seg_size;
+		if (pad == seg_size)
+			pad = 0;
+	} else
+		pad = seg_size;
+	return hdr_len + data_len + pad;
+}
+
+struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent,
+					    u32 remote_qpn, u16 pkey_index,
+					    struct ib_ah *ah,
+					    int hdr_len, int data_len)
+{
+	struct ib_mad_agent_private *mad_agent_priv;
+	struct ib_mad_send_buf *send_buf;
+	int buf_size;
+	void *buf;
+
+	mad_agent_priv = container_of(mad_agent,
+				      struct ib_mad_agent_private, agent);
+	buf_size = get_buf_length(hdr_len, data_len);
+
+	buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL);
+	if (!buf)
+		return ERR_PTR(-ENOMEM);
+
+	send_buf = buf + buf_size;
+	memset(send_buf, 0, sizeof *send_buf);
+	send_buf->mad = buf;
+
+	send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device,
+					    buf, buf_size, DMA_TO_DEVICE);
+	pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr);
+	send_buf->sge.length = buf_size;
+	send_buf->sge.lkey = mad_agent->mr->lkey;
+
+	send_buf->send_wr.wr_id = (unsigned long) send_buf;
+	send_buf->send_wr.sg_list = &send_buf->sge;
+	send_buf->send_wr.num_sge = 1;
+	send_buf->send_wr.opcode = IB_WR_SEND;
+	send_buf->send_wr.send_flags = IB_SEND_SIGNALED;
+	send_buf->send_wr.wr.ud.ah = ah;
+	send_buf->send_wr.wr.ud.mad_hdr = &send_buf->mad->mad_hdr;
+	send_buf->send_wr.wr.ud.remote_qpn = remote_qpn;
+	send_buf->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY;
+	send_buf->send_wr.wr.ud.pkey_index = pkey_index;
+
+	if (mad_agent->rmpp_version) {
+		struct ib_rmpp_mad *rmpp_mad;
+		rmpp_mad = (struct ib_rmpp_mad *)send_buf->mad;
+		rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len -
+			offsetof(struct ib_rmpp_mad, data) + data_len);
+	}
+
+	send_buf->mad_agent = mad_agent;
+	atomic_inc(&mad_agent_priv->refcount);
+	return send_buf;
+}
+EXPORT_SYMBOL(ib_create_send_mad);
+
+void ib_free_send_mad(struct ib_mad_send_buf *send_buf)
+{
+	struct ib_mad_agent_private *mad_agent_priv;
+
+	mad_agent_priv = container_of(send_buf->mad_agent,
+				      struct ib_mad_agent_private, agent);
+
+	dma_unmap_single(send_buf->mad_agent->device->dma_device,
+			 pci_unmap_addr(send_buf, mapping),
+			 send_buf->sge.length, DMA_TO_DEVICE);
+	kfree(send_buf->mad);
+
+	if (atomic_dec_and_test(&mad_agent_priv->refcount))
+		wake_up(&mad_agent_priv->wait);
+}
+EXPORT_SYMBOL(ib_free_send_mad);
+
 static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv,
 		       struct ib_mad_send_wr_private *mad_send_wr)
 {


From roland at topspin.com  Tue Apr 19 17:07:40 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 19 Apr 2005 17:07:40 -0700
Subject: [openib-general] [RFC] patch for new send MAD allocation routines
In-Reply-To: <20050419165430.786a1dfc.mshefty@ichips.intel.com> (Sean
	Hefty's message of "Tue, 19 Apr 2005 16:54:30 -0700")
References: <20050419165430.786a1dfc.mshefty@ichips.intel.com>
Message-ID: <52is2i48fn.fsf@topspin.com>

Instead of hard-coding GFP_KERNEL:

    > +	buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL);

would it make sense to put a gfp_mask into the API and avoid some of
the heartburn about interrupt context that we've had lately?

 - R.


From roland at topspin.com  Tue Apr 19 17:24:46 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 19 Apr 2005 17:24:46 -0700
Subject: [openib-general] Re: uverbs API
References: <426593D2.4030707@ichips.intel.com>
Message-ID: <528y3e47n5.fsf@topspin.com>

    > Here is my TODO list that I need some feedback on....

    > - resize_cq

A fair bit of work, probably won't get done too soon.

    > - query_device
    > - ib_query_gid

Both easy, I'll add them shortly.

    > - ibv_get_cq_event(), need timed event call and wakeup

Can you explain what this means a little more?  Is there something you
need that you can't get by using select()/poll() with a timeout on the
CQ event FD?

    > - current implementation supports one event per device, plans for more?

Yes, in the medium term I plan to add support for multiple MSI-X
vectors so that multiple CQ events are possible.

    > - memory window support

Right now I don't have any plans to implement this.  All the feedback
I've seen is that with current hardware, performance is not good
enough to make MWs worth using.

 - R.


From krkumar2 at in.ibm.com  Wed Apr 20 00:14:28 2005
From: krkumar2 at in.ibm.com (Krishna Kumar2)
Date: Wed, 20 Apr 2005 12:44:28 +0530
Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for userspace
	verbs
Message-ID: <OF90624C76.07994D94-ON65256FE9.0024E41B-65256FE9.0027CD14@in.ibm.com>

Hi Roland,

> In particular, the memory pinning code in in uverbs_mem.c could stand a 
looking over.

1. In ib_umem_get(), I see you set ret = 0, which is unnecessary because 
chunk->nents
    is set based on ret value. Plus you already have a "while (ret)" to 
break out. "ret = 0"
    can be safely removed.

2. Also, as an optimization, in __ib_umem_release(), you could add another 
argument
    "page_dirty" which if set will do set_page_dirty_lock() (it seems to 
be a costly routine),
    and pass that argument as 0 in ib_umem_get() and 1 in 
ib_umem_release().

3. In __ib_umem_unmark() (sorry, I don't fully know this code very well 
and could be wrong),
    should the for loop have cur_base = vma->vm_start (instead of vm_end) 
since vma is set
    to the next one before this statement is executed ?

thanks,

- KK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050420/d8cb6224/attachment.html>

From roland at topspin.com  Wed Apr 20 06:57:06 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 06:57:06 -0700
Subject: [openib-general] kernel/user verbs interface changed
Message-ID: <521x954klp.fsf@topspin.com>

I just committed a change to ib_user_verbs.h in the kernel and
kern-abi.h in libibverbs to add command codes for all verbs.  This
means that you must update both the kernel and libibverbs at the same
time; for example, a new kernel will not work with old libibverbs.

 - R.


From ardavis at ichips.intel.com  Wed Apr 20 08:51:27 2005
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 20 Apr 2005 08:51:27 -0700
Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get
In-Reply-To: <52r7hbixzz.fsf@topspin.com>
References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com>
Message-ID: <42667A7F.8090604@ichips.intel.com>

Here is a new oops from my overnight run....

Apr 19 12:14:57 iclust-19 kernel: idr_remove called for id=0 which is 
not allocated.
Apr 19 12:14:57 iclust-19 kernel:
Apr 19 12:14:57 iclust-19 kernel: Call 
Trace:<ffffffff80241884>{idr_remove+244} 
<ffffffff8037f0fe>{ib_uverbs_event_release+126}
Apr 19 12:14:57 iclust-19 kernel:        
<ffffffff8037f886>{ib_uverbs_close+566} <ffffffff8017baa2>{__fput+98}
Apr 19 12:14:57 iclust-19 kernel:        
<ffffffff8016a66d>{remove_vm_struct+125} <ffffffff8016bbd6>{do_munmap+918}
Apr 19 12:14:57 iclust-19 kernel:        
<ffffffff8042d991>{__down_read+49} <ffffffff8016c3fd>{sys_munmap+77}
Apr 19 12:14:57 iclust-19 kernel:        <ffffffff8010e30a>{system_call+126}
Apr 19 12:14:57 iclust-19 kernel: Unable to handle kernel NULL pointer 
dereference at 0000000000000010 RIP:
Apr 19 12:14:57 iclust-19 kernel: <ffffffff8036da30>{ib_dealloc_pd+0}
Apr 19 12:14:57 iclust-19 kernel: PGD 2feee067 PUD 312c7067 PMD 0
Apr 19 12:14:57 iclust-19 kernel: Oops: 0000 [1] SMP
Apr 19 12:14:57 iclust-19 kernel: CPU 0
Apr 19 12:14:57 iclust-19 kernel: Modules linked in:
Apr 19 12:14:57 iclust-19 kernel: Pid: 19391, comm: putfence1 Not 
tainted 2.6.11
Apr 19 12:14:57 iclust-19 kernel: RIP: 0010:[<ffffffff8036da30>] 
<ffffffff8036da30>{ib_dealloc_pd+0}
Apr 19 12:14:57 iclust-19 kernel: RSP: 0018:ffff81002f66fe40  EFLAGS: 
00010296
Apr 19 12:14:57 iclust-19 kernel: RAX: 0000000000000000 RBX: 
0000000000000000 RCX: 0000000000040000
Apr 19 12:14:57 iclust-19 kernel: RDX: 00000000ffffff01 RSI: 
ffff8100325bb400 RDI: 0000000000000000
Apr 19 12:14:57 iclust-19 kernel: RBP: ffff8100311fe900 R08: 
00000000fffffff8 R09: 0000000000000002
Apr 19 12:14:57 iclust-19 kernel: R10: 00000000ffffffff R11: 
0000000000000000 R12: ffff8100311fe910
Apr 19 12:14:57 iclust-19 kernel: R13: ffff81003a3e7d78 R14: 
ffff81003a3e7880 R15: ffff81003a3e7d88
Apr 19 12:14:57 iclust-19 kernel: FS:  00002aaaaae55f40(0000) 
GS:ffffffff805fe400(0000) knlGS:0000000000000000
Apr 19 12:14:57 iclust-19 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Apr 19 12:14:57 iclust-19 kernel: CR2: 0000000000000010 CR3: 
0000000031e8f000 CR4: 00000000000006e0
Apr 19 12:14:57 iclust-19 kernel: Process putfence1 (pid: 19391, 
threadinfo ffff81002f66e000, task ffff81002fbd54a0)
Apr 19 12:14:57 iclust-19 kernel: Stack: ffffffff8037f88e 
ffff81003a3e7d80 ffff81003227a2c0 ffff810037ff6440
Apr 19 12:14:57 iclust-19 kernel:        ffff81003d864108 
ffff81003f289870 00002aaaab4d2000 ffff810032eb6e00
Apr 19 12:14:57 iclust-19 kernel:        ffffffff8017baa2 00002aaaab4d2000
Apr 19 12:14:57 iclust-19 kernel: Call 
Trace:<ffffffff8037f88e>{ib_uverbs_close+574} <ffffffff8017baa2>{__fput+98}
Apr 19 12:14:57 iclust-19 kernel:        
<ffffffff8016a66d>{remove_vm_struct+125} <ffffffff8016bbd6>{do_munmap+918}
Apr 19 12:14:57 iclust-19 kernel:        
<ffffffff8042d991>{__down_read+49} <ffffffff8016c3fd>{sys_munmap+77}
Apr 19 12:14:57 iclust-19 kernel:        <ffffffff8010e30a>{system_call+126}
Apr 19 12:14:57 iclust-19 kernel:
Apr 19 12:14:57 iclust-19 kernel: Code: 8b 47 10 85 c0 75 0d 48 8b 07 4c 
8b 98 18 01 00 00 41 ff e3
Apr 19 12:14:57 iclust-19 kernel: RIP 
<ffffffff8036da30>{ib_dealloc_pd+0} RSP <ffff81002f66fe40>
Apr 19 12:14:57 iclust-19 kernel: CR2: 0000000000000010


From mshefty at ichips.intel.com  Wed Apr 20 09:02:11 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 20 Apr 2005 09:02:11 -0700
Subject: [openib-general] [RFC] patch for new send MAD allocation routines
In-Reply-To: <52is2i48fn.fsf@topspin.com>
References: <20050419165430.786a1dfc.mshefty@ichips.intel.com>
	<52is2i48fn.fsf@topspin.com>
Message-ID: <42667D03.1030502@ichips.intel.com>

Roland Dreier wrote:
> Instead of hard-coding GFP_KERNEL:
> 
>     > +	buf = kmalloc(sizeof *send_buf + buf_size, GFP_KERNEL);
> 
> would it make sense to put a gfp_mask into the API and avoid some of
> the heartburn about interrupt context that we've had lately?

That's easy enough to do.

- Sean


From IBMEHCAD at de.ibm.com  Wed Apr 20 09:17:43 2005
From: IBMEHCAD at de.ibm.com (IBMEHCA DD)
Date: Wed, 20 Apr 2005 18:17:43 +0200
Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack
Message-ID: <OF2C40F988.88F5D2C7-ONC1256FE9.00588C28-C1256FE9.00597C4A@de.ibm.com>

Hi,
we've just released the first linux device driver for the IBM eServer HCA 
for Power5.
It's gen1 based and runs on SLES9 SP1.
Main testvehicle for this code was IPoIB.

gen2 and full userspace support will be next.

http://sourceforge.net/projects/ibmehcad/


Hardware device driver development for gen2 is so much easier... 

Chirstoph Raisch
HCAD teamlead, ibm boeblingen lab, 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050420/1016aaad/attachment.html>

From roland at topspin.com  Wed Apr 20 09:38:13 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 09:38:13 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
In-Reply-To: <42667A7F.8090604@ichips.intel.com> (Arlin Davis's message of
	"Wed, 20 Apr 2005 08:51:27 -0700")
References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com>
	<42667A7F.8090604@ichips.intel.com>
Message-ID: <52sm1l2ykq.fsf@topspin.com>

Thanks for the report.  Did your overnight run involve multiple
processes using IB verbs running at the same time?

I think I might see a race condition that could possibly cause this
crash...

 - R.


From roland at topspin.com  Wed Apr 20 09:44:45 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 09:44:45 -0700
Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack
In-Reply-To: <OF2C40F988.88F5D2C7-ONC1256FE9.00588C28-C1256FE9.00597C4A@de.ibm.com>
	(IBMEHCA DD's message of "Wed, 20 Apr 2005 18:17:43 +0200")
References: <OF2C40F988.88F5D2C7-ONC1256FE9.00588C28-C1256FE9.00597C4A@de.ibm.com>
Message-ID: <52ll7d2y9u.fsf@topspin.com>

    > Hi, we've just released the first linux device driver for
    > the IBM eServer HCA for Power5.  It's gen1 based and runs
    > on SLES9 SP1.  Main testvehicle for this code was IPoIB.

    > gen2 and full userspace support will be next.

Excellent, I'm glad to see this released.  I'm looking forward to
seeing the gen2 support.

If I may make a small suggestion for future releases: please have the
tar file contain a top-level directory like ehca-0021, with everything
contained in that directory.  It's a little annoying to unpack a tar
file and have it spread 5 files in your working directory, especially
when some have generic names like "INSTALL" or "patches."

Thanks,
  Roland


From roland at topspin.com  Wed Apr 20 09:48:05 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 09:48:05 -0700
Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for
	userspace verbs
In-Reply-To: <OF90624C76.07994D94-ON65256FE9.0024E41B-65256FE9.0027CD14@in.ibm.com>
	(Krishna Kumar2's message of "Wed, 20 Apr 2005 12:44:28 +0530")
References: <OF90624C76.07994D94-ON65256FE9.0024E41B-65256FE9.0027CD14@in.ibm.com>
Message-ID: <52hdi12y4a.fsf@topspin.com>

    Krishna> 1. In ib_umem_get(), I see you set ret = 0, which is
    Krishna> unnecessary because chunk-> nents is set based on ret
    Krishna> value. Plus you already have a "while (ret)" to break
    Krishna> out. "ret = 0" can be safely removed.

Actually, I know why "ret = 0" is there: although it's not strictly
needed, gcc isn't smart enough to see that, and without the
initialization, it warns that "'ret' might be used uninitialized in
this function."

 - R.


From ardavis at ichips.intel.com  Wed Apr 20 09:50:05 2005
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 20 Apr 2005 09:50:05 -0700
Subject: [openib-general] Kernel oops: NULL ptr dereference in ib_umem_get
In-Reply-To: <52sm1l2ykq.fsf@topspin.com>
References: <426023BD.8080504@ichips.intel.com>
	<52r7hbixzz.fsf@topspin.com>	<42667A7F.8090604@ichips.intel.com>
	<52sm1l2ykq.fsf@topspin.com>
Message-ID: <4266883D.1050306@ichips.intel.com>

Roland Dreier wrote:

>Thanks for the report.  Did your overnight run involve multiple
>processes using IB verbs running at the same time?
>
>I think I might see a race condition that could possibly cause this
>crash...
>
> - R.
>
>  
>
Yes, this MPI test is running across 2 nodes, with 2 processes on each node.


From roland at topspin.com  Wed Apr 20 09:50:26 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 09:50:26 -0700
Subject: [openib-general] [PATCH][RFC][1/4] IB: core changes for
	userspace verbs
In-Reply-To: <OF90624C76.07994D94-ON65256FE9.0024E41B-65256FE9.0027CD14@in.ibm.com>
	(Krishna Kumar2's message of "Wed, 20 Apr 2005 12:44:28 +0530")
References: <OF90624C76.07994D94-ON65256FE9.0024E41B-65256FE9.0027CD14@in.ibm.com>
Message-ID: <52fyxl2y0d.fsf@topspin.com>

    Krishna> 1. In ib_umem_get(), I see you set ret = 0, which is
    Krishna> unnecessary because chunk-> nents is set based on ret
    Krishna> value. Plus you already have a "while (ret)" to break
    Krishna> out. "ret = 0" can be safely removed.

True, done.

    Krishna> 2. Also, as an optimization, in __ib_umem_release(), you
    Krishna> could add another argument "page_dirty" which if set will
    Krishna> do set_page_dirty_lock() (it seems to be a costly
    Krishna> routine), and pass that argument as 0 in ib_umem_get()
    Krishna> and 1 in ib_umem_release().

Seems reasonable, done as well.

    Krishna> 3. In __ib_umem_unmark() (sorry, I don't fully know this
    Krishna> code very well and could be wrong), should the for loop
    Krishna> have cur_base = vma->vm_start (instead of vm_end) since
    Krishna> vma is set to the next one before this statement is
    Krishna> executed ?

Yes, there was a bug there.  I think it's already fixed in the latest
code, though.

Thanks,
  Roland


From roland at topspin.com  Wed Apr 20 11:09:16 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 11:09:16 -0700
Subject: [openib-general] Kernel oops:  NULL ptr dereference in ib_umem_get
In-Reply-To: <4266883D.1050306@ichips.intel.com> (Arlin Davis's message of
	"Wed, 20 Apr 2005 09:50:05 -0700")
References: <426023BD.8080504@ichips.intel.com> <52r7hbixzz.fsf@topspin.com>
	<42667A7F.8090604@ichips.intel.com> <52sm1l2ykq.fsf@topspin.com>
	<4266883D.1050306@ichips.intel.com>
Message-ID: <523btl2ucz.fsf@topspin.com>

OK, I'm not absolutely sure this fixes the cause of the oops you saw,
but I am pretty sure it is a necessary fix.  You can apply the patch
below or just pull the latest subversion.  Remember that the latest
subversion kernel code requires up-to-date libibverbs code as well.

My current theory is that you had two MPI processes exiting
simultaneously, and ib_dealloc_ucontext() ended up accessing the same
struct idr for both processes, which is a no-no.

 - R.

--- infiniband/core/uverbs_main.c	(revision 2193)
+++ infiniband/core/uverbs_main.c	(working copy)
@@ -99,6 +99,8 @@ static int ib_dealloc_ucontext(struct ib
 	if (!context)
 		return 0;
 
+	down(&ib_uverbs_idr_mutex);
+
 	/* XXX Free AHs */
 
 	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
@@ -141,6 +143,8 @@ static int ib_dealloc_ucontext(struct ib
 		kfree(uobj);
 	}
 
+	up(&ib_uverbs_idr_mutex);
+
 	return context->device->dealloc_ucontext(context);
 }
 

From ardavis at ichips.intel.com  Wed Apr 20 11:35:31 2005
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Wed, 20 Apr 2005 11:35:31 -0700
Subject: [openib-general] Re: uverbs API
In-Reply-To: <528y3e47n5.fsf@topspin.com>
References: <426593D2.4030707@ichips.intel.com> <528y3e47n5.fsf@topspin.com>
Message-ID: <4266A0F3.4020804@ichips.intel.com>

Roland Dreier wrote:

>    > Here is my TODO list that I need some feedback on....
>
>    > - resize_cq
>
>A fair bit of work, probably won't get done too soon.
>
>    > - query_device
>    > - ib_query_gid
>
>Both easy, I'll add them shortly.
>
>    > - ibv_get_cq_event(), need timed event call and wakeup
>
>Can you explain what this means a little more?  Is there something you
>need that you can't get by using select()/poll() with a timeout on the
>CQ event FD?
>
>  
>
As long as you can tell me that a thread blocked on get_cq_event will 
wakeup on a device close, signal, or CQ error then I don't need a wakeup 
call from userspace.

and yes, select() was exactly my thinking,  but I was hoping we could 
get it added to the ibv_get_cq_event() code and just include a new 
timeout parameter (in usecs) with the call.

-arlin

>    > - current implementation supports one event per device, plans for more?
>
>Yes, in the medium term I plan to add support for multiple MSI-X
>vectors so that multiple CQ events are possible.
>
>    > - memory window support
>
>Right now I don't have any plans to implement this.  All the feedback
>I've seen is that with current hardware, performance is not good
>enough to make MWs worth using.
>
> - R.
>
>  
>


From tduffy at sun.com  Wed Apr 20 13:12:35 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 20 Apr 2005 13:12:35 -0700
Subject: [openib-general] Re: [openib-commits] r2195 - in
	gen2/trunk/src/linux-kernel/infiniband: core include
In-Reply-To: <20050420173553.D22902283D6@openib.ca.sandia.gov>
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>
Message-ID: <1114027955.11751.4.camel@duffman>

On Wed, 2005-04-20 at 10:35 -0700, sean.hefty at openib.org wrote:
> Modified: gen2/trunk/src/linux-kernel/infiniband/core/verbs.c
> ===================================================================
> --- gen2/trunk/src/linux-kernel/infiniband/core/verbs.c	2005-04-20 16:54:38 UTC (rev 2194)
> +++ gen2/trunk/src/linux-kernel/infiniband/core/verbs.c	2005-04-20 17:35:52 UTC (rev 2195)
> @@ -40,6 +40,7 @@
>  #include <linux/err.h>
>  
>  #include <ib_verbs.h>
> +#include <ib_cache.h>
>  
>  /* Protection domains */
>  
> @@ -87,6 +88,40 @@
>  }
>  EXPORT_SYMBOL(ib_create_ah);
>  
> +struct ib_ah *ib_create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
> +				   struct ib_grh *grh, u8 port_num)
> +{
> +	struct ib_ah_attr ah_attr;
> +	u32 flow_class;
> +	u16 gid_index;
> +	int ret;
> +
> +	memset(&ah_attr, 0, sizeof ah_attr);
> +	ah_attr.dlid = wc->slid;
> +	ah_attr.sl = wc->sl;
> +	ah_attr.src_path_bits = wc->dlid_path_bits;
> +	ah_attr.port_num = port_num;
> +	
> +	if (wc->wc_flags & IB_WC_GRH) {
> +		ah_attr.ah_flags = IB_AH_GRH;
> +		ah_attr.grh.dgid = grh->dgid;
> +
> +		ret = ib_find_cached_gid(pd->device, &grh->sgid, &port_num,
> +					 &gid_index);
> +		if (ret)
> +			return ERR_PTR(ret);
> +
> +		ah_attr.grh.sgid_index = (u8) gid_index;
> +		flow_class = be32_to_cpu(&grh->version_tclass_flow);
> +		ah_attr.grh.flow_label = flow_class & 0xFFFFF;
> +		ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF;
> +		ah_attr.grh.hop_limit = grh->hop_limit;
> +	}
> +
> +	return ib_create_ah(pd, &ah_attr);
> +}
> +EXPORT_SYMBOL(ib_create_ah_from_wc);
> +
>  int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr)
>  {
>  	return ah->device->modify_ah ?

Causes build warning on 64bit:

  CC [M]  drivers/infiniband/core/verbs.o
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c: In function ‘ib_create_ah_from_wc’:
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115: warning: cast from pointer to integer of different size

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050420/4b83769d/attachment.sig>

From roland at topspin.com  Wed Apr 20 13:24:06 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 13:24:06 -0700
Subject: [openib-general] Re: [openib-commits] r2195 - in
	gen2/trunk/src/linux-kernel/infiniband: core include
In-Reply-To: <1114027955.11751.4.camel@duffman> (Tom Duffy's message of
	"Wed, 20 Apr 2005 13:12:35 -0700")
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>
	<1114027955.11751.4.camel@duffman>
Message-ID: <52u0m119jt.fsf@topspin.com>

    Tom> Causes build warning on 64bit:

    Tom>   CC [M] drivers/infiniband/core/verbs.o
    Tom> /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:
    Tom> In function âib_create_ah_from_wcâ:
    Tom> /build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/core/verbs.c:115:
    Tom> warning: cast from pointer to integer of different size

Looks like a real bug -- I think

		flow_class = be32_to_cpu(&grh->version_tclass_flow);

should be

		flow_class = be32_to_cpu(grh->version_tclass_flow);

(ie no "&" -- we want to swap the value, not the address!)

 - R.


From tduffy at sun.com  Wed Apr 20 13:29:06 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 20 Apr 2005 13:29:06 -0700
Subject: [openib-general] Re: [openib-commits] r2195 - in
	gen2/trunk/src/linux-kernel/infiniband: core include
In-Reply-To: <52u0m119jt.fsf@topspin.com>
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>
	<1114027955.11751.4.camel@duffman>  <52u0m119jt.fsf@topspin.com>
Message-ID: <1114028946.11751.7.camel@duffman>

On Wed, 2005-04-20 at 13:24 -0700, Roland Dreier wrote:
> Looks like a real bug -- I think
> 
> 		flow_class = be32_to_cpu(&grh->version_tclass_flow);
> 
> should be
> 
> 		flow_class = be32_to_cpu(grh->version_tclass_flow);
> 
> (ie no "&" -- we want to swap the value, not the address!)

That is what I thought as well, but there are other places in the code
that do the same thing.

agent.c:159
ping.c:137

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050420/f4113cb9/attachment.sig>

From libor at topspin.com  Wed Apr 20 13:32:44 2005
From: libor at topspin.com (Libor Michalek)
Date: Wed, 20 Apr 2005 13:32:44 -0700
Subject: [openib-general] Re: uverbs API
In-Reply-To: <4266A0F3.4020804@ichips.intel.com>;
	from ardavis@ichips.intel.com on Wed, Apr 20, 2005 at 11:35:31AM
	-0700
References: <426593D2.4030707@ichips.intel.com> <528y3e47n5.fsf@topspin.com>
	<4266A0F3.4020804@ichips.intel.com>
Message-ID: <20050420133244.A9497@topspin.com>

On Wed, Apr 20, 2005 at 11:35:31AM -0700, Arlin Davis wrote:
> Roland Dreier wrote:
> >
> >Can you explain what this means a little more?  Is there something you
> >need that you can't get by using select()/poll() with a timeout on the
> >CQ event FD?
> >
> >  
> >
> As long as you can tell me that a thread blocked on get_cq_event will 
> wakeup on a device close, signal, or CQ error then I don't need a wakeup 
> call from userspace.
> 
> and yes, select() was exactly my thinking,  but I was hoping we could 
> get it added to the ibv_get_cq_event() code and just include a new 
> timeout parameter (in usecs) with the call.

  I guess it would be trivial for someone to write an ibv_get_cq_event() 
wrapper which took a timeout, and performed the select before calling
the real ibv_get_cq_event... However, that seems limiting from a real
application point of view where you will almost certainly want to have
more then one file descriptor on which you are waiting.

  If the consumer is responsible for getting the FD and placing
it into a select, inorder to support timeouts and event notification,
then it's trivial to add a CM fd and a SA query fd, not to mention 
any other file descriptor in your application. 


-Libor


From roland at topspin.com  Wed Apr 20 13:34:26 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 20 Apr 2005 13:34:26 -0700
Subject: [openib-general] Re: [openib-commits] r2195 - in
	gen2/trunk/src/linux-kernel/infiniband: core include
In-Reply-To: <1114028946.11751.7.camel@duffman> (Tom Duffy's message of
	"Wed, 20 Apr 2005 13:29:06 -0700")
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>
	<1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com>
	<1114028946.11751.7.camel@duffman>
Message-ID: <52pswp192l.fsf@topspin.com>

    Tom> That is what I thought as well, but there are other places in
    Tom> the code that do the same thing.

    Tom> agent.c:159, ping.c:137

Those are doing be32_to_cpup() -- notice the "p" at the end, which
means it dereferences a pointer.  be32_to_cpup(&val) is rather
obfuscated but still technically correct.  It's sort of like writing
"*&val" instead of just "val."

In other words, it's probably worth cleaning up those other places,
but they're not actually bugs.

 - R.


From tduffy at sun.com  Wed Apr 20 13:48:01 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 20 Apr 2005 13:48:01 -0700
Subject: [openib-general] [PATCH][CORE] use be32_to_cpu instead of
	be32_to_cpup
In-Reply-To: <52pswp192l.fsf@topspin.com>
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>
	<1114027955.11751.4.camel@duffman> <52u0m119jt.fsf@topspin.com>
	<1114028946.11751.7.camel@duffman>  <52pswp192l.fsf@topspin.com>
Message-ID: <1114030081.11751.22.camel@duffman>

On Wed, 2005-04-20 at 13:34 -0700, Roland Dreier wrote:
> Those are doing be32_to_cpup() -- notice the "p" at the end, which
> means it dereferences a pointer.  be32_to_cpup(&val) is rather
> obfuscated but still technically correct.  It's sort of like writing
> "*&val" instead of just "val."
> 
> In other words, it's probably worth cleaning up those other places,
> but they're not actually bugs.

Ah, didn't see that.  Good eye.  I propose the following patch then.

Pointed-out-by: Roland Dreier <roland at topspin.com>
Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.11-openib/drivers/infiniband/core/agent.c
===================================================================
--- linux-2.6.11-openib/drivers/infiniband/core/agent.c	(revision 2198)
+++ linux-2.6.11-openib/drivers/infiniband/core/agent.c	(working copy)
@@ -155,10 +155,10 @@ static int agent_mad_send(struct ib_mad_
 			/* Should sgid be looked up ? */
 			ah_attr.grh.sgid_index = 0;
 			ah_attr.grh.hop_limit = grh->hop_limit;
-			ah_attr.grh.flow_label = be32_to_cpup(
-				&grh->version_tclass_flow)  & 0xfffff;
-			ah_attr.grh.traffic_class = (be32_to_cpup(
-				&grh->version_tclass_flow) >> 20) & 0xff;
+			ah_attr.grh.flow_label = be32_to_cpu(
+				grh->version_tclass_flow)  & 0xfffff;
+			ah_attr.grh.traffic_class = (be32_to_cpu(
+				grh->version_tclass_flow) >> 20) & 0xff;
 			memcpy(ah_attr.grh.dgid.raw,
 			       grh->sgid.raw,
 			       sizeof(ah_attr.grh.dgid));
Index: linux-2.6.11-openib/drivers/infiniband/core/verbs.c
===================================================================
--- linux-2.6.11-openib/drivers/infiniband/core/verbs.c	(revision 2198)
+++ linux-2.6.11-openib/drivers/infiniband/core/verbs.c	(working copy)
@@ -112,7 +112,7 @@ struct ib_ah *ib_create_ah_from_wc(struc
 			return ERR_PTR(ret);
 
 		ah_attr.grh.sgid_index = (u8) gid_index;
-		flow_class = be32_to_cpu(&grh->version_tclass_flow);
+		flow_class = be32_to_cpu(grh->version_tclass_flow);
 		ah_attr.grh.flow_label = flow_class & 0xFFFFF;
 		ah_attr.grh.traffic_class = (flow_class >> 20) & 0xFF;
 		ah_attr.grh.hop_limit = grh->hop_limit;
Index: linux-2.6.11-openib/drivers/infiniband/core/ping.c
===================================================================
--- linux-2.6.11-openib/drivers/infiniband/core/ping.c	(revision 2198)
+++ linux-2.6.11-openib/drivers/infiniband/core/ping.c	(working copy)
@@ -133,10 +133,10 @@ static int ping_mad_send(struct ib_mad_a
 			/* Should sgid be looked up ? */
 			ah_attr.grh.sgid_index = 0;
 			ah_attr.grh.hop_limit = grh->hop_limit;
-			ah_attr.grh.flow_label = be32_to_cpup(
-				&grh->version_tclass_flow)  & 0xfffff;
-			ah_attr.grh.traffic_class = (be32_to_cpup(
-				&grh->version_tclass_flow) >> 20) & 0xff;
+			ah_attr.grh.flow_label = be32_to_cpu(
+				grh->version_tclass_flow)  & 0xfffff;
+			ah_attr.grh.traffic_class = (be32_to_cpu(
+				grh->version_tclass_flow) >> 20) & 0xff;
 			memcpy(ah_attr.grh.dgid.raw,
 			       grh->sgid.raw,
 			       sizeof(ah_attr.grh.dgid));


From mshefty at ichips.intel.com  Wed Apr 20 14:19:20 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 20 Apr 2005 14:19:20 -0700
Subject: [openib-general] [PATCH][CORE] use be32_to_cpu instead
	of	be32_to_cpup
In-Reply-To: <1114030081.11751.22.camel@duffman>
References: <20050420173553.D22902283D6@openib.ca.sandia.gov>	<1114027955.11751.4.camel@duffman>
	<52u0m119jt.fsf@topspin.com>	<1114028946.11751.7.camel@duffman>
	<52pswp192l.fsf@topspin.com> <1114030081.11751.22.camel@duffman>
Message-ID: <4266C758.1080401@ichips.intel.com>

Tom Duffy wrote:

> Ah, didn't see that.  Good eye.  I propose the following patch then.
> 
> Pointed-out-by: Roland Dreier <roland at topspin.com>
> Signed-off-by: Tom Duffy <tduffy at sun.com>
> 

Thanks for catching this.  I'll go ahead and take the patch and apply it.

My plan is to eventually go through the code and identify areas where 
the newer call can be made in place of the existing code.  I had 
already identified agent.c, but it looks like ping.c is a candidate as 
well.

- Sean


From robert.j.woodruff at intel.com  Wed Apr 20 14:29:22 2005
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Wed, 20 Apr 2005 14:29:22 -0700
Subject: [openib-general] User-verbs Broken in SVN 2194 ?
Message-ID: <1AC79F16F5C5284499BB9591B33D6F00042A40A1@orsmsx408>

I tried to build and run the usermode verbs from the SVN 2194
I checked out this morning.
It appears to build ok. 

When I run the example ibv_devices it shows

device              Node GUID
------------------------------------
mthca0              0002c900011da040

But when I run ibv_pingpong I get

Couldn't get context for mthca0

Any ideas  ?

woody


From robert.j.woodruff at intel.com  Wed Apr 20 14:51:51 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Wed, 20 Apr 2005 14:51:51 -0700
Subject: [openib-general] User-verbs Broken in SVN 2194 ?
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00042A40A1@orsmsx408>
Message-ID: <ORSMSX408FRaqbC8wSA00000010@orsmsx408.amr.corp.intel.com>


> Couldn't get context for mthca0

> Any ideas  ?

> woody

Never mind. I needed to make the /dev nodes.  


From mshefty at ichips.intel.com  Wed Apr 20 16:27:26 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 20 Apr 2005 16:27:26 -0700
Subject: [openib-general] [PATCH] fix bug matching responses with non-DATA
	RMPP MADs
Message-ID: <20050420162726.10e35922.mshefty@ichips.intel.com>

The following patch fixes an issue where a response MAD could have been
incorrectly matched with an internally generated RMPP ACK.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>

Index: core/mad.c
===================================================================
--- core/mad.c	(revision 2202)
+++ core/mad.c	(working copy)
@@ -1550,6 +1550,18 @@
 		return mad_recv_wc;
 }
 
+static int is_data_mad(struct ib_mad_agent_private *mad_agent_priv,
+		       struct ib_mad_hdr *mad_hdr)
+{
+	struct ib_rmpp_mad *rmpp_mad;
+
+	rmpp_mad = (struct ib_rmpp_mad *)mad_hdr;
+	return !mad_agent_priv->agent.rmpp_version ||
+		!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) &
+				    IB_MGMT_RMPP_FLAG_ACTIVE) ||
+		(rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_DATA);
+}
+
 static struct ib_mad_send_wr_private*
 find_send_req(struct ib_mad_agent_private *mad_agent_priv,
 	      u64 tid)
@@ -1568,7 +1580,9 @@
 	 */
 	list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list,
 			    agent_list) {
-		if (mad_send_wr->tid == tid && mad_send_wr->timeout) {
+		if (is_data_mad(mad_agent_priv,
+				mad_send_wr->send_wr.wr.ud.mad_hdr) &&
+		    mad_send_wr->tid == tid && mad_send_wr->timeout) {
 			/* Verify request has not been canceled */
 			return (mad_send_wr->status == IB_WC_SUCCESS) ?
 				mad_send_wr : NULL;
@@ -2055,7 +2069,9 @@
 
 	list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list,
 			    agent_list) {
-		if (mad_send_wr->wr_id == wr_id)
+		if (is_data_mad(mad_agent_priv,
+				mad_send_wr->send_wr.wr.ud.mad_hdr) &&
+		    mad_send_wr->wr_id == wr_id)
 			return mad_send_wr;
 	}
 	return NULL;


From tduffy at sun.com  Wed Apr 20 17:37:11 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 20 Apr 2005 17:37:11 -0700
Subject: [PATCH][MTHCA] fix sparc build WAS: Re: [openib-general]
	[PATCH][RFC][3/4] IB: userspace verbs mthca changes
In-Reply-To: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
References: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
Message-ID: <1114043831.18198.17.camel@duffman>

On Mon, 2005-04-04 at 15:09 -0700, Roland Dreier wrote:
> @@ -574,6 +836,22 @@
>         return 0;
>  }
>  
> +static int mthca_mmap_uar(struct ib_ucontext *context,
> +                         struct vm_area_struct *vma)
> +{
> +       if (vma->vm_end - vma->vm_start != PAGE_SIZE)
> +               return -EINVAL;
> +
> +       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> +       if (remap_pfn_range(vma, vma->vm_start,
> +                           to_mucontext(context)->uar.pfn,
> +                           PAGE_SIZE, vma->vm_page_prot))
> +               return -EAGAIN;
> +
> +       return 0;
> +}
> +

This breaks building on sparc64:

  CC [M]  drivers/infiniband/hw/mthca/mthca_provider.o
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c: In function `mthca_mmap_uar':
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c:352: warning: implicit declaration of function `pgprot_noncached'
/build1/tduffy/openib-work/linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c:352: error: incompatible types in assignment
make[3]: *** [drivers/infiniband/hw/mthca/mthca_provider.o] Error 1
make[2]: *** [drivers/infiniband/hw/mthca] Error 2
make[1]: *** [_module_drivers/infiniband] Error 2
make: *** [_all] Error 2

This is ugly, but fixes the build.  Perhaps sparc needs
pgprot_noncached() to be a noop?

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c	(revision 2202)
+++ linux-2.6.11-openib/drivers/infiniband/hw/mthca/mthca_provider.c	(working copy)
@@ -349,7 +349,9 @@ static int mthca_mmap_uar(struct ib_ucon
 	if (vma->vm_end - vma->vm_start != PAGE_SIZE)
 		return -EINVAL;
 
+#ifdef pgprot_noncached
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+#endif
 
 	if (remap_pfn_range(vma, vma->vm_start,
 			    to_mucontext(context)->uar.pfn,


From davem at davemloft.net  Wed Apr 20 17:38:20 2005
From: davem at davemloft.net (David S. Miller)
Date: Wed, 20 Apr 2005 17:38:20 -0700
Subject: [PATCH][MTHCA] fix sparc build WAS: Re: [openib-general]
	[PATCH][RFC][3/4] IB: userspace verbs mthca changes
In-Reply-To: <1114043831.18198.17.camel@duffman>
References: <200544159.AzH1nqpM3uTQZaKG@topspin.com>
	<1114043831.18198.17.camel@duffman>
Message-ID: <20050420173820.24c512ae.davem@davemloft.net>

On Wed, 20 Apr 2005 17:37:11 -0700
Tom Duffy <tduffy at sun.com> wrote:

> This breaks building on sparc64:
 ...
> This is ugly, but fixes the build.  Perhaps sparc needs
> pgprot_noncached() to be a noop?

No, it should actually do something, like so:

include/asm-sparc64/pgtable.h: af9bf175a223cf44310293287d50302e0fd3f9e9
--- a/include/asm-sparc64/pgtable.h
+++ b/include/asm-sparc64/pgtable.h
@@ -416,6 +416,11 @@ extern int io_remap_pfn_range(struct vm_
 			       unsigned long pfn,
 			       unsigned long size, pgprot_t prot);
 
+/* Clear virtual and physical cachability, set side-effect bit.  */
+#define pgprot_noncached(prot) \
+	(__pgprot((pgprot_val(prot) & ~(_PAGE_CP | _PAGE_CV)) | \
+	 _PAGE_E))
+
 /*
  * For sparc32&64, the pfn in io_remap_pfn_range() carries <iospace> in
  * its high 4 bits.  These macros/functions put it there or get it from there.


From hozer at hozed.org  Wed Apr 20 19:17:13 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Wed, 20 Apr 2005 21:17:13 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050418200711.GI15688@aon.at>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<4263DF70.2060702@ammasso.com>
	<1113853240.6274.99.camel@laptopd505.fenrus.org>
	<20050418200711.GI15688@aon.at>
Message-ID: <20050421021713.GP999@kalmia.hozed.org>

On Mon, Apr 18, 2005 at 10:07:12PM +0200, Bernhard Fischer wrote:
> On Mon, Apr 18, 2005 at 09:40:40PM +0200, Arjan van de Ven wrote:
> >On Mon, 2005-04-18 at 11:25 -0500, Timur Tabi wrote:
> >> Arjan van de Ven wrote:
> >> 
> >> > this is a myth; linux is free to move the page about in physical memory
> >> > even if it's mlock()ed!!
> darn, yes, this is true.
> I know people who introduced
> #define VM_RESERVED     0x00080000      /* Don't unmap it from swap_out
> */

Someone (aka Tospin, infinicon, and Amasso) should probably post a patch
adding '#define VM_REGISTERD 0x01000000', and some extensions to
something like 'madvise' to set pages to be registered.

My preference is said patch will also allow a way for the kernel to
reclaim registered memory from an application under memory pressure.


From timur.tabi at ammasso.com  Wed Apr 20 20:07:45 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Wed, 20 Apr 2005 22:07:45 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050421021713.GP999@kalmia.hozed.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<4263DF70.2060702@ammasso.com>
	<1113853240.6274.99.camel@laptopd505.fenrus.org>
	<20050418200711.GI15688@aon.at>
	<20050421021713.GP999@kalmia.hozed.org>
Message-ID: <42671901.4000805@ammasso.com>

Troy Benjegerdes wrote:

> Someone (aka Tospin, infinicon, and Amasso) should probably post a patch
> adding '#define VM_REGISTERD 0x01000000', and some extensions to
> something like 'madvise' to set pages to be registered.
> 
> My preference is said patch will also allow a way for the kernel to
> reclaim registered memory from an application under memory pressure.

I don't know if VM_REGISTERED is a good idea or not, but it should be absolutely 
impossible for the kernel to reclaim "registered" (aka pinned) memory, no matter what. 
For RDMA services (such as Infiniband, iWARP, etc), it's normal for non-root processes to 
pin hundreds of megabytes of memory, and that memory better be locked to those physical 
pages until the application deregisters them.

If kernel really thinks it needs to unpin those pages, then at the very least it should 
kill the process, and the syslog better have a very clear message indicating why.  That 
way, the application doesn't continue thinking that everything's still going to work.  If 
those pages become unpinned, the applications are going to experience serious data corruption.


From tduffy at sun.com  Thu Apr 21 07:57:20 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 07:57:20 -0700
Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev, 1.9,
	1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83]
Message-ID: <1114095440.17167.1.camel@duffman>

Looks like Fedora Core 4 will support IB out of the box.  They have the
kernel modules, udev support, glibc-headers.  Is there anything else
that would be nice to get right before 4 ships?

-tduffy
-------------- next part --------------
An embedded message was scrubbed...
From: fedora-cvs-commits at redhat.com
Subject: rpms/udev/devel pam_console.dev, 1.9, 1.10 udev.rules, 1.23, 1.24	udev.spec, 1.82, 1.83
Date: Thu, 21 Apr 2005 09:11:36 -0400
Size: 7538
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050421/5a41cc39/attachment.mht>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050421/5a41cc39/attachment.sig>

From robert.j.woodruff at intel.com  Thu Apr 21 08:32:00 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Thu, 21 Apr 2005 08:32:00 -0700
Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev, 1.9,
	1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83]
In-Reply-To: <1114095440.17167.1.camel@duffman>
Message-ID: <ORSMSX408TRqf69aZbL00000011@orsmsx408.amr.corp.intel.com>


>Looks like Fedora Core 4 will support IB out of the box.  They have the
>kernel modules, udev support, glibc-headers.  Is there anything else
>that would be nice to get right before 4 ships?

>-tduffy

Cool.

Do you know what the time frame is ? i.e., how long till it ships ?

It would be nice to get the user-mode verbs support in, but 
I think that it may need some more testing and I don't think
the user-mode kernel module has been submitted upstream yet. 

Roland, what were your thoughts on when we would be ready to submit
the user mode support upstream ?

woody


From mshefty at ichips.intel.com  Thu Apr 21 09:48:17 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 21 Apr 2005 09:48:17 -0700
Subject: [openib-general] MAD/RMPP test program
Message-ID: <4267D951.7030606@ichips.intel.com>

For those interested (likely a few developers only), I've checked in a 
kernel test program that I used to stress the MAD/RMPP code.

gen2/utils/src/linux-kernel/infiniband/util/grmpp

- Sean


From mshefty at ichips.intel.com  Thu Apr 21 10:31:06 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 21 Apr 2005 10:31:06 -0700
Subject: [openib-general] [PATCH] [MAD] fix race completing request MAD with
	timeout/cancel
Message-ID: <20050421103106.30eb3df4.mshefty@ichips.intel.com>

This patch should fix an issue processing a sent MAD after it has timed
out or been canceled.  The race occurs when a response MAD matches with
the sent request.  The request could time out or be canceled after the
response MAD matches with the request, but before the request completion
can be processed.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


Index: core/mad.c
===================================================================
--- core/mad.c	(revision 2203)
+++ core/mad.c	(working copy)
@@ -342,6 +342,7 @@
 	spin_lock_init(&mad_agent_priv->lock);
 	INIT_LIST_HEAD(&mad_agent_priv->send_list);
 	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
+	INIT_LIST_HEAD(&mad_agent_priv->done_list);
 	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
 	INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv);
 	INIT_LIST_HEAD(&mad_agent_priv->local_list);
@@ -1591,6 +1592,16 @@
 	return NULL;
 }
 
+static void ib_mark_req_done(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	mad_send_wr->timeout = 0;
+	if (mad_send_wr->refcount == 1) {
+		list_del(&mad_send_wr->agent_list);
+		list_add_tail(&mad_send_wr->agent_list,
+			      &mad_send_wr->mad_agent_priv->done_list);
+	}
+}
+
 static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
 				 struct ib_mad_recv_wc *mad_recv_wc)
 {
@@ -1619,8 +1630,7 @@
 				wake_up(&mad_agent_priv->wait);
 			return;
 		}
-		/* Timeout = 0 means that we won't wait for a response */
-		mad_send_wr->timeout = 0;
+		ib_mark_req_done(mad_send_wr);
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 		/* Defined behavior is to complete response before request */
Index: core/mad_priv.h
===================================================================
--- core/mad_priv.h	(revision 2202)
+++ core/mad_priv.h	(working copy)
@@ -92,6 +92,7 @@
 	spinlock_t lock;
 	struct list_head send_list;
 	struct list_head wait_list;
+	struct list_head done_list;
 	struct work_struct timed_work;
 	unsigned long timeout;
 	struct list_head local_list;


From adi at hexapodia.org  Thu Apr 21 10:38:21 2005
From: adi at hexapodia.org (Andy Isaacson)
Date: Thu, 21 Apr 2005 10:38:21 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <42671901.4000805@ammasso.com>
Message-ID: <20050421173821.GA13312@hexapodia.org>

On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote:
> Troy Benjegerdes wrote:
> >Someone (aka Tospin, infinicon, and Amasso) should probably post a patch
> >adding '#define VM_REGISTERD 0x01000000', and some extensions to
> >something like 'madvise' to set pages to be registered.
> >
> >My preference is said patch will also allow a way for the kernel to
> >reclaim registered memory from an application under memory pressure.
> 
> I don't know if VM_REGISTERED is a good idea or not, but it should be 
> absolutely impossible for the kernel to reclaim "registered" (aka pinned) 
> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc), 
> it's normal for non-root processes to pin hundreds of megabytes of memory, 
> and that memory better be locked to those physical pages until the 
> application deregisters them.

If you take the hardline position that "the app is the only thing that
matters", your code is unlikely to get merged.  Linux is a
general-purpose OS.

I don't think that Troy was suggesting the kernel should deregister
memory without notifying the application.  Personally, I envision
something like the NetBSD Scheduler Activations (SA) work, where the
kernel can notify the app of changes to its state in a very efficient
manner.  (According to the NetBSD design whitepaper, the kernel does an
upcall whenever the multithreaded app gains or loses a CPU!)

In a Linux context, I doubt that fullblown SA is necessary or
appropriate.  Rather, I'd suggest two new signals, SIGMEMLOW and
SIGMEMCRIT.  The userland comms library registers handlers for both.
When the kernel decides that it needs to reclaim some memory from the
app, it sends SIGMEMLOW.  The comms library then has the responsibility
to un-reserve some memory in an orderly fashion.  If a reasonable [1]
time has expired since SIGMEMLOW and the kernel is still hungry, the
kernel sends SIGMEMCRIT.  At this point, the comms lib *must* unregister
some memory [2] even if it has to drop state to do so; if it returns
from the signal handler without having unregistered the memory, the
kernel will SIGKILL.

[1] Part of the interface spec should cover the expectation as to how
    long the library is allowed to take; I'd guess that 2 timeslices
    should suffice.
[2] Is there a way for the kernel to pass down to userspace how many
    pages it wants, maybe in the sigcontext?

> If kernel really thinks it needs to unpin those pages, then at the very 
> least it should kill the process, and the syslog better have a very clear 
> message indicating why.  That way, the application doesn't continue 
> thinking that everything's still going to work.  If those pages become 
> unpinned, the applications are going to experience serious data corruption.

You might want to consider what happens with your communication system
in a machine running power-saving modes (in the limit, suspend-to-disk).
Of course most machines with Infiniband adapters aren't running swsusp,
but it's not inconceivable that blade servers might sleep to lower power
and cooling costs.

-andy


From roland at topspin.com  Thu Apr 21 08:37:33 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 21 Apr 2005 08:37:33 -0700
Subject: [openib-general] [Fwd: rpms/udev/devel pam_console.dev,
	1.9,1.10 udev.rules, 1.23, 1.24 udev.spec, 1.82, 1.83]
In-Reply-To: <ORSMSX408TRqf69aZbL00000011@orsmsx408.amr.corp.intel.com> (Bob
	Woodruff's message of "Thu, 21 Apr 2005 08:32:00 -0700")
References: <ORSMSX408TRqf69aZbL00000011@orsmsx408.amr.corp.intel.com>
Message-ID: <52k6mw16pu.fsf@topspin.com>

    Bob> Do you know what the time frame is ? i.e., how long till it ships ?

http://fedora.redhat.com/participate/schedule/

    Bob> Roland, what were your thoughts on when we would be ready to
    Bob> submit the user mode support upstream ?

I hope to be able to send the patches upstream soon after 2.6.12 is
released.  2.6.12-rc3 just came out so I would hope that the final
release will be in about a month or so.

 - R.


From timur.tabi at ammasso.com  Thu Apr 21 11:39:35 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Thu, 21 Apr 2005 13:39:35 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050421173821.GA13312@hexapodia.org>
References: <20050421173821.GA13312@hexapodia.org>
Message-ID: <4267F367.3090508@ammasso.com>

Andy Isaacson wrote:

> If you take the hardline position that "the app is the only thing that
> matters", your code is unlikely to get merged.  Linux is a
> general-purpose OS.

The problem is that our driver and library implement an API that we don't fully control. 
The API states that the application allocates the memory and tells the library to register 
it.  The app then goes on its merry way until it's done, at which point it tells the 
library to deregister the memory.  Neither the app nor the API has any provision for the 
app to be notified that the memory is no longer pinned and therefore can't be trusted. 
That would be considered a critical failure from the app's perspective, so the kernel 
would be doing it a favor by killing the process.

> You might want to consider what happens with your communication system
> in a machine running power-saving modes (in the limit, suspend-to-disk).
> Of course most machines with Infiniband adapters aren't running swsusp,
> but it's not inconceivable that blade servers might sleep to lower power
> and cooling costs.

Any application that registers memory, will in all likelihood be running at 100% CPU 
non-stop.  The computer is not going to be doing anything else but whatever that app is 
trying to do.  The application could conceiveable register gigabytes of RAM, and if even a 
single page becomes unpinned, the whole thing is worthless.  The application cannot do 
anything meaningful if it gets a message saying that some of the memory has become 
unpinned and should not be used.

So the real question is: how important is it to the kernel developers that Linux support 
these kinds of enterprise-class applications?

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From adi at hexapodia.org  Thu Apr 21 12:56:41 2005
From: adi at hexapodia.org (Andy Isaacson)
Date: Thu, 21 Apr 2005 12:56:41 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4267F367.3090508@ammasso.com>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
Message-ID: <20050421195641.GB13312@hexapodia.org>

On Thu, Apr 21, 2005 at 01:39:35PM -0500, Timur Tabi wrote:
> Andy Isaacson wrote:
> >If you take the hardline position that "the app is the only thing that
> >matters", your code is unlikely to get merged.  Linux is a
> >general-purpose OS.
> 
> The problem is that our driver and library implement an API that we don't 
> fully control. The API states that the application allocates the memory and 
> tells the library to register it.  The app then goes on its merry way until 
> it's done, at which point it tells the library to deregister the memory.  
> Neither the app nor the API has any provision for the app to be notified 
> that the memory is no longer pinned and therefore can't be trusted. That 
> would be considered a critical failure from the app's perspective, so the 
> kernel would be doing it a favor by killing the process.

I'm familiar with MPI 1.0 and 2.0, but I haven't been following the
development of modern messaging APIs, so I might not make sense here...

Assuming that the app calls into the library on a fairly regular basis,
you could implement a fast-path/slow-path scheme where the library
normally operates in go-fast mode, but if a "unregister" event has
occurred, the library falls back to a less performant mode.

But now having written that I'm thinking that it's not worth the bother
- if you've got a 512P MPP job, it's basically equivalent to job death
for one of the nodes to go away in this manner -- even if the process is
still running on the node, the fact that you took a giant performance
hiccup is unacceptable.  Therefore, cluster admins are going to do their
darndest to avoid this behavior, so we might as well just kill the job
and make it explicit.

> >You might want to consider what happens with your communication system
> >in a machine running power-saving modes (in the limit, suspend-to-disk).
> >Of course most machines with Infiniband adapters aren't running swsusp,
> >but it's not inconceivable that blade servers might sleep to lower power
> >and cooling costs.
> 
> Any application that registers memory, will in all likelihood be running at 
> 100% CPU non-stop.  The computer is not going to be doing anything else but 
> whatever that app is trying to do.  The application could conceiveable 
> register gigabytes of RAM, and if even a single page becomes unpinned, the 
> whole thing is worthless.  The application cannot do anything meaningful if 
> it gets a message saying that some of the memory has become unpinned and 
> should not be used.
> 
> So the real question is: how important is it to the kernel developers that 
> Linux support these kinds of enterprise-class applications?

While I understand your arguments, this kind of rhetoric is more likely
to harden ears than to convince people you're right.  I refer you to the
"Live Patching Function" thread.

*You* need to come up with a solution that looks good to *the community*
if you want it merged.  In the long run, this process is likely to
result in *your* systems working better than if you had just gone off
and done your thing.  If you have to do something that "tastes bad" to
the average l-k hacker, *justify* it by addressing the alternatives and
explaining why your solution is the right one.

I'm leaning towards agreeing that mlock()-alicious code is the right way
to solve this problem, and it's not clear to me what the benefit of
adding a new VM_REGISTERED flag would be.

Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their
pages?  Or are you doing something more nasty?

(Oh, I see that Libor has contributed to the other branch of this
thread... off to read...)

-andy


From timur.tabi at ammasso.com  Thu Apr 21 13:07:42 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Thu, 21 Apr 2005 15:07:42 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050421195641.GB13312@hexapodia.org>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
	<20050421195641.GB13312@hexapodia.org>
Message-ID: <4268080E.3000303@ammasso.com>

Andy Isaacson wrote:

> I'm familiar with MPI 1.0 and 2.0, but I haven't been following the
> development of modern messaging APIs, so I might not make sense here...
> 
> Assuming that the app calls into the library on a fairly regular basis,

Not really.  The whole point is to have the adapter DMA the data directly from memory to 
the network.  That's why it's called RDMA - remote DMA.

> Therefore, cluster admins are going to do their
> darndest to avoid this behavior, so we might as well just kill the job
> and make it explicit.

Yes, and if it turns out that the same MPI application dies on Linux but not on Solaris 
because Linux doesn't really care if the memory stays pinned, then we're going to see a 
lot of MPI customers transitioning away from Linux.

> *You* need to come up with a solution that looks good to *the community*
> if you want it merged.  

True, but I'm not going to waste my time adding this support if the consensus I get from 
the kernel developers that they don't want Linux to behave this way.

> Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their
> pages?  Or are you doing something more nasty?

A little more nasty.  I raise RLIMIT_MEMLOCK in the driver to "unlimited" and also set 
cap_raise(IPC_LOCK).  I do this because I need to support all 2.4 and 2.6 kernel versions 
with the same driver, but only 2.6.10 and later have any support for non-root mlock().

If and when our driver is submitted to the official kernel, that nastiness will be removed 
of course.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From chrisw at osdl.org  Thu Apr 21 13:12:27 2005
From: chrisw at osdl.org (Chris Wright)
Date: Thu, 21 Apr 2005 13:12:27 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4268080E.3000303@ammasso.com>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
	<20050421195641.GB13312@hexapodia.org>
	<4268080E.3000303@ammasso.com>
Message-ID: <20050421201227.GI23013@shell0.pdx.osdl.net>

* Timur Tabi (timur.tabi at ammasso.com) wrote:
> Andy Isaacson wrote:
> >Do you guys simply raise RLIMIT_MEMLOCK to allow apps to lock their
> >pages?  Or are you doing something more nasty?
> 
> A little more nasty.  I raise RLIMIT_MEMLOCK in the driver to "unlimited" 
> and also set cap_raise(IPC_LOCK).  I do this because I need to support all 
> 2.4 and 2.6 kernel versions with the same driver, but only 2.6.10 and later 
> have any support for non-root mlock().

FYI, that will not work on all 2.6 kernels.  Specifically anything that's
not using capabilities.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net


From timur.tabi at ammasso.com  Thu Apr 21 13:14:59 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Thu, 21 Apr 2005 15:14:59 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050421201227.GI23013@shell0.pdx.osdl.net>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
	<20050421195641.GB13312@hexapodia.org>
	<4268080E.3000303@ammasso.com>
	<20050421201227.GI23013@shell0.pdx.osdl.net>
Message-ID: <426809C3.7010101@ammasso.com>

Chris Wright wrote:

> FYI, that will not work on all 2.6 kernels.  Specifically anything that's
> not using capabilities.

It works with every kernel I've tried.  I'm sure there are plenty of kernel configuration 
options that will break our driver.  But as long as all the distros our customers use 
work, as well as reasonably-configured custom kernels, we're happy.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From chrisw at osdl.org  Thu Apr 21 13:25:03 2005
From: chrisw at osdl.org (Chris Wright)
Date: Thu, 21 Apr 2005 13:25:03 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426809C3.7010101@ammasso.com>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
	<20050421195641.GB13312@hexapodia.org>
	<4268080E.3000303@ammasso.com>
	<20050421201227.GI23013@shell0.pdx.osdl.net>
	<426809C3.7010101@ammasso.com>
Message-ID: <20050421202503.GO493@shell0.pdx.osdl.net>

* Timur Tabi (timur.tabi at ammasso.com) wrote:
> It works with every kernel I've tried.  I'm sure there are plenty of kernel 
> configuration options that will break our driver.  But as long as all the 
> distros our customers use work, as well as reasonably-configured custom 
> kernels, we're happy.
> 

Hey, if you're happy (and, as you said, you don't intend to merge that
bit), I'm happy ;-)

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net


From arjan at infradead.org  Thu Apr 21 13:30:57 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Thu, 21 Apr 2005 22:30:57 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050421202503.GO493@shell0.pdx.osdl.net>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com> <20050421195641.GB13312@hexapodia.org>
	<4268080E.3000303@ammasso.com>
	<20050421201227.GI23013@shell0.pdx.osdl.net>
	<426809C3.7010101@ammasso.com>
	<20050421202503.GO493@shell0.pdx.osdl.net>
Message-ID: <1114115458.6277.84.camel@laptopd505.fenrus.org>

On Thu, 2005-04-21 at 13:25 -0700, Chris Wright wrote:
> * Timur Tabi (timur.tabi at ammasso.com) wrote:
> > It works with every kernel I've tried.  I'm sure there are plenty of kernel 
> > configuration options that will break our driver.  But as long as all the 
> > distros our customers use work, as well as reasonably-configured custom 
> > kernels, we're happy.
> > 
> 
> Hey, if you're happy (and, as you said, you don't intend to merge that
> bit), I'm happy ;-)


yeah... drivers giving unprivileged processes more privs belong on
bugtraq though, not in the core kernel :)


From tduffy at sun.com  Thu Apr 21 15:31:06 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 15:31:06 -0700
Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3
Message-ID: <1114122666.6858.5.camel@duffman>

The sock structure was changed in 2.6.12-rc? and SDP no longer compiles
against it.  This patch allows SDP to build with either 2.6.11 or
2.6.12-rc3 as we must preserve building on current stable tree.

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(working copy)
@@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s
 	 */
 	sk->sk_lingertime   = listen_sk->sk_lingertime;
 	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
+/* XXX Remove once 2.6.12 is released */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	sk->sk_debug        = listen_sk->sk_debug;
 	sk->sk_localroute   = listen_sk->sk_localroute;
+	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
+#else
+	if (sock_flag(sk, SOCK_DBG))
+		sock_set_flag(listen_sk, SOCK_DBG);
+	if (sock_flag(sk, SOCK_LOCALROUTE))
+		sock_set_flag(listen_sk, SOCK_LOCALROUTE);
+	if (sock_flag(sk, SOCK_RCVTSTAMP))
+		sock_set_flag(listen_sk, SOCK_RCVTSTAMP);
+#endif
 	sk->sk_sndbuf       = listen_sk->sk_sndbuf;
 	sk->sk_rcvbuf       = listen_sk->sk_rcvbuf;
 	sk->sk_no_check     = listen_sk->sk_no_check;
 	sk->sk_priority     = listen_sk->sk_priority;
-	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
 	sk->sk_rcvtimeo     = listen_sk->sk_rcvtimeo;
 	sk->sk_sndtimeo     = listen_sk->sk_sndtimeo;
 	sk->sk_reuse        = listen_sk->sk_reuse;


From roland at topspin.com  Thu Apr 21 15:36:19 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 21 Apr 2005 15:36:19 -0700
Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <1114122666.6858.5.camel@duffman> (Tom Duffy's message of "Thu,
	21 Apr 2005 15:31:06 -0700")
References: <1114122666.6858.5.camel@duffman>
Message-ID: <52y8bbzrj0.fsf@topspin.com>

    Tom> The sock structure was changed in 2.6.12-rc? and SDP no
    Tom> longer compiles against it.  This patch allows SDP to build
    Tom> with either 2.6.11 or 2.6.12-rc3 as we must preserve building
    Tom> on current stable tree.

Is this really the only change required?  It seems that the socket
allocation function changed too -- 2.6.11 has

	extern struct sock *		sk_alloc(int family, int priority, int zero_it,
						 kmem_cache_t *slab);

while my up-to-date Linus tree has

	extern struct sock		*sk_alloc(int family, int priority,
						  struct proto *prot, int zero_it);

so I think sdp_conn.c at least needs some fixing up.

(I don't have time track down what goes in struct proto right now, so
I'm not posting a real patch)

 - R.


From tduffy at sun.com  Thu Apr 21 15:52:43 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 15:52:43 -0700
Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <52y8bbzrj0.fsf@topspin.com>
References: <1114122666.6858.5.camel@duffman>  <52y8bbzrj0.fsf@topspin.com>
Message-ID: <1114123963.6858.11.camel@duffman>

On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote:
> Is this really the only change required?  It seems that the socket
> allocation function changed too -- 2.6.11 has
> 
> 	extern struct sock *		sk_alloc(int family, int priority, int zero_it,
> 						 kmem_cache_t *slab);
> 
> while my up-to-date Linus tree has
> 
> 	extern struct sock		*sk_alloc(int family, int priority,
> 						  struct proto *prot, int zero_it);
> 
> so I think sdp_conn.c at least needs some fixing up.

Oh, you are right, I missed the compile warning, but I see it now.

Why does the SDP code pass in sizeof(struct inet_sock) for the zero_it
bool?

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050421/98e953ce/attachment.sig>

From libor at topspin.com  Thu Apr 21 16:17:30 2005
From: libor at topspin.com (Libor Michalek)
Date: Thu, 21 Apr 2005 16:17:30 -0700
Subject: [openib-general] [PATCH][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <1114123963.6858.11.camel@duffman>;
	from tduffy@sun.com on Thu, Apr 21, 2005 at 03:52:43PM -0700
References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com>
	<1114123963.6858.11.camel@duffman>
Message-ID: <20050421161730.A27238@topspin.com>

On Thu, Apr 21, 2005 at 03:52:43PM -0700, Tom Duffy wrote:
> On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote:
> > Is this really the only change required?  It seems that the socket
> > allocation function changed too -- 2.6.11 has
> > 
> > 	extern struct sock *		sk_alloc(int family, int priority, int zero_it,
> > 						 kmem_cache_t *slab);
> > 
> > while my up-to-date Linus tree has
> > 
> > 	extern struct sock		*sk_alloc(int family, int priority,
> > 						  struct proto *prot, int zero_it);
> > 
> > so I think sdp_conn.c at least needs some fixing up.
> 
> Oh, you are right, I missed the compile warning, but I see it now.
> 
> Why does the SDP code pass in sizeof(struct inet_sock) for the zero_it
> bool?

  Is this a trick question? :) Because it's not a bool but an integer,
which use to be a bool in the 2.4 kernel days. Here's the relevant
code snip from net/core/sock.c:

		if (zero_it) {
			memset(sk, 0,
			       zero_it == 1 ? sizeof(struct sock) : zero_it);
			sk->sk_family = family;
			sock_lock_init(sk);
		}


-Libor


From tduffy at sun.com  Thu Apr 21 16:27:04 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 16:27:04 -0700
Subject: [openib-general] [PATCHv2][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <52y8bbzrj0.fsf@topspin.com>
References: <1114122666.6858.5.camel@duffman>  <52y8bbzrj0.fsf@topspin.com>
Message-ID: <1114126024.6858.21.camel@duffman>

On Thu, 2005-04-21 at 15:36 -0700, Roland Dreier wrote:
> (I don't have time track down what goes in struct proto right now, so
> I'm not posting a real patch)

Ok, this patch now builds without warning on 2.6.11 and 2.6.12-rc3.

Libor, what do you think?

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(working copy)
@@ -1112,6 +1112,15 @@ error_attr:
 	return result;
 }
 
+/* XXX remove if/else (leave struct) once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) )
+static struct proto sdp_proto = {
+	.name		= "sdp_sock",
+	.owner		= THIS_MODULE,
+	.obj_size	= sizeof(struct inet_sock),
+};
+#endif
+
 /*
  * sdp_conn_alloc - allocate a new socket, and init.
  */
@@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	int result;
 
 	sk = sk_alloc(dev_root_s.proto, priority, 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 		      sizeof(struct inet_sock), dev_root_s.sock_cache);
+#else
+		      &sdp_proto, sizeof(struct inet_sock));
+#endif
 	if (!sk) {
 		sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>",
 			     dev_root_s.proto, priority);
@@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family
 		goto error_conn;
 	}
 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
         dev_root_s.sock_cache = kmem_cache_create("sdp_sock",
 						  sizeof(struct inet_sock), 
 						  0, SLAB_HWCACHE_ALIGN,
@@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family
 		result = -ENOMEM;
 		goto error_sock;
         }
+#else
+	if (proto_register(&sdp_proto, 1) != 0) {
+		sdp_warn("Failed to register sdp proto.");
+		result = -ENOMEM;
+		goto error_sock;
+	}
+#endif
 
 	/*
 	 * start listening
@@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family
 error_listen:
 	(void)ib_destroy_cm_id(dev_root_s.listen_id);
 error_cm_id:
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 error_sock:
 	kmem_cache_destroy(dev_root_s.conn_cache);
 error_conn:
@@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void)
 	 * delete conn cache
 	 */
 	kmem_cache_destroy(dev_root_s.conn_cache);
+/* Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	/*
 	 * delete sock cache
 	 */
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 	/*
 	 * stop listening
 	 */
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(working copy)
@@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s
 	 */
 	sk->sk_lingertime   = listen_sk->sk_lingertime;
 	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
+/* XXX Remove once 2.6.12 is released */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	sk->sk_debug        = listen_sk->sk_debug;
 	sk->sk_localroute   = listen_sk->sk_localroute;
+	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
+#else
+	if (sock_flag(sk, SOCK_DBG))
+		sock_set_flag(listen_sk, SOCK_DBG);
+	if (sock_flag(sk, SOCK_LOCALROUTE))
+		sock_set_flag(listen_sk, SOCK_LOCALROUTE);
+	if (sock_flag(sk, SOCK_RCVTSTAMP))
+		sock_set_flag(listen_sk, SOCK_RCVTSTAMP);
+#endif
 	sk->sk_sndbuf       = listen_sk->sk_sndbuf;
 	sk->sk_rcvbuf       = listen_sk->sk_rcvbuf;
 	sk->sk_no_check     = listen_sk->sk_no_check;
 	sk->sk_priority     = listen_sk->sk_priority;
-	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
 	sk->sk_rcvtimeo     = listen_sk->sk_rcvtimeo;
 	sk->sk_sndtimeo     = listen_sk->sk_sndtimeo;
 	sk->sk_reuse        = listen_sk->sk_reuse;
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(working copy)
@@ -201,7 +201,10 @@ struct sdev_root {
 	 * cache's
 	 */
 	kmem_cache_t *conn_cache;
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_t *sock_cache;
+#endif
 };
 
 #endif /* _SDP_DEV_H */


From ftillier at infiniconsys.com  Thu Apr 21 16:31:59 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Thu, 21 Apr 2005 16:31:59 -0700
Subject: [openib-general] [PATCHv2][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <1114126024.6858.21.camel@duffman>
Message-ID: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>

> From: Tom Duffy [mailto:tduffy at sun.com]
> Sent: Thursday, April 21, 2005 4:27 PM
> 
> Ok, this patch now builds without warning on 2.6.11 and 2.6.12-rc3.
> 
> Libor, what do you think?
> 
> Signed-off-by: Tom Duffy <tduffy at sun.com>
> 
> Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
> ===================================================================
> --- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
> 	(revision 2207)
> +++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
(working
> copy)
> @@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s
>  	 */
>  	sk->sk_lingertime   = listen_sk->sk_lingertime;
>  	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
> +/* XXX Remove once 2.6.12 is released */
> +#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
>  	sk->sk_debug        = listen_sk->sk_debug;
>  	sk->sk_localroute   = listen_sk->sk_localroute;
> +	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
> +#else
> +	if (sock_flag(sk, SOCK_DBG))
> +		sock_set_flag(listen_sk, SOCK_DBG);
> +	if (sock_flag(sk, SOCK_LOCALROUTE))
> +		sock_set_flag(listen_sk, SOCK_LOCALROUTE);
> +	if (sock_flag(sk, SOCK_RCVTSTAMP))
> +		sock_set_flag(listen_sk, SOCK_RCVTSTAMP);
> +#endif

Isn't the above change backwards?  The original code was copying settings
from listen_sk to sk, and the new code seems to be checking flags in sk to
determine whether to set them in listen_sk.

- Fab


From tduffy at sun.com  Thu Apr 21 16:33:14 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 16:33:14 -0700
Subject: [openib-general] [PATCHv3][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <20050421161730.A27238@topspin.com>
References: <1114122666.6858.5.camel@duffman> <52y8bbzrj0.fsf@topspin.com>
	<1114123963.6858.11.camel@duffman> <20050421161730.A27238@topspin.com>
Message-ID: <1114126394.6858.26.camel@duffman>

On Thu, 2005-04-21 at 16:17 -0700, Libor Michalek wrote:
>   Is this a trick question? :) Because it's not a bool but an integer,
> which use to be a bool in the 2.4 kernel days. Here's the relevant
> code snip from net/core/sock.c:
> 
> 		if (zero_it) {
> 			memset(sk, 0,
> 			       zero_it == 1 ? sizeof(struct sock) : zero_it);
> 			sk->sk_family = family;
> 			sock_lock_init(sk);
> 		}

Sorry, I was looking at the new code where it is (used as) a bool again.
In fact, I fucked up and on my v2 patch, the *new* call to sk_alloc
should just be 1.  Here is v3.

Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(working copy)
@@ -1112,6 +1112,15 @@ error_attr:
 	return result;
 }
 
+/* XXX remove if/else (leave struct) once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) )
+static struct proto sdp_proto = {
+	.name		= "sdp_sock",
+	.owner		= THIS_MODULE,
+	.obj_size	= sizeof(struct inet_sock),
+};
+#endif
+
 /*
  * sdp_conn_alloc - allocate a new socket, and init.
  */
@@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	int result;
 
 	sk = sk_alloc(dev_root_s.proto, priority, 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 		      sizeof(struct inet_sock), dev_root_s.sock_cache);
+#else
+		      &sdp_proto, 1);
+#endif
 	if (!sk) {
 		sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>",
 			     dev_root_s.proto, priority);
@@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family
 		goto error_conn;
 	}
 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
         dev_root_s.sock_cache = kmem_cache_create("sdp_sock",
 						  sizeof(struct inet_sock), 
 						  0, SLAB_HWCACHE_ALIGN,
@@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family
 		result = -ENOMEM;
 		goto error_sock;
         }
+#else
+	if (proto_register(&sdp_proto, 1) != 0) {
+		sdp_warn("Failed to register sdp proto.");
+		result = -ENOMEM;
+		goto error_sock;
+	}
+#endif
 
 	/*
 	 * start listening
@@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family
 error_listen:
 	(void)ib_destroy_cm_id(dev_root_s.listen_id);
 error_cm_id:
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 error_sock:
 	kmem_cache_destroy(dev_root_s.conn_cache);
 error_conn:
@@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void)
 	 * delete conn cache
 	 */
 	kmem_cache_destroy(dev_root_s.conn_cache);
+/* Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	/*
 	 * delete sock cache
 	 */
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 	/*
 	 * stop listening
 	 */
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(working copy)
@@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s
 	 */
 	sk->sk_lingertime   = listen_sk->sk_lingertime;
 	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
+/* XXX Remove once 2.6.12 is released */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	sk->sk_debug        = listen_sk->sk_debug;
 	sk->sk_localroute   = listen_sk->sk_localroute;
+	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
+#else
+	if (sock_flag(sk, SOCK_DBG))
+		sock_set_flag(listen_sk, SOCK_DBG);
+	if (sock_flag(sk, SOCK_LOCALROUTE))
+		sock_set_flag(listen_sk, SOCK_LOCALROUTE);
+	if (sock_flag(sk, SOCK_RCVTSTAMP))
+		sock_set_flag(listen_sk, SOCK_RCVTSTAMP);
+#endif
 	sk->sk_sndbuf       = listen_sk->sk_sndbuf;
 	sk->sk_rcvbuf       = listen_sk->sk_rcvbuf;
 	sk->sk_no_check     = listen_sk->sk_no_check;
 	sk->sk_priority     = listen_sk->sk_priority;
-	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
 	sk->sk_rcvtimeo     = listen_sk->sk_rcvtimeo;
 	sk->sk_sndtimeo     = listen_sk->sk_sndtimeo;
 	sk->sk_reuse        = listen_sk->sk_reuse;
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(working copy)
@@ -201,7 +201,10 @@ struct sdev_root {
 	 * cache's
 	 */
 	kmem_cache_t *conn_cache;
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_t *sock_cache;
+#endif
 };
 
 #endif /* _SDP_DEV_H */


From tduffy at sun.com  Thu Apr 21 16:37:54 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 21 Apr 2005 16:37:54 -0700
Subject: [openib-general] [PATCHv4][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>
References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>
Message-ID: <1114126674.6858.31.camel@duffman>

On Thu, 2005-04-21 at 16:31 -0700, Fab Tillier wrote:
> Isn't the above change backwards?  The original code was copying settings
> from listen_sk to sk, and the new code seems to be checking flags in sk to
> determine whether to set them in listen_sk.

You are so right.  My brain ain't on today or something.

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(working copy)
@@ -1112,6 +1112,15 @@ error_attr:
 	return result;
 }
 
+/* XXX remove if/else (leave struct) once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE > KERNEL_VERSION(2,6,11) )
+static struct proto sdp_proto = {
+	.name		= "sdp_sock",
+	.owner		= THIS_MODULE,
+	.obj_size	= sizeof(struct inet_sock),
+};
+#endif
+
 /*
  * sdp_conn_alloc - allocate a new socket, and init.
  */
@@ -1122,7 +1131,12 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	int result;
 
 	sk = sk_alloc(dev_root_s.proto, priority, 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 		      sizeof(struct inet_sock), dev_root_s.sock_cache);
+#else
+		      &sdp_proto, 1);
+#endif
 	if (!sk) {
 		sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>",
 			     dev_root_s.proto, priority);
@@ -1966,6 +1980,8 @@ int sdp_conn_table_init(int proto_family
 		goto error_conn;
 	}
 
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
         dev_root_s.sock_cache = kmem_cache_create("sdp_sock",
 						  sizeof(struct inet_sock), 
 						  0, SLAB_HWCACHE_ALIGN,
@@ -1975,6 +1991,13 @@ int sdp_conn_table_init(int proto_family
 		result = -ENOMEM;
 		goto error_sock;
         }
+#else
+	if (proto_register(&sdp_proto, 1) != 0) {
+		sdp_warn("Failed to register sdp proto.");
+		result = -ENOMEM;
+		goto error_sock;
+	}
+#endif
 
 	/*
 	 * start listening
@@ -2002,7 +2025,12 @@ int sdp_conn_table_init(int proto_family
 error_listen:
 	(void)ib_destroy_cm_id(dev_root_s.listen_id);
 error_cm_id:
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 error_sock:
 	kmem_cache_destroy(dev_root_s.conn_cache);
 error_conn:
@@ -2049,10 +2077,15 @@ int sdp_conn_table_clear(void)
 	 * delete conn cache
 	 */
 	kmem_cache_destroy(dev_root_s.conn_cache);
+/* Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	/*
 	 * delete sock cache
 	 */
 	kmem_cache_destroy(dev_root_s.sock_cache);
+#else
+	proto_unregister(&sdp_proto);
+#endif
 	/*
 	 * stop listening
 	 */
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(working copy)
@@ -356,13 +356,23 @@ static int sdp_cm_listen_lookup(struct s
 	 */
 	sk->sk_lingertime   = listen_sk->sk_lingertime;
 	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
+/* XXX Remove once 2.6.12 is released */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	sk->sk_debug        = listen_sk->sk_debug;
 	sk->sk_localroute   = listen_sk->sk_localroute;
+	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
+#else
+	if (sock_flag(listen_sk, SOCK_DBG))
+		sock_set_flag(sk, SOCK_DBG);
+	if (sock_flag(listen_sk, SOCK_LOCALROUTE))
+		sock_set_flag(sk, SOCK_LOCALROUTE);
+	if (sock_flag(listen_sk, SOCK_RCVTSTAMP))
+		sock_set_flag(sk, SOCK_RCVTSTAMP);
+#endif
 	sk->sk_sndbuf       = listen_sk->sk_sndbuf;
 	sk->sk_rcvbuf       = listen_sk->sk_rcvbuf;
 	sk->sk_no_check     = listen_sk->sk_no_check;
 	sk->sk_priority     = listen_sk->sk_priority;
-	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
 	sk->sk_rcvtimeo     = listen_sk->sk_rcvtimeo;
 	sk->sk_sndtimeo     = listen_sk->sk_sndtimeo;
 	sk->sk_reuse        = listen_sk->sk_reuse;
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(revision 2207)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(working copy)
@@ -201,7 +201,10 @@ struct sdev_root {
 	 * cache's
 	 */
 	kmem_cache_t *conn_cache;
+/* XXX Remove once 2.6.12 is out */
+#if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,11) )
 	kmem_cache_t *sock_cache;
+#endif
 };
 
 #endif /* _SDP_DEV_H */


From mshefty at ichips.intel.com  Thu Apr 21 18:34:39 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 21 Apr 2005 18:34:39 -0700
Subject: [openib-general] [PATCH] [MAD RMPP] add RMPP send support to MAD
	layer
Message-ID: <20050421183439.262c8233.mshefty@ichips.intel.com>

The following patch adds RMPP send support to the kernel MAD layer.

- NACKs are not implemented
- Spec compliant double-sided transfers are not implemented.  Request/
  reply matching works, but missing is the ACK to the ACK that
  occurs during the RMPP direction switch.
- Clients are limited to a single sge.
- Timeout values are hard-coded until such time that packet lifetimes
  magically appear.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


Index: include/ib_verbs.h
===================================================================
--- include/ib_verbs.h	(revision 2207)
+++ include/ib_verbs.h	(working copy)
@@ -573,6 +573,7 @@
 			u32	remote_qpn;
 			u32	remote_qkey;
 			int	timeout_ms; /* valid for MADs only */
+			int	retries;    /* valid for MADs only */
 			u16	pkey_index; /* valid for GSI only */
 			u8	port_num;   /* valid for DR SMPs on switch only */
 		} ud;
Index: core/mad_rmpp.c
===================================================================
--- core/mad_rmpp.c	(revision 2207)
+++ core/mad_rmpp.c	(working copy)
@@ -76,20 +76,6 @@
 	struct ib_sge sge;
 };
 
-static struct ib_ah * create_ah_from_wc(struct ib_pd *pd, struct ib_wc *wc,
-					u8 port_num)
-{
-	struct ib_ah_attr ah_attr;
-
-	memset(&ah_attr, 0, sizeof ah_attr);
-	ah_attr.dlid = wc->slid;
-	ah_attr.sl = wc->sl;
-	ah_attr.src_path_bits = wc->dlid_path_bits;
-	ah_attr.port_num = port_num;
-
-	return ib_create_ah(pd, &ah_attr);
-}
-
 static void destroy_rmpp_recv(struct mad_rmpp_recv *rmpp_recv)
 {
 	atomic_dec(&rmpp_recv->refcount);
@@ -164,9 +150,10 @@
 	if (!rmpp_recv)
 		return NULL;
 
-	rmpp_recv->ah = create_ah_from_wc(agent->agent.qp->pd,
-					  mad_recv_wc->wc,
-					  agent->agent.port_num);
+	rmpp_recv->ah = ib_create_ah_from_wc(agent->agent.qp->pd,
+					     mad_recv_wc->wc,
+					     mad_recv_wc->recv_buf.grh,
+					     agent->agent.port_num);
 	if (IS_ERR(rmpp_recv->ah))
 		goto error;
 
@@ -291,18 +278,28 @@
 	kfree(msg);
 }
 
+static int data_offset(u8 mgmt_class)
+{
+	if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM)
+		return offsetof(struct ib_sa_mad, data);
+	else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) &&
+		 (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END))
+		return offsetof(struct ib_vendor_mad, data);
+	else
+		return offsetof(struct ib_rmpp_mad, data);
+}
+
 static void format_ack(struct ib_rmpp_mad *ack,
 		       struct ib_rmpp_mad *data,
 		       struct mad_rmpp_recv *rmpp_recv)
 {
 	unsigned long flags;
 
-	ack->mad_hdr = data->mad_hdr;
+	memcpy(&ack->mad_hdr, &data->mad_hdr,
+	       data_offset(data->mad_hdr.mgmt_class));
+
 	ack->mad_hdr.method ^= IB_MGMT_METHOD_RESP;
-	ack->rmpp_hdr.rmpp_version = data->rmpp_hdr.rmpp_version;
 	ack->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_ACK;
-	ib_set_rmpp_resptime(&ack->rmpp_hdr,
-			     ib_get_rmpp_resptime(&data->rmpp_hdr));
 	ib_set_rmpp_flags(&ack->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE);
 
 	spin_lock_irqsave(&rmpp_recv->lock, flags);
@@ -392,12 +389,18 @@
 
 static inline int get_mad_len(struct mad_rmpp_recv *rmpp_recv)
 {
-	int hdr_size;
+	struct ib_rmpp_mad *rmpp_mad;
+	int hdr_size, data_size, pad;
 
-	/* TODO: need to check for SA MADs - requires access to SA header */
-	hdr_size = sizeof(struct ib_mad_hdr) + sizeof(struct ib_rmpp_hdr);
-	return rmpp_recv->seg_num * (sizeof(struct ib_mad) - hdr_size) +
-	       hdr_size;
+	rmpp_mad = (struct ib_rmpp_mad *)rmpp_recv->cur_seg_buf->mad;
+
+	hdr_size = data_offset(rmpp_mad->mad_hdr.mgmt_class);
+	data_size = sizeof(struct ib_rmpp_mad) - hdr_size;
+	pad = be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin);
+	if (pad > data_size)
+		pad = 0;
+
+	return hdr_size + rmpp_recv->seg_num * data_size - pad;
 }
 
 static struct ib_mad_recv_wc * complete_rmpp(struct mad_rmpp_recv *rmpp_recv)
@@ -513,6 +516,121 @@
 	return mad_recv_wc;
 }
 
+static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset +
+	       (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) *
+	       (mad_send_wr->seg_num - 1);
+}
+
+static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_rmpp_mad *rmpp_mad;
+	int timeout;
+
+	rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr;
+	ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE);
+	rmpp_mad->rmpp_hdr.seg_num = cpu_to_be32(mad_send_wr->seg_num);
+
+	if (mad_send_wr->seg_num == 1) {
+		rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_FIRST;
+		rmpp_mad->rmpp_hdr.paylen_newwin =
+			cpu_to_be32(mad_send_wr->total_seg *
+				    (sizeof(struct ib_rmpp_mad) -
+				       offsetof(struct ib_rmpp_mad, data)));
+		mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad);
+	} else {
+		mad_send_wr->send_wr.num_sge = 2;
+		mad_send_wr->sg_list[0].length = mad_send_wr->data_offset;
+		mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr);
+		mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) -
+						 mad_send_wr->data_offset;
+		mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey;
+	}
+
+	if (mad_send_wr->seg_num == mad_send_wr->total_seg) {
+		rmpp_mad->rmpp_hdr.rmpp_rtime_flags |= IB_MGMT_RMPP_FLAG_LAST;
+		rmpp_mad->rmpp_hdr.paylen_newwin =
+			cpu_to_be32(sizeof(struct ib_rmpp_mad) -
+				    offsetof(struct ib_rmpp_mad, data) -
+				    mad_send_wr->pad);
+	}
+
+	/* 5 seconds until we can find the packet lifetime */
+	timeout = mad_send_wr->send_wr.wr.ud.timeout_ms;
+	if (timeout && timeout < 5000)
+		mad_send_wr->timeout = msecs_to_jiffies(timeout);
+	else
+		mad_send_wr->timeout = msecs_to_jiffies(5000);
+	mad_send_wr->seg_num++;
+
+	return ib_send_mad(mad_send_wr);
+}
+
+static void process_rmpp_ack(struct ib_mad_agent_private *agent,
+			     struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_rmpp_mad *rmpp_mad;
+	unsigned long flags;
+	int seg_num, newwin, ret;
+
+	rmpp_mad = (struct ib_rmpp_mad *)mad_recv_wc->recv_buf.mad;
+	if (rmpp_mad->rmpp_hdr.rmpp_status)
+		return;
+
+	seg_num = be32_to_cpu(rmpp_mad->rmpp_hdr.seg_num);
+	newwin = be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin);
+
+	spin_lock_irqsave(&agent->lock, flags);
+	mad_send_wr = ib_find_send_mad(agent, rmpp_mad->mad_hdr.tid);
+	if (!mad_send_wr)
+		goto out;	/* Unmatched ACK */
+
+	if ((mad_send_wr->last_ack == mad_send_wr->total_seg) ||
+	    (!mad_send_wr->timeout) || (mad_send_wr->status != IB_WC_SUCCESS))
+		goto out;	/* Send is already done */
+
+	if (seg_num > mad_send_wr->total_seg)
+		goto out;	/* Bad ACK */
+
+	if (newwin < mad_send_wr->newwin || seg_num < mad_send_wr->last_ack)
+		goto out;	/* Old ACK */
+
+	if (seg_num > mad_send_wr->last_ack) {
+		mad_send_wr->last_ack = seg_num;
+		mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries;
+	}
+	mad_send_wr->newwin = newwin;
+	if (mad_send_wr->refcount > 1)
+		goto out;	/* Send is active */
+
+	if (mad_send_wr->last_ack == mad_send_wr->total_seg) {
+		/* If no response is expected, the ACK completes the send */
+		if (!mad_send_wr->send_wr.wr.ud.timeout_ms) {
+			struct ib_mad_send_wc wc;
+
+			ib_mark_mad_done(mad_send_wr);
+			spin_unlock_irqrestore(&agent->lock, flags);
+
+			wc.status = IB_WC_SUCCESS;
+			wc.vendor_err = 0;
+			wc.wr_id = mad_send_wr->wr_id;
+			ib_mad_complete_send_wr(mad_send_wr, &wc);
+			return;
+		}
+		ib_reset_mad_timeout(mad_send_wr,
+				     mad_send_wr->send_wr.wr.ud.timeout_ms);
+	} else if (mad_send_wr->seg_num < mad_send_wr->newwin) {
+		/* Send failure will just result in a timeout/retry */
+		ret = send_next_seg(mad_send_wr);
+		if (!ret)
+			mad_send_wr->refcount++;
+	}
+out:
+	spin_unlock_irqrestore(&agent->lock, flags);
+}
+
 struct ib_mad_recv_wc *
 ib_process_rmpp_recv_wc(struct ib_mad_agent_private *agent,
 			struct ib_mad_recv_wc *mad_recv_wc)
@@ -523,6 +641,9 @@
 	if (!(rmpp_mad->rmpp_hdr.rmpp_rtime_flags & IB_MGMT_RMPP_FLAG_ACTIVE))
 		return mad_recv_wc;
 
+	if (rmpp_mad->rmpp_hdr.rmpp_version != IB_MGMT_RMPP_VERSION)
+		goto out;
+
 	switch (rmpp_mad->rmpp_hdr.rmpp_type) {
 	case IB_MGMT_RMPP_TYPE_DATA:
 		if (rmpp_mad->rmpp_hdr.seg_num == __constant_htonl(1))
@@ -530,38 +651,121 @@
 		else
 			return continue_rmpp(agent, mad_recv_wc);
 	case IB_MGMT_RMPP_TYPE_ACK:
-		/* process_rmpp_ack(agent, mad_recv_wc); */
+		process_rmpp_ack(agent, mad_recv_wc);
 		break;
 	case IB_MGMT_RMPP_TYPE_STOP:
 	case IB_MGMT_RMPP_TYPE_ABORT:
-		/* process_rmpp_nack(agent, mad_recv_wc); */
+		/* TODO: process_rmpp_nack(agent, mad_recv_wc); */
 		break;
 	default:
 		break;
 	}
+out:
 	ib_free_recv_mad(mad_recv_wc);
 	return NULL;
 }
 
+int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_rmpp_mad *rmpp_mad;
+	int i, total_len, ret;
+
+	rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr;
+	if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) &
+	      IB_MGMT_RMPP_FLAG_ACTIVE))
+		return IB_RMPP_RESULT_UNHANDLED;
+
+	if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA)
+		return IB_RMPP_RESULT_INTERNAL;
+
+	if (mad_send_wr->send_wr.num_sge > 1)
+		return -EINVAL;		/* TODO: support num_sge > 1 */
+
+	mad_send_wr->seg_num = 1;
+	mad_send_wr->newwin = 1;
+	mad_send_wr->data_offset = data_offset(rmpp_mad->mad_hdr.mgmt_class);
+
+	total_len = 0;
+	for (i = 0; i < mad_send_wr->send_wr.num_sge; i++)
+		total_len += mad_send_wr->send_wr.sg_list[i].length;
 
-enum ib_mad_result
-ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr,
-			struct ib_mad_send_wc *mad_send_wc)
+        mad_send_wr->total_seg = (total_len - mad_send_wr->data_offset) /
+			(sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset);
+	mad_send_wr->pad = total_len - offsetof(struct ib_rmpp_mad, data) -
+			   be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin);
+	mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries;
+
+	/* We need to wait for the final ACK even if there isn't a response */
+	mad_send_wr->refcount += (mad_send_wr->timeout == 0);
+
+	ret = send_next_seg(mad_send_wr);
+	if (!ret)
+		return IB_RMPP_RESULT_CONSUMED;
+	return ret;
+}
+
+int ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr,
+			    struct ib_mad_send_wc *mad_send_wc)
 {
 	struct ib_rmpp_mad *rmpp_mad;
 	struct rmpp_msg *msg;
+	int ret;
 
 	rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr;
 	if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & 
 	      IB_MGMT_RMPP_FLAG_ACTIVE))
-		return IB_MAD_RESULT_SUCCESS;
+		return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */
 
 	if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) {
 		msg = (struct rmpp_msg *) (unsigned long) mad_send_wc->wr_id;
 		free_rmpp_msg(msg);
-		return IB_MAD_RESULT_CONSUMED;
+		return IB_RMPP_RESULT_INTERNAL;	 /* ACK, STOP, or ABORT */
 	}
 
-	/* TODO: continue send until done - ACKed or we have a response */
-	return IB_MAD_RESULT_SUCCESS;
+	if (mad_send_wc->status != IB_WC_SUCCESS ||
+	    mad_send_wr->status != IB_WC_SUCCESS)
+		return IB_RMPP_RESULT_PROCESSED; /* Canceled or send error */
+
+	if (!mad_send_wr->timeout)
+		return IB_RMPP_RESULT_PROCESSED; /* Response received */
+
+	if (mad_send_wr->last_ack == mad_send_wr->total_seg) {
+		mad_send_wr->timeout =
+			msecs_to_jiffies(mad_send_wr->send_wr.wr.ud.timeout_ms);
+		return IB_RMPP_RESULT_PROCESSED; /* Send done */
+	}
+
+	if (mad_send_wr->seg_num > mad_send_wr->newwin ||
+	    mad_send_wr->seg_num > mad_send_wr->total_seg)
+		return IB_RMPP_RESULT_PROCESSED; /* Wait for ACK */
+
+	ret = send_next_seg(mad_send_wr);
+	if (ret) {
+		mad_send_wc->status = IB_WC_GENERAL_ERR;
+		return IB_RMPP_RESULT_PROCESSED;
+	}
+	return IB_RMPP_RESULT_CONSUMED;
+}
+
+int ib_timeout_rmpp(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_rmpp_mad *rmpp_mad;
+	int ret;
+
+	rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr;
+	if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & 
+	      IB_MGMT_RMPP_FLAG_ACTIVE))
+		return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */
+
+	if (mad_send_wr->last_ack == mad_send_wr->total_seg ||
+	    !mad_send_wr->retries--)
+		return IB_RMPP_RESULT_PROCESSED;
+
+	mad_send_wr->seg_num = mad_send_wr->last_ack + 1;
+	ret = send_next_seg(mad_send_wr);
+	if (ret)
+		return IB_RMPP_RESULT_PROCESSED;
+
+	mad_send_wr->refcount++;
+	return IB_RMPP_RESULT_CONSUMED;
 }
Index: core/mad.c
===================================================================
--- core/mad.c	(revision 2207)
+++ core/mad.c	(working copy)
@@ -63,8 +63,6 @@
 static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info,
 				    struct ib_mad_private *mad);
 static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv);
-static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
-				    struct ib_mad_send_wc *mad_send_wc);
 static void timeout_sends(void *data);
 static void cancel_sends(void *data);
 static void local_completions(void *data);
@@ -851,7 +849,7 @@
 }
 EXPORT_SYMBOL(ib_free_send_mad);
 
-static int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
+int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	struct ib_mad_qp_info *qp_info;
 	struct ib_send_wr *bad_send_wr;
@@ -953,19 +951,18 @@
 			ret = -ENOMEM;
 			goto error2;
 		}
+		memset(mad_send_wr, 0, sizeof *mad_send_wr);
 
 		mad_send_wr->send_wr = *send_wr;
 		mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list;
 		memcpy(mad_send_wr->sg_list, send_wr->sg_list,
 		       sizeof *send_wr->sg_list * send_wr->num_sge);
-		mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id;
-		mad_send_wr->send_wr.next = NULL;
+		mad_send_wr->wr_id = send_wr->wr_id;
 		mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid;
 		mad_send_wr->mad_agent_priv = mad_agent_priv;
 		/* Timeout will be updated after send completes */
 		mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr.
 							ud.timeout_ms);
-		mad_send_wr->retry = 0;
 		/* One reference for each work request to QP + response */
 		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
 		mad_send_wr->status = IB_WC_SUCCESS;
@@ -977,8 +974,13 @@
 			      &mad_agent_priv->send_list);
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
-		ret = ib_send_mad(mad_send_wr);
-		if (ret) {
+		if (mad_agent_priv->agent.rmpp_version) {
+			ret = ib_send_rmpp_mad(mad_send_wr);
+			if (ret >= 0 && ret != IB_RMPP_RESULT_CONSUMED)
+				ret = ib_send_mad(mad_send_wr);
+		} else
+			ret = ib_send_mad(mad_send_wr);
+		if (ret < 0) {
 			/* Fail send request */
 			spin_lock_irqsave(&mad_agent_priv->lock, flags);
 			list_del(&mad_send_wr->agent_list);
@@ -1538,19 +1540,6 @@
 	return valid;
 }
 
-static struct ib_mad_recv_wc *
-process_recv(struct ib_mad_agent_private *mad_agent_priv,
-	     struct ib_mad_recv_wc *mad_recv_wc)
-{
-	INIT_LIST_HEAD(&mad_recv_wc->rmpp_list);
-	list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list);
-	
-	if (mad_agent_priv->agent.rmpp_version)
-		return ib_process_rmpp_recv_wc(mad_agent_priv, mad_recv_wc);
-	else
-		return mad_recv_wc;
-}
-
 static int is_data_mad(struct ib_mad_agent_private *mad_agent_priv,
 		       struct ib_mad_hdr *mad_hdr)
 {
@@ -1563,9 +1552,8 @@
 		(rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_DATA);
 }
 
-static struct ib_mad_send_wr_private*
-find_send_req(struct ib_mad_agent_private *mad_agent_priv,
-	      u64 tid)
+struct ib_mad_send_wr_private*
+ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, u64 tid)
 {
 	struct ib_mad_send_wr_private *mad_send_wr;
 
@@ -1592,7 +1580,7 @@
 	return NULL;
 }
 
-static void ib_mark_req_done(struct ib_mad_send_wr_private *mad_send_wr)
+void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	mad_send_wr->timeout = 0;
 	if (mad_send_wr->refcount == 1) {
@@ -1610,19 +1598,24 @@
 	unsigned long flags;
 	u64 tid;
 
-	/* Process the receive before giving it to the user. */
-	mad_recv_wc = process_recv(mad_agent_priv, mad_recv_wc);
-	if (!mad_recv_wc) {
-		if (atomic_dec_and_test(&mad_agent_priv->refcount))
-			wake_up(&mad_agent_priv->wait);
-		return;
+	INIT_LIST_HEAD(&mad_recv_wc->rmpp_list);
+	list_add(&mad_recv_wc->recv_buf.list, &mad_recv_wc->rmpp_list);
+	
+	if (mad_agent_priv->agent.rmpp_version) {
+		mad_recv_wc = ib_process_rmpp_recv_wc(mad_agent_priv,
+						      mad_recv_wc);
+		if (!mad_recv_wc) {
+			if (atomic_dec_and_test(&mad_agent_priv->refcount))
+				wake_up(&mad_agent_priv->wait);
+			return;
+		}
 	}
 
 	/* Complete corresponding request */
 	if (response_mad(mad_recv_wc->recv_buf.mad)) {
 		tid = mad_recv_wc->recv_buf.mad->mad_hdr.tid;
 		spin_lock_irqsave(&mad_agent_priv->lock, flags);
-		mad_send_wr = find_send_req(mad_agent_priv, tid);
+		mad_send_wr = ib_find_send_mad(mad_agent_priv, tid);
 		if (!mad_send_wr) {
 			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 			ib_free_recv_mad(mad_recv_wc);
@@ -1630,7 +1623,7 @@
 				wake_up(&mad_agent_priv->wait);
 			return;
 		}
-		ib_mark_req_done(mad_send_wr);
+		ib_mark_mad_done(mad_send_wr);
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 		/* Defined behavior is to complete response before request */
@@ -1821,23 +1814,33 @@
 	}
 }
 
+void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr,
+			  int timeout_ms)
+{
+	mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
+	wait_for_response(mad_send_wr);
+	adjust_timeout(mad_send_wr->mad_agent_priv);
+}
+
 /*
  * Process a send work completion
  */
-static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
-				    struct ib_mad_send_wc *mad_send_wc)
+void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
+			     struct ib_mad_send_wc *mad_send_wc)
 {
 	struct ib_mad_agent_private	*mad_agent_priv;
 	unsigned long			flags;
-	enum ib_mad_result		ret;
+	int				ret;
 
 	mad_agent_priv = mad_send_wr->mad_agent_priv;
-	if (mad_agent_priv->agent.rmpp_version)
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	if (mad_agent_priv->agent.rmpp_version) {
 		ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc);
-	else
-		ret = IB_MAD_RESULT_SUCCESS;
+		if (ret == IB_RMPP_RESULT_CONSUMED)
+			goto done;
+	} else
+		ret = IB_RMPP_RESULT_UNHANDLED;
 
-	spin_lock_irqsave(&mad_agent_priv->lock, flags);
 	if (mad_send_wc->status != IB_WC_SUCCESS &&
 	    mad_send_wr->status == IB_WC_SUCCESS) {
 		mad_send_wr->status = mad_send_wc->status;
@@ -1849,8 +1852,7 @@
 		    mad_send_wr->status == IB_WC_SUCCESS) {
 			wait_for_response(mad_send_wr);
 		}
-		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
-		return;
+		goto done;
 	}
 
 	/* Remove send from MAD agent and notify client of completion */
@@ -1860,7 +1862,7 @@
 
 	if (mad_send_wr->status != IB_WC_SUCCESS )
 		mad_send_wc->status = mad_send_wr->status;
-	if (ret == IB_MAD_RESULT_SUCCESS)
+	if (ret != IB_RMPP_RESULT_INTERNAL)
 		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
 						   mad_send_wc);
 
@@ -1869,6 +1871,9 @@
 		wake_up(&mad_agent_priv->wait);
 
 	kfree(mad_send_wr);
+	return;
+done:
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 }
 
 static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv,
@@ -2066,8 +2071,7 @@
 }
 
 static struct ib_mad_send_wr_private*
-find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv,
-		   u64 wr_id)
+find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, u64 wr_id)
 {
 	struct ib_mad_send_wr_private *mad_send_wr;
 
@@ -2234,6 +2238,7 @@
 	struct ib_mad_send_wr_private *mad_send_wr;
 	struct ib_mad_send_wc mad_send_wc;
 	unsigned long flags, delay;
+	int ret;
 
 	mad_agent_priv = (struct ib_mad_agent_private *)data;
 
@@ -2257,6 +2262,14 @@
 		}
 
 		list_del(&mad_send_wr->agent_list);
+		if (mad_agent_priv->agent.rmpp_version) {
+			ret = ib_timeout_rmpp(mad_send_wr);
+			if (ret == IB_RMPP_RESULT_CONSUMED) {
+				list_add_tail(&mad_send_wr->agent_list,
+					      &mad_agent_priv->send_list);
+				continue;
+			}
+		}
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 		mad_send_wc.wr_id = mad_send_wr->wr_id;
Index: core/mad_rmpp.h
===================================================================
--- core/mad_rmpp.h	(revision 2207)
+++ core/mad_rmpp.h	(working copy)
@@ -37,14 +37,24 @@
 
 #include "mad_priv.h"
 
+enum {
+	IB_RMPP_RESULT_PROCESSED,
+	IB_RMPP_RESULT_CONSUMED,
+	IB_RMPP_RESULT_INTERNAL,
+	IB_RMPP_RESULT_UNHANDLED
+};
+
+int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr);
+
 struct ib_mad_recv_wc *
-ib_process_rmpp_recv_wc(struct ib_mad_agent_private *mad_agent_priv,
+ib_process_rmpp_recv_wc(struct ib_mad_agent_private *agent,
 			struct ib_mad_recv_wc *mad_recv_wc);
 
-enum ib_mad_result
-ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr,
-			struct ib_mad_send_wc *mad_send_wc);
+int ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr,
+			    struct ib_mad_send_wc *mad_send_wc);
+
+void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent);
 
-void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *mad_agent_priv);
+int ib_timeout_rmpp(struct ib_mad_send_wr_private *mad_send_wr);
 
 #endif	/* __MAD_RMPP_H__ */
Index: core/mad_priv.h
===================================================================
--- core/mad_priv.h	(revision 2207)
+++ core/mad_priv.h	(working copy)
@@ -126,6 +126,15 @@
 	int retry;
 	int refcount;
 	enum ib_wc_status status;
+
+	/* RMPP control */
+	int last_ack;
+	int seg_num;
+	int newwin;
+	int total_seg;
+	int data_offset;
+	int pad;
+	int retries;
 };
 
 struct ib_mad_local_private {
@@ -198,4 +207,17 @@
 
 extern kmem_cache_t *ib_mad_cache;
 
+int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr);
+
+struct ib_mad_send_wr_private *
+ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, u64 tid);
+
+void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
+			     struct ib_mad_send_wc *mad_send_wc);
+
+void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr);
+
+void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr,
+			  int timeout_ms);
+
 #endif	/* __IB_MAD_PRIV_H__ */


From greg at kroah.com  Thu Apr 21 23:14:43 2005
From: greg at kroah.com (Greg KH)
Date: Thu, 21 Apr 2005 23:14:43 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4268080E.3000303@ammasso.com>
References: <20050421173821.GA13312@hexapodia.org>
	<4267F367.3090508@ammasso.com>
	<20050421195641.GB13312@hexapodia.org>
	<4268080E.3000303@ammasso.com>
Message-ID: <20050422061443.GA10499@kroah.com>

On Thu, Apr 21, 2005 at 03:07:42PM -0500, Timur Tabi wrote:
> >*You* need to come up with a solution that looks good to *the community*
> >if you want it merged.  
> 
> True, but I'm not going to waste my time adding this support if the 
> consensus I get from the kernel developers that they don't want Linux to 
> behave this way.

I think we have been giving you that consensus from the very
beginning :)

The very fact that you tried to trot out the "enterprise" card should
have raised a huge flag...

thanks,

greg k-h


From pavel at suse.cz  Thu Apr 21 12:47:06 2005
From: pavel at suse.cz (Pavel Machek)
Date: Thu, 21 Apr 2005 21:47:06 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263E53E.3090107@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com>
	<52is2kawsi.fsf@topspin.com> <4263E53E.3090107@ammasso.com>
Message-ID: <20050421194706.GE475@openzaurus.ucw.cz>

Hi!

> >    Timur> Why do you call mlock() and get_user_pages()?  In our 
> >    code,
> >    Timur> we only call mlock(), and the memory is pinned.  We have a
> >    Timur> test case that fails if only get_user_pages() is called,
> >    Timur> but it passes if only mlock() is called.
> >
> >What if a buggy/malicious userspace program doesn't call mlock()?
> 
> Our library calls mlock() when the apps requests memory to be 
> "registered".  We then call munlock() when the app requests the 
> memory to be unregistered.  All apps talk to our library for all 
> services.  No apps talk to the driver directly.

That does not cover "malicious" part.
				Pavel
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


From 7eggert at gmx.de  Fri Apr 22 06:10:09 2005
From: 7eggert at gmx.de (Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>)
Date: Fri, 22 Apr 2005 15:10:09 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
References: <3VAeQ-1To-7@gated-at.bofh.it> <3VNYt-4M4-15@gated-at.bofh.it>
Message-ID: <E1DOxv9-0000pc-Pe@be1.7eggert.dyndns.org>

Andy Isaacson <adi at hexapodia.org> wrote:
> On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote:

>> I don't know if VM_REGISTERED is a good idea or not, but it should be
>> absolutely impossible for the kernel to reclaim "registered" (aka pinned)
>> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc),
>> it's normal for non-root processes to pin hundreds of megabytes of memory,
>> and that memory better be locked to those physical pages until the
>> application deregisters them.
> 
> If you take the hardline position that "the app is the only thing that
> matters", your code is unlikely to get merged.  Linux is a
> general-purpose OS.

All userspace hardware drivers with DMA will require pinned pages (and some
of them will require continuous memory). Since this memory may be scheduled
to be accessed by DMA, reclaiming those pages may (aka. will) result in
"random" memory corruption unless done by the driver itself.

You can't even set a time limit, the driver may have allocated all DMA
memory to queued transfers, and some media needs to get plugged in by
the lazy robot. As soon as the robot arrives - boom. (For the same reason,
this memory MUST NOT be freed if the application terminates abnormally,
e.g. killed by OOM).

In other words, you need to make this memory as unaccessible as the
framebuffer on a graphic card. If that causes a lockup, you better had
prevented that while allocating.

> In a Linux context, I doubt that fullblown SA is necessary or
> appropriate.  Rather, I'd suggest two new signals, SIGMEMLOW and
> SIGMEMCRIT.  The userland comms library registers handlers for both.
> When the kernel decides that it needs to reclaim some memory from the
> app, it sends SIGMEMLOW.  The comms library then has the responsibility
> to un-reserve some memory in an orderly fashion.  If a reasonable [1]
> time has expired since SIGMEMLOW and the kernel is still hungry, the
> kernel sends SIGMEMCRIT.  At this point, the comms lib *must* unregister
> some memory [2] even if it has to drop state to do so; if it returns
> from the signal handler without having unregistered the memory, the
> kernel will SIGKILL.

Choosing Data loss vs. finitely stalled system may sometimes be a bad
decision.

If I designes an application that might get a "gimme memory or die",
I'd reserve an extra bunch of memory with the only purpose of being
released in this situation. If the kernel had done that instead, this
part of memory could have been used e.g. as a read-only disk cache in
the meantime (off cause provided somebody cared to implement that).

> [2] Is there a way for the kernel to pass down to userspace how many
>     pages it wants, maybe in the sigcontext?

Then you'd need only one signal.

I think this interface is usefull, it would e.g. allow a picture viewer
to cache as many decoded and scaled pictures as the RAM permits, freeing
them if the RAM gets full and the swap would have to be used.

-- 
"When the pin is pulled, Mr. Grenade is not our friend.
-U.S. Marine Corps


From ftillier at infiniconsys.com  Fri Apr 22 10:01:55 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Fri, 22 Apr 2005 10:01:55 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <E1DOxv9-0000pc-Pe@be1.7eggert.dyndns.org>
Message-ID: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com>

> From: Bodo Eggert <harvested.in.lkml at posting.7eggert.dyndns.org>
> Sent: Friday, April 22, 2005 6:10 AM
> 
> All userspace hardware drivers with DMA will require pinned pages (and
> some of them will require continuous memory). Since this memory may be
> scheduled to be accessed by DMA, reclaiming those pages may (aka. will)
> result in "random" memory corruption unless done by the driver itself.

Any reclaim must involve the driver.  That doesn't mean that it must involve
the application.  That said this isn't trivial to implement.

> 
> You can't even set a time limit, the driver may have allocated all DMA
> memory to queued transfers, and some media needs to get plugged in by
> the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> this memory MUST NOT be freed if the application terminates abnormally,
> e.g. killed by OOM).

InfiniBand provides support for deregistering memory that might be
referenced at some future time by an RDMA operation.  The only side effect
this has is that the QP on both sides of the connection transition to an
error state.

Upon abnormal termination, all registrations must be undone and the memory
unpinned.  This must be synchronized with the hardware so that there are no
races.  The IB deregistration semantics provide such synchronization.  I'd
venture that any HW design that does not do this is broken.

Requiring the memory to never be freed upon abnormal termination equates to
a serious memory leak, in that physical memory is leaked, not virtual.

- Fab


From eitan at mellanox.co.il  Fri Apr 22 10:41:58 2005
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 22 Apr 2005 20:41:58 +0300
Subject: [openib-general] MAD/RMPP test program
Message-ID: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com>

Hi Sean,

Were you able to qualify the protocol implementation using an IB analyzer?

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Thursday, April 21, 2005 7:48 PM
> To: openib-general
> Subject: [openib-general] MAD/RMPP test program
> 
> For those interested (likely a few developers only), I've checked in a
> kernel test program that I used to stress the MAD/RMPP code.
> 
> gen2/utils/src/linux-kernel/infiniband/util/grmpp
> 
> - Sean
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050422/dbcb0cb3/attachment.html>

From timur.tabi at ammasso.com  Fri Apr 22 10:55:22 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Fri, 22 Apr 2005 12:55:22 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113840973.6274.84.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>	 <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
Message-ID: <42693A8A.80105@ammasso.com>

Arjan van de Ven wrote:
> On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote:
> 
>>Roland Dreier wrote:
>>
>>>    Troy> How is memory pinning handled? (I haven't had time to read
>>>    Troy> all the code, so please excuse my ignorance of something
>>>    Troy> obvious).
>>>
>>>The userspace library calls mlock() and then the kernel does
>>>get_user_pages().
>>
>>Why do you call mlock() and get_user_pages()?  In our code, we only call mlock(), and the 
>>memory is pinned. 
> 
> 
> this is a myth; linux is free to move the page about in physical memory
> even if it's mlock()ed!!

Can you tell me when Linux actually does this?  I know in theory it can happen, but I've 
never seen it.  Does the code to implement moving of data from one physical page to 
another even exist in any version of Linux?

Also, what would be the point?  What reason would there be to move some data from one 
physical page to another, while keeping the same virtual address?

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From mshefty at ichips.intel.com  Fri Apr 22 11:00:44 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 22 Apr 2005 11:00:44 -0700
Subject: [openib-general] MAD/RMPP test program
In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com>
References: <506C3D7B14CDD411A52C00025558DED6047EF16B@mtlex01.yok.mtl.com>
Message-ID: <42693BCC.7080106@ichips.intel.com>

Eitan Zahavi wrote:
> Hi Sean,
> 
> Were you able to qualify the protocol implementation using an IB analyzer?

Lacking a usable IB analyzer... no.  I did use the madeye utility to 
examine the headers for the window size, ACK format, timeouts, retries, 
etc.

If someone does run this against an analyzer and notices any issues, 
please let me know of them.

- Sean


From arjan at infradead.org  Fri Apr 22 11:12:58 2005
From: arjan at infradead.org (Arjan van de Ven)
Date: Fri, 22 Apr 2005 20:12:58 +0200
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <42693A8A.80105@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
	<42693A8A.80105@ammasso.com>
Message-ID: <1114193579.10355.38.camel@laptopd505.fenrus.org>

On Fri, 2005-04-22 at 12:55 -0500, Timur Tabi wrote:
> Arjan van de Ven wrote:
> > On Mon, 2005-04-18 at 11:09 -0500, Timur Tabi wrote:
> > 
> >>Roland Dreier wrote:
> >>
> >>>    Troy> How is memory pinning handled? (I haven't had time to read
> >>>    Troy> all the code, so please excuse my ignorance of something
> >>>    Troy> obvious).
> >>>
> >>>The userspace library calls mlock() and then the kernel does
> >>>get_user_pages().
> >>
> >>Why do you call mlock() and get_user_pages()?  In our code, we only call mlock(), and the 
> >>memory is pinned. 
> > 
> > 
> > this is a myth; linux is free to move the page about in physical memory
> > even if it's mlock()ed!!
> 
> Can you tell me when Linux actually does this?  I know in theory it can happen, but I've 
> never seen it.  Does the code to implement moving of data from one physical page to 
> another even exist in any version of Linux?

hot(un)plug memory.

> 
> Also, what would be the point?  What reason would there be to move some data from one 
> physical page to another, while keeping the same virtual address?

so that you can hot unplug the dimm in question.

I guess that's a bit of a high end though though... so maybe you don't
care about it.


From 7eggert at gmx.de  Fri Apr 22 15:01:23 2005
From: 7eggert at gmx.de (Bodo Eggert)
Date: Sat, 23 Apr 2005 00:01:23 +0200 (CEST)
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com>
References: <000101c5475c$fe3c5fa0$8d5aa8c0@infiniconsys.com>
Message-ID: <Pine.LNX.4.58.0504222331190.5827@be1.lrz>

On Fri, 22 Apr 2005, Fab Tillier wrote:
> > From: Bodo Eggert <harvested.in.lkml at posting.7eggert.dyndns.org>
> > Sent: Friday, April 22, 2005 6:10 AM

> > You can't even set a time limit, the driver may have allocated all DMA
> > memory to queued transfers, and some media needs to get plugged in by
> > the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> > this memory MUST NOT be freed if the application terminates abnormally,
> > e.g. killed by OOM).
> 
> InfiniBand provides support for deregistering memory that might be
> referenced at some future time by an RDMA operation.  The only side effect
> this has is that the QP on both sides of the connection transition to an
> error state.
> 
> Upon abnormal termination, all registrations must be undone and the memory
> unpinned.  This must be synchronized with the hardware so that there are no
> races.

If you know the hardware. If you have userspace drivers, this will be
impossible, and even if you have kernel drivers, you'll need to know 
which of them is responsible for each part of the pinned memory.

This doesn't imply the affected memory to be lost. The same application
that created the pinned memory can reset the hardware (provided nobody
changed the configuration), then reconnect to the shared memory segment
you'll use for that purpose and use or free it.

-- 
To iterate is human; to recurse, divine. 


From tduffy at sun.com  Fri Apr 22 15:57:32 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 22 Apr 2005 15:57:32 -0700
Subject: [openib-general] [PATCHv4][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <1114126674.6858.31.camel@duffman>
References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>
	<1114126674.6858.31.camel@duffman>
Message-ID: <1114210652.5519.1.camel@duffman>

On Thu, 2005-04-21 at 16:37 -0700, Tom Duffy wrote:
> On Thu, 2005-04-21 at 16:31 -0700, Fab Tillier wrote:
> > Isn't the above change backwards?  The original code was copying settings
> > from listen_sk to sk, and the new code seems to be checking flags in sk to
> > determine whether to set them in listen_sk.
> 
> You are so right.  My brain ain't on today or something.

You know what, cancel this whole patch.  I have it wrong, and I am
reworking a new patch to work with the new sk_alloc().

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050422/75618cae/attachment.sig>

From libor at topspin.com  Fri Apr 22 17:57:19 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 22 Apr 2005 17:57:19 -0700
Subject: [openib-general] [ANNOUNCE] Userspace Connection Manager
Message-ID: <20050422175719.A1735@topspin.com>


  I've made the initial check-in of the userspace connection manager
library. The kernel module that provides the access from userspace
to the kernel CM was checked in previously, and is already being built
as part of the core IB support. (ib_ucm.ko)

  To use the Userspace CM you'll need to create a single character
device file:

    mknod /dev/infiniband/ucm c 231 255

  There's a dependency on infiniband/verbs.h so you'll need libibverbs
installed on the same system. Check out src/userspace/libibcm and build:

    ./autogen.sh && ./configure && make && sudo make install

  The API is very similar to the kernel CM, as you will be able to tell
by looking at infiniband/cm.h, (thanks Sean.) with the one notable
exception being CM event notification. Unlike the kernel which delivers
events through a callback, the userspace CM does not deliver events, they
must be solicited. (ib_cm_event_get()) The file descriptor used by the
CM can be retreived for use in poll/select, so a user does not need to
block on event solicitation. Ideally an app should be able to use the
cm and verbs without needing to use threads for event handling.

  There exists a simple example, which drives the CM through the
standard connection states, but does not actually create any QPs.

  Next step is more testing and to create a real example which actually
uses libibverbs, moves data, and uses the real SA to get path records.


-Libor


From akpm at osdl.org  Sat Apr 23 19:44:21 2005
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 23 Apr 2005 19:44:21 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <4263E445.8000605@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
Message-ID: <20050423194421.4f0d6612.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Christoph Hellwig wrote:
> > On Mon, Apr 18, 2005 at 11:22:29AM -0500, Timur Tabi wrote:
> > 
> >>That's not what we're seeing.  We have hardware that does DMA over the 
> >>network (much like the Infiniband stuff), and we have a testcase that fails 
> >>if get_user_pages() is used, but not if mlock() is used.
> > 
> > 
> > If you don't share your testcase it's unlikely to be fixed.
> 
> As I said, the testcase only works with our hardware, and it's also very large.  It's one 
> small test that's part of a huge test suite.  It takes a couple hours just to install the 
> damn thing.
> 
> We want to produce a simpler test case that demonstrates the problem in an 
> easy-to-understand manner, but we don't have time to do that now.

If your theory is correct then it should be able to demonstrate this
problem without any special hardware at all: pin some user memory, then
generate memory pressure then check the contents of those pinned pages.

But if, for the DMA transfer, you're using the array of page*'s which were
originally obtained from get_user_pages() then it's rather hard to see how
the kernel could alter the page's contents.

Then again, if mlock() fixes it then something's up.  Very odd.


From timur.tabi at ammasso.com  Sun Apr 24 07:23:48 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Sun, 24 Apr 2005 09:23:48 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050423194421.4f0d6612.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
Message-ID: <426BABF4.3050205@ammasso.com>

Andrew Morton wrote:

> If your theory is correct then it should be able to demonstrate this
> problem without any special hardware at all: pin some user memory, then
> generate memory pressure then check the contents of those pinned pages.

I tried that, but I couldn't get it to fail.  But that was a while ago, and I've learned a 
few things since then, so I'll try again.

> But if, for the DMA transfer, you're using the array of page*'s which were
> originally obtained from get_user_pages() then it's rather hard to see how
> the kernel could alter the page's contents.
> 
> Then again, if mlock() fixes it then something's up.  Very odd.

With mlock(), we don't need to use get_user_pages() at all.  Arjan tells me the only time 
an mlocked page can move is with hot (un)plug of memory, but that isn't supported on the 
systems that we support.  We actually prefer mlock() over get_user_pages(), because if the 
process dies, the locks automatically go away too.


From greg at kroah.com  Sun Apr 24 13:53:10 2005
From: greg at kroah.com (Greg KH)
Date: Sun, 24 Apr 2005 13:53:10 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426BABF4.3050205@ammasso.com>
References: <20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com>
Message-ID: <20050424205309.GA5386@kroah.com>

On Sun, Apr 24, 2005 at 09:23:48AM -0500, Timur Tabi wrote:
> Andrew Morton wrote:
> 
> >If your theory is correct then it should be able to demonstrate this
> >problem without any special hardware at all: pin some user memory, then
> >generate memory pressure then check the contents of those pinned pages.
> 
> I tried that, but I couldn't get it to fail.  But that was a while ago, and 
> I've learned a few things since then, so I'll try again.
> 
> >But if, for the DMA transfer, you're using the array of page*'s which were
> >originally obtained from get_user_pages() then it's rather hard to see how
> >the kernel could alter the page's contents.
> >
> >Then again, if mlock() fixes it then something's up.  Very odd.
> 
> With mlock(), we don't need to use get_user_pages() at all.  Arjan tells me 
> the only time an mlocked page can move is with hot (un)plug of memory, but 
> that isn't supported on the systems that we support.

You don't "support" i386 or ia64 or x86-64 or ppc64 systems?  What
hardware do you support?  And what about the fact that you are aiming to
get this code into mainline, right?  If not, why are you asking here?
:)

thanks,

greg k-h


From timur.tabi at ammasso.com  Sun Apr 24 14:52:31 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Sun, 24 Apr 2005 16:52:31 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050424205309.GA5386@kroah.com>
References: <20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com>
Message-ID: <426C151F.3000407@ammasso.com>

Greg KH wrote:

> You don't "support" i386 or ia64 or x86-64 or ppc64 systems?  What
> hardware do you support? 

I've never seen or heard of any x86-32 or x86-64 system that supports hot-swap RAM. Our 
hardware does not support PPC, and our software doesn't support ia-64.

 > And what about the fact that you are aiming to
> get this code into mainline, right?  If not, why are you asking here?
> :)

Well, our primary concern is getting our stuff to work.  Since get_user_pages() doesn't 
work, but mlock() does, that's what we use.  I don't know how to fix get_user_pages(), and 
I don't have the time right now to figure it out.  I know that technically mlock() is not 
the right way to do it, and so we're not going to be submitting our code for the mainline 
until get_user_pages() works and our code uses it instead of mlock().


From greg at kroah.com  Sun Apr 24 18:03:51 2005
From: greg at kroah.com (Greg KH)
Date: Sun, 24 Apr 2005 18:03:51 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426C151F.3000407@ammasso.com>
References: <20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com>
	<426C151F.3000407@ammasso.com>
Message-ID: <20050425010351.GA21246@kroah.com>

On Sun, Apr 24, 2005 at 04:52:31PM -0500, Timur Tabi wrote:
> Greg KH wrote:
> 
> >You don't "support" i386 or ia64 or x86-64 or ppc64 systems?  What
> >hardware do you support? 
> 
> I've never seen or heard of any x86-32 or x86-64 system that supports 
> hot-swap RAM.

I know of at least 1 x86-32 box from a three-letter-named company with
this feature that has been shipping for a few _years_ now.  That box is
pretty much everywhere now, and I know that other versions of it are
also quite popular (despite the high cost...)

> Our hardware does not support PPC, and our software doesn't support
> ia-64.

Your hardware is just a pci card, right?  Why wouldn't it work on ppc64
and ia64 then?

> > And what about the fact that you are aiming to
> >get this code into mainline, right?  If not, why are you asking here?
> >:)
> 
> Well, our primary concern is getting our stuff to work.  Since 
> get_user_pages() doesn't work, but mlock() does, that's what we use.  I 
> don't know how to fix get_user_pages(), and I don't have the time right now 
> to figure it out.  I know that technically mlock() is not the right way to 
> do it, and so we're not going to be submitting our code for the mainline 
> until get_user_pages() works and our code uses it instead of mlock().

Wait, what _is_ "your stuff"?  The open-ib code?  Or some other, private
fork?  Any pointers to this stuff?

thanks,

greg k-h


From timur.tabi at ammasso.com  Sun Apr 24 21:12:20 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Sun, 24 Apr 2005 23:12:20 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425010351.GA21246@kroah.com>
References: <20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com>
	<426C151F.3000407@ammasso.com> <20050425010351.GA21246@kroah.com>
Message-ID: <426C6E24.5050203@ammasso.com>

Greg KH wrote:

> I know of at least 1 x86-32 box from a three-letter-named company with
> this feature that has been shipping for a few _years_ now.  That box is
> pretty much everywhere now, and I know that other versions of it are
> also quite popular (despite the high cost...)

Hmm... Well, I think we were already planning on telling our customers that we don't 
support hot-swap RAM.  Is there a CONFIG option for that feature?

> Your hardware is just a pci card, right?  Why wouldn't it work on ppc64
> and ia64 then?

It's PCI-X, actually, and I don't think we've ever actually plugged it into a PPC box. 
Isn't Open Firmware support required for all PPC boxes, anyway?  Our PCI card is not OF 
compatible, AFAIK.

As for IA64, well, we could support it, but it's not a high enough priority.  We do have 
some CPU-specific code in our driver that we would need to port to IA-64.

> Wait, what _is_ "your stuff"?  The open-ib code?

No, if anything, it's the competition to IB.  It's called iWARP (RDMA over TCP/IP), and 
it's similar to IB except it uses gigabit ethernet instead of whatever hardware IB uses. 
Because we also support RMDA, we have the same problems as OpenIB, however, we would 
prefer that the kernel support OpenRDMA instead, since it's more generic.

 >  Or some other, private
> fork?  Any pointers to this stuff?

http://ammasso.com/support.html

The current version of the code calls sys_mlock() directly from the driver.  We haven't 
released yet the version that calls mlock().


From roland at topspin.com  Mon Apr 25 06:15:10 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 06:15:10 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
Message-ID: <52is2bvvz5.fsf@topspin.com>

    Timur> With mlock(), we don't need to use get_user_pages() at all.
    Timur> Arjan tells me the only time an mlocked page can move is
    Timur> with hot (un)plug of memory, but that isn't supported on
    Timur> the systems that we support.  We actually prefer mlock()
    Timur> over get_user_pages(), because if the process dies, the
    Timur> locks automatically go away too.

There actually is another way pages can move, with both
get_user_pages() and mlock(): copy-on-write after a fork().  If
userspace does a fork(), then all PTEs are marked read-only, and if
the original process touches the page after the fork(), a new page
will be allocated and mapped at the original virtual address.

This is actually a pretty big pain, because the only good solution
seems to be for the kernel to mark these registered regions as
VM_DONTCOPY.  Right now this means that driver code ends up monkeying
with vm_flags for user vmas.

Does it seem reasonable to add a new system call to let userspace mark
memory it doesn't want copied into forked processes?  Something like

	long sys_mark_nocopy(unsigned long addr, size_t len, int mark)

which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
A better name would be gratefully accepted...

Then to register memory for RDMA, userspace would call
sys_mark_nocopy() (with appropriate accounting to handle possibly
overlapping regions) and the kernel would call get_user_pages().  The
get_user_pages() is of course required because the kernel can't trust
userspace to keep the pages locked.  mlock() would no longer be
necessary.  We can trust userspace to call sys_mark_nocopy() as
needed, because a process can only hurt itself and its children by
misusing the sys_mark_nocopy() call.

If this seems reasonable then I can code a patch.

 - R.


From hch at infradead.org  Mon Apr 25 06:17:53 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Mon, 25 Apr 2005 14:17:53 +0100
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52is2bvvz5.fsf@topspin.com>
References: <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
Message-ID: <20050425131753.GA8860@infradead.org>

On Mon, Apr 25, 2005 at 06:15:10AM -0700, Roland Dreier wrote:
> Does it seem reasonable to add a new system call to let userspace mark
> memory it doesn't want copied into forked processes?  Something like
> 
> 	long sys_mark_nocopy(unsigned long addr, size_t len, int mark)
> 
> which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
> A better name would be gratefully accepted...

add a new MAP_DONTCOPY flag and accept it in mmap and mprotect?


From haveblue at us.ibm.com  Mon Apr 25 06:30:22 2005
From: haveblue at us.ibm.com (Dave Hansen)
Date: Mon, 25 Apr 2005 06:30:22 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426C6E24.5050203@ammasso.com>
References: <20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com> <20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com> <20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com> <20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <20050424205309.GA5386@kroah.com>
	<426C151F.3000407@ammasso.com> <20050425010351.GA21246@kroah.com>
	<426C6E24.5050203@ammasso.com>
Message-ID: <1114435822.14501.51.camel@localhost>

On Sun, 2005-04-24 at 23:12 -0500, Timur Tabi wrote:
> Greg KH wrote:
> > I know of at least 1 x86-32 box from a three-letter-named company with
> > this feature that has been shipping for a few _years_ now.  That box is
> > pretty much everywhere now, and I know that other versions of it are
> > also quite popular (despite the high cost...)
> 
> Hmm... Well, I think we were already planning on telling our customers that we don't 
> support hot-swap RAM.  Is there a CONFIG option for that feature?

The driver to do the ACPI portion of both add and remove is in the
kernel today, so it's certainly a feature that's coming relatively soon.

There is a large variety of x86_64, ppc64, ia64 and ia64 hardware that
will be doing memory hotplug.  I believe that every POWER5 system is
capable of supporting it, at least virtually.

I don't think your concerns end with memory hotplug.  The same
approaches to moving memory around will be used for NUMA memory
balancing and for memory defragmentation.  Can you say that your cards
will never be used on a system which has memory which becomes
fragmented?

-- Dave


From hozer at hozed.org  Mon Apr 25 06:31:31 2005
From: hozer at hozed.org (Troy Benjegerdes)
Date: Mon, 25 Apr 2005 08:31:31 -0500
Subject: [openib-general] IBM eHCA Device Driver for gen1 IB stack
In-Reply-To: <52ll7d2y9u.fsf@topspin.com>
References: <OF2C40F988.88F5D2C7-ONC1256FE9.00588C28-C1256FE9.00597C4A@de.ibm.com>
	<52ll7d2y9u.fsf@topspin.com>
Message-ID: <20050425133131.GT999@kalmia.hozed.org>

On Wed, Apr 20, 2005 at 09:44:45AM -0700, Roland Dreier wrote:
>     > Hi, we've just released the first linux device driver for
>     > the IBM eServer HCA for Power5.  It's gen1 based and runs
>     > on SLES9 SP1.  Main testvehicle for this code was IPoIB.
> 
>     > gen2 and full userspace support will be next.
> 
> Excellent, I'm glad to see this released.  I'm looking forward to
> seeing the gen2 support.
> 
> If I may make a small suggestion for future releases: please have the
> tar file contain a top-level directory like ehca-0021, with everything
> contained in that directory.  It's a little annoying to unpack a tar
> file and have it spread 5 files in your working directory, especially
> when some have generic names like "INSTALL" or "patches."

What will it take to get the Gen2 support into the openib.org subversion
tree?

How much is the low-level driver likely to change once it's written and
working?


From panda at cse.ohio-state.edu  Mon Apr 25 07:00:44 2005
From: panda at cse.ohio-state.edu (Dhabaleswar Panda)
Date: Mon, 25 Apr 2005 10:00:44 -0400 (EDT)
Subject: [openib-general] Annoucing the release of OSU MVAPICH 0.9.5
Message-ID: <200504251400.j3PE0jkL009406@xi.cse.ohio-state.edu>

The MVAPICH (MPI over InfiniBand) team at the Ohio State University is
pleased to announce the release of MVAPICH 0.9.5 for multiple
platforms (EM64T, G5, IA-32, IA-64, and Opteron) and network
interfaces (PCI-X and PCI-Express-including the new mem-free cards).

MVAPICH 0.9.5 is being distributed as a single integrated package
(with the latest MPICH 1.2.6 and MVICH). It can be downloaded with a
`single click' and installed. It is available under BSD license.

MVAPICH/MVAPICH2 software is being used by more than 200 organizations
world-wide (in 26 countries) to extract the potential of InfiniBand
networking technology for designing high-end computing systems and
servers. It is also being distributed by many IBA vendors in their
software distributions.

The current version (MVAPICH 0.9.5) provides support for the VAPI
layer. As indicated below, an implementation of MVAPICH 0.9.5 on the
OpenIB Gen2 interface will be available soon.

This new release has the following features:

      - multi-rail support (multiple adapters per node and 
                 multiple ports per adapter) 

      - optimized intra-node shared memory support 
                 (both for bus-based and NUMA-based systems)  

      - enhanced MPI broadcast support with IBA hardware-based
                 multicast

      - flexible mechanisms for minimizing memory resource 
                 usage on large scale clusters 

      - support for TotalView debugger 

      - optimized and tuned for the above platforms and different 
                 network interfaces (PCI-X and PCI-Express)

      - single code base for all of the above platforms

Other features of this release include:

- Excellent performance: MVAPICH 0.9.5 with multi-rail (1-NIC, 2-port)
  delivers 4.0 microsec latency, up to 1498 MB/sec unidirectional
  bandwidth, and up to 2704 MB/sec bidirectional bandwidth on EM64T 
  system with PCI-Express. Detailed performance numbers for other 
  platforms are available on the project's web page. 

- An enhanced and detailed `User and Tuning Guide' to assist users: 

       - to install this package on different platforms 
            with different options

       - to vary different parameters of the MPI installation to 
            extract maximum performance and achieve scalability, 
            especially on large-scale systems.

You are welcome to download the MVAPICH 0.9.5 package and access
relevant information from the following URL:

http://nowlab.cis.ohio-state.edu/projects/mpi-iba/

Since the 0.9.4 release, we have introduced a set of patches based on
user feedbacks. If you plan to continue using 0.9.4 for some more
time, we strongly encourage you to download and apply these patches to
your current installation.

Our upcoming releases include:

    - an OpenIB Gen2 version of MVAPICH 0.9.5

    - MVAPICH2 0.6.5 with uDAPL support to run on different networks
      with uDAPL interface

All feedbacks, including bug reports and hints for performance tuning,
are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu.

Thanks, 

MVAPICH Team at OSU/NBCL 


From roland at topspin.com  Mon Apr 25 07:16:23 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 07:16:23 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425131753.GA8860@infradead.org> (Christoph Hellwig's
	message of "Mon, 25 Apr 2005 14:17:53 +0100")
References: <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425131753.GA8860@infradead.org>
Message-ID: <523btfvt54.fsf@topspin.com>

    Roland> Does it seem reasonable to add a new system call to let
    Roland> userspace mark memory it doesn't want copied into forked
    Roland> processes?  Something like

    Roland> long sys_mark_nocopy(unsigned long addr, size_t len, int
    Roland> mark)

    Roland> which would set VM_DONTCOPY if mark != 0, and clear it if
    Roland> mark == 0.  A better name would be gratefully accepted...

    Christoph> add a new MAP_DONTCOPY flag and accept it in mmap and
    Christoph> mprotect?

That is much better, thanks.  But I think it would need to be
PROT_DONTCOPY to work with mprotect(), right?

 - R.


From caitlin.bestler at gmail.com  Mon Apr 25 07:43:22 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 25 Apr 2005 07:43:22 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52is2bvvz5.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com>
Message-ID: <469958e00504250743340ff9e9@mail.gmail.com>

On 4/25/05, Roland Dreier <roland at topspin.com> wrote:
>     Timur> With mlock(), we don't need to use get_user_pages() at all.
>     Timur> Arjan tells me the only time an mlocked page can move is
>     Timur> with hot (un)plug of memory, but that isn't supported on
>     Timur> the systems that we support.  We actually prefer mlock()
>     Timur> over get_user_pages(), because if the process dies, the
>     Timur> locks automatically go away too.
> 
> There actually is another way pages can move, with both
> get_user_pages() and mlock(): copy-on-write after a fork().  If
> userspace does a fork(), then all PTEs are marked read-only, and if
> the original process touches the page after the fork(), a new page
> will be allocated and mapped at the original virtual address.
> 
> This is actually a pretty big pain, because the only good solution
> seems to be for the kernel to mark these registered regions as
> VM_DONTCOPY.  Right now this means that driver code ends up monkeying
> with vm_flags for user vmas.
> 
> Does it seem reasonable to add a new system call to let userspace mark
> memory it doesn't want copied into forked processes?  Something like
> 
>         long sys_mark_nocopy(unsigned long addr, size_t len, int mark)
> 
> which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
> A better name would be gratefully accepted...
> 
> Then to register memory for RDMA, userspace would call
> sys_mark_nocopy() (with appropriate accounting to handle possibly
> overlapping regions) and the kernel would call get_user_pages().  The
> get_user_pages() is of course required because the kernel can't trust
> userspace to keep the pages locked.  mlock() would no longer be
> necessary.  We can trust userspace to call sys_mark_nocopy() as
> needed, because a process can only hurt itself and its children by
> misusing the sys_mark_nocopy() call.
> 
> If this seems reasonable then I can code a patch.
> 

Who is responsible for counting within a process, and
then between processes (in case shared memory is
being registered)? The application? Middleware? Driver?

My concern here is that the application layer may not
be fully aware when middleware is registering memory,
and middleware may not be fully aware when the memory
it receives from the application is shared with another
process.


From roland at topspin.com  Mon Apr 25 08:34:06 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 08:34:06 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <469958e00504250743340ff9e9@mail.gmail.com> (Caitlin Bestler's
	message of "Mon, 25 Apr 2005 07:43:22 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com>
	<469958e00504250743340ff9e9@mail.gmail.com>
Message-ID: <52y8b6vpjl.fsf@topspin.com>

    Caitlin> Who is responsible for counting within a process, and
    Caitlin> then between processes (in case shared memory is being
    Caitlin> registered)? The application? Middleware? Driver?

The verbs code doing the registration should do it as part of the
registration.  Shared memory does not cause any additional issues
because it is mapped into the virtual memory map of each process and
must be marked VM_DONTCOPY in each process separately.

 - R.


From caitlin.bestler at gmail.com  Mon Apr 25 08:49:40 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 25 Apr 2005 08:49:40 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52y8b6vpjl.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com>
	<469958e00504250743340ff9e9@mail.gmail.com>
	<52y8b6vpjl.fsf@topspin.com>
Message-ID: <469958e005042508494a6cc9c9@mail.gmail.com>

That leaves a problem when the same memory region is
registered with different vendors. Verbs A marks the area,
Verbs B sees that is already marked, Verbs A unmarks
the area when it is done not knowing that B is relying
on the memory staying pinned.

I do not believe there is a solution to this problem when
working at arms length from Linux other than documenting
the problem and informing applications of workarounds
required when using multiple vendors concurrently with
the same memory (i.e, destroy the most recently
created memory region first, or pin the memory
yourself before creating the first memory region).

The only other alternative is to make the pinning
some sort of shared service that would apply across
multiple vendors. That is doable, but might not be
worthwhile given that a single process using multiple
vendor devices concurrently is decidely the exception.
But those users deserve at least a warning.


On 4/25/05, Roland Dreier <roland at topspin.com> wrote:
>     Caitlin> Who is responsible for counting within a process, and
>     Caitlin> then between processes (in case shared memory is being
>     Caitlin> registered)? The application? Middleware? Driver?
> 
> The verbs code doing the registration should do it as part of the
> registration.  Shared memory does not cause any additional issues
> because it is mapped into the virtual memory map of each process and
> must be marked VM_DONTCOPY in each process separately.
> 
>  - R.
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From tduffy at sun.com  Mon Apr 25 10:09:00 2005
From: tduffy at sun.com (Tom Duffy)
Date: Mon, 25 Apr 2005 10:09:00 -0700
Subject: [openib-general] [PATCH][SDP] fix panic when cat'ing
	/proc/net/sdp/conn_main
Message-ID: <1114448940.13354.8.camel@duffman>

If you start up a something like ./ttcp.aio.x -r -l 65536 -a 20 with no
SM running on your subnet, and then cat /proc/net/sdp/conn_main, you
will panic:

Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
<ffffffff882af935>{:ib_sdp:sdp_proc_dump_conn_main+469}
PGD 33943067 PUD 338ad067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa md5 ipv6 parport_pc lp parport autofs4 nfs lockd rfcomm l2cap bluetooth pcmcia yenta_socket rsrc_nonstatic pcmcia_core sunrpc ext3 jbd dm_mod video container button battery ac ohci_hcd i2c_amd756 i2c_core ib_mthca ib_mad ib_core tg3 floppy xfs exportfs mptscsih mptbase sd_mod scsi_mod
Pid: 5548, comm: cat Not tainted 2.6.11.7openib
RIP: 0010:[<ffffffff882af935>] <ffffffff882af935>{:ib_sdp:sdp_proc_dump_conn_main+469}
RSP: 0018:ffff8100778cbd78  EFLAGS: 00010056
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff882c24f0 RDI: ffff810033f9418a
RBP: 000000000000018a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff81003a9219c0 R12: 0000000000000000
R13: ffff810033f94000 R14: 0000000000000400 R15: ffff8100778cbe98
FS:  00002aaaaaad4b00(0000) GS:ffffffff8047dc00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 000000003385d000 CR4: 00000000000006e0
Process cat (pid: 5548, threadinfo ffff8100778ca000, task ffff81007f3bcf50)
Stack: ffff8100778cbddc ffff81003b402010 ffff81003bbfb9b0 ffff81003bb6a940
       ffff81003bb6a940 0000000000000292 0000000000000292 ffffffff8016be89
       ffff8100000015a5 0000000000000000
Call Trace:<ffffffff8016be89>{do_no_page+729} <ffffffff882a8b25>{:ib_sdp:sdp_proc_read_parse+37}
       <ffffffff801b7093>{proc_file_read+227} <ffffffff8017d725>{vfs_read+229}
       <ffffffff8017da33>{sys_read+83} <ffffffff8010e3da>{system_call+126}

After this patch:

[root at sins-stinger-10 ~]# cat /proc/net/sdp/conn_main
dst address:port src address:port  ID  comm_id  pid      dst guid         src guid     dlid slid dqpn   sqpn   data sent buff'd data rcvd_buff'd   data written      data read     src_serv snk_serv
---------------- ---------------- ---- -------- ---- ---------------- ---------------- ---- ---- ------ ------ ---------------- ---------------- ---------------- ---------------- -------- --------
00.00.00.00:0000 00.00.00.00:1389 0000 00000000 155a 0000000000000000 0000000000000000 0000 0000 000000 000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 00000000 00000000

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
===================================================================
--- linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(revision 2207)
+++ linux-2.6.11-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(working copy)
@@ -1384,7 +1384,7 @@ int sdp_proc_dump_conn_main(char *buffer
 				  ((conn->src_addr >> 24) & 0xff),
 				  conn->src_port, 
 				  conn->hashent,
-				  conn->cm_id->local_id,
+				  conn->cm_id ? conn->cm_id->local_id : 0,
 				  conn->pid,
 				  (u32)((d_guid >> 32) & 0xffffffff),
 				  (u32)(d_guid & 0xffffffff),


From adi at hexapodia.org  Mon Apr 25 12:11:11 2005
From: adi at hexapodia.org (Andy Isaacson)
Date: Mon, 25 Apr 2005 12:11:11 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050423194421.4f0d6612.akpm@osdl.org>
References: <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
Message-ID: <20050425191111.GC2511@hexapodia.org>

On Sat, Apr 23, 2005 at 07:44:21PM -0700, Andrew Morton wrote:
> Timur Tabi <timur.tabi at ammasso.com> wrote:
> > As I said, the testcase only works with our hardware, and it's also
> > very large.  It's one small test that's part of a huge test suite.
> > It takes a couple hours just to install the damn thing.
> > 
> > We want to produce a simpler test case that demonstrates the problem in an 
> > easy-to-understand manner, but we don't have time to do that now.
> 
> If your theory is correct then it should be able to demonstrate this
> problem without any special hardware at all: pin some user memory, then
> generate memory pressure then check the contents of those pinned pages.
> 
> But if, for the DMA transfer, you're using the array of page*'s which were
> originally obtained from get_user_pages() then it's rather hard to see how
> the kernel could alter the page's contents.
> 
> Then again, if mlock() fixes it then something's up.  Very odd.

Andrew,

Libor Michalek posted a much more reasonable (to my limited
understanding) bug description in <20050412180447.E6958 at topspin.com>.

(And I'd love to provide a URL, but damned if I can figure out how to
find that message on gmane.  Clue-bat applications gladly accepted.)

Libor Michalek wrote:
# The driver did use get_user_pages() to elevated the refcount on all the
# pages it was going to use for IO, as well as call set_page_dirty() since
# the pages were going to have data written to them from the device.
# 
# The problem we were seeing is that the minor fault by the app resulted
# in a new physical page getting mapped for the application. The page that
# had the elevated refcount was still waiting for the data to be written
# to by the driver at the time that the app accessed the page causing the
# minor fault. Obviously since the app had a new mapping the data written
# by the driver was lost.
# 
# It looks like code was added to try_to_unmap_one() to address this, so   
# hopefully it's no longer an issue...

Which makes me think that Timur's bug is just an
insufficiently-understood version of Libor's.

-andy


From akpm at osdl.org  Mon Apr 25 13:54:01 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 13:54:01 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52is2bvvz5.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
Message-ID: <20050425135401.65376ce0.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Timur> With mlock(), we don't need to use get_user_pages() at all.
>      Timur> Arjan tells me the only time an mlocked page can move is
>      Timur> with hot (un)plug of memory, but that isn't supported on
>      Timur> the systems that we support.  We actually prefer mlock()
>      Timur> over get_user_pages(), because if the process dies, the
>      Timur> locks automatically go away too.
> 
>  There actually is another way pages can move, with both
>  get_user_pages() and mlock(): copy-on-write after a fork().  If
>  userspace does a fork(), then all PTEs are marked read-only, and if
>  the original process touches the page after the fork(), a new page
>  will be allocated and mapped at the original virtual address.

Do we care about that?  A straightforward scenario under which this can
happen is:

a) app starts some read I/O in an asynchronous manner
b) app forks
c) child writes to one of the pages which is still under read I/O
d) the read I/O completes
e) the child is left with the old data plus the child's modification instead
   of the new data

which is a very silly application which is giving itself unpredictable
memory contents anyway.

I assume there's a more sensible scenario?


From roland at topspin.com  Mon Apr 25 14:12:40 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 14:12:40 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425135401.65376ce0.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 13:54:01 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
Message-ID: <521x8yv9vb.fsf@topspin.com>

    Andrew> Do we care about that?  A straightforward scenario under
    Andrew> which this can happen is:

    Andrew> a) app starts some read I/O in an asynchronous manner
    Andrew> b) app forks
    Andrew> c) child writes to one of the pages which is still under read I/O
    Andrew> d) the read I/O completes
    Andrew> e) the child is left with the old data plus the child's modification instead
    Andrew>    of the new data

    Andrew> which is a very silly application which is giving itself
    Andrew> unpredictable memory contents anyway.

    Andrew> I assume there's a more sensible scenario?

You're right, that is a silly scenario ;)  In fact if we mark vmas
with VM_DONTCOPY, then the child just crashes with a seg fault.

The type of thing I'm worried about is something like, for example:

a) app registers memory region with RDMA hardware -- in other words,
   loads the device's translation table for future I/O
b) app forks
c) app writes to the registered memory region, and the kernel breaks
   the COW for the (now read-only) page by mapping a new page
d) app starts an I/O that will do a DMA read from the region
e) device reads using the wrong, old mapping

This can be pretty insiduous because for example fork() + immediate
exec() or just using system() still leaves the parent with PTEs marked
read-only.  If an application does overlapping memory registrations so
get_user_pages() is called a lot, then as far as I can see
can_share_swap_page() will always return 0 and the COW will happen
even if the child process has thrown out its original vmas.

Or if the counts are in the correct range, then there's a small window
between fork() and exec() where the parent process can screw itself
up, so most of the time the app works, until it doesn't.

 - R.


From caitlin.bestler at gmail.com  Mon Apr 25 14:42:55 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 25 Apr 2005 14:42:55 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <521x8yv9vb.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
Message-ID: <469958e005042514421b57c833@mail.gmail.com>

On 4/25/05, Roland Dreier <roland at topspin.com> wrote:
>     Andrew> Do we care about that?  A straightforward scenario under
>     Andrew> which this can happen is:
> 
>     Andrew> a) app starts some read I/O in an asynchronous manner
>     Andrew> b) app forks
>     Andrew> c) child writes to one of the pages which is still under read I/O
>     Andrew> d) the read I/O completes
>     Andrew> e) the child is left with the old data plus the child's modification instead
>     Andrew>    of the new data
> 
>     Andrew> which is a very silly application which is giving itself
>     Andrew> unpredictable memory contents anyway.
> 
>     Andrew> I assume there's a more sensible scenario?
> 
> You're right, that is a silly scenario ;)  In fact if we mark vmas
> with VM_DONTCOPY, then the child just crashes with a seg fault.
> 
> The type of thing I'm worried about is something like, for example:
> 
> a) app registers memory region with RDMA hardware -- in other words,
>    loads the device's translation table for future I/O
> b) app forks
> c) app writes to the registered memory region, and the kernel breaks
>    the COW for the (now read-only) page by mapping a new page
> d) app starts an I/O that will do a DMA read from the region
> e) device reads using the wrong, old mapping
> 
> This can be pretty insiduous because for example fork() + immediate
> exec() or just using system() still leaves the parent with PTEs marked
> read-only.  If an application does overlapping memory registrations so
> get_user_pages() is called a lot, then as far as I can see
> can_share_swap_page() will always return 0 and the COW will happen
> even if the child process has thrown out its original vmas.
> 
> Or if the counts are in the correct range, then there's a small window
> between fork() and exec() where the parent process can screw itself
> up, so most of the time the app works, until it doesn't.
> 

Every RDMA related interface specification that I know of specifically
excludes support of RDMA resources being inherited by child processes,
with the warning that excellent implementations will give the child
process an error for attempting to use the parent's RDMA resources.
More streamlined implementations will simply be unpredictable.

As for forking while the parent has a pending read: since the parent
has not reaped the completion at the time of the fork the buffers
in question are undefined. The child's buffers will be consistent,
that is they are undefined.


From akpm at osdl.org  Mon Apr 25 15:14:59 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 15:14:59 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <521x8yv9vb.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
Message-ID: <20050425151459.1f5fb378.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Andrew> Do we care about that?  A straightforward scenario under
>     Andrew> which this can happen is:
> 
>     Andrew> a) app starts some read I/O in an asynchronous manner
>     Andrew> b) app forks
>     Andrew> c) child writes to one of the pages which is still under read I/O
>     Andrew> d) the read I/O completes
>     Andrew> e) the child is left with the old data plus the child's modification instead
>     Andrew>    of the new data
> 
>     Andrew> which is a very silly application which is giving itself
>     Andrew> unpredictable memory contents anyway.
> 
>     Andrew> I assume there's a more sensible scenario?
> 
> You're right, that is a silly scenario ;)  In fact if we mark vmas
> with VM_DONTCOPY, then the child just crashes with a seg fault.
> 
> The type of thing I'm worried about is something like, for example:
> 
> a) app registers memory region with RDMA hardware -- in other words,
>    loads the device's translation table for future I/O

Whoa, hang on.

The way we expect get_user_pages() to be used is that the kernel will use
get_user_pages() once per application I/O request.

Are you saying that RDMA clients will semi-permanently own pages which were
pinned by get_user_pages()?  That those pages will be used for multiple
separate I/O operations?

If so, then that's a significant design departure and it would be good to
hear why it is necessary.

> b) app forks
> c) app writes to the registered memory region, and the kernel breaks
>    the COW for the (now read-only) page by mapping a new page
> d) app starts an I/O that will do a DMA read from the region
> e) device reads using the wrong, old mapping

Sure.  But such an app could be declared to be buggy...

> This can be pretty insiduous because for example fork() + immediate
> exec() or just using system() still leaves the parent with PTEs marked
> read-only.  If an application does overlapping memory registrations so
> get_user_pages() is called a lot, then as far as I can see
> can_share_swap_page() will always return 0 and the COW will happen
> even if the child process has thrown out its original vmas.
> 
> Or if the counts are in the correct range, then there's a small window
> between fork() and exec() where the parent process can screw itself
> up, so most of the time the app works, until it doesn't.
> 
>  - R.


From timur.tabi at ammasso.com  Mon Apr 25 15:21:28 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 17:21:28 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
Message-ID: <426D6D68.6040504@ammasso.com>

Andrew Morton wrote:

> The way we expect get_user_pages() to be used is that the kernel will use
> get_user_pages() once per application I/O request.
> 
> Are you saying that RDMA clients will semi-permanently own pages which were
> pinned by get_user_pages()?  That those pages will be used for multiple
> separate I/O operations?

Yes, absolutely!

The memory buffer is allocated by the process (usually just via malloc) and 
registed/pinned by the driver.  It then stays pinned for the life of the process (typically).

> If so, then that's a significant design departure and it would be good to
> hear why it is necessary.

That's just how RMDA works.  Once the memory is pinned, if the app wants to send data to 
another node, it does two things:

1) Puts the data into its buffer
2) Sends a "work request" to the driver with (among other things) the offset and length of 
the data.

This is a time-critical operation.  It must occurs as fast as possible, which means the 
memory must have already been pinned.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From timur.tabi at ammasso.com  Mon Apr 25 15:23:54 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 17:23:54 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
Message-ID: <426D6DFA.4090908@ammasso.com>

Andrew Morton wrote:

> The way we expect get_user_pages() to be used is that the kernel will use
> get_user_pages() once per application I/O request.

Are you saying that the mapping obtained by get_user_pages() is valid only within the 
context of the IOCtl call?  That once the driver returns from the IOCtl, the mapping 
should no longer be used?

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From akpm at osdl.org  Mon Apr 25 15:32:56 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 15:32:56 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426D6D68.6040504@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
Message-ID: <20050425153256.3850ee0a.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Andrew Morton wrote:
> 
> > The way we expect get_user_pages() to be used is that the kernel will use
> > get_user_pages() once per application I/O request.
> > 
> > Are you saying that RDMA clients will semi-permanently own pages which were
> > pinned by get_user_pages()?  That those pages will be used for multiple
> > separate I/O operations?
> 
> Yes, absolutely!
> 
> The memory buffer is allocated by the process (usually just via malloc) and 
> registed/pinned by the driver.  It then stays pinned for the life of the process (typically).

ug.  What stops the memory from leaking if the process exits?

I hope this is a privileged operation?

> > If so, then that's a significant design departure and it would be good to
> > hear why it is necessary.
> 
> That's just how RMDA works.  Once the memory is pinned, if the app wants to send data to 
> another node, it does two things:
> 
> 1) Puts the data into its buffer
> 2) Sends a "work request" to the driver with (among other things) the offset and length of 
> the data.
> 
> This is a time-critical operation.  It must occurs as fast as possible, which means the 
> memory must have already been pinned.

It would be better to obtain this memory via a mmap() of some special
device node, so we can perform appropriate permission checking and clean
everything up on unclean application exit.


From akpm at osdl.org  Mon Apr 25 15:35:42 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 15:35:42 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426D6DFA.4090908@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
Message-ID: <20050425153542.70197e6a.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Andrew Morton wrote:
> 
> > The way we expect get_user_pages() to be used is that the kernel will use
> > get_user_pages() once per application I/O request.
> 
> Are you saying that the mapping obtained by get_user_pages() is valid only within the 
> context of the IOCtl call?  That once the driver returns from the IOCtl, the mapping 
> should no longer be used?

Yes, we expect that all the pages which get_user_pages() pinned will become
unpinned within the context of the syscall which pinned the pages.  Or
shortly after, in the case of async I/O.

This is because there is no file descriptor or anything else associated
with the pages which permits the kernel to clean stuff up on unclean
application exit.  Also there are the obvious issues with permitting
pinning of unbounded amounts of memory.


From timur.tabi at ammasso.com  Mon Apr 25 15:42:36 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 17:42:36 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
Message-ID: <426D725C.4070103@ammasso.com>

Andrew Morton wrote:

> This is because there is no file descriptor or anything else associated
> with the pages which permits the kernel to clean stuff up on unclean
> application exit.  Also there are the obvious issues with permitting
> pinning of unbounded amounts of memory.

Then that might explain the "bug" that we're seeing with get_user_pages().  We've been 
assuming that get_user_pages() mappings are permanent.

Well, I was just about to re-implement get_user_pages() support in our driver to 
demonstrate the bug.  I guess I'll hold off on that.

If you look at the Infiniband code that was recently submitted, I think you'll see it does 
exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the 
page mappings for future use.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From robert.j.woodruff at intel.com  Mon Apr 25 15:51:03 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 25 Apr 2005 15:51:03 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org>
Message-ID: <ORSMSX408FRaqbC8wSA00000014@orsmsx408.amr.corp.intel.com>

 Andrew Morton wrote,
>Yes, we expect that all the pages which get_user_pages() pinned will become
>unpinned within the context of the syscall which pinned the pages.  Or
>shortly after, in the case of async I/O.

>This is because there is no file descriptor or anything else associated
>with the pages which permits the kernel to clean stuff up on unclean
>application exit.  Also there are the obvious issues with permitting
>pinning of unbounded amounts of memory.

There definitely needs to be a mechanism to prevent people from pinning
too much memory. We saw issues in the sourceforge stack and some of the
vendors stacks where we could lock memory till the system hung. 
In the sourceforge InfiniBand stack, we put in a 
check to make sure that people did not pin too much memory. 
It was sort of a crude/bruit force mechanism, but effective. I think that we
limited people from locking down more that 1/2 of kernel memory or
70 % of all memory (it was tunable with a module option) and if they
exceeded
the limit, their requests to register memory would begin to fail. 
Arlin can provide details on how we did it or people can look at the 
IBAL code for an example. 

woody


From timur.tabi at ammasso.com  Mon Apr 25 16:13:01 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 18:13:01 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <ORSMSX408FRaqbC8wSA00000014@orsmsx408.amr.corp.intel.com>
References: <ORSMSX408FRaqbC8wSA00000014@orsmsx408.amr.corp.intel.com>
Message-ID: <426D797D.3000108@ammasso.com>

Bob Woodruff wrote:

> There definitely needs to be a mechanism to prevent people from pinning
> too much memory. 

Any limit would have to be very high - definitely more than just half.  What if the 
application needs to pin 2GB?  The customer is not going to buy 4+ GB of RAM just because 
Linux doesn't like pinning more than half.  In an x86-32 system, that would required PAE 
support and slow everything down.

Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned.  So 
you have 3GB of RAM, Linux should allow you to pin 2.5GB.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From akpm at osdl.org  Mon Apr 25 16:13:30 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 16:13:30 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426D725C.4070103@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<426D725C.4070103@ammasso.com>
Message-ID: <20050425161330.32c32b4b.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Andrew Morton wrote:
> 
> > This is because there is no file descriptor or anything else associated
> > with the pages which permits the kernel to clean stuff up on unclean
> > application exit.  Also there are the obvious issues with permitting
> > pinning of unbounded amounts of memory.
> 
> Then that might explain the "bug" that we're seeing with get_user_pages().  We've been 
> assuming that get_user_pages() mappings are permanent.

They are permanent until someone runs put_page() against all the pages. 
What I'm saying is that all current callers of get_user_pages() _do_ run
put_page() within the same syscall or upon I/O termination.

> Well, I was just about to re-implement get_user_pages() support in our driver to 
> demonstrate the bug.  I guess I'll hold off on that.
> 
> If you look at the Infiniband code that was recently submitted, I think you'll see it does 
> exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the 
> page mappings for future use.

Where?

bix:/usr/src/linux-2.6.12-rc3> grep -rl get_user_pages .
./arch/i386/lib/usercopy.c
./arch/sparc64/kernel/ptrace.c
./drivers/video/pvr2fb.c
./drivers/media/video/video-buf.c
./drivers/scsi/sg.c
./drivers/scsi/st.c
./include/asm-ia64/pgtable.h
./include/linux/mm.h
./include/asm-um/archparam-i386.h
./include/asm-i386/fixmap.h
./fs/nfs/direct.c
./fs/aio.c
./fs/binfmt_elf.c
./fs/bio.c
./fs/direct-io.c
./kernel/futex.c
./kernel/ptrace.c
./mm/memory.c
./mm/nommu.c
./mm/rmap.c
./mm/mempolicy.c


From libor at topspin.com  Mon Apr 25 16:17:13 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 25 Apr 2005 16:17:13 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425153542.70197e6a.akpm@osdl.org>;
	from akpm@osdl.org on Mon, Apr 25, 2005 at 03:35:42PM -0700
References: <20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
Message-ID: <20050425161713.A9002@topspin.com>

On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote:
> Timur Tabi <timur.tabi at ammasso.com> wrote:
> >
> > Andrew Morton wrote:
> > 
> > > The way we expect get_user_pages() to be used is that the kernel will use
> > > get_user_pages() once per application I/O request.
> > 
> > Are you saying that the mapping obtained by get_user_pages() is valid only within the 
> > context of the IOCtl call?  That once the driver returns from the IOCtl, the mapping 
> > should no longer be used?
> 
> Yes, we expect that all the pages which get_user_pages() pinned will become
> unpinned within the context of the syscall which pinned the pages.  Or
> shortly after, in the case of async I/O.

  When a network protocol is making use of async I/O the amount of time
between posting the read request and getting the completion for that
request is unbounded since it depends on the other half of the connection
sending some data. In this case the buffer that was pinned during the
io_submit() may be pinned, and holding the pages, for a long time. During
this time the process might fork, at this point any data received will be
placed into the wrong spot. 

> This is because there is no file descriptor or anything else associated
> with the pages which permits the kernel to clean stuff up on unclean
> application exit.  Also there are the obvious issues with permitting
> pinning of unbounded amounts of memory.

  Correct, the driver must be able to determine that the process has died
and clean up after it, so the pinned region in most implementations is
associated with an open file descriptor.

-Libor


From akpm at osdl.org  Mon Apr 25 16:17:47 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 16:17:47 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <426D797D.3000108@ammasso.com>
References: <ORSMSX408FRaqbC8wSA00000014@orsmsx408.amr.corp.intel.com>
	<426D797D.3000108@ammasso.com>
Message-ID: <20050425161747.28b03800.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Bob Woodruff wrote:
> 
> > There definitely needs to be a mechanism to prevent people from pinning
> > too much memory. 
> 
> Any limit would have to be very high - definitely more than just half.  What if the 
> application needs to pin 2GB?  The customer is not going to buy 4+ GB of RAM just because 
> Linux doesn't like pinning more than half.  In an x86-32 system, that would required PAE 
> support and slow everything down.
> 
> Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned.  So 
> you have 3GB of RAM, Linux should allow you to pin 2.5GB.
> 

You can pin the whole darn lot *if you have the correct privileges*.


From timur.tabi at ammasso.com  Mon Apr 25 16:21:51 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 18:21:51 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425161330.32c32b4b.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6DFA.4090908@ammasso.com>	<20050425153542.70197e6a.akpm@osdl.org>	<426D725C.4070103@ammasso.com>
	<20050425161330.32c32b4b.akpm@osdl.org>
Message-ID: <426D7B8F.6000903@ammasso.com>

Andrew Morton wrote:

> They are permanent until someone runs put_page() against all the pages. 
> What I'm saying is that all current callers of get_user_pages() _do_ run
> put_page() within the same syscall or upon I/O termination.

Oh, okay then.  I guess I'll get back to work!

Actually, with RDMA, "I/O termination" technically doesn't happen until the memory is 
deregistered.  When the memory is registered, all that means is that it's should be pinned 
and the virtual-to-physical should be stored.  No actual I/O occurs at that point.

>>If you look at the Infiniband code that was recently submitted, I think you'll see it does 
>>exactly that: after calling mlock(), the driver calls get_user_pages(), and it stores the 
>>page mappings for future
 >
> Where?

I was talking about the code that Roland mentioned in the first message of this thread - 
the user-space verbs support.  He said the code calls mlock() and get_user_pages().

FYI, our driver detects the process termination and cleans up everything itself.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From akpm at osdl.org  Mon Apr 25 16:24:05 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 16:24:05 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050425161713.A9002@topspin.com>
References: <20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<20050425161713.A9002@topspin.com>
Message-ID: <20050425162405.0889093e.akpm@osdl.org>

Libor Michalek <libor at topspin.com> wrote:
>
> On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote:
> > Timur Tabi <timur.tabi at ammasso.com> wrote:
> > >
> > > Andrew Morton wrote:
> > > 
> > > > The way we expect get_user_pages() to be used is that the kernel will use
> > > > get_user_pages() once per application I/O request.
> > > 
> > > Are you saying that the mapping obtained by get_user_pages() is valid only within the 
> > > context of the IOCtl call?  That once the driver returns from the IOCtl, the mapping 
> > > should no longer be used?
> > 
> > Yes, we expect that all the pages which get_user_pages() pinned will become
> > unpinned within the context of the syscall which pinned the pages.  Or
> > shortly after, in the case of async I/O.
> 
>   When a network protocol is making use of async I/O the amount of time
> between posting the read request and getting the completion for that
> request is unbounded since it depends on the other half of the connection
> sending some data. In this case the buffer that was pinned during the
> io_submit() may be pinned, and holding the pages, for a long time.

Sure.

> During
> this time the process might fork, at this point any data received will be
> placed into the wrong spot. 

Well the data is placed in _a_ spot.  That's only the "wrong" spot because
you've defined it to be wrong!

IOW: what behaviour are you actually looking for here, and why, and does it
matter?

> > This is because there is no file descriptor or anything else associated
> > with the pages which permits the kernel to clean stuff up on unclean
> > application exit.  Also there are the obvious issues with permitting
> > pinning of unbounded amounts of memory.
> 
>   Correct, the driver must be able to determine that the process has died
> and clean up after it, so the pinned region in most implementations is
> associated with an open file descriptor.

How is that association created?


From akpm at osdl.org  Mon Apr 25 16:27:40 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 16:27:40 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426D7B8F.6000903@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<426D725C.4070103@ammasso.com>
	<20050425161330.32c32b4b.akpm@osdl.org>
	<426D7B8F.6000903@ammasso.com>
Message-ID: <20050425162740.702a171b.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> FYI, our driver detects the process termination and cleans up everything itself.

How is this implemented?


From robert.j.woodruff at intel.com  Mon Apr 25 16:29:34 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Mon, 25 Apr 2005 16:29:34 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbsimplementation
In-Reply-To: <426D797D.3000108@ammasso.com>
Message-ID: <ORSMSX4081XvpFVjCRG00000015@orsmsx408.amr.corp.intel.com>

Timur Tabi wrote,
 
>Any limit would have to be very high - definitely more than just half.
What if the 
>application needs to pin 2GB?  The customer is not going to buy 4+ GB of
RAM just 
>because 
>Linux doesn't like pinning more than half.  In an x86-32 system, that would
required >PAE 
>support and slow everything down.

>Off the top of my head, I'd say Linux would need to allow all but 512MB to
be pinned.  >So 
>you have 3GB of RAM, Linux should allow you to pin 2.5GB.

That is why we made it tunable, so that people could decide how to allow.

There is probably a better way to do it than some hard limit, but 
that would take a little more understanding of the VM system than we had,
and that is why some of the core kernel folks maybe able to help us come up
with a better solution.

woody


From caitlin.bestler at gmail.com  Mon Apr 25 16:37:56 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Mon, 25 Apr 2005 16:37:56 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425162405.0889093e.akpm@osdl.org>
References: <20050418164316.GA27697@infradead.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org> <521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org> <426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<20050425161713.A9002@topspin.com>
	<20050425162405.0889093e.akpm@osdl.org>
Message-ID: <469958e00504251637350cc8c@mail.gmail.com>

On 4/25/05, Andrew Morton <akpm at osdl.org> wrote:

> 
> > > This is because there is no file descriptor or anything else associated
> > > with the pages which permits the kernel to clean stuff up on unclean
> > > application exit.  Also there are the obvious issues with permitting
> > > pinning of unbounded amounts of memory.
> >
> >   Correct, the driver must be able to determine that the process has died
> > and clean up after it, so the pinned region in most implementations is
> > associated with an open file descriptor.
> 
> How is that association created?


There is not a file descrptor, but there is an rnic handle. Both DAPL
and IT-API require that process death will result in the handle and all
of its dependent objects being released.

The rnic handle can always be declared to be a "file descriptor" if
that makes it follow normal OS conventions more precisiely.

There is also a need for some form of resource manager to approve
creation of Memory Regions. Obviously you cannot have multiple
applications claiming half of physical memory.

But if you merely require the user to have root privileges in order
to create a Memory Region, and then take a first-come first-served
attitude, I don't think you end up with something that is truly a
general purpose capability.

A general purpose RDMA capability requires the ability to indefinitely
pin large portions of user memory. It makes sense to integrate that
with OS policy control over resource utilization and to integrate it with
memory suspend/resume capabilities so that hotplug memory can
be supported. What you can't do is downgrade a Memory Region so
that it is no longer a memory region. Doing that means that you are
not truly supporting RDMA.


From roland at topspin.com  Mon Apr 25 16:58:03 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 16:58:03 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425153256.3850ee0a.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 15:32:56 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
Message-ID: <52vf6atnn8.fsf@topspin.com>

    Andrew> ug.  What stops the memory from leaking if the process
    Andrew> exits?

    Andrew> I hope this is a privileged operation?

I don't think it has to be privileged.  In my implementation, the
driver keeps a per-process list of registered memory regions and
unpins/cleans up on process exit.

    Andrew> It would be better to obtain this memory via a mmap() of
    Andrew> some special device node, so we can perform appropriate
    Andrew> permission checking and clean everything up on unclean
    Andrew> application exit.

This seems to interact poorly with how applications want to use RDMA,
ie typically through a library interface such as MPI.  People doing
HPC don't want to recode their apps to use a new allocator, they just
want to link to a new MPI library and have the app go fast.

 - R.


From roland at topspin.com  Mon Apr 25 17:02:36 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 17:02:36 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425151459.1f5fb378.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 15:14:59 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
Message-ID: <52r7gytnfn.fsf@topspin.com>

    Andrew> Whoa, hang on.

    Andrew> The way we expect get_user_pages() to be used is that the
    Andrew> kernel will use get_user_pages() once per application I/O
    Andrew> request.

    Andrew> Are you saying that RDMA clients will semi-permanently own
    Andrew> pages which were pinned by get_user_pages()?  That those
    Andrew> pages will be used for multiple separate I/O operations?

    Andrew> If so, then that's a significant design departure and it
    Andrew> would be good to hear why it is necessary.

The idea is that applications manage the lifetime of pinned memory
regions.  They can do things like post multiple I/O operations without
any page-walking overhead, or pass a buffer descriptor to a remote
host who will send data at some indeterminate time in the future.  In
addition, InfiniBand has the notion of atomic operations, so a cluster
application may be using some memory region to implement a global lock.

This might not be the most kernel-friendly design but it is pretty
deeply ingrained in the design of RDMA transports like InfiniBand and
iWARP (RDMA over IP).

I'm also not opposed to implementing some other mechanism to make this
work, but the combiniation of get_user_pages() in the kernel and
extending mprotect() to allow setting VM_DONTCOPY seems to work fine.

 - R.


From roland at topspin.com  Mon Apr 25 17:04:02 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 17:04:02 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <469958e005042514421b57c833@mail.gmail.com> (Caitlin Bestler's
	message of "Mon, 25 Apr 2005 14:42:55 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<469958e005042514421b57c833@mail.gmail.com>
Message-ID: <52mzrmtnd9.fsf@topspin.com>

    Caitlin> Every RDMA related interface specification that I know of
    Caitlin> specifically excludes support of RDMA resources being
    Caitlin> inherited by child processes, with the warning that
    Caitlin> excellent implementations will give the child process an
    Caitlin> error for attempting to use the parent's RDMA resources.
    Caitlin> More streamlined implementations will simply be
    Caitlin> unpredictable.

    Caitlin> As for forking while the parent has a pending read: since
    Caitlin> the parent has not reaped the completion at the time of
    Caitlin> the fork the buffers in question are undefined. The
    Caitlin> child's buffers will be consistent, that is they are
    Caitlin> undefined.

I think you've missed the point: unless a process sets VM_DONTCOPY on
its RDMA memory regions, then incorrect memory mappings may be used if
the app does something as simple as calling system("ls").

 - R.


From roland at topspin.com  Mon Apr 25 17:08:57 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 17:08:57 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425161330.32c32b4b.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 16:13:30 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com> <20050425153542.70197e6a.akpm@osdl.org>
	<426D725C.4070103@ammasso.com> <20050425161330.32c32b4b.akpm@osdl.org>
Message-ID: <52is2atn52.fsf@topspin.com>

    Timur> If you look at the Infiniband code that was recently
    Timur> submitted, I think you'll see it does exactly that: after
    Timur> calling mlock(), the driver calls get_user_pages(), and it
    Timur> stores the page mappings for future use.

    Andrew> Where?

The code isn't merged yet.  I sent a version to lkml for review -- in
fact it was this very thread that we're in now.  The code in question
is in http://lkml.org/lkml/2005/4/4/266

This implements a "userspace verbs" character device that memory
registration goes through.  This means the kernel has a device node
that will be closed when a process dies, and so the memory can be
cleaned up.

 - R.


From akpm at osdl.org  Mon Apr 25 17:10:50 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 17:10:50 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <469958e00504251637350cc8c@mail.gmail.com>
References: <20050418164316.GA27697@infradead.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<20050425161713.A9002@topspin.com>
	<20050425162405.0889093e.akpm@osdl.org>
	<469958e00504251637350cc8c@mail.gmail.com>
Message-ID: <20050425171050.5ba25918.akpm@osdl.org>

Caitlin Bestler <caitlin.bestler at gmail.com> wrote:
>
> > 
> > > > This is because there is no file descriptor or anything else associated
> > > > with the pages which permits the kernel to clean stuff up on unclean
> > > > application exit.  Also there are the obvious issues with permitting
> > > > pinning of unbounded amounts of memory.
> > >
> > >   Correct, the driver must be able to determine that the process has died
> > > and clean up after it, so the pinned region in most implementations is
> > > associated with an open file descriptor.
> > 
> > How is that association created?
> 
> 
> There is not a file descrptor, but there is an rnic handle. Both DAPL
> and IT-API require that process death will result in the handle and all
> of its dependent objects being released.

What's an "rnic handle", in Linux terms?

> The rnic handle can always be declared to be a "file descriptor" if
> that makes it follow normal OS conventions more precisiely.

Does that mean that the code has not yet been implemented?

Yes, a Linux fd is appropriate.  But we don't have any sane way right now
of saying "you need to run put_page() against all these pages in the
->release() handler".  That'll need to be coded by yourselves.

> There is also a need for some form of resource manager to approve
> creation of Memory Regions. Obviously you cannot have multiple
> applications claiming half of physical memory.

The kernel already has considerable resource management capabilities. 
Please consider using/extending/generalising those before inventing
anything new.  RLIMIT_MEMLOCK would be a starting point.

> But if you merely require the user to have root privileges in order
> to create a Memory Region, and then take a first-come first-served
> attitude, I don't think you end up with something that is truly a
> general purpose capability.

We don't want code in the kernel which will permit hostile unprivileged
users to trivially cause the box to lock up.  RLIMIT_MEMLOCK and, if
necessary, CAP_IPC_LOCK sound appropriate here.

> A general purpose RDMA capability requires the ability to indefinitely
> pin large portions of user memory. It makes sense to integrate that
> with OS policy control over resource utilization and to integrate it with
> memory suspend/resume capabilities so that hotplug memory can
> be supported. What you can't do is downgrade a Memory Region so
> that it is no longer a memory region. Doing that means that you are
> not truly supporting RDMA.


From akpm at osdl.org  Mon Apr 25 17:11:45 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 17:11:45 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52vf6atnn8.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
Message-ID: <20050425171145.2f0fd7f8.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Andrew> ug.  What stops the memory from leaking if the process
>     Andrew> exits?
> 
>     Andrew> I hope this is a privileged operation?
> 
> I don't think it has to be privileged.  In my implementation, the
> driver keeps a per-process list of registered memory regions and
> unpins/cleans up on process exit.

How does the driver detect process exit?

>     Andrew> It would be better to obtain this memory via a mmap() of
>     Andrew> some special device node, so we can perform appropriate
>     Andrew> permission checking and clean everything up on unclean
>     Andrew> application exit.
> 
> This seems to interact poorly with how applications want to use RDMA,
> ie typically through a library interface such as MPI.  People doing
> HPC don't want to recode their apps to use a new allocator, they just
> want to link to a new MPI library and have the app go fast.

Fair enough.


From roland at topspin.com  Mon Apr 25 17:23:17 2005
From: roland at topspin.com (Roland Dreier)
Date: Mon, 25 Apr 2005 17:23:17 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425171145.2f0fd7f8.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 17:11:45 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org> <52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
Message-ID: <52acnmtmh6.fsf@topspin.com>

    Andrew> How does the driver detect process exit?

I already answered earlier but just to be clear: registration goes
through a character device, and all regions are cleaned up in the
->release() of that device.

I don't currently have any code accounting against RLIMIT_MEMLOCK or
testing CAP_FOO, but I have no problem adding whatever is thought
appropriate.  Userspace also has control over the permissions and
owner/group of the /dev node.

 - R.


From akpm at osdl.org  Mon Apr 25 17:37:57 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 17:37:57 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52acnmtmh6.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
Message-ID: <20050425173757.1dbab90b.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Andrew> How does the driver detect process exit?
> 
> I already answered earlier but just to be clear: registration goes
> through a character device, and all regions are cleaned up in the
> ->release() of that device.

yup.

> I don't currently have any code accounting against RLIMIT_MEMLOCK or
> testing CAP_FOO, but I have no problem adding whatever is thought
> appropriate.  Userspace also has control over the permissions and
> owner/group of the /dev node.

I guess device node permissions won't be appropriate here, if only because
it sounds like everyone will go and set them to 0666.

RLIMIT_MEMLOCK sounds like the appropriate mechanism.  We cannot rely upon
userspace running mlock(), so perhaps it is appropriate to run sys_mlock()
in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking.

However an hostile app can just go and run munlock() and then allocate
some more pinned-by-get_user_pages() memory.

umm, how about we

- force the special pages into a separate vma

- run get_user_pages() against it all

- use RLIMIT_MEMLOCK accounting to check whether the user is allowed to
  do this thing

- undo the RMLIMIT_MEMLOCK accounting in ->release

This will all interact with user-initiated mlock/munlock in messy ways. 
Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is
unaffected by mlock/munlock activity is needed.

A bit of generalisation in do_mlock() should suit?


From iwamoto at valinux.co.jp  Mon Apr 25 19:03:38 2005
From: iwamoto at valinux.co.jp (IWAMOTO Toshihiro)
Date: Tue, 26 Apr 2005 11:03:38 +0900
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52vf6atnn8.fsf@topspin.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
Message-ID: <20050426020338.5909570488@sv1.valinux.co.jp>

At Mon, 25 Apr 2005 16:58:03 -0700,
Roland Dreier wrote:
>     Andrew> It would be better to obtain this memory via a mmap() of
>     Andrew> some special device node, so we can perform appropriate
>     Andrew> permission checking and clean everything up on unclean
>     Andrew> application exit.
> 
> This seems to interact poorly with how applications want to use RDMA,
> ie typically through a library interface such as MPI.  People doing
> HPC don't want to recode their apps to use a new allocator, they just
> want to link to a new MPI library and have the app go fast.

Such HPC users cannot use the memory hotremoval feature, and something
needs to be implemented so that the NUMA migration can handle such
memory properly, but I see your point.

If such memory were allocated by a driver, the memory could be placed
in non-hotremovable areas to avoid the above problems.

--
IWAMOTO Toshihiro


From timur.tabi at ammasso.com  Mon Apr 25 19:16:53 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 21:16:53 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426020338.5909570488@sv1.valinux.co.jp>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>
	<20050426020338.5909570488@sv1.valinux.co.jp>
Message-ID: <426DA495.4040700@ammasso.com>

IWAMOTO Toshihiro wrote:

> If such memory were allocated by a driver, the memory could be placed
> in non-hotremovable areas to avoid the above problems.

How can the driver allocated 3GB of pinned memory on a system with 3.5GB of RAM?  Can 
vmalloc() or get_free_pages() allocate that much memory?


From timur.tabi at ammasso.com  Mon Apr 25 19:21:03 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 21:21:03 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425173757.1dbab90b.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>	<20050425171145.2f0fd7f8.akpm@osdl.org>	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
Message-ID: <426DA58F.3020508@ammasso.com>

Andrew Morton wrote:

> RLIMIT_MEMLOCK sounds like the appropriate mechanism.  We cannot rely upon
> userspace running mlock(), so perhaps it is appropriate to run sys_mlock()
> in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking.

I don't see what's wrong with relying on userspace to call mlock().  First all, all RDMA 
apps call a third-party API, like DAPL or MPI, to register memory.  The memory needs to be 
registered in order for the driver and adapter to know where it is.  During this 
registration, the memory is also pinned.  That's when we call mlock().

> 
> However an hostile app can just go and run munlock() and then allocate
> some more pinned-by-get_user_pages() memory.

Isn't mlock() on a per-process basis anyway?  How can one process call munlock() on 
another process' memory?

> umm, how about we
> 
> - force the special pages into a separate vma
> 
> - run get_user_pages() against it all
> 
> - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to
>   do this thing
> 
> - undo the RMLIMIT_MEMLOCK accounting in ->release

Isn't this kinda what mlock() does already?  Create a new VMA and then VM_LOCK it?

> This will all interact with user-initiated mlock/munlock in messy ways. 
> Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is
> unaffected by mlock/munlock activity is needed.
> 
> A bit of generalisation in do_mlock() should suit?

Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap.


From steve.langdon at hp.com  Mon Apr 25 19:26:04 2005
From: steve.langdon at hp.com (Stephen Langdon)
Date: Mon, 25 Apr 2005 22:26:04 -0400
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426020338.5909570488@sv1.valinux.co.jp>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>
	<20050426020338.5909570488@sv1.valinux.co.jp>
Message-ID: <426DA6BC.9070703@hp.com>

I don't think that we should jump to the conclusion that in the long 
term HPC users cannot benefit from support of mechanisms such as 
hotremoval of memory or other forms of page migration in physical 
memory.  In an earlier exchange on the openib-general list Mike Krause 
sent the message quoted below on very much the same topic.  On the other 
hand I am willing to accept that there is practical value to 
implementations which are not (yet) sophisticated to enough to support 
the migration functions.

Steve Langdon

> Michael Krause wrote: At 05:35 PM 3/14/2005, Caitlin Bestler wrote:
>
>>  
>>
>> > -----Original Message-----
>> > From: Troy Benjegerdes [ mailto:hozer at hozed.org]
>> > Sent: Monday, March 14, 2005 5:06 PM
>> > To: Caitlin Bestler
>> > Cc: openib-general at openib.org
>> > Subject: Re: [openib-general] Getting rid of pinned memory requirement
>> >
>> > >
>> > > The key is that the entire operation either has to be fast
>> > > enough so that no connection or application session layer
>> > > time-outs occur, or an end-to-end agreement to suspend the
>> > > connetion is a requirement. The first option seems more
>> > > plausible to me, the second essentially
>> > > reuqires extending the CM protocol. That's a tall order even for
>> > > InfiniBand, and it's even worse for iWARP where the CM
>> > > functionality typically ends when the connection is established.
>> > 
>> > I'll buy the good network design argument.
>
>
> I and others designed InfiniBand RNR (Receiver not ready) operations 
> to allow one to adjust V-to-P mappings (not change the address that 
> was advertised) in order to allow an OS to safely play some games with 
> memory and not drop a connection.  The time values associated with RNR 
> allow a solution to tolerate up to infinite amount of time to perform 
> such operations but the envisioned goal was to do this on the order of 
> a handful or milliseconds in the worse case.  For iWARP, there was no 
> support for defining RNR functionality as indeed many people claimed 
> one could just drop in-bound segments and allow the retransmission 
> protocol to deal with the delay (even if this has performance 
> implications due to back-off algorithms though some claim SACK would 
> minimize this to a large extent).  Again, the idea was to minimize the 
> worse case to milliseconds of down time.  BTW, all of this assumed 
> that the OS would not perform these types of changes that often so the 
> long-term impact on an application would be minimum.
>
>> >
>> > I suppose if the kernel wants to revoke a card's pinned
>> > memory, we should be able to guarantee that it gets new
>> > pinned memory within a bounded time. What sort of timing do
>> > we need? Milliseconds?
>> > Microseconds?
>> >
>> > In the case of iWarp, isn't this just TCP underneath? If so,
>> > can't we just drop any packets in the pipe on the floor and
>> > let them get retransmitted? (I suppose the same argument goes
>> > for infiniband..
>> > what sort of a time window do we have for retransmission?)
>> >
>> > What are the limits on end-to-end flow control in IB and iWarp?
>> >
>>
>> >From the RDMA Provider's perspective, the short answer is "quick 
>> enough so that I don't have to do anything heroic to keep the 
>> connection alive."
>
>
> It should not require anything heroic.  What is does require is a 
> local method to suspend the local QP(s) so that it cannot place or 
> read memory in the effected area.  That can take some time depending 
> upon the implementation.  There is then the time to over write the 
> mappings which again depending upon the implementation and the number 
> of mappings could be milliseconds in length.
>
>> With TCP you also have to add "and healthy". If you've ever had a 
>> long download that got effectively stalled by a burst of noise and 
>> you just hit the 'reload' button on your browser then you know what 
>> I'm talking about.
>>
>> But in transport neutral terms I would think that one RTT is 
>> definitely safe -- that much data could have
>> been dropped by one switch failure or one nasty spike in inbound noise.
>>
>> > >
>> > > Yes, there are limits on how much memory you can mlock, or even
>> > > allocate. Applications are required to reqister memory precisely
>> > > because the required guarantess are not there by default.
>> > Eliminating
>> > > those guarantees *is* effectively rewriting every RDMA application
>> > > without even letting them know.
>> >
>> > Some of this argument is a policy issue, which I would argue
>> > shouldn't be hard-coded in the code or in the network hardware.
>> >
>> > At least in my view, the guarantees are only there to make
>> > applications go fast. We are getting low latency and high
>> > performance with infiniband by making memory registration go
>> > really really slow. If, to make big HPC simulation
>> > applications work, we wind up doing memcpy() to put the data
>> > into a registered buffer because we can't register half of
>> > physical memory, the application isn't going very fast.
>> >
>>
>> What you are looking for is a distinction between registering
>> memory to *enable* the RNIC to optimize local access and
>> registering memory to enable its being advertised to the
>> remote end.
>>
>> Early implementations of RDMA, both IB and iWARP, have not
>> distinquished between the two. But theoretically *applications*
>> do not need memory regions that are not enabled for remote
>> access to be pinned. That is an RNIC requirement that could
>> evolve. But applications themselves *do* need remotely
>> accessible memory regions, portions of which they intend
>> to advertise with RKeys, to be truly available (i.e., pinned).
>>
>> You are also making a policy assumption that an application
>> that actually needs half of physical memory should be using
>> paged memory. Memory is cheap, and if performance is critical
>> why should this memory be swapped out to disk?
>>
>> Is the limitation on not being able to register half of
>> physical memory based upon some assumption that swapping
>> is a requirement? Or is it a limitation in the memory region
>> size? If it's the latter, you need to get the OS to support
>> larger page sizes.
>
>
> For some OS, you can pin very large areas.  I've seen 15/16 of memory 
> being able to be pinned with no adverse impacts on the applications.  
> For these OS, kernel memory is effectively pinned memory.  As such, 
> depending upon the mix of services being provided, the system may 
> operate quite nicely with such large amounts of memory being pinned.  
> As more services are "ported" to operate over RDMA technologies, 
> memory management isn't necessarily any harder; it just becomes 
> something people have to think more about.  Today's VM designs have 
> allowed people to get sloppy as they assume that swapping will occur 
> and since many platforms are not that loaded, they don't see any real 
> adverse impacts.  User-space RDMA applications requires people to 
> think once again about memory management and that swapping isn't a 
> get-out-of-jail card.  One needs to develop resource management tools 
> to determine who obtains specified amounts of resources and their 
> priorities.  For the most part, this is somewhat a re-invention of 
> some thinking that went into the micro-kernel work in past years.  
> These problems are not intractable; they are only constrained by the 
> legacy inertia inherent in all technologies today.
>
> Mike
>
>  
>


IWAMOTO Toshihiro wrote:

>At Mon, 25 Apr 2005 16:58:03 -0700,
>Roland Dreier wrote:
>  
>
>>    Andrew> It would be better to obtain this memory via a mmap() of
>>    Andrew> some special device node, so we can perform appropriate
>>    Andrew> permission checking and clean everything up on unclean
>>    Andrew> application exit.
>>
>>This seems to interact poorly with how applications want to use RDMA,
>>ie typically through a library interface such as MPI.  People doing
>>HPC don't want to recode their apps to use a new allocator, they just
>>want to link to a new MPI library and have the app go fast.
>>    
>>
>
>Such HPC users cannot use the memory hotremoval feature, and something
>needs to be implemented so that the NUMA migration can handle such
>memory properly, but I see your point.
>
>If such memory were allocated by a driver, the memory could be placed
>in non-hotremovable areas to avoid the above problems.
>
>--
>IWAMOTO Toshihiro
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: steve.langdon.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050425/bf692f3d/attachment.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 6189 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050425/bf692f3d/attachment.bin>

From akpm at osdl.org  Mon Apr 25 20:16:29 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 20:16:29 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426DA58F.3020508@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<426DA58F.3020508@ammasso.com>
Message-ID: <20050425201629.11d9118f.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Andrew Morton wrote:
> 
> > RLIMIT_MEMLOCK sounds like the appropriate mechanism.  We cannot rely upon
> > userspace running mlock(), so perhaps it is appropriate to run sys_mlock()
> > in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking.
> 
> I don't see what's wrong with relying on userspace to call mlock().  First all, all RDMA 
> apps call a third-party API, like DAPL or MPI, to register memory.  The memory needs to be 
> registered in order for the driver and adapter to know where it is.  During this 
> registration, the memory is also pinned.  That's when we call mlock().

All the above refers to well-behaved applications.

Now think about how the syscalls which you provide may be used by
applications which are *designed* to cripple or to compromise the machine.

> > 
> > However an hostile app can just go and run munlock() and then allocate
> > some more pinned-by-get_user_pages() memory.
> 
> Isn't mlock() on a per-process basis anyway?  How can one process call munlock() on 
> another process' memory?

I'm referring to an application which uses your syscalls to obtain pinned
memory and uses munlock() so that it may then use your syscalls to obtain
evem more pinned memory.  With the objective of taking the machine down.

> > umm, how about we
> > 
> > - force the special pages into a separate vma
> > 
> > - run get_user_pages() against it all
> > 
> > - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to
> >   do this thing
> > 
> > - undo the RMLIMIT_MEMLOCK accounting in ->release
> 
> Isn't this kinda what mlock() does already?  Create a new VMA and then VM_LOCK it?

kinda.  But applications can undo the mlock which the kernel did.

> > This will all interact with user-initiated mlock/munlock in messy ways. 
> > Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is
> > unaffected by mlock/munlock activity is needed.
> > 
> > A bit of generalisation in do_mlock() should suit?
> 
> Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap.

I haven't even thought about memory hotswap.  Surely it'll fail if the
pages are pinned by get_user_pages()?


From libor at topspin.com  Mon Apr 25 20:31:10 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 25 Apr 2005 20:31:10 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050412180447.E6958@topspin.com>;
	from libor@topspin.com on Tue, Apr 12, 2005 at 06:04:47PM -0700
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<20050412180447.E6958@topspin.com>
Message-ID: <20050425203110.A9729@topspin.com>

On Tue, Apr 12, 2005 at 06:04:47PM -0700, Libor Michalek wrote:
> On Mon, Apr 11, 2005 at 05:13:47PM -0700, Andrew Morton wrote:
> > Roland Dreier <roland at topspin.com> wrote:
> > >
> > >     Troy> Do we even need the mlock in userspace then?
> > > 
> > > Yes, because the kernel may go through and unmap pages from userspace
> > > while trying to swap.  Since we have the page locked in the kernel,
> > > the physical page won't go anywhere, but userspace might end up with a
> > > different page mapped at the same virtual address.
> 
> With the last few kernels I haven't had a chance to retest the problem
> that pushed us in the direction of using mlock. I will go back and do
> so with the latest kernel. Below I've given a quick description of the
> issue.
> 
> > That shouldn't happen.  If get_user_pages() has elevated the refcount on a
> > page then the following can happen:
> > 
> > - The VM may decide to add the page to swapcache (if it's not mmapped
> >   from a file).
> > 
> > - Once the page is backed by either swapcache of a (mmapped) file, the VM
> >   may decide the unmap the application's pte's.  A later minor fault by the
> >   app will cause the same physical page to be remapped.
> 
> The driver did use get_user_pages() to elevated the refcount on all the
> pages it was going to use for IO, as well as call set_page_dirty() since
> the pages were going to have data written to them from the device.
> 
> The problem we were seeing is that the minor fault by the app resulted
> in a new physical page getting mapped for the application. The page that
> had the elevated refcount was still waiting for the data to be written
> to by the driver at the time that the app accessed the page causing the
> minor fault. Obviously since the app had a new mapping the data written
> by the driver was lost.
> 
> It looks like code was added to try_to_unmap_one() to address this, so
> hopefully it's no longer an issue...

  I wrote a quick test module and program to confirm that the problem
we saw in older kernels with get_user_pages() no longer exists. The
module creates a character device with three different ioctl commands:

  - Pin the pages of a buffer using get_user_pages()
  - Check the pages by calling get_user_pages() a second time and
    comparing the new and original page list.
  - Relase the pages using put_page()

  The program opens the charcter device file descriptor, pins the pages
and waits for a signal, before checking the pages, which is sent to the
process after running some other program which exercises the VM. On older
kernels the check fails, on my 2.6.11 kernel the check succeeds. So
mlock is not needed on top of get_user_pages() as it was before.

  Thanks for the heads up.

  Module and program attached.

-Libor
-------------- next part --------------
/*
 * Copyright (c) 2005 Topspin Communications.  All rights reserved.
 *
 * This software is available to you under a choice of one of two
 * licenses.  You may choose to be licensed under the terms of the GNU
 * General Public License (GPL) Version 2, available from the file
 * COPYING in the main directory of this source tree, or the
 * OpenIB.org BSD license below:
 *
 *     Redistribution and use in source and binary forms, with or
 *     without modification, are permitted provided that the following
 *     conditions are met:
 *
 *      - Redistributions of source code must retain the above
 *	copyright notice, this list of conditions and the following
 *	disclaimer.
 *
 *      - Redistributions in binary form must reproduce the above
 *	copyright notice, this list of conditions and the following
 *	disclaimer in the documentation and/or other materials
 *	provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 *
 * $Id: $
 */
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/module.h>
#include <linux/device.h>
#include <linux/err.h>
#include <linux/poll.h>
#include <linux/file.h>
#include <linux/mount.h>
#include <linux/cdev.h>
#include <linux/devfs_fs_kernel.h>

#include <asm/uaccess.h>
#include <asm/highmem.h>

	
MODULE_AUTHOR("Libor Michalek");
MODULE_DESCRIPTION("Get pages test");
MODULE_LICENSE("GPL");

enum {
	TEST_MAJOR = 232,
	TEST_MINOR = 255
};

#define TEST_DEV MKDEV(TEST_MAJOR, TEST_MINOR)

enum {
	TEST_CMD_REGISTER   = 1,
	TEST_CMD_UNREGISTER = 2,
	TEST_CMD_CHECK      = 3
};

struct ioctl_arg {
	__u64 addr;
	__u64 size;
};

struct region_root {
	struct semaphore mutex;
	struct list_head regions; /* list of pending events. */
	struct file *filp;
	int nr_region;
};

struct test_region {
	unsigned long user;
	unsigned long addr;
	unsigned long size;
	int  nr_pages;
	struct page **pages;
	struct region_root *root;
	struct list_head region_list; /* member in root region list */
};

static void test_unlock(struct test_region *region)
{
        long i;

	list_del(&region->region_list);

        for (i = 0; i < region->nr_pages; i++)
                put_page(region->pages[i]);

	printk(KERN_ERR "TEST: Unlocked address <%016lx>\n", region->user);

	kfree(region->pages);
	kfree(region);
}

static struct test_region *test_lookup(struct region_root *root,
				       unsigned long addr)
{
	struct test_region *region;

	list_for_each_entry(region, &root->regions, region_list)
		if (region->user == addr)
			return region;

	return NULL;
}

static int test_lock(struct region_root *root,
		     unsigned long uaddr,
		     unsigned long size)
{
	struct test_region *region;
	int nr_pages;
	int result;

	region = kmalloc(sizeof(*region), GFP_KERNEL);
	if (!region)
		return -ENOMEM;

	region->user = uaddr;
	region->addr = uaddr & PAGE_MASK;
	region->size = PAGE_ALIGN(size + (uaddr & ~PAGE_MASK));
	region->root = root;

        nr_pages = (region->size + PAGE_SIZE-1) >> PAGE_SHIFT;

	region->pages = kmalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL);
	if (!region->pages) {

		result = -ENOMEM;
		goto page_err;
	}

        region->nr_pages = get_user_pages(current, current->mm,
					  region->addr,
					  nr_pages,
					  1, 0, 
					  region->pages, NULL);
	if (region->nr_pages != nr_pages) {
		result = -EFAULT;
		goto get_err;
	}

	list_add_tail(&region->region_list, &root->regions);

	printk(KERN_ERR "TEST:   Locked address <%016lx>\n", region->user);

	return 0;
get_err:
	kfree(region->pages);
page_err:
	kfree(region);
	return result;
}

static int test_check(struct test_region *region)
{
	struct page **pages;
	int nr_pages;
	int result = 0;
	int i;

	pages = kmalloc(sizeof(struct page *) * region->nr_pages, GFP_KERNEL);
	if (!pages)
		return -ENOMEM;

        nr_pages = get_user_pages(current, current->mm,
				  region->addr,
				  region->nr_pages,
				  1, 0, 
				  pages, NULL);
	if (region->nr_pages != nr_pages) {
		result = -EFAULT;
		goto get_err;
	}

	for (i = 0; i < nr_pages; i++) {

		if (region->pages[i] != pages[i])
			printk(KERN_ERR "TEST: Check error <%p:%p> "
			       "page <%u> of <%u>\n",
			       pages[i], region->pages[i], i, nr_pages);
		put_page(pages[i]);
	}

get_err:
	kfree(pages);
	return result;
}

static long test_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
	struct region_root *root = filp->private_data;
	struct test_region *region;
	struct ioctl_arg    ureq;
	int result = 0;

	if (!root)
		return -EINVAL;

        if (copy_from_user(&ureq, (void __user *)arg, sizeof(ureq)))
                return -EFAULT;

	down(&root->mutex);

	switch (cmd) {
	case TEST_CMD_REGISTER:

		result = test_lock(root, ureq.addr, ureq.size);
		break;
	case TEST_CMD_UNREGISTER:

		region = test_lookup(root, ureq.addr);
		if (!region)
			result = -ENOENT;
		else
			test_unlock(region);

		break;
	case TEST_CMD_CHECK:

		region = test_lookup(root, ureq.addr);
		if (!region)
			result = -ENOENT;
		else
			result = test_check(region);

		break;
	default:
		result = -ERANGE;
		break;
	}

	up(&root->mutex);
	return result;
}

static int test_open(struct inode *inode, struct file *filp)
{
	struct region_root *root;

	root = kmalloc(sizeof(*root), GFP_KERNEL);
	if (!root)
		return -ENOMEM;

	memset(root, 0, sizeof(*root));

	INIT_LIST_HEAD(&root->regions);
	init_MUTEX(&root->mutex);

	filp->private_data = root;
	root->filp = filp;

	printk(KERN_ERR "TEST: Created root struct\n");

	return 0;
}

static int test_close(struct inode *inode, struct file *filp)
{
	struct region_root *root = filp->private_data;
	struct test_region *region;

	down(&root->mutex);

	while (!list_empty(&root->regions)) {

		region = list_entry(root->regions.next,
				    struct test_region, region_list);
		test_unlock(region);
	}

	up(&root->mutex);

	kfree(root);

	filp->private_data = NULL;

	printk(KERN_ERR "TEST: Deleted root struct\n");
	return 0;
}

static struct file_operations test_fops = {
	.owner          = THIS_MODULE,
	.open 	        = test_open,
	.release        = test_close,
	.compat_ioctl   = test_ioctl,
	.unlocked_ioctl = test_ioctl,
};


static struct cdev test_cdev;

static int __init test_init(void)
{
	int result;

	result = register_chrdev_region(TEST_DEV, 1, "mltest");
	if (result) {
		printk(KERN_ERR "TEST: Error <%d> registering dev\n", result);
		goto err_chr;
	}

	cdev_init(&test_cdev, &test_fops);

	result = cdev_add(&test_cdev, TEST_DEV, 1);
	if (result) {
		printk(KERN_ERR "TEST: Error <%d> adding cdev\n", result);
		goto err_cdev;
	}

	return 0;
err_cdev:
	unregister_chrdev_region(TEST_DEV, 1);
err_chr:
	return result;
}

static void __exit test_cleanup(void)
{
	cdev_del(&test_cdev);
	unregister_chrdev_region(TEST_DEV, 1);
}

module_init(test_init);
module_exit(test_cleanup);
-------------- next part --------------
/*
 * Copyright (c) 2005 Topspin Communications.  All rights reserved.
 *
 * This software is available to you under a choice of one of two
 * licenses.  You may choose to be licensed under the terms of the GNU
 * General Public License (GPL) Version 2, available from the file
 * COPYING in the main directory of this source tree, or the
 * OpenIB.org BSD license below:
 *
 *     Redistribution and use in source and binary forms, with or
 *     without modification, are permitted provided that the following
 *     conditions are met:
 *
 *      - Redistributions of source code must retain the above
 *	copyright notice, this list of conditions and the following
 *	disclaimer.
 *
 *      - Redistributions in binary form must reproduce the above
 *	copyright notice, this list of conditions and the following
 *	disclaimer in the documentation and/or other materials
 *	provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 *
 * $Id: $
 */

#include <stdlib.h>
#include <string.h>
#include <glob.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <stdint.h>
#include <poll.h>
#include <unistd.h>
#include <signal.h>

#include <linux/types.h>

#define TEST_DEV_PATH "/dev/mltest"
#define TEST_SLEEP_UTIME 50000

struct ioctl_arg {
	__u64 addr;
	__u64 size;
};

enum {
	TEST_CMD_REGISTER   = 1,
	TEST_CMD_UNREGISTER = 2,
	TEST_CMD_CHECK      = 3
};

static int hold = 1;

void sig_usr(int value)
{
	hold = 0;
}

int main(int argc, char **argv)
{
	struct ioctl_arg req;
	void *data;
	int   param_c = 0;
	int   size;
	int   fd;
	int   result;

	if (2 != argc ||
	    0 > (size = atoi(argv[++param_c]))) { 
		
		fprintf(stderr, "usage: %s <size>\n", argv[0]);
		fprintf(stderr, "  size  - allocated region size in bytes.\n");
		
		exit(1);
	}
	signal(SIGUSR1, sig_usr);

	data = malloc(size);
	if (!data) {
		fprintf(stderr, "Failed to allocated region of size <%d>\n",
			size);
		exit(1);
	}
	
	fd = open(TEST_DEV_PATH, O_RDWR);
	if (fd < 0) {
		
		fprintf(stderr, "Error <%d:%d> opening device <%s>\n",
			fd, errno, TEST_DEV_PATH);
		goto open_err;
	}

	req.addr = (unsigned long)data;
	req.size = size;

	result = ioctl(fd, TEST_CMD_REGISTER, &req);
	if (result) {

		fprintf(stderr, "Error <%d:%d> registering region\n",
			result, errno);
		goto ioctl_err;
	}

	fprintf(stdout, 
		"Address <%016lx> registered, process <%d> waiting...\n",
		data, getpid());

	while (hold) {

		usleep(TEST_SLEEP_UTIME);
	}

	fprintf(stdout, "Process continuing, checking address <%016lx>\n",
		data);

	result = ioctl(fd, TEST_CMD_CHECK, &req);
	if (result) {

		fprintf(stderr, "Error <%d:%d> checking region\n",
			result, errno);
		goto ioctl_err;
	}

	result = ioctl(fd, TEST_CMD_UNREGISTER, &req);
	if (result) {

		fprintf(stderr, "Error <%d:%d> unregistering region\n", 
			result, errno);
		goto ioctl_err;
	}

ioctl_err:
	close(fd);
open_err:
	free(data);

	return 0;
}

From timur.tabi at ammasso.com  Mon Apr 25 20:38:26 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Mon, 25 Apr 2005 22:38:26 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425201629.11d9118f.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>	<20050425171145.2f0fd7f8.akpm@osdl.org>	<52acnmtmh6.fsf@topspin.com>	<20050425173757.1dbab90b.akpm@osdl.org>	<426DA58F.3020508@ammasso.com>
	<20050425201629.11d9118f.akpm@osdl.org>
Message-ID: <426DB7B2.7000409@ammasso.com>

Andrew Morton wrote:

> I'm referring to an application which uses your syscalls to obtain pinned
> memory and uses munlock() so that it may then use your syscalls to obtain
> evem more pinned memory.  With the objective of taking the machine down.

I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of the 
application.  All we need to do is export those functions, and my driver can call them. 
However, that still doesn't prevent an app from calling munlock().

But I don't understand the distinction between having the driver call do_mlock() vs. the 
application calling mlock().  Won't we still have the same problems?  A malicious app can 
just call our driver instead of calling mlock() or munlock(). The driver won't know the 
difference between an authorized app and an unauthorized one.

Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one process can lock?

> I haven't even thought about memory hotswap.  Surely it'll fail if the
> pages are pinned by get_user_pages()?

Any memory registered for RDMA devices obviously can't be swapped out.  Technically, the 
driver could detect that, and reject any attempt to transfer data to those regions until 
everything is remapped to other RAM.  But that's opening a whole new can of worms.  I 
don't know how the memory hotswap mechanism works, so I can't guess what recovery 
mechanisms can be implemented in the driver.


From libor at topspin.com  Mon Apr 25 20:55:12 2005
From: libor at topspin.com (Libor Michalek)
Date: Mon, 25 Apr 2005 20:55:12 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425162405.0889093e.akpm@osdl.org>;
	from akpm@osdl.org on Mon, Apr 25, 2005 at 04:24:05PM -0700
References: <20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6DFA.4090908@ammasso.com>
	<20050425153542.70197e6a.akpm@osdl.org>
	<20050425161713.A9002@topspin.com>
	<20050425162405.0889093e.akpm@osdl.org>
Message-ID: <20050425205512.B9729@topspin.com>

On Mon, Apr 25, 2005 at 04:24:05PM -0700, Andrew Morton wrote:
> Libor Michalek <libor at topspin.com> wrote:
> > On Mon, Apr 25, 2005 at 03:35:42PM -0700, Andrew Morton wrote:
> >
> > > Yes, we expect that all the pages which get_user_pages() pinned 
> > > will become unpinned within the context of the syscall which pinned
> > > the pages.  Or shortly after, in the case of async I/O.
> > 
> >   When a network protocol is making use of async I/O the amount of time
> > between posting the read request and getting the completion for that
> > request is unbounded since it depends on the other half of the connection
> > sending some data. In this case the buffer that was pinned during the
> > io_submit() may be pinned, and holding the pages, for a long time.
> 
> Sure.
> 
> > During
> > this time the process might fork, at this point any data received will be
> > placed into the wrong spot. 
> 
> Well the data is placed in _a_ spot.  That's only the "wrong" spot because
> you've defined it to be wrong!
> 
> IOW: what behaviour are you actually looking for here, and why, and does it
> matter?

  For example a network server app has an open connection on which it
uses async IO to submit two buffers for a read operation. Both buffers
are pinned using get_user_pages() and the connection waits for data to
arrive. The connection received data, it is written into the first buffer,
the app is notified using async IO, and it retreives the async IO
completion. The app reads the buffer which happens to contain a command
to spawn a child, the app forks a child. Now there is still a buffer
posted for read and if more data arrives on the connection that data is
copied to the pages which were saved when the buffer was pinned. The app
is notified, retrieves the async IO completion, but when it goes to read
that buffer it will not have the new data.
  
> > > This is because there is no file descriptor or anything else associated
> > > with the pages which permits the kernel to clean stuff up on unclean
> > > application exit.  Also there are the obvious issues with permitting
> > > pinning of unbounded amounts of memory.
> > 
> >   Correct, the driver must be able to determine that the process has died
> > and clean up after it, so the pinned region in most implementations is
> > associated with an open file descriptor.
> 
> How is that association created?

  The kernel module which pinned the memory is responsible for unpinning
it if the file descriptor, which was used to deliver the command that
resulted in the pinning, is closed.

-Libor


From akpm at osdl.org  Mon Apr 25 21:33:15 2005
From: akpm at osdl.org (Andrew Morton)
Date: Mon, 25 Apr 2005 21:33:15 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <426DB7B2.7000409@ammasso.com>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org>
	<5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org>
	<52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org>
	<4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<426DA58F.3020508@ammasso.com>
	<20050425201629.11d9118f.akpm@osdl.org>
	<426DB7B2.7000409@ammasso.com>
Message-ID: <20050425213315.27db35db.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Andrew Morton wrote:
> 
> > I'm referring to an application which uses your syscalls to obtain pinned
> > memory and uses munlock() so that it may then use your syscalls to obtain
> > evem more pinned memory.  With the objective of taking the machine down.
> 
> I'm in favor of having drivers call do_mlock() and do_munlock() on behalf of the 
> application.  All we need to do is export those functions, and my driver can call them. 
> However, that still doesn't prevent an app from calling munlock().

Precisely.  That's why I suggested that we have an alternative vma->vm_flag
bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL),
only userspace cannot alter it.

> But I don't understand the distinction between having the driver call do_mlock() vs. the 
> application calling mlock().  Won't we still have the same problems?  A malicious app can 
> just call our driver instead of calling mlock() or munlock(). The driver won't know the 
> difference between an authorized app and an unauthorized one.

The driver will set VM_LOCKED_KERNEL, not VM_LOCKED.

> Besides, isn't the whole point behind RLIMIT_MEMLOCK to limit how much one process can lock?

Sure.  The internal setting of VM_LOCKED_KERNEL should still use
RLIMIT_MEMLOCK accounting.


From hch at infradead.org  Mon Apr 25 23:12:36 2005
From: hch at infradead.org (Christoph Hellwig)
Date: Tue, 26 Apr 2005 07:12:36 +0100
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52r7gytnfn.fsf@topspin.com>
References: <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<52r7gytnfn.fsf@topspin.com>
Message-ID: <20050426061236.GA27220@infradead.org>

On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote:
> The idea is that applications manage the lifetime of pinned memory
> regions.  They can do things like post multiple I/O operations without
> any page-walking overhead, or pass a buffer descriptor to a remote
> host who will send data at some indeterminate time in the future.  In
> addition, InfiniBand has the notion of atomic operations, so a cluster
> application may be using some memory region to implement a global lock.
> 
> This might not be the most kernel-friendly design but it is pretty
> deeply ingrained in the design of RDMA transports like InfiniBand and
> iWARP (RDMA over IP).

Actuallky, no it isn't.   All these transports would work just fine with
the mmap a character device to hand out memory from the kernel approach
I told you to use multiple times and Andrew mentioned in this thread aswell.
What doesn't work with that design are the braindead designed by comittee
APIs in the RDMA world - but I don't think we should care about them too
much.


From caitlin.bestler at gmail.com  Tue Apr 26 06:45:42 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Tue, 26 Apr 2005 06:45:42 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426061236.GA27220@infradead.org>
References: <4263DEC5.5080909@ammasso.com> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<52r7gytnfn.fsf@topspin.com> <20050426061236.GA27220@infradead.org>
Message-ID: <469958e00504260645dd2218d@mail.gmail.com>

On 4/25/05, Christoph Hellwig <hch at infradead.org> wrote:
> On Mon, Apr 25, 2005 at 05:02:36PM -0700, Roland Dreier wrote:
> > The idea is that applications manage the lifetime of pinned memory
> > regions.  They can do things like post multiple I/O operations without
> > any page-walking overhead, or pass a buffer descriptor to a remote
> > host who will send data at some indeterminate time in the future.  In
> > addition, InfiniBand has the notion of atomic operations, so a cluster
> > application may be using some memory region to implement a global lock.
> >
> > This might not be the most kernel-friendly design but it is pretty
> > deeply ingrained in the design of RDMA transports like InfiniBand and
> > iWARP (RDMA over IP).
> 
> Actuallky, no it isn't.   All these transports would work just fine with
> the mmap a character device to hand out memory from the kernel approach
> I told you to use multiple times and Andrew mentioned in this thread aswell.
> What doesn't work with that design are the braindead designed by comittee
> APIs in the RDMA world - but I don't think we should care about them too
> much.
> 


RDMA registers and uses the memory the user specifies. That is why byte
granularity and multiple redundant registrations are explicitly specified.

The mechanism by which this requirement is implemented is of course
OS dependent. But the requirements are that the application specifies
what portion of their memory they want registered (or what set of physical
pages if they have sufficient privilege) and that request is either honored
or refused by a resource manager (one preferably as integrated with
general OS resource management as possible).

The other aspect is that remotely enabled memory regions and memory
windows most be enabled for hardware access for the duration of 
the region or window -- indefinitely until process death or explicit
termination by the application layer.

Theoretically there is nothing in the wire protocols that requires source
buffers to be pinned indefinitely, but that is the only way any RDMA
interface has ever worked -- so "brain death" must be pretty widespread.

The fact that this problem must be solved for remotely accessible
buffers, and that for cluster applications like MPI there is no distinction
between buffers used for inbound messages and outbound messages,
might have something to do with this.

User verbs needs to deal with these actual Memory Registration requirements,
including the very real application need for Memory Windows. The solution
should map to existing OS controls as much as possible.


From timur.tabi at ammasso.com  Tue Apr 26 07:07:03 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Tue, 26 Apr 2005 09:07:03 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425213315.27db35db.akpm@osdl.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>	<20050411142213.GC26127@kalmia.hozed.org>	<52mzs51g5g.fsf@topspin.com>	<20050411163342.GE26127@kalmia.hozed.org>	<5264yt1cbu.fsf@topspin.com>	<20050411180107.GF26127@kalmia.hozed.org>	<52oeclyyw3.fsf@topspin.com>	<20050411171347.7e05859f.akpm@osdl.org>	<4263DEC5.5080909@ammasso.com>	<20050418164316.GA27697@infradead.org>	<4263E445.8000605@ammasso.com>	<20050423194421.4f0d6612.akpm@osdl.org>	<426BABF4.3050205@ammasso.com>	<52is2bvvz5.fsf@topspin.com>	<20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>	<20050425171145.2f0fd7f8.akpm@osdl.org>	<52acnmtmh6.fsf@topspin.com>	<20050425173757.1dbab90b.akpm@osdl.org>	<426DA58F.3020508@ammasso.com>	<20050425201629.11d9118f.akpm@osdl.org>	<426DB7B2.7000409@ammasso.com>
	<20050425213315.27db35db.akpm@osdl.org>
Message-ID: <426E4B07.1040400@ammasso.com>

Andrew Morton wrote:

> Precisely.  That's why I suggested that we have an alternative vma->vm_flag
> bit which behaves in a similar manner to VM_LOCKED (say, VM_LOCKED_KERNEL),
> only userspace cannot alter it.

How about calling it VM_PINNED?  That way, we can define

Locked - won't be swapped to disk, but can be moved around in memory
Pinned - won't be swapped to disk or moved around in memory


From halr at voltaire.com  Tue Apr 26 07:48:24 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Apr 2005 10:48:24 -0400
Subject: [openib-general] Re: [openib-commits] r2196 - in
	gen2/trunk/src/linux-kernel/infiniband: core include]
Message-ID: <1114526788.1764.200.camel@localhost.localdomain>

Hi Sean,

I may have missed this but how is the need for the non natural
alignment accomodated now ?

I am unable to IPoIB ping from one node to another as the SA query for
PathRecord is not answered as something is now wrong in the query. When
I add back the packing (patch below), it works again.

I do think the packing is for more than just 64 bit architectures as I
am running this on an Intel 386.

-- Hal

Index: ib_mad.h
===================================================================
--- ib_mad.h	(revision 2209)
+++ ib_mad.h	(working copy)
@@ -134,12 +134,18 @@
 
 #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull
<< n))
 
+/*
+ * ib_sa_hdr and ib_sa_mad structures must be packed because they have 
+ * 64-bit fields that are only 32-bit aligned. 64-bit architectures
will
+ * lay them out wrong otherwise.  (And unfortunately they are sent on 
+ * the wire so we can't change the layout)
+ */
 struct ib_sa_hdr {
 	u64			sm_key;
 	u16			attr_offset;
 	u16			reserved;
 	ib_sa_comp_mask		comp_mask;
-};
+} __attribute__ ((packed));
 
 struct ib_mad {
 	struct ib_mad_hdr	mad_hdr;
@@ -157,7 +163,7 @@
 	struct ib_rmpp_hdr	rmpp_hdr;
 	struct ib_sa_hdr	sa_hdr;
 	u8			data[200];
-};
+} __attribute__ ((packed));
 
 struct ib_vendor_mad {
 	struct ib_mad_hdr	mad_hdr;


--Forwarded Message--


From: sean.hefty at openib.org
To: openib-commits at openib.org
Subject: [openib-commits] r2196 - in gen2/trunk/src/linux-kernel/infiniband: core include
Date: 20 Apr 2005 10:58:56 -0700

Author: sean.hefty
Date: 2005-04-20 10:58:55 -0700 (Wed, 20 Apr 2005)
New Revision: 2196

Modified:
   gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c
   gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h
   gen2/trunk/src/linux-kernel/infiniband/include/ib_sa.h
Log:
Move SA MAD definitions to ib_mad.h.  Removed unneeded packed attribute
from MAD structure definitions.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


Modified: gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c
===================================================================
-- gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c	2005-04-20 17:35:52 UTC (rev 2195)
+++ gen2/trunk/src/linux-kernel/infiniband/core/sa_query.c	2005-04-20 17:58:55 UTC (rev 2196)
@@ -50,26 +50,6 @@
 MODULE_DESCRIPTION("InfiniBand subnet administration query support");
 MODULE_LICENSE("Dual BSD/GPL");
 
-/*
- * These two structures must be packed because they have 64-bit fields
- * that are only 32-bit aligned.  64-bit architectures will lay them
- * out wrong otherwise.  (And unfortunately they are sent on the wire
- * so we can't change the layout)
- */
-struct ib_sa_hdr {
-	u64			sm_key;
-	u16			attr_offset;
-	u16			reserved;
-	ib_sa_comp_mask		comp_mask;
-} __attribute__ ((packed));
-
-struct ib_sa_mad {
-	struct ib_mad_hdr	mad_hdr;
-	struct ib_rmpp_hdr	rmpp_hdr;
-	struct ib_sa_hdr	sa_hdr;
-	u8			data[200];
-} __attribute__ ((packed));
-
 struct ib_sa_sm_ah {
 	struct ib_ah        *ah;
 	struct kref          ref;

Modified: gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h
===================================================================
-- gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h	2005-04-20 17:35:52 UTC (rev 2195)
+++ gen2/trunk/src/linux-kernel/infiniband/include/ib_mad.h	2005-04-20 17:58:55 UTC (rev 2196)
@@ -117,7 +117,7 @@
 	u16	attr_id;
 	u16	resv;
 	u32	attr_mod;
-} __attribute__ ((packed));
+};
 
 struct ib_rmpp_hdr {
 	u8	rmpp_version;
@@ -126,26 +126,44 @@
 	u8	rmpp_status;
 	u32	seg_num;
 	u32	paylen_newwin;
-} __attribute__ ((packed));
+};
 
+typedef u64 __bitwise ib_sa_comp_mask;
+
+#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n))
+
+struct ib_sa_hdr {
+	u64			sm_key;
+	u16			attr_offset;
+	u16			reserved;
+	ib_sa_comp_mask		comp_mask;
+};
+
 struct ib_mad {
 	struct ib_mad_hdr	mad_hdr;
 	u8			data[232];
-} __attribute__ ((packed));
+};
 
 struct ib_rmpp_mad {
 	struct ib_mad_hdr	mad_hdr;
 	struct ib_rmpp_hdr	rmpp_hdr;
 	u8			data[220];
-} __attribute__ ((packed));
+};
 
+struct ib_sa_mad {
+	struct ib_mad_hdr	mad_hdr;
+	struct ib_rmpp_hdr	rmpp_hdr;
+	struct ib_sa_hdr	sa_hdr;
+	u8			data[200];
+};
+
 struct ib_vendor_mad {
 	struct ib_mad_hdr	mad_hdr;
 	struct ib_rmpp_hdr	rmpp_hdr;
 	u8			reserved;
 	u8			oui[3];
 	u8			data[216];
-} __attribute__ ((packed));
+};
 

From roland at topspin.com  Tue Apr 26 08:14:08 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 08:14:08 -0700
Subject: [openib-general] Re: [openib-commits] r2196 - in
	gen2/trunk/src/linux-kernel/infiniband: core include]
In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain> (Hal
	Rosenstock's message of "26 Apr 2005 10:48:24 -0400")
References: <1114526788.1764.200.camel@localhost.localdomain>
Message-ID: <521x8xtvsv.fsf@topspin.com>

    Hal> Hi Sean, I may have missed this but how is the need for the
    Hal> non natural alignment accomodated now ?

    Hal> I am unable to IPoIB ping from one node to another as the SA
    Hal> query for PathRecord is not answered as something is now
    Hal> wrong in the query. When I add back the packing (patch
    Hal> below), it works again.

    Hal> I do think the packing is for more than just 64 bit
    Hal> architectures as I am running this on an Intel 386.

Yes, the two SA structures definitely need the packed attribute as
explained in the comment.  I believe that the other MAD structures do
not need the attribute but that needs to be checked.

 - R.


From robert.j.woodruff at intel.com  Tue Apr 26 08:14:33 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Tue, 26 Apr 2005 08:14:33 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspaceverbs
	implementation
In-Reply-To: <52mzrmtnd9.fsf@topspin.com>
Message-ID: <ORSMSX408ryWtIIZS2T00000016@orsmsx408.amr.corp.intel.com>

Roland wrote,
>I think you've missed the point: unless a process sets VM_DONTCOPY on
>its RDMA memory regions, then incorrect memory mappings may be used if
>the app does something as simple as calling system("ls").

> - R.

This is the exact problem that we saw with the Mellanox vapi driver.
It set VM_DONTCOPY and the result was that when someone did
a system("ls"), the call often caused a segv in the child. The issue seemed
to
be that someone had done a malloc of a buffer and then registered it,
which caused the entire page to be set to VM_DONTCOPY. Then someone
else (like the pthreads library) did a malloc that happened to reside
in the same page. When the user called system() which did a fork()/exec(),
the pthreads library would segv when trying to clean things up before
the exec(). 

We found that if we did not set VM_DONTCOPY that the child would no longer
segv, but in some instances, the Mellanox card seemed to be hosed after
the system call and would no longer transfer data. We never did understand
why.
We then found that if we set VM_DONTCOPY on only the register space
pages (doorbells and such), but not on the registered memory,
that it seemed to work OK. 

woody


From timur.tabi at ammasso.com  Tue Apr 26 08:24:03 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Tue, 26 Apr 2005 10:24:03 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426061236.GA27220@infradead.org>
References: <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org>
	<4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org>
	<426BABF4.3050205@ammasso.com> <52is2bvvz5.fsf@topspin.com>
	<20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<52r7gytnfn.fsf@topspin.com> <20050426061236.GA27220@infradead.org>
Message-ID: <426E5D13.6000200@ammasso.com>

Christoph Hellwig wrote:

> What doesn't work with that design are the braindead designed by comittee
> APIs in the RDMA world - but I don't think we should care about them too
> much.

I think you should.  The whole point behind RDMA is that these APIs exist and are being 
used by real-world applications.  You can't just ignore them because they're inconvenient. 
  If you're not willing to cater to these API's needs, then you may as well tell all the 
RDMA developers to forgot about Linux and port everything to Windows instead.

The APIs are here to stay, and the whole point behind this thread is to discuss how Linux 
can support them.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From tduffy at sun.com  Tue Apr 26 08:27:21 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 08:27:21 -0700
Subject: [openib-general] Re: [openib-commits] r2196 - in
	gen2/trunk/src/linux-kernel/infiniband: core include]
In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain>
References: <1114526788.1764.200.camel@localhost.localdomain>
Message-ID: <1114529241.11580.1.camel@duffman>

On Tue, 2005-04-26 at 10:48 -0400, Hal Rosenstock wrote:
> Hi Sean,
> 
> I may have missed this but how is the need for the non natural
> alignment accomodated now ?
> 
> I am unable to IPoIB ping from one node to another as the SA query for
> PathRecord is not answered as something is now wrong in the query. When
> I add back the packing (patch below), it works again.

Ah good. It is not just me.  Please apply.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/cf50b457/attachment.sig>

From roland at topspin.com  Tue Apr 26 08:31:32 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 08:31:32 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050425173757.1dbab90b.akpm@osdl.org> (Andrew Morton's
	message of "Mon, 25 Apr 2005 17:37:57 -0700")
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411163342.GE26127@kalmia.hozed.org> <5264yt1cbu.fsf@topspin.com>
	<20050411180107.GF26127@kalmia.hozed.org> <52oeclyyw3.fsf@topspin.com>
	<20050411171347.7e05859f.akpm@osdl.org> <4263DEC5.5080909@ammasso.com>
	<20050418164316.GA27697@infradead.org> <4263E445.8000605@ammasso.com>
	<20050423194421.4f0d6612.akpm@osdl.org> <426BABF4.3050205@ammasso.com>
	<52is2bvvz5.fsf@topspin.com> <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
Message-ID: <52wtqpsgff.fsf@topspin.com>

    Andrew> umm, how about we

    Andrew> - force the special pages into a separate vma

    Andrew> - run get_user_pages() against it all

    Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user
    Andrew> is allowed to do this thing

    Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release

    Andrew> This will all interact with user-initiated mlock/munlock
    Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag
    Andrew> which works like VM_LOCKED but is unaffected by
    Andrew> mlock/munlock activity is needed.

    Andrew> A bit of generalisation in do_mlock() should suit?

Yes, it seems that modifying do_mlock() to something like

	int do_mlock(unsigned long start, size_t len,
		     unsigned int set, unsigned int clear)

and then exporting a function along the lines of

	int do_mem_pin(unsigned long start, size_t len, int on)

that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on
flag.

Seem reasonable?  If so I can code this up.

 - R.


From libor at topspin.com  Tue Apr 26 08:42:34 2005
From: libor at topspin.com (Libor Michalek)
Date: Tue, 26 Apr 2005 08:42:34 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <52wtqpsgff.fsf@topspin.com>;
	from roland@topspin.com on Tue, Apr 26, 2005 at 08:31:32AM -0700
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com>
Message-ID: <20050426084234.A10366@topspin.com>

On Tue, Apr 26, 2005 at 08:31:32AM -0700, Roland Dreier wrote:
>     Andrew> umm, how about we
> 
>     Andrew> - force the special pages into a separate vma
> 
>     Andrew> - run get_user_pages() against it all
> 
>     Andrew> - use RLIMIT_MEMLOCK accounting to check whether the user
>     Andrew> is allowed to do this thing
> 
>     Andrew> - undo the RMLIMIT_MEMLOCK accounting in ->release
> 
>     Andrew> This will all interact with user-initiated mlock/munlock
>     Andrew> in messy ways. Maybe a new kernel-internal vma->vm_flag
>     Andrew> which works like VM_LOCKED but is unaffected by
>     Andrew> mlock/munlock activity is needed.
> 
>     Andrew> A bit of generalisation in do_mlock() should suit?
> 
> Yes, it seems that modifying do_mlock() to something like
> 
> 	int do_mlock(unsigned long start, size_t len,
> 		     unsigned int set, unsigned int clear)
> 
> and then exporting a function along the lines of
> 
> 	int do_mem_pin(unsigned long start, size_t len, int on)
> 
> that sets/clears (VM_LOCKED_KERNEL | VM_DONTCOPY) according to the on
> flag.

  Do you mean that the set/clear parameters to do_mlock() are the
actual flags which are set/cleared by the caller? Also, the issue
remains that the flags are not reference counted which is a problem
if you are dealing with overlapping memory region, or even if one
region ends and another begins on the same page. Since the desire is
to be able to pin any memory that a user can malloc this is a likely
scenario.

-Libor


From roland at topspin.com  Tue Apr 26 08:49:17 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 08:49:17 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050426084234.A10366@topspin.com> (Libor Michalek's message
	of "Tue, 26 Apr 2005 08:42:34 -0700")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
Message-ID: <52mzrlsflu.fsf@topspin.com>

    Libor>   Do you mean that the set/clear parameters to do_mlock()
    Libor> are the actual flags which are set/cleared by the caller? 
    Libor> Also, the issue remains that the flags are not reference
    Libor> counted which is a problem if you are dealing with
    Libor> overlapping memory region, or even if one region ends and
    Libor> another begins on the same page. Since the desire is to be
    Libor> able to pin any memory that a user can malloc this is a
    Libor> likely scenario.

Good point... we need to figure out how to handle:

    a) app registers 0x0000 through 0x17ff
    b) app registers 0x1800 through 0x2fff
    c) app unregisters 0x0000 through 0x17ff
    d) the page at 0x1000 must stay pinned

hmm...

 - R.


From mshefty at ichips.intel.com  Tue Apr 26 09:13:27 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 26 Apr 2005 09:13:27 -0700
Subject: [openib-general] Re: [openib-commits] r2196 -
 in	gen2/trunk/src/linux-kernel/infiniband: core include]
In-Reply-To: <1114526788.1764.200.camel@localhost.localdomain>
References: <1114526788.1764.200.camel@localhost.localdomain>
Message-ID: <426E68A7.6040305@ichips.intel.com>

Hal Rosenstock wrote:

> Hi Sean,
> 
> I may have missed this but how is the need for the non natural
> alignment accomodated now ?

Sorry about that.  I was looking at the layout for a single structure, 
and not nesting of structures, when I made this change, and my testing 
is on a 32-bit system.

- Sean


From olivier.cozette at seanodes.com  Tue Apr 26 09:56:32 2005
From: olivier.cozette at seanodes.com (Olivier Cozette)
Date: Tue, 26 Apr 2005 18:56:32 +0200
Subject: [openib-general] kernel vapi
Message-ID: <1114534592.15717.33.camel@olivier.toulouse>

	Hello,

Sorry, but i don't the good list to tell about my problem, so a post it
to this list.

I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried
to port the vping to kernel space
(ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping).

I have removed all the tcp things, and all the thread relative things.

If i prohibite the call of schedule() (uninteruptible process, so the pc
is no more usable), the server receive data from one userspace client.

But if i use schedule() in anywere in the code, for example in the
get_next_rq_cqe(), my module oops with a probleme in __switch_to. 

With more debugging, i find that current->thread.io_bitmap_ptr have very
strange value ( 0x1 , 0x3 , 0x565554535251504f ) or sometimes a good
value (in normal kernel space ). When there are bad value, the
_switch_to crash within "memcpy(tss->io_bitmap,
next->io_bitmap_ptr,IO_BITMAP_SIZE*sizeof(u32));" or with a "mov %db6,%
rax" instruction.

My module seems stable when i set current->thread.io_bitmap_ptr to NULL,
but it crash when i type "top" or "ps aux".


So if anyone know an issue please help me.

	Olivier


From halr at voltaire.com  Tue Apr 26 10:39:56 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Apr 2005 13:39:56 -0400
Subject: [openib-general] Re: [openib-commits] r2196 - in
	gen2/trunk/src/linux-kernel/infiniband: core include]
In-Reply-To: <1114529241.11580.1.camel@duffman>
References: <1114526788.1764.200.camel@localhost.localdomain>
	<1114529241.11580.1.camel@duffman>
Message-ID: <1114537000.1764.243.camel@localhost.localdomain>

On Tue, 2005-04-26 at 11:27, Tom Duffy wrote:
> On Tue, 2005-04-26 at 10:48 -0400, Hal Rosenstock wrote:
> > Hi Sean,
> > 
> > I may have missed this but how is the need for the non natural
> > alignment accomodated now ?
> > 
> > I am unable to IPoIB ping from one node to another as the SA query for
> > PathRecord is not answered as something is now wrong in the query. When
> > I add back the packing (patch below), it works again.
> 
> Ah good. It is not just me.  Please apply.

And I thought it was (just) me too :-) Applied.

-- Hal 


From akpm at osdl.org  Tue Apr 26 12:28:50 2005
From: akpm at osdl.org (Andrew Morton)
Date: Tue, 26 Apr 2005 12:28:50 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <52mzrlsflu.fsf@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
Message-ID: <20050426122850.44d06fa6.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Libor>   Do you mean that the set/clear parameters to do_mlock()
>     Libor> are the actual flags which are set/cleared by the caller? 
>     Libor> Also, the issue remains that the flags are not reference
>     Libor> counted which is a problem if you are dealing with
>     Libor> overlapping memory region, or even if one region ends and
>     Libor> another begins on the same page. Since the desire is to be
>     Libor> able to pin any memory that a user can malloc this is a
>     Libor> likely scenario.
> 
> Good point... we need to figure out how to handle:
> 
>     a) app registers 0x0000 through 0x17ff
>     b) app registers 0x1800 through 0x2fff
>     c) app unregisters 0x0000 through 0x17ff
>     d) the page at 0x1000 must stay pinned

The userspace library should be able to track the tree and the overlaps,
etc.  Things might become interesting when the memory is MAP_SHARED
pagecache and multiple independent processes are involved, although I guess
that'd work OK.

But afaict the problem wherein part of a page needs VM_DONTCOPY and the
other part does not cannot be solved.


From woodennickel at gmail.com  Tue Apr 26 12:57:15 2005
From: woodennickel at gmail.com (Bill Jordan)
Date: Tue, 26 Apr 2005 15:57:15 -0400
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
Message-ID: <5ebee0d1050426125764409335@mail.gmail.com>

On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:
> But afaict the problem wherein part of a page needs VM_DONTCOPY and the
> other part does not cannot be solved.

There may be an opportunity to create a solution where we can mark the
page as "copy on fork" so the child has a page with a copy of the
contents (at the time of the fork) instead of marking the page
copy-on-write.

-- 
Bill Jordan
InfiniCon Systems


From roland at topspin.com  Tue Apr 26 13:14:31 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 13:14:31 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org> (Andrew Morton's
	message of "Tue, 26 Apr 2005 12:28:50 -0700")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org>
Message-ID: <5264y9s3bs.fsf@topspin.com>

    Roland>     a) app registers 0x0000 through 0x17ff
    Roland>     b) app registers 0x1800 through 0x2fff
    Roland>     c) app unregisters 0x0000 through 0x17ff
    Roland>     d) the page at 0x1000 must stay pinned

    Andrew> The userspace library should be able to track the tree and
    Andrew> the overlaps, etc.  Things might become interesting when
    Andrew> the memory is MAP_SHARED pagecache and multiple
    Andrew> independent processes are involved, although I guess
    Andrew> that'd work OK.

I used to think I knew how to handle this, but in your scheme where
the kernel is doing accounting for pinned memory by marking vmas with
VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock
vmas covering 0x0000 through 0x1fff and credit 8K back to the
process's pinning count.

Sorry to be so dense but can you spell out what you think should
happen at steps a), b) and c) above?

    Andrew> But afaict the problem wherein part of a page needs
    Andrew> VM_DONTCOPY and the other part does not cannot be solved.

Yes, I agree.  If an app wants to register half a page and pass the
other half to a child process, I think the only answer is "don't do
that then."

 - R.


From timur.tabi at ammasso.com  Tue Apr 26 13:18:40 2005
From: timur.tabi at ammasso.com (Timur Tabi)
Date: Tue, 26 Apr 2005 15:18:40 -0500
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <5264y9s3bs.fsf@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>	<52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com>	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com>
Message-ID: <426EA220.6010007@ammasso.com>

Roland Dreier wrote:

> Yes, I agree.  If an app wants to register half a page and pass the
> other half to a child process, I think the only answer is "don't do
> that then."

How can the app know that, though?  It would have to allocate I/O buffers with knowledge 
of page boundaries.  Today, the apps just malloc() a bunch of memory and pay no attention 
to whether the beginning or the end of the buffer shares a page with some other, unrelated 
object.  We may as well tell the app that it needs to page-align all I/O buffers.

My point is that we can't just simply say, "Don't do that".  Some entity (the kernel, 
libraries, whatever) should be able to tell the app that its usage of memory is going to 
break in some unpredictable way.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13


From akpm at osdl.org  Tue Apr 26 13:32:29 2005
From: akpm at osdl.org (Andrew Morton)
Date: Tue, 26 Apr 2005 13:32:29 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <5264y9s3bs.fsf@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com>
Message-ID: <20050426133229.416a5e66.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Roland>     a) app registers 0x0000 through 0x17ff
>     Roland>     b) app registers 0x1800 through 0x2fff
>     Roland>     c) app unregisters 0x0000 through 0x17ff
>     Roland>     d) the page at 0x1000 must stay pinned
> 
>     Andrew> The userspace library should be able to track the tree and
>     Andrew> the overlaps, etc.  Things might become interesting when
>     Andrew> the memory is MAP_SHARED pagecache and multiple
>     Andrew> independent processes are involved, although I guess
>     Andrew> that'd work OK.
> 
> I used to think I knew how to handle this, but in your scheme where
> the kernel is doing accounting for pinned memory by marking vmas with
> VM_KERNEL_LOCKED, at step c), I don't see why the kernel won't unlock
> vmas covering 0x0000 through 0x1fff and credit 8K back to the
> process's pinning count.
> 
> Sorry to be so dense but can you spell out what you think should
> happen at steps a), b) and c) above?

Well I was vaguely proposing that the userspace library keep track of the
byteranges and the underlying page states.  So in the above scenario
userspace would leave the page at 0x1000 registered until all
registrations against that page have been undone.


From akpm at osdl.org  Tue Apr 26 13:37:52 2005
From: akpm at osdl.org (Andrew Morton)
Date: Tue, 26 Apr 2005 13:37:52 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <426EA220.6010007@ammasso.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
Message-ID: <20050426133752.37d74805.akpm@osdl.org>

Timur Tabi <timur.tabi at ammasso.com> wrote:
>
> Roland Dreier wrote:
> 
>  > Yes, I agree.  If an app wants to register half a page and pass the
>  > other half to a child process, I think the only answer is "don't do
>  > that then."
> 
>  How can the app know that, though?  It would have to allocate I/O buffers with knowledge 
>  of page boundaries.  Today, the apps just malloc() a bunch of memory and pay no attention 
>  to whether the beginning or the end of the buffer shares a page with some other, unrelated 
>  object.  We may as well tell the app that it needs to page-align all I/O buffers.
> 
>  My point is that we can't just simply say, "Don't do that".  Some entity (the kernel, 
>  libraries, whatever) should be able to tell the app that its usage of memory is going to 
>  break in some unpredictable way.

Our point is that contemporary microprocessors cannot electrically do what
you want them to do!

Now, conceeeeeeiveably the kernel could keep track of the state of the
pages down to the byte level, and could keep track of all COWed pages and
could look at faulting addresses at the byte level and could copy sub-page
ranges by hand from one process's address space into another process's
after I/O completion.  I don't think we want to do that.

Methinks your specification is busted.


From roland at topspin.com  Tue Apr 26 14:23:28 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 14:23:28 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050426133229.416a5e66.akpm@osdl.org> (Andrew Morton's
	message of "Tue, 26 Apr 2005 13:32:29 -0700")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org>
Message-ID: <521x8xs04v.fsf@topspin.com>

    Andrew> Well I was vaguely proposing that the userspace library
    Andrew> keep track of the byteranges and the underlying page
    Andrew> states.  So in the above scenario userspace would leave
    Andrew> the page at 0x1000 registered until all registrations
    Andrew> against that page have been undone.

OK, I already have code in userspace that keeps reference counts for
overlapping regions, etc.  However I'm not sure how to tie this in
with reliable accounting of pinned memory -- we don't want malicious
userspace code to be able fool the accounting, right?

So I'm still trying to puzzle out what to do.  I don't want to keep a
complicated data structure in the kernel keeping track of what memory
has been registered.  Right now, I just keep a list of structs, one
for each region, and when a process dies, I just go through region by
region and do a put_page() to balance off the get_user_pages().

However I don't see how to make it work if I put the reference
counting for overlapping regions in userspace but when I want mlock()
accounting in the kernel.  If a buggy/malicious app does:

    a) register from 0x0000 to 0x2fff
    b) register from 0x1000 to 0x1fff
    c) unregister from 0x0000 to 0x2fff

then it seems the kernel is screwed unless it counts how many times a
vma has been pinned.  And adding a pin_count member to vm_struct seems
like a pretty damn major step.

We definitely have to make sure that userspace is never able to either
unpin a page that is still registered with RDMA hardware, because that
can lead to DMA to into memory that someone else owns.  On the other
hand, we don't want userspace to be able to defeat resource accounting
by tricking the kernel into keeping page_count elevated after it
credits the memory back to a process's limit on locked pages.

The limit on the number of locked pages seems like a natural thing to
check against, but perhaps we need a different limit for the number of
pages pinned for use by RDMA hardware.  Sort of the same way that
there's a separate limit on the number of in-flight aios.

 - R.


From tduffy at sun.com  Tue Apr 26 15:36:33 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 15:36:33 -0700
Subject: [openib-general] [SDP] having moving data on ttcp.aio connection
Message-ID: <1114554993.22383.7.camel@duffman>

Has anybody seen this type of error when doing SDP?

 ERR: : VMA lock <508000:65536> error <-12> <16:0:8>
WARN: <1> <0404:2480> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <51c000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <530000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <544000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <558000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <56c000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <580000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <594000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <5a8000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <5bc000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <5d0000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <5e4000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <5f8000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <60c000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <620000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <634000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <648000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <65c000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <670000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
 ERR: : VMA lock <684000:65536> error <-12> <16:0:8>
WARN: <1> <050e:11b1> Error <-12> IOCB lock <65536:0>
WARN: <1> <ff01:3b01> CM state <0> event <9> error <-2>

ttcp reports:

[root at sins-stinger-10 ~]# ./ttcp -r -l 65536 -a 20
ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001
ttcp-r: socket
ttcp-r: accept from 192.168.0.26
ttcp-r: Event error <-12> <5275648>
ttcp-r: 0 bytes in 2.50 real seconds = 0.00 Mbit/sec +++
ttcp-r: 2 I/O calls, usec/call = 1248114.00, calls/sec = 0.80
ttcp-r: user: 0 sys: 41994 total: 41994 real: 2496228 (microseconds)

[root at flopteron2 ~]# ./ttcp -t -l 65536 -n 100000 -a 20 192.168.0.233
ttcp-t: buflen = 65536 nbuf = 100000 align = 16384/0 port = 5001
192.168.0.233
ttcp-t: socket
ttcp-t: connect
ttcp-t: Event error <-12> <5275648>
ttcp-t: 0 bytes in 2.56 real seconds = 0.00 Mbit/sec +++
ttcp-t: 2 I/O calls, usec/call = 1282312.00, calls/sec = 0.78
ttcp-t: user: 0 sys: 42993 total: 42993 real: 2564624 (microseconds)

BTW, this with both ends on 2.6.11 with stock openib revision 2214, not
my modified 2.6.12-rc3 version.  Also, I am using opensm for my SM (back
to back configuration).

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/e31dec22/attachment.sig>

From tduffy at sun.com  Tue Apr 26 16:05:22 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 16:05:22 -0700
Subject: [openib-general] .org pavilion spot in LW 2005 in SF
In-Reply-To: <1110826633.21708.8.camel@duffman>
References: <1110826633.21708.8.camel@duffman>
Message-ID: <1114556723.22383.12.camel@duffman>

On Mon, 2005-03-14 at 10:57 -0800, Tom Duffy wrote:
> Duncan,
> 
> Hello, I am contacting you as a representative from the OpenIB.org
> alliance.  We are a non-profit organization that is dedicated to
> providing an open-source, multi-vendor, best-of-breed Infiniband stack
> for the Linux kernel as well as all the related userland libraries and
> utilities.
> 
> Our website is http://www.openib.org.  All of our projects are available
> under the GPL as well as a BSD license.
> 
> We would like a slot in the .org pavilion for LinuxWorld 2005 in San
> Francisco.  The booth will have demos of InfiniBand in action using the
> recently accepted code in the 2.6.11 kernel running on multiple vendors
> hardware.
> 
> Please "reply all" as I have CC'ed the developer list for OpenIB.

Duncan, are you still the person involved with setting up the .org
pavilion at Linux World?  If not, can you please forward my message to
the appropriate person.

Thanks,

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/d87e2499/attachment.sig>

From tduffy at sun.com  Tue Apr 26 16:18:37 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 16:18:37 -0700
Subject: [openib-general] [SDP] having moving data on ttcp.aio connection
In-Reply-To: <20050426160835.A10906@topspin.com>
References: <1114554993.22383.7.camel@duffman>
	<20050426160835.A10906@topspin.com>
Message-ID: <1114557517.22383.19.camel@duffman>

On Tue, 2005-04-26 at 16:08 -0700, Libor Michalek wrote:
>   limit memorylocked unlimited

or in a real shell:

$ ulimit -l unlimited

;-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/4857d1e6/attachment.sig>

From tduffy at sun.com  Tue Apr 26 16:51:40 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 16:51:40 -0700
Subject: [openib-general] [PATCHv5][SDP] Allow SDP to compile on 2.6.12-rc3
In-Reply-To: <1114210652.5519.1.camel@duffman>
References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>
	<1114126674.6858.31.camel@duffman>  <1114210652.5519.1.camel@duffman>
Message-ID: <1114559500.22383.27.camel@duffman>

So, here is a version that is tested working on 2.6.12-rc3.  I have
combined the inet_sock struct with the sdp_opt struct.

I have not bothered to put #ifdef's in here to keep working with 2.6.11
as the code has changed too much to bother.  So, I don't expect this
patch to be applied to openib trunk until 2.6.12-final comes out.  Take
a look that I didn't make any more dumb mistakes, and please test on
your configurations.

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c	(working copy)
@@ -1236,7 +1236,7 @@ int sdp_event_recv(struct sdp_opt *conn,
 			 * If data was consumed by the protocol, signal
 			 * the user.
 			 */
-			sdp_inet_wake_recv(conn->sk, conn->byte_strm);
+			sdp_inet_wake_recv(sk_sdp(conn), conn->byte_strm);
 	/*
 	 * It's possible that a new recv buffer advertisment opened up the 
 	 * recv window and we can flush buffered send data
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c	(working copy)
@@ -101,9 +101,9 @@ module_param(sdp_debug_level, int, 0);
  */
 void sdp_inet_wake_send(struct sock *sk)
 {
-	struct sdp_opt *conn;
+	struct sdp_opt *conn = sdp_sk(sk);
 
-	if (!sk || !(conn = SDP_GET_CONN(sk)))
+	if (sk == NULL)
 		return;
 
 	if (sk->sk_socket && test_bit(SOCK_NOSPACE, &sk->sk_socket->flags) &&
@@ -312,7 +312,7 @@ static int sdp_inet_release(struct socke
 	}
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "RELEASE: linger <%d:%lu> data <%d:%d>",
 		     sock_flag(sk, SOCK_LINGER), sk->sk_lingertime,
@@ -429,6 +429,7 @@ done:
 	sock_orphan(sk);
 	sdp_conn_unlock(conn);
 	sdp_conn_put(conn);
+	sock_put(sk);
 
 	return 0;
 }
@@ -446,7 +447,7 @@ static int sdp_inet_bind(struct socket *
 	int result;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "BIND: family <%d> addr <%08x:%04x>",
 		     addr->sin_family, addr->sin_addr.s_addr, addr->sin_port);
@@ -537,7 +538,7 @@ static int sdp_inet_connect(struct socke
 	int result;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "CONNECT: family <%d> addr <%08x:%04x>",
 		     addr->sin_family, addr->sin_addr.s_addr, addr->sin_port);
@@ -699,7 +700,7 @@ static int sdp_inet_listen(struct socket
 	int result;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "LISTEN: addr <%08x:%04x> backlog <%04x>",
 		     conn->src_addr, conn->src_port, backlog);
@@ -760,7 +761,7 @@ static int sdp_inet_accept(struct socket
 	long timeout;
 
 	listen_sk = listen_sock->sk;
-	listen_conn = SDP_GET_CONN(listen_sk);
+	listen_conn = sdp_sk(listen_sk);
 
 	sdp_dbg_ctrl(listen_conn, "ACCEPT: addr <%08x:%04x>",
 		     listen_conn->src_addr, listen_conn->src_port);
@@ -816,7 +817,7 @@ static int sdp_inet_accept(struct socket
 				goto listen_done;
 			}
 		} else {
-			accept_sk = accept_conn->sk;
+			accept_sk = sk_sdp(accept_conn);
 
 			switch (accept_conn->istate) {
 			case SDP_SOCK_ST_ACCEPTED:
@@ -913,7 +914,7 @@ static int sdp_inet_getname(struct socke
 	struct sdp_opt *conn;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "GETNAME: src <%08x:%04x> dst <%08x:%04x>",
 		     conn->src_addr, conn->src_port, 
@@ -953,7 +954,7 @@ static unsigned int sdp_inet_poll(struct
 	 * recheck the falgs on being woken.
 	 */
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_data(conn, "POLL: socket flags <%08lx>", sock->flags);
 
@@ -1040,7 +1041,7 @@ static int sdp_inet_ioctl(struct socket 
 	int value;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "IOCTL: command <%d> argument <%08lx>", cmd, arg);
 	/*
@@ -1162,7 +1163,7 @@ static int sdp_inet_setopt(struct socket
 	int result = 0;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "SETSOCKOPT: level <%d> option <%d>", 
 		     level, optname);
@@ -1229,7 +1230,7 @@ static int sdp_inet_getopt(struct socket
 	int len;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_ctrl(conn, "GETSOCKOPT: level <%d> option <%d>",
 		     level, optname);
@@ -1287,7 +1288,7 @@ static int sdp_inet_shutdown(struct sock
 	int result = 0;
 	struct sdp_opt *conn;
 
-	conn = SDP_GET_CONN(sock->sk);
+	conn = sdp_sk(sock->sk);
 
 	sdp_dbg_ctrl(conn, "SHUTDOWN: flag <%d>", flag);
 	/*
@@ -1422,7 +1423,7 @@ static int sdp_inet_create(struct socket
 	sock->ops = &lnx_stream_ops;
 	sock->state = SS_UNCONNECTED;
 
-	sock_graft(conn->sk, sock);
+	sock_graft(sk_sdp(conn), sock);
 
 	conn->pid = current->pid;
 
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c	(working copy)
@@ -185,7 +185,7 @@ int sdp_event_read(struct sdp_opt *conn,
 			 */
 			conn->byte_strm += result;
 
-			sdp_inet_wake_recv(conn->sk, conn->byte_strm);
+			sdp_inet_wake_recv(sk_sdp(conn), conn->byte_strm);
 		} else {
 			if (result < 0)
 				sdp_dbg_warn(conn, "Error <%d> receiving buff",
@@ -229,7 +229,7 @@ int sdp_event_read(struct sdp_opt *conn,
 
 		iocb->flags &= ~(SDP_IOCB_F_ACTIVE | SDP_IOCB_F_RDMA_R);
 
-		if (conn->sk->sk_rcvlowat > iocb->post)
+		if (sk_sdp(conn)->sk_rcvlowat > iocb->post)
 			break;
 
 		iocb = sdp_iocb_q_get_head(&conn->r_pend);
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c	(working copy)
@@ -1792,7 +1792,7 @@ static int sdp_inet_write_cancel(struct 
 	/*
 	 * lock the socket while we operate.
 	 */
-	conn = SDP_GET_CONN(si->sock->sk);
+	conn = sdp_sk(si->sock->sk);
 	sdp_conn_lock(conn);
 
 	sdp_dbg_ctrl(conn, "Cancel Write IOCB. <%08x:%04x> <%08x:%04x>",
@@ -2002,7 +2002,7 @@ int sdp_send_flush(struct sdp_opt *conn)
 	/*
 	 * see if there is enough buffer to wake/notify writers
 	 */
-	sdp_inet_wake_send(conn->sk); /*  conn->sk->write_space(conn->sk); */
+	sdp_inet_wake_send(sk_sdp(conn)); /*  conn->sk->write_space(conn->sk); */
 
 	return 0;
 done:
@@ -2031,7 +2031,7 @@ int sdp_inet_send(struct kiocb *req, str
 	oob = (msg->msg_flags & MSG_OOB);
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_data(conn, "send state <%04x> size <%Zu> flags <%08x>",
 		     conn->state, size, msg->msg_flags);
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c	(working copy)
@@ -40,6 +40,7 @@
 void sdp_cm_actv_error(struct sdp_opt *conn, int error)
 {
 	int result;
+	struct sock *sk;
 	/*
 	 * error value is positive error.
 	 *
@@ -95,11 +96,12 @@ void sdp_cm_actv_error(struct sdp_opt *c
 	conn->shutdown = SHUTDOWN_MASK;
 	conn->send_buf = 0;
 
-	if (conn->sk->sk_socket)
-		conn->sk->sk_socket->state = SS_UNCONNECTED;
+	sk = sk_sdp(conn);
+	if (sk->sk_socket)
+		sk->sk_socket->state = SS_UNCONNECTED;
 
 	sdp_iocb_q_cancel_all(conn, (0 - error));
-	sdp_inet_wake_error(conn->sk);
+	sdp_inet_wake_error(sk);
 	return;
 }
 
@@ -117,7 +119,7 @@ static int sdp_cm_actv_establish(struct 
 		     conn->src_addr, conn->src_port, 
 		     conn->dst_addr, conn->dst_port);
 
-	sk = conn->sk;
+	sk = sk_sdp(conn);
 
 	qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL);
 	if (!qp_attr)
@@ -550,7 +552,7 @@ int sdp_cm_connect(struct sdp_opt *conn)
 					  
 	result = sdp_link_path_lookup(htonl(conn->dst_addr),
 				      htonl(conn->src_addr),
-				      conn->sk->sk_bound_dev_if,
+				      sk_sdp(conn)->sk_bound_dev_if,
 				      sdp_cm_path_complete,
 				      conn,
 				      &conn->plid);
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	(working copy)
@@ -312,7 +312,7 @@ int sdp_inet_port_get(struct sdp_opt *co
 	static s32 rover = -1;
 	unsigned long flags;
 
-	sk = conn->sk;
+	sk = sk_sdp(conn);
 	/*
 	 * lock table
 	 */
@@ -323,7 +323,7 @@ int sdp_inet_port_get(struct sdp_opt *co
 	if (port > 0) {
 		for (look = dev_root_s.bind_list, port_ok = 1;
 		     look; look = look->bind_next) {
-			srch = look->sk;
+			srch = sk_sdp(look);
 			/*
 			 * 1) same port
 			 * 2) linux force reuse is off.
@@ -756,17 +756,6 @@ void sdp_conn_destruct(struct sdp_opt *c
 
 	if (dump)
 		sdp_conn_state_dump(conn);
-	/*
-	 * free the OS socket structure
-	 */
-	if (!conn->sk)
-		sdp_dbg_warn(conn, "destruct, no socket! continuing.");
-	else {
-		sk_free(conn->sk);
-		conn->sk = NULL;
-	}
-
-	kmem_cache_free(dev_root_s.conn_cache, conn);
 }
 
 /*
@@ -1112,6 +1101,12 @@ error_attr:
 	return result;
 }
 
+static struct proto sdp_sk_proto = {
+	.name		= "SDP",
+	.owner		= THIS_MODULE,
+	.obj_size	= sizeof(struct sdp_opt),
+};
+
 /*
  * sdp_conn_alloc - allocate a new socket, and init.
  */
@@ -1121,8 +1116,7 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	struct sock *sk;
 	int result;
 
-	sk = sk_alloc(dev_root_s.proto, priority, 
-		      sizeof(struct inet_sock), dev_root_s.sock_cache);
+	sk = sk_alloc(dev_root_s.proto, priority, &sdp_sk_proto, 1);
 	if (!sk) {
 		sdp_dbg_warn(NULL, "socket alloc error for protocol. <%d:%d>",
 			     dev_root_s.proto, priority);
@@ -1146,23 +1140,8 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	sk->sk_state_change = sdp_inet_wake_generic;
 	sk->sk_data_ready   = sdp_inet_wake_recv;
 	sk->sk_error_report = sdp_inet_wake_error;
-	/*
-	 * Allocate must be called from process context, since QP
-	 * create/modifies must be in that context.
-	 */
-	conn = kmem_cache_alloc(dev_root_s.conn_cache, priority);
-	if (!conn) {
-		sdp_dbg_warn(conn, "connection alloc error. <%d>", priority);
-		result = -ENOMEM;
-		goto error;
-	}
 
-	memset(conn, 0, sizeof(struct sdp_opt));
-	/*
-	 * The STRM interface specific data is map/cast over the TCP specific
-	 * area of the sock.
-	 */
-	SDP_SET_CONN(sk, conn);
+	conn = sdp_sk(sk);
 	SDP_CONN_ST_INIT(conn);
 
 	conn->cm_id       = NULL;
@@ -1179,7 +1158,6 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	conn->parent      = NULL;
 
 	conn->pid       = 0;
-	conn->sk        = sk;
 	conn->hashent   = SDP_DEV_SK_INVALID;
 	conn->istate    = SDP_SOCK_ST_CLOSED;
 	conn->flags     = 0;
@@ -1286,7 +1264,7 @@ struct sdp_opt *sdp_conn_alloc(int prior
 		sdp_dbg_warn(conn, "Error <%d> conn table insert <%d:%d>",
 			     result, dev_root_s.sk_entry,
 			     dev_root_s.sk_size);
-		goto error_conn;
+		goto error;
 	}
 	/*
 	 * set reference
@@ -1300,8 +1278,6 @@ struct sdp_opt *sdp_conn_alloc(int prior
 	 * done
 	 */
 	return conn;
-error_conn:
-	kmem_cache_free(dev_root_s.conn_cache, conn);
 error:
 	sk_free(sk);
 	return NULL;
@@ -1470,7 +1446,7 @@ int sdp_proc_dump_conn_data(char *buffer
 			continue;
 
 		conn = dev_root_s.sk_array[counter];
-		sk = conn->sk;
+		sk = sk_sdp(conn);
 
 		offset += sprintf((buffer + offset), SDP_PROC_CONN_DATA_FORM,
 				  conn->hashent,
@@ -1956,26 +1932,13 @@ int sdp_conn_table_init(int proto_family
 		goto error_iocb;
 	}
 
-	dev_root_s.conn_cache = kmem_cache_create("sdp_conn",
-						  sizeof(struct sdp_opt),
-						  0, SLAB_HWCACHE_ALIGN,
-						  NULL, NULL);
-	if (!dev_root_s.conn_cache) {
-		sdp_warn("Failed to initialize connection cache.");
+	sdp_dbg_init("Registering socket proto.");
+	if (proto_register(&sdp_sk_proto, 1) != 0) {
+		sdp_warn("Failed to register sdp proto.");
 		result = -ENOMEM;
 		goto error_conn;
 	}
 
-        dev_root_s.sock_cache = kmem_cache_create("sdp_sock",
-						  sizeof(struct inet_sock), 
-						  0, SLAB_HWCACHE_ALIGN,
-						  NULL, NULL);
-        if (!dev_root_s.sock_cache) {
-		sdp_warn("Failed to initialize sock cache.");
-		result = -ENOMEM;
-		goto error_sock;
-        }
-
 	/*
 	 * start listening
 	 */
@@ -2002,9 +1965,7 @@ int sdp_conn_table_init(int proto_family
 error_listen:
 	(void)ib_destroy_cm_id(dev_root_s.listen_id);
 error_cm_id:
-	kmem_cache_destroy(dev_root_s.sock_cache);
-error_sock:
-	kmem_cache_destroy(dev_root_s.conn_cache);
+	proto_unregister(&sdp_sk_proto);
 error_conn:
 	sdp_main_iocb_cleanup();
 error_iocb:
@@ -2045,14 +2006,7 @@ int sdp_conn_table_clear(void)
 	 * delete IOCB table
 	 */
 	sdp_main_iocb_cleanup();
-	/*
-	 * delete conn cache
-	 */
-	kmem_cache_destroy(dev_root_s.conn_cache);
-	/*
-	 * delete sock cache
-	 */
-	kmem_cache_destroy(dev_root_s.sock_cache);
+	proto_unregister(&sdp_sk_proto);
 	/*
 	 * stop listening
 	 */
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c	(working copy)
@@ -780,7 +780,7 @@ static int sdp_recv_buff_iocb_pending(st
 	 */
 	if (!iocb->len ||
 	    (!conn->src_recv &&
-	     !(conn->sk->sk_rcvlowat > iocb->post))) {
+	     !(sk_sdp(conn)->sk_rcvlowat > iocb->post))) {
 		/*
 		 * complete IOCB
 		 */
@@ -835,7 +835,7 @@ int sdp_recv_buff(struct sdp_opt *conn, 
 	 */
 	if (buff->flags & SDP_BUFF_F_OOB_PEND) {
 		conn->rcv_urg_cnt++;
-		sdp_inet_wake_urg(conn->sk);
+		sdp_inet_wake_urg(sk_sdp(conn));
 	}
 	/*
 	 * loop while there are available IOCB's, break if there is no
@@ -933,7 +933,7 @@ static int sdp_inet_read_cancel(struct k
 	/*
 	 * lock the socket while we operate.
 	 */
-	conn = SDP_GET_CONN(si->sock->sk);
+	conn = sdp_sk(si->sock->sk);
 	sdp_conn_lock(conn);
 
 	sdp_dbg_ctrl(conn, "Cancel Read IOCB. <%08x:%04x> <%08x:%04x>",
@@ -1086,7 +1086,7 @@ static int sdp_inet_recv_urg(struct sock
 	int result = 0;
 	u8 value;
 
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	if (sock_flag(sk, SOCK_URGINLINE) || !conn->rcv_urg_cnt)
 		return -EINVAL;
@@ -1173,7 +1173,7 @@ int sdp_inet_recv(struct kiocb  *req, st
 	struct sdpc_buff_q peek_queue;
 
 	sk = sock->sk;
-	conn = SDP_GET_CONN(sk);
+	conn = sdp_sk(sk);
 
 	sdp_dbg_data(conn, "state <%08x> size <%Zu> pending <%d> falgs <%08x>",
 		     conn->state, size, conn->byte_strm, flags);
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c	(working copy)
@@ -294,8 +294,8 @@ int sdp_wall_recv_close(struct sdp_opt *
 		/*
 		 * async notification. POLL_HUP on full duplex close only.
 		 */
-		sdp_inet_wake_generic(conn->sk);
-		sk_wake_async(conn->sk, 1, POLL_IN);
+		sdp_inet_wake_generic(sk_sdp(conn));
+		sk_wake_async(sk_sdp(conn), 1, POLL_IN);
 
 		break;
 	}
@@ -327,8 +327,8 @@ int sdp_wall_recv_closing(struct sdp_opt
 	/*
 	 * async notification. POLL_HUP on full duplex close only.
 	 */
-	sdp_inet_wake_generic(conn->sk);
-	sk_wake_async(conn->sk, 1, POLL_HUP);
+	sdp_inet_wake_generic(sk_sdp(conn));
+	sk_wake_async(sk_sdp(conn), 1, POLL_HUP);
 
 	return 0;
 }
@@ -368,7 +368,7 @@ int sdp_wall_recv_abort(struct sdp_opt *
 	 */
 	sdp_iocb_q_cancel_all(conn, -ECONNRESET);
 
-	sdp_inet_wake_error(conn->sk);
+	sdp_inet_wake_error(sk_sdp(conn));
 
 	return 0;
 }
@@ -402,7 +402,7 @@ void sdp_wall_recv_drop(struct sdp_opt *
 		break;
 	case SDP_SOCK_ST_CLOSING:
 		conn->istate = SDP_SOCK_ST_CLOSED;
-		sdp_inet_wake_generic(conn->sk);
+		sdp_inet_wake_generic(sk_sdp(conn));
 
 		break;
 	default:
@@ -418,7 +418,7 @@ void sdp_wall_recv_drop(struct sdp_opt *
 		 */
 		sdp_iocb_q_cancel_all(conn, -ECONNRESET);
 
-		sdp_inet_wake_error(conn->sk);
+		sdp_inet_wake_error(sk_sdp(conn));
 
 		break;
 	}
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h	(working copy)
@@ -146,16 +146,18 @@ enum sdp_mode {
  */
 #define SDP_MSG_EVENT_TABLE_SIZE 0x20
 
-/*
- * connection handle within a socket.
- */
-#define SDP_GET_CONN(sk) \
-       (*((struct sdp_opt **)&(sk)->sk_protinfo))
-#define SDP_SET_CONN(sk, conn) \
-       (*((struct sdp_opt **)&(sk)->sk_protinfo) = (conn))
+static inline struct sdp_opt *sdp_sk(struct sock *sk)
+{
+	return (struct sdp_opt *)sk;
+}
+
+static inline struct sock *sk_sdp(struct sdp_opt *conn)
+{
+	return (struct sock *)conn;
+}
 
 #define SDP_CONN_SET_ERR(conn, val) \
-        ((conn)->error = (conn)->sk->sk_err = (val))
+        ((conn)->error = sk_sdp(conn)->sk_err = (val))
 #define SDP_CONN_GET_ERR(conn) \
         ((conn)->error)
 
@@ -214,10 +216,15 @@ struct sdp_conn_lock {
  * SDP Connection structure.
  */
 struct sdp_opt {
+	/*
+	 * inet_sock must be first member of sdp_sock
+	 * NOTE: this depends on inet_sock having struct sock as its
+	 * first member
+	 */
+	struct inet_sock in;
 	__s32 hashent;     /* connection ID/hash entry */
 	atomic_t refcnt;   /* connection reference count. */
 
-	struct sock *sk;
 	/*
 	 * SDP specific data
 	 */
@@ -530,7 +537,7 @@ static inline int sdp_conn_error(struct 
 	 * lock, however the linux socket error, needs to be xchg'd since the
 	 * SO_ERROR getsockopt happens outside of the connection lock.
 	 */
-	int error = xchg(&conn->sk->sk_err, 0);
+	int error = xchg(&sk_sdp(conn)->sk_err, 0);
 	SDP_CONN_SET_ERR(conn, 0);
 
 	return -error;
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	(working copy)
@@ -40,6 +40,7 @@
 void sdp_cm_pass_error(struct sdp_opt *conn, int error)
 {
 	int result;
+	struct sock *sk;
 
 	sdp_dbg_ctrl(conn, 
 		     "passive error. src <%08x:%04x> dst <%08x:%04x> <%d>",
@@ -59,11 +60,12 @@ void sdp_cm_pass_error(struct sdp_opt *c
 	conn->shutdown = SHUTDOWN_MASK;
 	conn->send_buf = 0;
 
-	if (conn->sk->sk_socket)
-		conn->sk->sk_socket->state = SS_UNCONNECTED;
+	sk = sk_sdp(conn);
+	if (sk->sk_socket)
+		sk->sk_socket->state = SS_UNCONNECTED;
 
 	sdp_iocb_q_cancel_all(conn, (0 - error));
-	sdp_inet_wake_error(conn->sk);
+	sdp_inet_wake_error(sk);
 }
 
 /*
@@ -130,7 +132,7 @@ int sdp_cm_pass_establish(struct sdp_opt
 		goto error;
 	}
 
-	sdp_inet_wake_send(conn->sk);
+	sdp_inet_wake_send(sk_sdp(conn));
 
         kfree(qp_attr);
 	return 0;
@@ -320,8 +322,8 @@ static int sdp_cm_listen_lookup(struct s
 	/*
 	 * check backlog
 	 */
-	listen_sk = listen_conn->sk;
-	sk = conn->sk;
+	listen_sk = sk_sdp(listen_conn);
+	sk = sk_sdp(conn);
 
 	if (listen_conn->backlog_cnt > listen_conn->backlog_max) {
 		sdp_dbg_warn(listen_conn, 
@@ -356,13 +358,16 @@ static int sdp_cm_listen_lookup(struct s
 	 */
 	sk->sk_lingertime   = listen_sk->sk_lingertime;
 	sk->sk_rcvlowat     = listen_sk->sk_rcvlowat;
-	sk->sk_debug        = listen_sk->sk_debug;
-	sk->sk_localroute   = listen_sk->sk_localroute;
+	if (sock_flag(listen_sk, SOCK_DBG))
+		sock_set_flag(sk, SOCK_DBG);
+	if (sock_flag(listen_sk, SOCK_LOCALROUTE))
+		sock_set_flag(sk, SOCK_LOCALROUTE);
 	sk->sk_sndbuf       = listen_sk->sk_sndbuf;
 	sk->sk_rcvbuf       = listen_sk->sk_rcvbuf;
 	sk->sk_no_check     = listen_sk->sk_no_check;
 	sk->sk_priority     = listen_sk->sk_priority;
-	sk->sk_rcvtstamp    = listen_sk->sk_rcvtstamp;
+	if (sock_flag(listen_sk, SOCK_RCVTSTAMP))
+		sock_set_flag(sk, SOCK_RCVTSTAMP);
 	sk->sk_rcvtimeo     = listen_sk->sk_rcvtimeo;
 	sk->sk_sndtimeo     = listen_sk->sk_sndtimeo;
 	sk->sk_reuse        = listen_sk->sk_reuse;
@@ -501,7 +506,7 @@ int sdp_cm_req_handler(struct ib_cm_id *
 		goto done;
 	}
 	/*
-	 * Lock the new connection before modifyingg it into any tables.
+	 * Lock the new connection before modifying it into any tables.
 	 */
 	sdp_conn_lock(conn);
 	/*
Index: linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h
===================================================================
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(revision 2212)
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	(working copy)
@@ -197,11 +197,6 @@ struct sdev_root {
 	 * SDP wide listen
 	 */
 	struct ib_cm_id *listen_id;      /* listen handle */
-	/*
-	 * cache's
-	 */
-	kmem_cache_t *conn_cache;
-	kmem_cache_t *sock_cache;
 };
 
 #endif /* _SDP_DEV_H */


From akpm at osdl.org  Tue Apr 26 17:05:13 2005
From: akpm at osdl.org (Andrew Morton)
Date: Tue, 26 Apr 2005 17:05:13 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <521x8xs04v.fsf@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com>
	<20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com>
	<20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com>
	<20050426133229.416a5e66.akpm@osdl.org>
	<521x8xs04v.fsf@topspin.com>
Message-ID: <20050426170513.33b81f76.akpm@osdl.org>

Roland Dreier <roland at topspin.com> wrote:
>
>     Andrew> Well I was vaguely proposing that the userspace library
>     Andrew> keep track of the byteranges and the underlying page
>     Andrew> states.  So in the above scenario userspace would leave
>     Andrew> the page at 0x1000 registered until all registrations
>     Andrew> against that page have been undone.
> 
> OK, I already have code in userspace that keeps reference counts for
> overlapping regions, etc.  However I'm not sure how to tie this in
> with reliable accounting of pinned memory -- we don't want malicious
> userspace code to be able fool the accounting, right?
> 
> So I'm still trying to puzzle out what to do.  I don't want to keep a
> complicated data structure in the kernel keeping track of what memory
> has been registered.  Right now, I just keep a list of structs, one
> for each region, and when a process dies, I just go through region by
> region and do a put_page() to balance off the get_user_pages().
> 
> However I don't see how to make it work if I put the reference
> counting for overlapping regions in userspace but when I want mlock()
> accounting in the kernel.  If a buggy/malicious app does:
> 
>     a) register from 0x0000 to 0x2fff
>     b) register from 0x1000 to 0x1fff
>     c) unregister from 0x0000 to 0x2fff

As far as the kernel is concerned, step b) should be a no-op.  (The kernel
might choose to split the vma, but that's not significant).

> then it seems the kernel is screwed unless it counts how many times a
> vma has been pinned.  And adding a pin_count member to vm_struct seems
> like a pretty damn major step.
> 
> We definitely have to make sure that userspace is never able to either
> unpin a page that is still registered with RDMA hardware, because that
> can lead to DMA to into memory that someone else owns.  On the other
> hand, we don't want userspace to be able to defeat resource accounting
> by tricking the kernel into keeping page_count elevated after it
> credits the memory back to a process's limit on locked pages.

The kernel can simply register and unregister ranges for RDMA.  So
effectively a particular page is in either the registered or unregistered
state.  Kernel accounting counts the number of registered pages and
compares this with rlimits.

On top of all that, your userspace library needs to keep track of when
pages should really be registered and unregistered with the kernel.  Using
overlap logic and per-page refcounting or whatever.

No?


From tduffy at sun.com  Tue Apr 26 20:00:35 2005
From: tduffy at sun.com (Tom Duffy)
Date: Tue, 26 Apr 2005 20:00:35 -0700
Subject: [openib-general] kernel vapi
In-Reply-To: <1114534592.15717.33.camel@olivier.toulouse>
References: <1114534592.15717.33.camel@olivier.toulouse>
Message-ID: <1114570835.22627.7.camel@duffman>

On Tue, 2005-04-26 at 18:56 +0200, Olivier Cozette wrote:
> 	Hello,
> 
> Sorry, but i don't the good list to tell about my problem, so a post it
> to this list.
> 
> I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried
> to port the vping to kernel space
> (ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping).

Can you reproduce your problem on a 2.6.11 kernel with the gen2 code?
Unfortunately, gen1 is no longer supported.  Nor is the 2.4 kernel.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/e5bf21e6/attachment.sig>

From tomduffy at gmail.com  Tue Apr 26 20:05:44 2005
From: tomduffy at gmail.com (Tom Duffy)
Date: Tue, 26 Apr 2005 20:05:44 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
Message-ID: <9d3b7de7050426200569e83b68@mail.gmail.com>

Has anybody else noticed that openib.org doesn't seem to render
properly on Firefox/Linux?  Check out this screenshot.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openib_banner.png
Type: image/png
Size: 11157 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050426/4ba75630/attachment.png>

From caitlin.bestler at gmail.com  Tue Apr 26 20:15:46 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Tue, 26 Apr 2005 20:15:46 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426122850.44d06fa6.akpm@osdl.org>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425153256.3850ee0a.akpm@osdl.org> <52vf6atnn8.fsf@topspin.com>
	<20050425171145.2f0fd7f8.akpm@osdl.org> <52acnmtmh6.fsf@topspin.com>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
Message-ID: <469958e00504262015772c9181@mail.gmail.com>

On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:
> Roland Dreier <roland at topspin.com> wrote:
> >
> >     Libor>   Do you mean that the set/clear parameters to do_mlock()
> >     Libor> are the actual flags which are set/cleared by the caller?
> >     Libor> Also, the issue remains that the flags are not reference
> >     Libor> counted which is a problem if you are dealing with
> >     Libor> overlapping memory region, or even if one region ends and
> >     Libor> another begins on the same page. Since the desire is to be
> >     Libor> able to pin any memory that a user can malloc this is a
> >     Libor> likely scenario.
> >
> > Good point... we need to figure out how to handle:
> >
> >     a) app registers 0x0000 through 0x17ff
> >     b) app registers 0x1800 through 0x2fff
> >     c) app unregisters 0x0000 through 0x17ff
> >     d) the page at 0x1000 must stay pinned
> 
> The userspace library should be able to track the tree and the overlaps,
> etc.  Things might become interesting when the memory is MAP_SHARED
> pagecache and multiple independent processes are involved, although I guess
> that'd work OK.
> 
> But afaict the problem wherein part of a page needs VM_DONTCOPY and the
> other part does not cannot be solved.
> 

Which portion of the userspace library? HCA-dependent code, or common code?

The HCA-dependent code would fail to count when the same memory was
registered to different HCAs (for example to the internal network device and
the external network device).

The vendor-independent code *could* do it, but only by maintaining a 
complete list of all registrations that had been issued but not cancelled.
That data would be redundant with data kept at the verb layer, and by
the kernel.

It *would' work, but maintaining highly redundant data at multiple layers
is something that I generally try to avoid.


From caitlin.bestler at gmail.com  Tue Apr 26 20:21:23 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Tue, 26 Apr 2005 20:21:23 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426170513.33b81f76.akpm@osdl.org>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<20050426133229.416a5e66.akpm@osdl.org> <521x8xs04v.fsf@topspin.com>
	<20050426170513.33b81f76.akpm@osdl.org>
Message-ID: <469958e0050426202144a1fdf4@mail.gmail.com>

On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:

> >
> > However I don't see how to make it work if I put the reference
> > counting for overlapping regions in userspace but when I want mlock()
> > accounting in the kernel.  If a buggy/malicious app does:
> >
> >     a) register from 0x0000 to 0x2fff
> >     b) register from 0x1000 to 0x1fff
> >     c) unregister from 0x0000 to 0x2fff
> 
> As far as the kernel is concerned, step b) should be a no-op.  (The kernel
> might choose to split the vma, but that's not significant).
> 

If "register" and "unregister" is meant in the RDMA sense then the above
sequence is totally reasonable. The b) registration could be for a different
protection domain that did not require access to all of the larger region.

Unless a full counting lock is available from the kernel, the responsibility
of the collective RDMA components would be to a) pin 0x0000 to 0x2fff,
b) nothing c) unpin 0x000 to 0x0fff and 0x2000 to 0x2fff


From halr at voltaire.com  Wed Apr 27 03:53:31 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Apr 2005 06:53:31 -0400
Subject: [openib-general] [SDP] having moving data on ttcp.aio connection
In-Reply-To: <1114554993.22383.7.camel@duffman>
References: <1114554993.22383.7.camel@duffman>
Message-ID: <1114556937.1764.331.camel@localhost.localdomain>

On Tue, 2005-04-26 at 18:36, Tom Duffy wrote:
> BTW, this with both ends on 2.6.11 with stock openib revision 2214, not
> my modified 2.6.12-rc3 version.  Also, I am using opensm for my SM (back
> to back configuration).

I haven't run OpenSM on top of the latest changes in a while so I need
to ask:

Does ibstat/ibstatus show both ports as active with LIDs assigned ?

Thanks.

-- Hal


From halr at voltaire.com  Wed Apr 27 06:09:07 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Apr 2005 09:09:07 -0400
Subject: [openib-general] [SDP] having moving data on ttcp.aio connection
In-Reply-To: <1114556937.1764.331.camel@localhost.localdomain>
References: <1114554993.22383.7.camel@duffman>
	<1114556937.1764.331.camel@localhost.localdomain>
Message-ID: <1114607232.1764.379.camel@localhost.localdomain>

On Wed, 2005-04-27 at 06:53, Hal Rosenstock wrote:
> On Tue, 2005-04-26 at 18:36, Tom Duffy wrote:
> > BTW, this with both ends on 2.6.11 with stock openib revision 2214, not
> > my modified 2.6.12-rc3 version.  Also, I am using opensm for my SM (back
> > to back configuration).
> 
> I haven't run OpenSM on top of the latest changes in a while

Just did this and OpenSM appears to work (as does IPoIB).

-- Hal

> so I need to ask:
> 
> Does ibstat/ibstatus show both ports as active with LIDs assigned ?
> 
> Thanks.
> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From libor at topspin.com  Tue Apr 26 16:08:35 2005
From: libor at topspin.com (Libor Michalek)
Date: Tue, 26 Apr 2005 16:08:35 -0700
Subject: [openib-general] [SDP] having moving data on ttcp.aio connection
In-Reply-To: <1114554993.22383.7.camel@duffman>;
	from tduffy@sun.com on Tue, Apr 26, 2005 at 03:36:33PM -0700
References: <1114554993.22383.7.camel@duffman>
Message-ID: <20050426160835.A10906@topspin.com>

On Tue, Apr 26, 2005 at 03:36:33PM -0700, Tom Duffy wrote:
> Has anybody seen this type of error when doing SDP?
> 
>  ERR: : VMA lock <508000:65536> error <-12> <16:0:8>
> WARN: <1> <0404:2480> Error <-12> IOCB lock <65536:0>
> 
> ttcp reports:
> 
> [root at sins-stinger-10 ~]# ./ttcp -r -l 65536 -a 20
> ttcp-r: buflen = 65536 nbuf = 0 align = 16384/0 port = 5001
> ttcp-r: socket
> ttcp-r: accept from 192.168.0.26
> ttcp-r: Event error <-12> <5275648>
> ttcp-r: 0 bytes in 2.50 real seconds = 0.00 Mbit/sec +++
> ttcp-r: 2 I/O calls, usec/call = 1248114.00, calls/sec = 0.80
> ttcp-r: user: 0 sys: 41994 total: 41994 real: 2496228 (microseconds)
> 
> BTW, this with both ends on 2.6.11 with stock openib revision 2214, not
> my modified 2.6.12-rc3 version.  Also, I am using opensm for my SM (back
> to back configuration).

  The error -12 is errno ENOMEM, and the most common caused for ENOMEM
with AIO, especially on an unloaded system, is that the mlock failed,
and the most common reason is that you cannot lock as many pages as
you are attempting to lock. You should increase the amount of memory 
that the user is allowed to lock. The following command in each shell
from which you are running ttcp:

  limit memorylocked unlimited

I've already told Hal I'd add this to the README, I'll do it right now.

-Libor


From roland at topspin.com  Tue Apr 26 19:13:24 2005
From: roland at topspin.com (Roland Dreier)
Date: Tue, 26 Apr 2005 19:13:24 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace
	verbs implementation
In-Reply-To: <20050426170513.33b81f76.akpm@osdl.org> (Andrew Morton's
	message of "Tue, 26 Apr 2005 17:05:13 -0700")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<521x8yv9vb.fsf@topspin.com> <20050425151459.1f5fb378.akpm@osdl.org>
	<426D6D68.6040504@ammasso.com> <20050425153256.3850ee0a.akpm@osdl.org>
	<52vf6atnn8.fsf@topspin.com> <20050425171145.2f0fd7f8.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <20050426133229.416a5e66.akpm@osdl.org>
	<521x8xs04v.fsf@topspin.com> <20050426170513.33b81f76.akpm@osdl.org>
Message-ID: <521x8xq857.fsf@topspin.com>

    Andrew> The kernel can simply register and unregister ranges for
    Andrew> RDMA.  So effectively a particular page is in either the
    Andrew> registered or unregistered state.  Kernel accounting
    Andrew> counts the number of registered pages and compares this
    Andrew> with rlimits.

    Andrew> On top of all that, your userspace library needs to keep
    Andrew> track of when pages should really be registered and
    Andrew> unregistered with the kernel.  Using overlap logic and
    Andrew> per-page refcounting or whatever.

This is OK as long as userspace is trusted.  However I don't see how
this works when we don't trust userspace.  The problem is that for an
RDMA device (IB HCA or iWARP RNIC), a process can create many memory
regions, each of which a separate virtual to physical translation
map.  For example, an app can do:

    a) register 0x0000 through 0xffff and get memory handle 1
    b) register 0x0000 through 0xffff and get memory handle 2
    c) use memory handle 1 for communication with remote app A
    d) use memory handle 2 for communication with remote app B

Even though memory handles 1 and 2 both refer to exactly the same
memory, they may have different lifetimes, might be attached to
different connections, and so on.

Clearly the memory at 0x0000 must stay pinned as long as the RDMA
device thinks either memory handle 1 or memory handle 2 is valid.
Furthermore, the kernel must be the one keeping track of how many
regions refer to a given page because we can't allow userspace to be
able to tell a device to go DMA to memory it doesn't own any more.

Creation and destruction of these memory handles will always go
through the kernel driver, so this isn't so bad.  And get_user_pages()
is almost exactly what we need: it stacks perfectly, since it operates
on the page_count rather than just setting a bit in vm_flags.  The
main problem is that it doesn't check against RLIMIT_MEMLOCK.

The most reasonable thing to do would seem to be having the IB kernel
memory region code update current->mm->locked_vm and check it against
RLIMIT_MEMLOCK.  I guess it would be good to figure out an appropriate
abstraction to export rather than monkeying with current->mm directly.
We could also put this directly in get_user_pages(), but I'd be
worried about messing with current users.

I just don't see a way to make VM_KERNEL_LOCKED work.

It would also be nice to have a way for apps to set VM_DONTCOPY
appropriately.  Christoph's suggestion of extending mmap() and
mprotect() with PROT_DONTCOPY seems good to me, especially since it
means we don't have to export do_mlock() functionality to modules.

 - R.


From halr at voltaire.com  Wed Apr 27 07:15:22 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Apr 2005 10:15:22 -0400
Subject: [openib-general] [PATCH] mad.c: Minor cleanup during startup and
	shutdown
Message-ID: <1114611322.1764.385.camel@localhost.localdomain>

mad.c: Minor cleanup during startup and shutdown

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Index: mad.c
===================================================================
--- mad.c	(revision 2212)
+++ mad.c	(working copy)
@@ -2534,14 +2534,6 @@
 	unsigned long flags;
 	char name[sizeof "ib_mad123"];
 
-	/* First, check if port already open at MAD layer */
-	port_priv = ib_get_mad_port(device, port_num);
-	if (port_priv) {
-		printk(KERN_DEBUG PFX "%s port %d already open\n",
-		       device->name, port_num);
-		return 0;
-	}
-
 	/* Create new device info */
 	port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL);
 	if (!port_priv) {
@@ -2666,7 +2658,7 @@
 
 static void ib_mad_init_device(struct ib_device *device)
 {
-	int ret, num_ports, cur_port, i, ret2;
+	int num_ports, cur_port, i;
 
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
@@ -2676,47 +2668,37 @@
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_mad_port_open(device, cur_port);
-		if (ret) {
+		if (ib_mad_port_open(device, cur_port)) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
 			goto error_device_open;
 		}
-		ret = ib_agent_port_open(device, cur_port);
-		if (ret) {
+		if (ib_agent_port_open(device, cur_port)) {
 			printk(KERN_ERR PFX "Couldn't open %s port %d "
 			       "for agents\n",
 			       device->name, cur_port);
 			goto error_device_open;
 		}
 	}
+	return;
 
-	goto error_device_query;
-
 error_device_open:
 	while (i > 0) {
 		cur_port--;
-		ret2 = ib_agent_port_close(device, cur_port);
-		if (ret2) {
+		if (ib_agent_port_close(device, cur_port))
 			printk(KERN_ERR PFX "Couldn't close %s port %d "
 			       "for agents\n",
 			       device->name, cur_port);
-		}
-		ret2 = ib_mad_port_close(device, cur_port);
-		if (ret2) {
+		if (ib_mad_port_close(device, cur_port))
 			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
 			       device->name, cur_port);
-		}
 		i--;
 	}
-
-error_device_query:
-	return;
 }
 
 static void ib_mad_remove_device(struct ib_device *device)
 {
-	int ret = 0, i, num_ports, cur_port, ret2;
+	int i, num_ports, cur_port;
 
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
@@ -2726,21 +2708,13 @@
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret2 = ib_agent_port_close(device, cur_port);
-		if (ret2) {
+		if (ib_agent_port_close(device, cur_port))
 			printk(KERN_ERR PFX "Couldn't close %s port %d "
 			       "for agents\n",
 			       device->name, cur_port);
-			if (!ret)
-				ret = ret2;
-		}
-		ret2 = ib_mad_port_close(device, cur_port);
-		if (ret2) {
+		if (ib_mad_port_close(device, cur_port))
 			printk(KERN_ERR PFX "Couldn't close %s port %d\n",
 			       device->name, cur_port);
-			if (!ret)
-				ret = ret2;
-		}
 	}
 }
 

From olivier.cozette at seanodes.com  Wed Apr 27 08:14:32 2005
From: olivier.cozette at seanodes.com (Olivier Cozette)
Date: Wed, 27 Apr 2005 17:14:32 +0200
Subject: [openib-general] kernel vapi
In-Reply-To: <1114570835.22627.7.camel@duffman>
References: <1114534592.15717.33.camel@olivier.toulouse>
	<1114570835.22627.7.camel@duffman>
Message-ID: <1114614872.15717.37.camel@olivier.toulouse>

Le mardi 26 avril 2005 à 20:00 -0700, Tom Duffy a écrit :
> On Tue, 2005-04-26 at 18:56 +0200, Olivier Cozette wrote:
> > 	Hello,
> > 
> > Sorry, but i don't the good list to tell about my problem, so a post it
> > to this list.
> > 
> > I'm using a kernel 2.4 with openib 1.0 with an x86-64 smp, and i tried
> > to port the vping to kernel space
> > (ib-support/third_party/mellanox_thca_host_native_2_6/src/HCA/examples/vping).
> 
> Can you reproduce your problem on a 2.6.11 kernel with the gen2 code?
> Unfortunately, gen1 is no longer supported.  Nor is the 2.4 kernel.
> 
> -tduffy

	Hello,

Actually i don't have any 2.6.11 kernel installed now, but i will try
tomorow.


	Olivier


From jlentini at netapp.com  Wed Apr 27 09:06:01 2005
From: jlentini at netapp.com (James Lentini)
Date: Wed, 27 Apr 2005 12:06:01 -0400 (EDT)
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <9d3b7de7050426200569e83b68@mail.gmail.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
Message-ID: <Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>


I see it too.

On Tue, 26 Apr 2005, Tom Duffy wrote:

> Has anybody else noticed that openib.org doesn't seem to render
> properly on Firefox/Linux?  Check out this screenshot.
>


From jcarr at linuxmachines.com  Wed Apr 27 09:26:56 2005
From: jcarr at linuxmachines.com (Jeff Carr)
Date: Wed, 27 Apr 2005 09:26:56 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
Message-ID: <426FBD50.5070704@linuxmachines.com>

James Lentini wrote:
> 
> I see it too.
> 
> On Tue, 26 Apr 2005, Tom Duffy wrote:
> 
>> Has anybody else noticed that openib.org doesn't seem to render
>> properly on Firefox/Linux?  Check out this screenshot.

Not for me. Perhaps reload fixes it?

Jeff
(using debian sid - firefox 1.0.1)


From jlentini at netapp.com  Wed Apr 27 09:28:29 2005
From: jlentini at netapp.com (James Lentini)
Date: Wed, 27 Apr 2005 12:28:29 -0400 (EDT)
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <426FBD50.5070704@linuxmachines.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com>
Message-ID: <Pine.LNX.4.61.0504271227170.5321@jlentini-linux.nane.netapp.com>


I'm using Firefox 1.0.3 on Fedora Core 3.

On Wed, 27 Apr 2005, Jeff Carr wrote:

> James Lentini wrote:
>> 
>> I see it too.
>> 
>> On Tue, 26 Apr 2005, Tom Duffy wrote:
>> 
>>> Has anybody else noticed that openib.org doesn't seem to render
>>> properly on Firefox/Linux?  Check out this screenshot.
>
> Not for me. Perhaps reload fixes it?
>
> Jeff
> (using debian sid - firefox 1.0.1)
>


From tduffy at sun.com  Wed Apr 27 10:13:15 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 10:13:15 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <426FBD50.5070704@linuxmachines.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com>
Message-ID: <1114621995.2221.2.camel@duffman>

On Wed, 2005-04-27 at 09:26 -0700, Jeff Carr wrote:
> Not for me. Perhaps reload fixes it?

The problem seems to stem from the fact that the horizontal blue bar
does not move when the font is increased or decreased.  Here is a series
of screenshots to demonstrate the issue:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openib-font-issue.jpg
Type: image/jpeg
Size: 27598 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050427/7304ef0a/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050427/7304ef0a/attachment.sig>

From roland at topspin.com  Wed Apr 27 10:25:54 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 27 Apr 2005 10:25:54 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <1114621995.2221.2.camel@duffman> (Tom Duffy's message of "Wed,
	27 Apr 2005 10:13:15 -0700")
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
Message-ID: <523btcp1wd.fsf@topspin.com>

    Tom> The problem seems to stem from the fact that the horizontal
    Tom> blue bar does not move when the font is increased or
    Tom> decreased.  Here is a series of screenshots to demonstrate
    Tom> the issue:

Looks like there's some absolute positioning hard-coded in the html:

    <div id="header" style="height: 78px;" id="header"> <a href="index.html"><img alt="OpenIB.org" src="images/openib.gif" style="border: 0px solid ; width: 128px; height: 56px;">

even better is:

    <meta name="generator" content="Windows Notepad" />

 - R.


From jcarr at linuxmachines.com  Wed Apr 27 10:42:34 2005
From: jcarr at linuxmachines.com (Jeff Carr)
Date: Wed, 27 Apr 2005 10:42:34 -0700
Subject: [openib-general] in need of a simple ulp
Message-ID: <426FCF0A.3070806@linuxmachines.com>

I'm new to IB so over the last few weeks I've read a large part of the 
openib archives. I have a small IB cluster of a few nodes running under 
2.6.11.7. IPoIB is working ok (SM dedicated on another node running 
2.6.9 & Mellanox IBGold).

To solve my particular problem, it would seem to me the best option 
would be to write a small ulp for this purpose. In this case, I have 
some pages in memory on one host that I need to offload to the other IB 
hosts for more cpu intensive processing. So, this is very similar to the 
way perf_main/simple_perf works, but I need a kernel space implementation.

IBoIP is rather performance bound it would seem. I was hoping that I 
could write/find some simple code to help with this, but there are not 
many ulp's and I notice from a previous thread last month there doesn't 
seem to be much abstraction or shared code that can be used.

In my infant state of understanding IB, it would appear that I need to

1) to open & discover the HCA's
2) create some Priority Domains
3) create some address handles
4) create some CQ's
5) create some queue pairs (maybe some SRQ's also)

possibly 3-5 can be replaced with some sort of "Registered Memory 
Region" as per 11.2.8 of the IB spec. (?)

I've attempted to look in the subversion repository for anything close 
or for things to use as starting points but didn't seem to find anything 
yet. The full repository is large and not very cleanly organized for a 
new user; so if there is some starting code that may be of use, that is 
what I was looking for.

Also, if there is anyone that would be willing to work on this problem 
on a contractual basis (code will be GPL'd) then please contact me.

Thanks,
Jeff Carr


Some choice quotes from the archives:

Ah a "design spec".  I remember writing those.  And I remember
discussing them with other engineers for hours about the proper wording
for paticular sections, and other trivial nonsense.

... if you are talking to the right people what does the name of
the room you happen to be in happen to do with anything?

Somebody needs to work on a history someday, as a sort of "don't do 
this" tutorial for the industry :-)

I've done a NIC design or two and once made the same mistake myself of
thinking I could guarantee data integrity -- turned out I could not. It
was good to know I was in such good company.

You laugh when you mention Cosmic Rays. Well, don't.

I don't think there's anything "illegal" in developing an SDP 
implementation, either under the GPL or otherwise. However, shipping it 
in products is another thing entirely.


From mshefty at ichips.intel.com  Wed Apr 27 10:49:12 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 27 Apr 2005 10:49:12 -0700
Subject: [openib-general] in need of a simple ulp
In-Reply-To: <426FCF0A.3070806@linuxmachines.com>
References: <426FCF0A.3070806@linuxmachines.com>
Message-ID: <426FD098.9090300@ichips.intel.com>

Jeff Carr wrote:
> In my infant state of understanding IB, it would appear that I need to
> 
> 1) to open & discover the HCA's
> 2) create some Priority Domains
> 3) create some address handles
> 4) create some CQ's
> 5) create some queue pairs (maybe some SRQ's also)
> 
> possibly 3-5 can be replaced with some sort of "Registered Memory 
> Region" as per 11.2.8 of the IB spec. (?)
> 
> I've attempted to look in the subversion repository for anything close 
> or for things to use as starting points but didn't seem to find anything 
> yet. The full repository is large and not very cleanly organized for a 
> new user; so if there is some starting code that may be of use, that is 
> what I was looking for.

Within the SVN repository, your best bet for finding things is staying 
  withing the gen2 branch.  For a relatively simple example that does 
what you mention above, try:

https://openib.org/svn/gen2/utils/src/linux-kernel/infiniband/util/cmpost/

This is a simple CM test program for the kernel.

- Sean


From tduffy at sun.com  Wed Apr 27 11:16:34 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 11:16:34 -0700
Subject: [openib-general] [PATCH][SDP] rename struct sdp_opt to sdp_sock
In-Reply-To: <1114559500.22383.27.camel@duffman>
References: <000001c546ca$51807b80$8d5aa8c0@infiniconsys.com>
	<1114126674.6858.31.camel@duffman> <1114210652.5519.1.camel@duffman>
	<1114559500.22383.27.camel@duffman>
Message-ID: <1114625794.2221.11.camel@duffman>

This seem to be what all the cool kids are doing these days.

Applies after my 2.6.12-rc3 fixup patch.

Signed-off-by: Tom Duffy <tduffy at sun.com>

diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_actv.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_actv.c	2005-04-25 21:02:06.073010000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_actv.c	2005-04-26 21:28:40.305963000 -0700
@@ -37,7 +37,7 @@
 /*
  * Connection establishment functions
  */
-void sdp_cm_actv_error(struct sdp_opt *conn, int error)
+void sdp_cm_actv_error(struct sdp_sock *conn, int error)
 {
 	int result;
 	struct sock *sk;
@@ -108,7 +108,7 @@ void sdp_cm_actv_error(struct sdp_opt *c
 /*
  * sdp_cm_actv_establish - process an accepted connection request.
  */
-static int sdp_cm_actv_establish(struct sdp_opt *conn)
+static int sdp_cm_actv_establish(struct sdp_sock *conn)
 {
 	struct ib_qp_attr *qp_attr;
 	int attr_mask = 0;
@@ -271,7 +271,7 @@ static int sdp_cm_hello_ack_check(struct
  * sdp_cm_rep_handler - handler for active connection open completion
  */
 int sdp_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event,
-		       struct sdp_opt *conn)
+		       struct sdp_sock *conn)
 {
 	struct sdp_msg_hello_ack *hello_ack;
 	int result = -EPROTO;
@@ -351,7 +351,7 @@ static void sdp_cm_path_complete(u64 id,
 {
 	struct ib_cm_req_param param;
 	struct sdp_msg_hello *hello_msg;
-	struct sdp_opt *conn = (struct sdp_opt *) arg;
+	struct sdp_sock *conn = (struct sdp_sock *) arg;
 	struct sdpc_buff *buff;
 	int result = 0;
 	int expect;
@@ -531,7 +531,7 @@ done:
 /*
  * sdp_cm_connect - initiate a SDP connection with a hello message.
  */
-int sdp_cm_connect(struct sdp_opt *conn)
+int sdp_cm_connect(struct sdp_sock *conn)
 {
 	int result;
 	/*
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_conn.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_conn.c	2005-04-26 16:47:31.580003000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.c	2005-04-26 21:28:40.323961000 -0700
@@ -76,10 +76,10 @@ static u32 sdp_psn_generate(void)
 /*
  * sdp_inet_accept_q_put - put a conn into a listen conn's accept Q.
  */
-int sdp_inet_accept_q_put(struct sdp_opt *listen_conn,
-			  struct sdp_opt *accept_conn)
+int sdp_inet_accept_q_put(struct sdp_sock *listen_conn,
+			  struct sdp_sock *accept_conn)
 {
-	struct sdp_opt *next_conn;
+	struct sdp_sock *next_conn;
 
 	if (listen_conn->parent ||
 	    accept_conn->parent ||
@@ -107,10 +107,10 @@ int sdp_inet_accept_q_put(struct sdp_opt
 /*
  * sdp_inet_accept_q_get - get a conn from a listen conn's accept Q.
  */
-struct sdp_opt *sdp_inet_accept_q_get(struct sdp_opt *listen_conn)
+struct sdp_sock *sdp_inet_accept_q_get(struct sdp_sock *listen_conn)
 {
-	struct sdp_opt *prev_conn;
-	struct sdp_opt *accept_conn;
+	struct sdp_sock *prev_conn;
+	struct sdp_sock *accept_conn;
 
 	if (listen_conn->parent ||
 	    !listen_conn->accept_next ||
@@ -146,10 +146,10 @@ struct sdp_opt *sdp_inet_accept_q_get(st
 /*
  * sdp_inet_accept_q_remove - remove a conn from a conn's accept Q.
  */
-int sdp_inet_accept_q_remove(struct sdp_opt *accept_conn)
+int sdp_inet_accept_q_remove(struct sdp_sock *accept_conn)
 {
-	struct sdp_opt *next_conn;
-	struct sdp_opt *prev_conn;
+	struct sdp_sock *next_conn;
+	struct sdp_sock *prev_conn;
 
 	if (!accept_conn->parent)
 		return -EFAULT;
@@ -181,7 +181,7 @@ int sdp_inet_accept_q_remove(struct sdp_
 /*
  * sdp_inet_listen_start - start listening for new connections on a socket
  */
-int sdp_inet_listen_start(struct sdp_opt *conn)
+int sdp_inet_listen_start(struct sdp_sock *conn)
 {
 	unsigned long flags;
 
@@ -214,9 +214,9 @@ int sdp_inet_listen_start(struct sdp_opt
 /*
  * sdp_inet_listen_stop - stop listening for new connections on a socket
  */
-int sdp_inet_listen_stop(struct sdp_opt *listen_conn)
+int sdp_inet_listen_stop(struct sdp_sock *listen_conn)
 {
-	struct sdp_opt *accept_conn;
+	struct sdp_sock *accept_conn;
 	int result;
 	unsigned long flags;
 
@@ -274,9 +274,9 @@ int sdp_inet_listen_stop(struct sdp_opt 
 /*
  * sdp_inet_listen_lookup - lookup a connection in the listen list
  */
-struct sdp_opt *sdp_inet_listen_lookup(u32 addr, u16 port)
+struct sdp_sock *sdp_inet_listen_lookup(u32 addr, u16 port)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	unsigned long flags;
 	/*
 	 * table lock
@@ -299,11 +299,11 @@ struct sdp_opt *sdp_inet_listen_lookup(u
 /*
  * sdp_inet_port_get - bind a socket to a port.
  */
-int sdp_inet_port_get(struct sdp_opt *conn, u16 port)
+int sdp_inet_port_get(struct sdp_sock *conn, u16 port)
 {
 	struct sock *sk;
 	struct sock *srch;
-	struct sdp_opt *look;
+	struct sdp_sock *look;
 	s32 counter;
 	s32 low_port;
 	s32 top_port;
@@ -421,7 +421,7 @@ done:
 /*
  * sdp_inet_port_put - unbind a socket from a port.
  */
-int sdp_inet_port_put(struct sdp_opt *conn)
+int sdp_inet_port_put(struct sdp_sock *conn)
 {
 	unsigned long flags;
 
@@ -450,7 +450,7 @@ int sdp_inet_port_put(struct sdp_opt *co
 /*
  * sdp_inet_port_inherit - inherit a port from another socket (accept)
  */
-int sdp_inet_port_inherit(struct sdp_opt *parent, struct sdp_opt *child)
+int sdp_inet_port_inherit(struct sdp_sock *parent, struct sdp_sock *child)
 {
 	int result;
 	unsigned long flags;
@@ -486,7 +486,7 @@ done:
 /*
  * sdp_conn_table_insert - insert a connection into the connection table
  */
-static int sdp_conn_table_insert(struct sdp_opt *conn)
+static int sdp_conn_table_insert(struct sdp_sock *conn)
 {
 	u32 counter;
 	int result = -ENOMEM;
@@ -530,7 +530,7 @@ static int sdp_conn_table_insert(struct 
 /*
  * sdp_conn_table_remove - remove a connection from the connection table
  */
-int sdp_conn_table_remove(struct sdp_opt *conn)
+int sdp_conn_table_remove(struct sdp_sock *conn)
 {
 	int result = 0;
 	unsigned long flags;
@@ -567,9 +567,9 @@ done:
 /*
  * sdp_conn_table_lookup - look up connection in the connection table
  */
-struct sdp_opt *sdp_conn_table_lookup(s32 entry)
+struct sdp_sock *sdp_conn_table_lookup(s32 entry)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	unsigned long flags;
 	/*
 	 * lock table
@@ -617,7 +617,7 @@ static void sdp_desc_q_cancel_iocb(struc
 	}
 }
 
-void sdp_iocb_q_cancel_all_read(struct sdp_opt *conn, ssize_t error)
+void sdp_iocb_q_cancel_all_read(struct sdp_sock *conn, ssize_t error)
 {
 	sdp_iocb_q_cancel(&conn->r_pend, SDP_IOCB_F_ALL, error);
 	sdp_iocb_q_cancel(&conn->r_snk, SDP_IOCB_F_ALL, error);
@@ -625,7 +625,7 @@ void sdp_iocb_q_cancel_all_read(struct s
 	sdp_desc_q_cancel_iocb(&conn->r_src, error);
 }
 
-void sdp_iocb_q_cancel_all_write(struct sdp_opt *conn, ssize_t error)
+void sdp_iocb_q_cancel_all_write(struct sdp_sock *conn, ssize_t error)
 {
 	sdp_iocb_q_cancel(&conn->w_src, SDP_IOCB_F_ALL, error);
 
@@ -633,7 +633,7 @@ void sdp_iocb_q_cancel_all_write(struct 
 	sdp_desc_q_cancel_iocb(&conn->w_snk, error);
 }
 
-void sdp_iocb_q_cancel_all(struct sdp_opt *conn, ssize_t error)
+void sdp_iocb_q_cancel_all(struct sdp_sock *conn, ssize_t error)
 {
 	sdp_iocb_q_cancel_all_read(conn, error);
 	sdp_iocb_q_cancel_all_write(conn, error);
@@ -646,7 +646,7 @@ void sdp_iocb_q_cancel_all(struct sdp_op
 /*
  * sdp_conn_destruct - final destructor for connection.
  */
-void sdp_conn_destruct(struct sdp_opt *conn)
+void sdp_conn_destruct(struct sdp_sock *conn)
 {
 	int dump = 0;
 	int result;
@@ -761,7 +761,7 @@ void sdp_conn_destruct(struct sdp_opt *c
 /*
  * sdp_conn_internal_lock - lock the connection (use only from macro)
  */
-void sdp_conn_internal_lock(struct sdp_opt *conn, unsigned long *flags)
+void sdp_conn_internal_lock(struct sdp_sock *conn, unsigned long *flags)
 {
 	DECLARE_WAITQUEUE(wait, current);
 	unsigned long f = *flags;
@@ -785,7 +785,7 @@ void sdp_conn_internal_lock(struct sdp_o
 /*
  * sdp_conn_relock - test the connection (use only from macro)
  */
-void sdp_conn_relock(struct sdp_opt *conn)
+void sdp_conn_relock(struct sdp_sock *conn)
 {
 	unsigned long flags;
 	struct ib_wc entry;
@@ -849,7 +849,7 @@ void sdp_conn_relock(struct sdp_opt *con
 /*
  * sdp_conn_cq_drain - drain one of the the connection's CQs
  */
-int sdp_conn_cq_drain(struct ib_cq *cq, struct sdp_opt *conn)
+int sdp_conn_cq_drain(struct ib_cq *cq, struct sdp_sock *conn)
 {
 	struct ib_wc entry;
 	int result;
@@ -901,7 +901,7 @@ return calls;
 /*
  * sdp_conn_internal_unlock - lock the connection (use only from macro)
  */
-void sdp_conn_internal_unlock(struct sdp_opt *conn)
+void sdp_conn_internal_unlock(struct sdp_sock *conn)
 {
 	int calls = 0;
 	/*
@@ -921,7 +921,7 @@ void sdp_conn_internal_unlock(struct sdp
 /*
  * sdp_conn_lock_init - initialize connection lock
  */
-static void sdp_conn_lock_init(struct sdp_opt *conn)
+static void sdp_conn_lock_init(struct sdp_sock *conn)
 {
 	spin_lock_init(&(conn->lock.slock));
 	conn->lock.users = 0;
@@ -931,7 +931,7 @@ static void sdp_conn_lock_init(struct sd
 /*
  * sdp_conn_alloc_ib - allocate IB structures for a new connection.
  */
-int sdp_conn_alloc_ib(struct sdp_opt *conn, struct ib_device *device,
+int sdp_conn_alloc_ib(struct sdp_sock *conn, struct ib_device *device,
 		      u8 hw_port, u16 pkey)
 {
 	struct ib_qp_init_attr *init_attr;
@@ -1104,15 +1104,15 @@ error_attr:
 static struct proto sdp_sk_proto = {
 	.name		= "SDP",
 	.owner		= THIS_MODULE,
-	.obj_size	= sizeof(struct sdp_opt),
+	.obj_size	= sizeof(struct sdp_sock),
 };
 
 /*
  * sdp_conn_alloc - allocate a new socket, and init.
  */
-struct sdp_opt *sdp_conn_alloc(int priority)
+struct sdp_sock *sdp_conn_alloc(int priority)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	struct sock *sk;
 	int result;
 
@@ -1309,7 +1309,7 @@ error:
 int sdp_proc_dump_conn_main(char *buffer, int max_size, off_t start_index, 
 			    long *end_index)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	off_t counter = 0;
 	int   offset = 0;
 	u64   s_guid;
@@ -1414,7 +1414,7 @@ int sdp_proc_dump_conn_data(char *buffer
 			    long *end_index)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	off_t counter = 0;
 	int   offset = 0;
 	unsigned long flags;
@@ -1508,7 +1508,7 @@ done:
 int sdp_proc_dump_conn_rdma(char *buffer, int max_size, off_t start_index,
 			    long *end_index)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	off_t counter = 0;
 	int   offset = 0;
 	unsigned long flags;
@@ -1585,7 +1585,7 @@ done:
 int sdp_proc_dump_conn_sopt(char *buffer, int max_size, off_t start_index,
 			    long *end_index)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	off_t counter = 0;
 	int   offset = 0;
 	unsigned long flags;
@@ -1904,7 +1904,7 @@ int sdp_conn_table_init(int proto_family
 		goto error_size;
 	}
 
-	byte_size = conn_size * sizeof(struct sdp_opt *);
+	byte_size = conn_size * sizeof(struct sdp_sock *);
 	page_size = (byte_size >> 12) + ((0xfff & byte_size) > 0 ? 1 : 0);
 	for (dev_root_s.sk_ordr = 0;
 	     (1 << dev_root_s.sk_ordr) < page_size; dev_root_s.sk_ordr++) ;
@@ -1985,7 +1985,7 @@ int sdp_conn_table_clear(void)
  {
 	sdp_dbg_init("Deleting connection tables.");
 #if 0
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	/*
 	 * drain all the connections
 	 */
Only in linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/: .sdp_conn.c.swp
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_conn.h linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_conn.h	2005-04-26 11:02:20.765001000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_conn.h	2005-04-26 21:28:40.331963000 -0700
@@ -146,12 +146,12 @@ enum sdp_mode {
  */
 #define SDP_MSG_EVENT_TABLE_SIZE 0x20
 
-static inline struct sdp_opt *sdp_sk(struct sock *sk)
+static inline struct sdp_sock *sdp_sk(struct sock *sk)
 {
-	return (struct sdp_opt *)sk;
+	return (struct sdp_sock *)sk;
 }
 
-static inline struct sock *sk_sdp(struct sdp_opt *conn)
+static inline struct sock *sk_sdp(struct sdp_sock *conn)
 {
 	return (struct sock *)conn;
 }
@@ -215,7 +215,7 @@ struct sdp_conn_lock {
 /*
  * SDP Connection structure.
  */
-struct sdp_opt {
+struct sdp_sock {
 	/*
 	 * inet_sock must be first member of sdp_sock
 	 * NOTE: this depends on inet_sock having struct sock as its
@@ -378,17 +378,17 @@ struct sdp_opt {
 	/*
 	 * table managment
 	 */
-	struct sdp_opt *lstn_next;    /* next conn in the chain */
-	struct sdp_opt **lstn_p_next; /* previous next conn in the chain */
+	struct sdp_sock *lstn_next;    /* next conn in the chain */
+	struct sdp_sock **lstn_p_next; /* previous next conn in the chain */
 
-	struct sdp_opt *bind_next;    /* next conn in the chain */
-	struct sdp_opt **bind_p_next; /* previous next conn in the chain */
+	struct sdp_sock *bind_next;    /* next conn in the chain */
+	struct sdp_sock **bind_p_next; /* previous next conn in the chain */
 	/*
 	 * listen/accept managment
 	 */
-	struct sdp_opt *parent;      /* listening socket queuing. */
-	struct sdp_opt *accept_next; /* sockets waiting for acceptance. */
-	struct sdp_opt *accept_prev; /* sockets waiting for acceptance. */
+	struct sdp_sock *parent;      /* listening socket queuing. */
+	struct sdp_sock *accept_next; /* sockets waiting for acceptance. */
+	struct sdp_sock *accept_prev; /* sockets waiting for acceptance. */
 	/*
 	 * OS info
 	 */
@@ -469,18 +469,18 @@ struct sdp_opt {
 /*
  * SDP connection lock
  */
-extern void sdp_conn_internal_lock(struct sdp_opt *conn, unsigned long *flags);
-extern void sdp_conn_internal_unlock(struct sdp_opt *conn);
-extern void sdp_conn_relock(struct sdp_opt *conn);
-extern void sdp_conn_destruct(struct sdp_opt *conn);
-extern int sdp_conn_cq_drain(struct ib_cq *cq, struct sdp_opt *conn);
+extern void sdp_conn_internal_lock(struct sdp_sock *conn, unsigned long *flags);
+extern void sdp_conn_internal_unlock(struct sdp_sock *conn);
+extern void sdp_conn_relock(struct sdp_sock *conn);
+extern void sdp_conn_destruct(struct sdp_sock *conn);
+extern int sdp_conn_cq_drain(struct ib_cq *cq, struct sdp_sock *conn);
 
 #define SDP_CONN_LOCK_IRQ(conn, flags) \
         spin_lock_irqsave(&((conn)->lock.slock), flags)
 #define SDP_CONN_UNLOCK_IRQ(conn, flags) \
         spin_unlock_irqrestore(&((conn)->lock.slock), flags)
 
-static inline void sdp_conn_lock(struct sdp_opt *conn)
+static inline void sdp_conn_lock(struct sdp_sock *conn)
 {
 	unsigned long flags;
 
@@ -496,7 +496,7 @@ static inline void sdp_conn_lock(struct 
 	spin_unlock_irqrestore(&(conn->lock.slock), flags);
 }
 
-static inline void sdp_conn_unlock(struct sdp_opt *conn)
+static inline void sdp_conn_unlock(struct sdp_sock *conn)
 {
 	unsigned long flags;
 
@@ -516,12 +516,12 @@ static inline void sdp_conn_unlock(struc
 /*
  * connection reference counting.
  */
-static inline void sdp_conn_hold(struct sdp_opt *conn)
+static inline void sdp_conn_hold(struct sdp_sock *conn)
 {
 	atomic_inc(&conn->refcnt);
 }
 
-static inline void sdp_conn_put(struct sdp_opt *conn)
+static inline void sdp_conn_put(struct sdp_sock *conn)
 {
 	if (atomic_dec_and_test(&conn->refcnt))
 		sdp_conn_destruct(conn);
@@ -530,7 +530,7 @@ static inline void sdp_conn_put(struct s
 /*
  * sdp_conn_error - get the connections error value destructively
  */
-static inline int sdp_conn_error(struct sdp_opt *conn)
+static inline int sdp_conn_error(struct sdp_sock *conn)
 {
 	/*
 	 * The connection error parameter is set and read under the connection
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_dev.h linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_dev.h	2005-04-25 11:55:01.370005000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_dev.h	2005-04-26 21:28:40.335964000 -0700
@@ -181,12 +181,12 @@ struct sdev_root {
 	u32 sk_ordr;  /* order size of region. */
 	u32 sk_rover; /* next potential available space. */
 	u32 sk_entry; /* number of socket table entries. */
-	struct sdp_opt **sk_array;	/* array of sockets. */
+	struct sdp_sock **sk_array;	/* array of sockets. */
 	/*
 	 * connection managment
 	 */
-	struct sdp_opt *listen_list;	/* list of listening connections */
-	struct sdp_opt *bind_list;	/* connections bound to a port. */
+	struct sdp_sock *listen_list;	/* list of listening connections */
+	struct sdp_sock *bind_list;	/* connections bound to a port. */
 	/*
 	 * list locks
 	 */
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_event.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_event.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_event.c	2005-04-25 11:55:01.374000000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_event.c	2005-04-26 21:28:40.340967000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_cq_event_locked - main per QP event handler
  */
-int sdp_cq_event_locked(struct ib_wc *comp, struct sdp_opt *conn)
+int sdp_cq_event_locked(struct ib_wc *comp, struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -131,7 +131,7 @@ done:
 void sdp_cq_event_handler(struct ib_cq *cq, void *arg)
 {
 	s32 hashent = (unsigned long)arg;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	s32 result;
 	unsigned long flags;
 
@@ -199,7 +199,7 @@ done:
  */
 
 static int sdp_cm_idle(struct ib_cm_id *cm_id, struct ib_cm_event *event,
-		       struct sdp_opt *conn)
+		       struct sdp_sock *conn)
 {
 	sdp_dbg_ctrl(conn, "CM IDLE. commID <%08x> event <%d> status <%d>",
 		     cm_id->local_id, event->event, event->param.send_status);
@@ -243,7 +243,7 @@ static int sdp_cm_idle(struct ib_cm_id *
 
 static int sdp_cm_established(struct ib_cm_id *cm_id,
 			      struct ib_cm_event *event,
-			      struct sdp_opt *conn)
+			      struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -309,7 +309,7 @@ done:
 }
 
 static int sdp_cm_dreq_rcvd(struct ib_cm_id *cm_id, struct ib_cm_event *event,
-			    struct sdp_opt *conn)
+			    struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -334,7 +334,7 @@ static int sdp_cm_dreq_rcvd(struct ib_cm
  * sdp_cm_timewait - handler for connection Time Wait completion
  */
 static int sdp_cm_timewait(struct ib_cm_id *cm_id, struct ib_cm_event *event,
-			   struct sdp_opt *conn)
+			   struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -414,7 +414,7 @@ error:
 int sdp_cm_event_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 {
 	s32 hashent = (unsigned long)cm_id->context;
-	struct sdp_opt *conn = NULL;
+	struct sdp_sock *conn = NULL;
 	int result = 0;
 
 	sdp_dbg_ctrl(NULL, "CM state <%d> event <%d> commID <%08x> ID <%d>",
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_inet.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_inet.c	2005-04-25 21:20:07.829965000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_inet.c	2005-04-26 21:28:40.348962000 -0700
@@ -101,7 +101,7 @@ module_param(sdp_debug_level, int, 0);
  */
 void sdp_inet_wake_send(struct sock *sk)
 {
-	struct sdp_opt *conn = sdp_sk(sk);
+	struct sdp_sock *conn = sdp_sk(sk);
 
 	if (sk == NULL)
 		return;
@@ -187,7 +187,7 @@ void sdp_inet_wake_urg(struct sock *sk)
 /*
  * sdp_inet_abort - abort an existing connection
  */
-static int sdp_inet_abort(struct sdp_opt *conn)
+static int sdp_inet_abort(struct sdp_sock *conn)
 {
 	int result;
 
@@ -234,7 +234,7 @@ static int sdp_inet_abort(struct sdp_opt
 /*
  * sdp_inet_disconnect - disconnect a connection
  */
-static int sdp_inet_disconnect(struct sdp_opt *conn)
+static int sdp_inet_disconnect(struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -299,7 +299,7 @@ static int sdp_inet_disconnect(struct sd
  */
 static int sdp_inet_release(struct socket *sock)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	struct sock *sk;
 	int  result;
 	long timeout;
@@ -441,7 +441,7 @@ static int sdp_inet_bind(struct socket *
 {
 	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	unsigned int addr_result = RTN_UNSPEC;
 	u16 bind_port;
 	int result;
@@ -533,7 +533,7 @@ static int sdp_inet_connect(struct socke
 {
 	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	long timeout;
 	int result;
 
@@ -696,7 +696,7 @@ done:
 static int sdp_inet_listen(struct socket *sock, int backlog)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	int result;
 
 	sk = sock->sk;
@@ -755,8 +755,8 @@ static int sdp_inet_accept(struct socket
 {
 	struct sock *listen_sk;
 	struct sock *accept_sk = NULL;
-	struct sdp_opt *listen_conn;
-	struct sdp_opt *accept_conn = NULL;
+	struct sdp_sock *listen_conn;
+	struct sdp_sock *accept_conn = NULL;
 	int result;
 	long timeout;
 
@@ -911,7 +911,7 @@ static int sdp_inet_getname(struct socke
 {
 	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 
 	sk = sock->sk;
 	conn = sdp_sk(sk);
@@ -946,7 +946,7 @@ static unsigned int sdp_inet_poll(struct
 				  poll_table *wait)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	unsigned int mask = 0;
 
 	/*
@@ -1035,7 +1035,7 @@ static int sdp_inet_ioctl(struct socket 
 			  unsigned long arg)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	struct sdpc_buff *buff;
 	int result = 0;
 	int value;
@@ -1158,7 +1158,7 @@ static int sdp_inet_setopt(struct socket
 			   char __user *optval, int optlen)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	int value;
 	int result = 0;
 
@@ -1225,7 +1225,7 @@ static int sdp_inet_getopt(struct socket
 			   char __user *optval, int __user *optlen)
 {
 	struct sock *sk;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	int value;
 	int len;
 
@@ -1286,7 +1286,7 @@ static int sdp_inet_getopt(struct socket
 static int sdp_inet_shutdown(struct socket *sock, int flag)
 {
 	int result = 0;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 
 	conn = sdp_sk(sock->sk);
 
@@ -1400,7 +1400,7 @@ static struct proto_ops lnx_stream_ops =
  */
 static int sdp_inet_create(struct socket *sock, int protocol)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 
 	sdp_dbg_ctrl(NULL, "SOCKET: type <%d> proto <%d> state <%u:%08lx>",
 		     sock->type, protocol, sock->state, sock->flags);
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_iocb.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_iocb.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_iocb.c	2005-04-25 11:55:01.388002000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_iocb.c	2005-04-26 21:28:40.353960000 -0700
@@ -420,7 +420,7 @@ static int sdp_mem_lock_cleanup(void)
 /*
  * sdp_iocb_register - register an IOCBs memory for advertisment
  */
-int sdp_iocb_register(struct sdpc_iocb *iocb, struct sdp_opt *conn)
+int sdp_iocb_register(struct sdpc_iocb *iocb, struct sdp_sock *conn)
 {
 	int result;
 
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_pass.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_pass.c	2005-04-25 21:05:20.526002000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_pass.c	2005-04-26 21:28:40.359963000 -0700
@@ -37,7 +37,7 @@
 /*
  * handle incomming passive connection error. (REJ)
  */
-void sdp_cm_pass_error(struct sdp_opt *conn, int error)
+void sdp_cm_pass_error(struct sdp_sock *conn, int error)
 {
 	int result;
 	struct sock *sk;
@@ -71,7 +71,7 @@ void sdp_cm_pass_error(struct sdp_opt *c
 /*
  * handle incomming passive connection establishment. (RTU)
  */
-int sdp_cm_pass_establish(struct sdp_opt *conn)
+int sdp_cm_pass_establish(struct sdp_sock *conn)
 {
         struct ib_qp_attr *qp_attr;
 	int attr_mask = 0;
@@ -147,7 +147,7 @@ done:
 /*
  * Functions to handle incoming passive connection requests. (REQ)
  */
-static int sdp_cm_accept(struct sdp_opt *conn)
+static int sdp_cm_accept(struct sdp_sock *conn)
 {
 	struct ib_cm_rep_param param;
 	struct sdp_msg_hello_ack *hello_ack;
@@ -290,9 +290,9 @@ error:
 	return result;
 }
 
-static int sdp_cm_listen_lookup(struct sdp_opt *conn)
+static int sdp_cm_listen_lookup(struct sdp_sock *conn)
 {
-	struct sdp_opt *listen_conn;
+	struct sdp_sock *listen_conn;
 	struct sock *listen_sk;
 	struct sock *sk;
 	int result;
@@ -478,7 +478,7 @@ static int sdp_cm_hello_check(struct sdp
 int sdp_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
 {
 	struct sdp_msg_hello *msg_hello = event->private_data;
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	int result;
 
 	sdp_dbg_ctrl(NULL, 
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_proto.h linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_proto.h
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_proto.h	2005-04-25 21:18:11.542976000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_proto.h	2005-04-26 21:28:40.366962000 -0700
@@ -114,23 +114,23 @@ int sdp_proc_dump_buff_pool(char *buffer
 /*
  * Wall between userspace protocol and SDP protocol proper
  */
-int sdp_wall_send_close(struct sdp_opt *conn);
+int sdp_wall_send_close(struct sdp_sock *conn);
 
-int sdp_wall_send_closing(struct sdp_opt *conn);
+int sdp_wall_send_closing(struct sdp_sock *conn);
 
-int sdp_wall_send_abort(struct sdp_opt *conn);
+int sdp_wall_send_abort(struct sdp_sock *conn);
 
-int sdp_wall_recv_close(struct sdp_opt *conn);
+int sdp_wall_recv_close(struct sdp_sock *conn);
 
-int sdp_wall_recv_closing(struct sdp_opt *conn);
+int sdp_wall_recv_closing(struct sdp_sock *conn);
 
-int sdp_wall_recv_abort(struct sdp_opt *conn);
+int sdp_wall_recv_abort(struct sdp_sock *conn);
 
-void sdp_wall_recv_drop(struct sdp_opt *conn);
+void sdp_wall_recv_drop(struct sdp_sock *conn);
 
-int sdp_wall_abort(struct sdp_opt *conn);
+int sdp_wall_abort(struct sdp_sock *conn);
 
-int sdp_recv_buff(struct sdp_opt *conn, struct sdpc_buff *buff);
+int sdp_recv_buff(struct sdp_sock *conn, struct sdpc_buff *buff);
 
 /*
  * Zcopy advertisment managment
@@ -184,7 +184,7 @@ void sdp_iocb_q_cancel(struct sdpc_iocb_
 
 void sdp_iocb_q_remove(struct sdpc_iocb *iocb);
 
-int sdp_iocb_register(struct sdpc_iocb *iocb, struct sdp_opt *conn);
+int sdp_iocb_register(struct sdpc_iocb *iocb, struct sdp_sock *conn);
 
 int sdp_iocb_release(struct sdpc_iocb *iocb);
 
@@ -276,13 +276,13 @@ int sdp_proc_dump_device(char *buffer,
 			 off_t start_index,
 			 long *end_index);
 
-int sdp_conn_table_remove(struct sdp_opt *conn);
+int sdp_conn_table_remove(struct sdp_sock *conn);
 
-struct sdp_opt *sdp_conn_table_lookup(s32 entry);
+struct sdp_sock *sdp_conn_table_lookup(s32 entry);
 
-struct sdp_opt *sdp_conn_alloc(int priority);
+struct sdp_sock *sdp_conn_alloc(int priority);
 
-int sdp_conn_alloc_ib(struct sdp_opt *conn,
+int sdp_conn_alloc_ib(struct sdp_sock *conn,
 		      struct ib_device *device, 
 		      u8 hw_port,
 		      u16 pkey);
@@ -300,41 +300,41 @@ void sdp_inet_wake_urg(struct sock *sk);
 /*
  * port/queue managment
  */
-int sdp_inet_accept_q_put(struct sdp_opt *listen_conn,
-			  struct sdp_opt *accept_conn);
+int sdp_inet_accept_q_put(struct sdp_sock *listen_conn,
+			  struct sdp_sock *accept_conn);
 
-struct sdp_opt *sdp_inet_accept_q_get(struct sdp_opt *listen_conn);
+struct sdp_sock *sdp_inet_accept_q_get(struct sdp_sock *listen_conn);
 
-int sdp_inet_accept_q_remove(struct sdp_opt *accept_conn);
+int sdp_inet_accept_q_remove(struct sdp_sock *accept_conn);
 
-int sdp_inet_listen_start(struct sdp_opt *listen_conn);
+int sdp_inet_listen_start(struct sdp_sock *listen_conn);
 
-int sdp_inet_listen_stop(struct sdp_opt *listen_conn);
+int sdp_inet_listen_stop(struct sdp_sock *listen_conn);
 
-struct sdp_opt *sdp_inet_listen_lookup(u32 addr, u16 port);
+struct sdp_sock *sdp_inet_listen_lookup(u32 addr, u16 port);
 
-int sdp_inet_port_get(struct sdp_opt *conn, u16 port);
+int sdp_inet_port_get(struct sdp_sock *conn, u16 port);
 
-int sdp_inet_port_put(struct sdp_opt *conn);
+int sdp_inet_port_put(struct sdp_sock *conn);
 
-int sdp_inet_port_inherit(struct sdp_opt *parent, struct sdp_opt *child);
+int sdp_inet_port_inherit(struct sdp_sock *parent, struct sdp_sock *child);
 
 /*
  * active connect functions
  */
-int sdp_cm_connect(struct sdp_opt *conn);
+int sdp_cm_connect(struct sdp_sock *conn);
 
 int sdp_cm_rep_handler(struct ib_cm_id *cm_id,
 		       struct ib_cm_event *event,
-		       struct sdp_opt *conn);
+		       struct sdp_sock *conn);
 
-void sdp_cm_actv_error(struct sdp_opt *conn, int error);
+void sdp_cm_actv_error(struct sdp_sock *conn, int error);
 /*
  * passive connect functions
  */
-void sdp_cm_pass_error(struct sdp_opt *conn, int error);
+void sdp_cm_pass_error(struct sdp_sock *conn, int error);
 
-int sdp_cm_pass_establish(struct sdp_opt *conn);
+int sdp_cm_pass_establish(struct sdp_sock *conn);
 
 int sdp_cm_req_handler(struct ib_cm_id *cm_id,
 		       struct ib_cm_event *event);
@@ -342,36 +342,36 @@ int sdp_cm_req_handler(struct ib_cm_id *
 /*
  * post functions
  */
-int sdp_recv_flush(struct sdp_opt *conn);
+int sdp_recv_flush(struct sdp_sock *conn);
 
-int sdp_send_flush(struct sdp_opt *conn);
+int sdp_send_flush(struct sdp_sock *conn);
 
-int sdp_send_ctrl_ack(struct sdp_opt *conn);
+int sdp_send_ctrl_ack(struct sdp_sock *conn);
 
-int sdp_send_ctrl_disconnect(struct sdp_opt *conn);
+int sdp_send_ctrl_disconnect(struct sdp_sock *conn);
 
-int sdp_send_ctrl_abort(struct sdp_opt *conn);
+int sdp_send_ctrl_abort(struct sdp_sock *conn);
 
-int sdp_send_ctrl_send_sm(struct sdp_opt *conn);
+int sdp_send_ctrl_send_sm(struct sdp_sock *conn);
 
-int sdp_send_ctrl_snk_avail(struct sdp_opt *conn,
+int sdp_send_ctrl_snk_avail(struct sdp_sock *conn,
 			    u32 size, 
 			    u32 rkey,
 			    u64 addr);
 
-int sdp_send_ctrl_resize_buff_ack(struct sdp_opt *conn, u32 size);
+int sdp_send_ctrl_resize_buff_ack(struct sdp_sock *conn, u32 size);
 
-int sdp_send_ctrl_rdma_rd(struct sdp_opt *conn, s32 size);
+int sdp_send_ctrl_rdma_rd(struct sdp_sock *conn, s32 size);
 
-int sdp_send_ctrl_rdma_wr(struct sdp_opt *conn, u32 size);
+int sdp_send_ctrl_rdma_wr(struct sdp_sock *conn, u32 size);
 
-int sdp_send_ctrl_mode_ch(struct sdp_opt *conn, u8 mode);
+int sdp_send_ctrl_mode_ch(struct sdp_sock *conn, u8 mode);
 
-int sdp_send_ctrl_src_cancel(struct sdp_opt *conn);
+int sdp_send_ctrl_src_cancel(struct sdp_sock *conn);
 
-int sdp_send_ctrl_snk_cancel(struct sdp_opt *conn);
+int sdp_send_ctrl_snk_cancel(struct sdp_sock *conn);
 
-int sdp_send_ctrl_snk_cancel_ack(struct sdp_opt *conn);
+int sdp_send_ctrl_snk_cancel_ack(struct sdp_sock *conn);
 
 /*
  * inet functions
@@ -380,20 +380,20 @@ int sdp_send_ctrl_snk_cancel_ack(struct 
 /*
  * event functions
  */
-int sdp_cq_event_locked(struct ib_wc *comp, struct sdp_opt *conn);
+int sdp_cq_event_locked(struct ib_wc *comp, struct sdp_sock *conn);
 
 void sdp_cq_event_handler(struct ib_cq *cq, void *arg);
 
 int sdp_cm_event_handler(struct ib_cm_id *cm_id,
 			 struct ib_cm_event *event);
 
-int sdp_event_recv(struct sdp_opt *conn, struct ib_wc *comp);
+int sdp_event_recv(struct sdp_sock *conn, struct ib_wc *comp);
 
-int sdp_event_send(struct sdp_opt *conn, struct ib_wc *comp);
+int sdp_event_send(struct sdp_sock *conn, struct ib_wc *comp);
 
-int sdp_event_read(struct sdp_opt *conn, struct ib_wc *comp);
+int sdp_event_read(struct sdp_sock *conn, struct ib_wc *comp);
 
-int sdp_event_write(struct sdp_opt *conn, struct ib_wc *comp);
+int sdp_event_write(struct sdp_sock *conn, struct ib_wc *comp);
 
 /*
  * DATA transport
@@ -409,11 +409,11 @@ int sdp_inet_recv(struct kiocb *iocb,
 		  size_t size, 
 		  int    flags);
 
-void sdp_iocb_q_cancel_all_read(struct sdp_opt *conn, ssize_t error);
+void sdp_iocb_q_cancel_all_read(struct sdp_sock *conn, ssize_t error);
 
-void sdp_iocb_q_cancel_all_write(struct sdp_opt *conn, ssize_t error);
+void sdp_iocb_q_cancel_all_write(struct sdp_sock *conn, ssize_t error);
 
-void sdp_iocb_q_cancel_all(struct sdp_opt *conn, ssize_t error);
+void sdp_iocb_q_cancel_all(struct sdp_sock *conn, ssize_t error);
 
 /*
  * link address information
@@ -443,7 +443,7 @@ int sdp_link_addr_cleanup(void);
 /*
  * Event handling function, demultiplexed base on Message ID
  */
-typedef int (*sdp_event_cb_func)(struct sdp_opt *conn, 
+typedef int (*sdp_event_cb_func)(struct sdp_sock *conn, 
 				 struct sdpc_buff *buff);
 
 /*
@@ -478,7 +478,7 @@ extern int sdp_debug_level;
 
 #define sdp_conn_dbg(level, type, conn, format, arg...) \
         do { \
-                struct sdp_opt *x = (conn); \
+                struct sdp_sock *x = (conn); \
                 if (x) { \
                         sdp_dbg_out(level, type, \
                                       "<%d> <%04x:%04x> " format, \
@@ -552,7 +552,7 @@ do {                                    
 /*
  * sdp_inet_write_space - writable space on send side
  */
-static inline int sdp_inet_write_space(struct sdp_opt *conn, int urg)
+static inline int sdp_inet_write_space(struct sdp_sock *conn, int urg)
 {
 	int size;
 
@@ -576,7 +576,7 @@ static inline int sdp_inet_write_space(s
 /*
  * sdp_inet_writable - return non-zero if socket is writable
  */
-static inline int sdp_inet_writable(struct sdp_opt *conn)
+static inline int sdp_inet_writable(struct sdp_sock *conn)
 {
 	if (SDP_ST_MASK_OPEN & conn->istate)
 		return (sdp_inet_write_space(conn, 0) <
@@ -588,7 +588,7 @@ static inline int sdp_inet_writable(stru
 /*
  * sdp_conn_stat_dump - dump stats to the log
  */
-static inline int sdp_conn_stat_dump(struct sdp_opt *conn)
+static inline int sdp_conn_stat_dump(struct sdp_sock *conn)
 {
 #ifdef _SDP_CONN_STATS_REC
 	int counter;
@@ -611,7 +611,7 @@ static inline int sdp_conn_stat_dump(str
 /*
  * sdp_conn_state_dump - dump state information to the log
  */
-static inline void sdp_conn_state_dump(struct sdp_opt *conn)
+static inline void sdp_conn_state_dump(struct sdp_sock *conn)
 {
 #ifdef _SDP_CONN_STATE_REC
 	int counter;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_rcvd.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_rcvd.c	2005-04-25 11:55:01.474000000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_rcvd.c	2005-04-26 21:28:40.377962000 -0700
@@ -38,7 +38,7 @@
  * Specific MID handler functions. (RECV)
  */
 
-static int sdp_rcvd_disconnect(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_disconnect(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result = 0;
 
@@ -115,7 +115,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_abort(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_abort(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result = 0;
 
@@ -145,7 +145,7 @@ static int sdp_rcvd_abort(struct sdp_opt
 	return result;
 }
 
-static int sdp_rcvd_send_sm(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_send_sm(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct sdpc_iocb *iocb;
 	int result;
@@ -188,7 +188,7 @@ static int sdp_rcvd_send_sm(struct sdp_o
 	return 0;
 }
 
-static int sdp_rcvd_rdma_wr(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_rdma_wr(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_rwch *rwch;
 	struct sdpc_iocb *iocb;
@@ -249,7 +249,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_rdma_rd(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_rdma_rd(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_rrch *rrch;
 	struct sdpc_iocb *iocb;
@@ -331,7 +331,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_mode_change(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_mode_change(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_mch *mch;
 	int result;
@@ -427,7 +427,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_src_cancel(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_src_cancel(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct sdpc_advt *advt;
 	int result;
@@ -517,7 +517,7 @@ done:
 	return result;
 }
 
-static int sdp_rcvd_snk_cancel(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_snk_cancel(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct sdpc_advt *advt;
 	s32 counter;
@@ -591,7 +591,7 @@ done:
 /*
  * sdp_rcvd_snk_cancel_ack - sink cancel confirmantion
  */
-static int sdp_rcvd_snk_cancel_ack(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_snk_cancel_ack(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct sdpc_iocb *iocb;
 	int result;
@@ -629,7 +629,7 @@ done:
 /*
  * sdp_rcvd_resize_buff_ack - buffer size change request
  */
-static int sdp_rcvd_resize_buff_ack(struct sdp_opt *conn,
+static int sdp_rcvd_resize_buff_ack(struct sdp_sock *conn,
 				    struct sdpc_buff *buff)
 {
 	struct msg_hdr_crbh *crbh;
@@ -659,7 +659,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_suspend(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_suspend(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_sch *sch;
 
@@ -671,12 +671,12 @@ static int sdp_rcvd_suspend(struct sdp_o
 	return 0;
 }
 
-static int sdp_rcvd_suspend_ack(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_suspend_ack(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_rcvd_snk_avail(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_snk_avail(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_snkah *snkah;
 	struct sdpc_advt *advt;
@@ -810,7 +810,7 @@ error:
 	return result;
 }
 
-static int sdp_rcvd_src_avail(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_src_avail(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_srcah *srcah;
 	struct sdpc_advt *advt;
@@ -980,7 +980,7 @@ done:
 /*
  * sdp_rcvd_data - SDP data message event received
  */
-static int sdp_rcvd_data(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_data(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int ret_val;
 
@@ -1025,7 +1025,7 @@ static int sdp_rcvd_data(struct sdp_opt 
 /*
  * sdp_rcvd_unsupported - Valid messages we're not expecting
  */
-static int sdp_rcvd_unsupported(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_rcvd_unsupported(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	/*
 	 * Since the gateway only initates RDMA's but is never a target, and
@@ -1089,7 +1089,7 @@ static sdp_event_cb_func recv_event_func
 /*
  * sdp_event_recv - recv event demultiplexing into sdp messages
  */
-int sdp_event_recv(struct sdp_opt *conn, struct ib_wc *comp)
+int sdp_event_recv(struct sdp_sock *conn, struct ib_wc *comp)
 {
 	sdp_event_cb_func dispatch_func;
 	struct sdpc_buff *buff;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_read.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_read.c	2005-04-25 11:55:01.478002000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_read.c	2005-04-26 21:28:40.383962000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_event_read_advt - RDMA read event handler for source advertisments
  */
-static int sdp_event_read_advt(struct sdp_opt *conn, struct ib_wc *comp)
+static int sdp_event_read_advt(struct sdp_sock *conn, struct ib_wc *comp)
 {
 	struct sdpc_advt *advt;
 	int result;
@@ -107,7 +107,7 @@ error:
 /*
  * sdp_event_read - RDMA read event handler
  */
-int sdp_event_read(struct sdp_opt *conn, struct ib_wc *comp)
+int sdp_event_read(struct sdp_sock *conn, struct ib_wc *comp)
 {
 	struct sdpc_iocb *iocb;
 	struct sdpc_buff *buff;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_recv.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_recv.c	2005-04-25 11:55:01.489003000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_recv.c	2005-04-26 21:28:40.391966000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_post_recv_buff - post a single buffers for data recv
  */
-static int sdp_post_recv_buff(struct sdp_opt *conn)
+static int sdp_post_recv_buff(struct sdp_sock *conn)
 {
 	struct ib_recv_wr receive_param = { NULL };
 	struct ib_recv_wr *bad_wr;
@@ -115,7 +115,7 @@ error:
 /*
  * sdp_post_rdma_buff - post a single buffers for rdma read on a conn
  */
-static int sdp_post_rdma_buff(struct sdp_opt *conn)
+static int sdp_post_rdma_buff(struct sdp_sock *conn)
 {
 	struct ib_send_wr send_param = { NULL };
 	struct ib_send_wr *bad_wr;
@@ -222,7 +222,7 @@ done:
 /*
  * sdp_post_rdma_iocb_src - post a iocb for rdma read on a conn
  */
-static int sdp_post_rdma_iocb_src(struct sdp_opt *conn)
+static int sdp_post_rdma_iocb_src(struct sdp_sock *conn)
 {
 	struct ib_send_wr send_param = { NULL };
 	struct ib_send_wr *bad_wr;
@@ -353,7 +353,7 @@ done:
 /*
  * sdp_post_rdma_iocb_snk - post a iocb for rdma read on a conn
  */
-static int sdp_post_rdma_iocb_snk(struct sdp_opt *conn)
+static int sdp_post_rdma_iocb_snk(struct sdp_sock *conn)
 {
 	struct sdpc_iocb *iocb;
 	int result = 0;
@@ -452,7 +452,7 @@ error:
 /*
  * sdp_post_rdma - post a rdma based requests for a connection
  */
-static int sdp_post_rdma(struct sdp_opt *conn)
+static int sdp_post_rdma(struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -531,7 +531,7 @@ done:
 /*
  * sdp_recv_flush - post a certain number of buffers on a connection
  */
-int sdp_recv_flush(struct sdp_opt *conn)
+int sdp_recv_flush(struct sdp_sock *conn)
 {
 	int result = 0;
 	int counter;
@@ -695,7 +695,7 @@ static int sdp_read_buff_iocb(struct sdp
 /*
  * sdp_recv_buff_iocb_active - Ease AIO read pending pressure
  */
-static int sdp_recv_buff_iocb_active(struct sdp_opt *conn,
+static int sdp_recv_buff_iocb_active(struct sdp_sock *conn,
 				     struct sdpc_buff *buff)
 {
 	struct sdpc_iocb *iocb;
@@ -744,7 +744,7 @@ static int sdp_recv_buff_iocb_active(str
 /*
  * sdp_recv_buff_iocb_pending - Ease AIO read pending pressure
  */
-static int sdp_recv_buff_iocb_pending(struct sdp_opt *conn,
+static int sdp_recv_buff_iocb_pending(struct sdp_sock *conn,
 				      struct sdpc_buff *buff)
 {
 	struct sdpc_iocb *iocb;
@@ -804,7 +804,7 @@ static int sdp_recv_buff_iocb_pending(st
 /*
  * sdp_recv_buff - Process a new buffer based on queue type
  */
-int sdp_recv_buff(struct sdp_opt *conn, struct sdpc_buff *buff)
+int sdp_recv_buff(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result;
 	int buffered;
@@ -917,7 +917,7 @@ static int sdp_read_src_lookup(struct sd
 static int sdp_inet_read_cancel(struct kiocb *req, struct io_event *ev)
 {
 	struct sock_iocb *si = kiocb_to_siocb(req);
-	struct sdp_opt   *conn;
+	struct sdp_sock   *conn;
 	struct sdpc_iocb *iocb;
 	int result = 0;
 
@@ -1081,7 +1081,7 @@ static int sdp_inet_recv_urg_trav(struct
 static int sdp_inet_recv_urg(struct sock *sk, struct msghdr *msg, int size,
 			     int flags)
 {
-	struct sdp_opt *conn;
+	struct sdp_sock *conn;
 	struct sdpc_buff *buff;
 	int result = 0;
 	u8 value;
@@ -1155,7 +1155,7 @@ int sdp_inet_recv(struct kiocb  *req, st
 		  size_t size, int flags)
 {
 	struct sock      *sk;
-	struct sdp_opt   *conn;
+	struct sdp_sock   *conn;
 	struct sdpc_iocb *iocb;
 	struct sdpc_buff *buff;
 	struct sdpc_buff *head = NULL;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_send.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_send.c	2005-04-25 11:55:01.511001000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_send.c	2005-04-26 21:28:40.403961000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_send_buff_post - Post a buffer send on a SDP connection
  */
-static int sdp_send_buff_post(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_send_buff_post(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct ib_send_wr send_param = { NULL };
 	struct ib_send_wr *bad_wr;
@@ -182,7 +182,7 @@ done:
 /*
  * sdp_send_data_buff_post - Post data for buffered transmission
  */
-static int sdp_send_data_buff_post(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_send_data_buff_post(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct sdpc_advt *advt;
 	int result;
@@ -292,7 +292,7 @@ error:
 /*
  * sdp_send_data_buff_snk - Post data for buffered transmission
  */
-static int sdp_send_data_buff_snk(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_send_data_buff_snk(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct ib_send_wr send_param = { NULL };
 	struct ib_send_wr *bad_wr;
@@ -428,7 +428,7 @@ error:
 /*
  * sdp_send_data_iocb_snk - process a zcopy write advert in the data path
  */
-static int sdp_send_data_iocb_snk(struct sdp_opt *conn, struct sdpc_iocb *iocb)
+static int sdp_send_data_iocb_snk(struct sdp_sock *conn, struct sdpc_iocb *iocb)
 {
 	struct ib_send_wr send_param = { NULL };
 	struct ib_send_wr *bad_wr;
@@ -553,7 +553,7 @@ error:
 /*
  * sdp_send_data_iocb_src - send a zcopy read advert in the data path
  */
-static int sdp_send_data_iocb_src(struct sdp_opt *conn, struct sdpc_iocb *iocb)
+static int sdp_send_data_iocb_src(struct sdp_sock *conn, struct sdpc_iocb *iocb)
 {
 	struct msg_hdr_srcah *src_ah;
 	struct sdpc_buff *buff;
@@ -764,7 +764,7 @@ static int sdp_send_iocb_buff_write(stru
 /*
  * sdp_send_data_iocb_buff - write multiple SDP buffers from an iocb
  */
-static int sdp_send_data_iocb_buff(struct sdp_opt *conn, struct sdpc_iocb *iocb)
+static int sdp_send_data_iocb_buff(struct sdp_sock *conn, struct sdpc_iocb *iocb)
 {
 	struct sdpc_buff *buff;
 	int result;
@@ -837,7 +837,7 @@ error:
 /*
  * sdp_send_data_iocb - Post IOCB data for transmission
  */
-static int sdp_send_data_iocb(struct sdp_opt *conn, struct sdpc_iocb *iocb)
+static int sdp_send_data_iocb(struct sdp_sock *conn, struct sdpc_iocb *iocb)
 {
 	int result = ENOBUFS;
 
@@ -928,7 +928,7 @@ done:
 /*
  * sdp_send_data_queue_test - send data buffer if conditions are met
  */
-static int sdp_send_data_queue_test(struct sdp_opt *conn,
+static int sdp_send_data_queue_test(struct sdp_sock *conn,
 				    struct sdpc_desc *element)
 {
 	int result;
@@ -959,7 +959,7 @@ static int sdp_send_data_queue_test(stru
 /*
  * sdp_send_data_queue_flush - Flush data from send queue, to send post
  */
-static int sdp_send_data_queue_flush(struct sdp_opt *conn)
+static int sdp_send_data_queue_flush(struct sdp_sock *conn)
 {
 	struct sdpc_desc *element;
 	int result = 0;
@@ -1001,7 +1001,7 @@ static int sdp_send_data_queue_flush(str
 /*
  * sdp_send_data_queue - send using the data queue if necessary
  */
-static int sdp_send_data_queue(struct sdp_opt *conn, struct sdpc_desc *element)
+static int sdp_send_data_queue(struct sdp_sock *conn, struct sdpc_desc *element)
 {
 	int result = 0;
 
@@ -1050,7 +1050,7 @@ done:
 /*
  * sdp_send_data_buff_get - get an appropriate write buffer for send
  */
-static inline struct sdpc_buff *sdp_send_data_buff_get(struct sdp_opt *conn)
+static inline struct sdpc_buff *sdp_send_data_buff_get(struct sdp_sock *conn)
 {
 	struct sdpc_buff *buff;
 
@@ -1075,7 +1075,7 @@ static inline struct sdpc_buff *sdp_send
 /*
  * sdp_send_data_buff_put - place a buffer into the send queue
  */
-static inline int sdp_send_data_buff_put(struct sdp_opt *conn,
+static inline int sdp_send_data_buff_put(struct sdp_sock *conn,
 					 struct sdpc_buff *buff, int size,
 					 int urg)
 {
@@ -1132,7 +1132,7 @@ static inline int sdp_send_data_buff_put
 /*
  * sdp_send_ctrl_buff_test - determine if it's OK to post a control msg
  */
-static int sdp_send_ctrl_buff_test(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_send_ctrl_buff_test(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result = 0;
 
@@ -1158,7 +1158,7 @@ error:
 /*
  * sdp_send_ctrl_buff_flush - Flush control buffers, to send post
  */
-static int sdp_send_ctrl_buff_flush(struct sdp_opt *conn)
+static int sdp_send_ctrl_buff_flush(struct sdp_sock *conn)
 {
 	struct sdpc_desc *element;
 	int result = 0;
@@ -1195,7 +1195,7 @@ static int sdp_send_ctrl_buff_flush(stru
 /*
  * sdp_send_ctrl_buff_buffered - Send a buffered control message
  */
-static int sdp_send_ctrl_buff_buffered(struct sdp_opt *conn,
+static int sdp_send_ctrl_buff_buffered(struct sdp_sock *conn,
 				       struct sdpc_buff *buff)
 {
 	int result = 0;
@@ -1231,7 +1231,7 @@ error:
 /*
  * sdp_send_ctrl_buff - Create and Send a buffered control message
  */
-static int sdp_send_ctrl_buff(struct sdp_opt *conn, u8 mid, int se, int sig)
+static int sdp_send_ctrl_buff(struct sdp_sock *conn, u8 mid, int se, int sig)
 {
 	int result = 0;
 	struct sdpc_buff *buff;
@@ -1285,7 +1285,7 @@ error:
 /*
  * do_send_ctrl_disconnect - Send a disconnect request
  */
-static int do_send_ctrl_disconnect(struct sdp_opt *conn)
+static int do_send_ctrl_disconnect(struct sdp_sock *conn)
 {
 	int result = 0;
 	struct sdpc_buff *buff;
@@ -1334,7 +1334,7 @@ error:
 /*
  * sdp_send_ctrl_disconnect - potentially send a disconnect request
  */
-int sdp_send_ctrl_disconnect(struct sdp_opt *conn)
+int sdp_send_ctrl_disconnect(struct sdp_sock *conn)
 {
 	int result;
 
@@ -1365,7 +1365,7 @@ int sdp_send_ctrl_disconnect(struct sdp_
 /*
  * sdp_send_ctrl_ack - Send a gratuitous Ack
  */
-int sdp_send_ctrl_ack(struct sdp_opt *conn)
+int sdp_send_ctrl_ack(struct sdp_sock *conn)
 {
 	/*
 	 * The gratuitous ack is not really and ack, but an update of the
@@ -1387,7 +1387,7 @@ int sdp_send_ctrl_ack(struct sdp_opt *co
 /*
  * sdp_send_ctrl_send_sm - Send a request for buffered mode
  */
-int sdp_send_ctrl_send_sm(struct sdp_opt *conn)
+int sdp_send_ctrl_send_sm(struct sdp_sock *conn)
 {
 	return sdp_send_ctrl_buff(conn, SDP_MID_SEND_SM, 1, 1);
 }
@@ -1395,7 +1395,7 @@ int sdp_send_ctrl_send_sm(struct sdp_opt
 /*
  * sdp_send_ctrl_src_cancel - Send a source cancel
  */
-int sdp_send_ctrl_src_cancel(struct sdp_opt *conn)
+int sdp_send_ctrl_src_cancel(struct sdp_sock *conn)
 {
 	return sdp_send_ctrl_buff(conn, SDP_MID_SRC_CANCEL, 1, 1);
 }
@@ -1403,7 +1403,7 @@ int sdp_send_ctrl_src_cancel(struct sdp_
 /*
  * sdp_send_ctrl_snk_cancel - Send a sink cancel
  */
-int sdp_send_ctrl_snk_cancel(struct sdp_opt *conn)
+int sdp_send_ctrl_snk_cancel(struct sdp_sock *conn)
 {
 	return sdp_send_ctrl_buff(conn, SDP_MID_SNK_CANCEL, 1, 1);
 }
@@ -1411,7 +1411,7 @@ int sdp_send_ctrl_snk_cancel(struct sdp_
 /*
  * sdp_send_ctrl_snk_cancel_ack - Send an ack for a sink cancel
  */
-int sdp_send_ctrl_snk_cancel_ack(struct sdp_opt *conn)
+int sdp_send_ctrl_snk_cancel_ack(struct sdp_sock *conn)
 {
 	return sdp_send_ctrl_buff(conn, SDP_MID_SNK_CANCEL_ACK, 1, 1);
 }
@@ -1419,7 +1419,7 @@ int sdp_send_ctrl_snk_cancel_ack(struct 
 /*
  * sdp_send_ctrl_abort - Send an abort message
  */
-int sdp_send_ctrl_abort(struct sdp_opt *conn)
+int sdp_send_ctrl_abort(struct sdp_sock *conn)
 {
 	/*
 	 * send
@@ -1430,7 +1430,7 @@ int sdp_send_ctrl_abort(struct sdp_opt *
 /*
  * sdp_send_ctrl_resize_buff_ack - Send an ack for a buffer size change
  */
-int sdp_send_ctrl_resize_buff_ack(struct sdp_opt *conn, u32 size)
+int sdp_send_ctrl_resize_buff_ack(struct sdp_sock *conn, u32 size)
 {
 	struct msg_hdr_crbah *crbah;
 	struct sdpc_buff *buff;
@@ -1481,7 +1481,7 @@ error:
 /*
  * sdp_send_ctrl_rdma_rd - Send an rdma read completion
  */
-int sdp_send_ctrl_rdma_rd(struct sdp_opt *conn, s32 size)
+int sdp_send_ctrl_rdma_rd(struct sdp_sock *conn, s32 size)
 {
 	struct msg_hdr_rrch *rrch;
 	struct sdpc_buff *buff;
@@ -1550,7 +1550,7 @@ error:
 /*
  * sdp_send_ctrl_rdma_wr - Send an rdma write completion
  */
-int sdp_send_ctrl_rdma_wr(struct sdp_opt *conn, u32 size)
+int sdp_send_ctrl_rdma_wr(struct sdp_sock *conn, u32 size)
 {
 	struct msg_hdr_rwch *rwch;
 	struct sdpc_buff *buff;
@@ -1607,7 +1607,7 @@ error:
 /*
  * sdp_send_ctrl_snk_avail - Send a sink available message
  */
-int sdp_send_ctrl_snk_avail(struct sdp_opt *conn, u32 size, u32 rkey, u64 addr)
+int sdp_send_ctrl_snk_avail(struct sdp_sock *conn, u32 size, u32 rkey, u64 addr)
 {
 	struct msg_hdr_snkah *snkah;
 	struct sdpc_buff *buff;
@@ -1670,7 +1670,7 @@ error:
 /*
  * sdp_send_ctrl_mode_ch - Send a mode change command
  */
-int sdp_send_ctrl_mode_ch(struct sdp_opt *conn, u8 mode)
+int sdp_send_ctrl_mode_ch(struct sdp_sock *conn, u8 mode)
 {
 	struct msg_hdr_mch *mch;
 	struct sdpc_buff *buff;
@@ -1776,7 +1776,7 @@ static int sdp_write_src_lookup(struct s
 static int sdp_inet_write_cancel(struct kiocb *req, struct io_event *ev)
 {
 	struct sock_iocb *si = kiocb_to_siocb(req);
-	struct sdp_opt   *conn;
+	struct sdp_sock   *conn;
 	struct sdpc_iocb *iocb;
 	int result = 0;
 
@@ -1915,7 +1915,7 @@ done:
 /*
  * sdp_send_flush_advt - Flush passive sink advertisments
  */
-static int sdp_send_flush_advt(struct sdp_opt *conn)
+static int sdp_send_flush_advt(struct sdp_sock *conn)
 {
 	struct sdpc_advt *advt;
 	int result;
@@ -1954,7 +1954,7 @@ static int sdp_send_flush_advt(struct sd
 /*
  * sdp_send_flush - Flush buffers from send queue, in to send post
  */
-int sdp_send_flush(struct sdp_opt *conn)
+int sdp_send_flush(struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -2016,7 +2016,7 @@ int sdp_inet_send(struct kiocb *req, str
 		  size_t size)
 {
 	struct sock      *sk;
-	struct sdp_opt   *conn;
+	struct sdp_sock   *conn;
 	struct sdpc_buff *buff;
 	struct sdpc_iocb *iocb;
 	int result = 0;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_sent.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_sent.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_sent.c	2005-04-25 11:55:01.516002000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_sent.c	2005-04-26 21:28:40.409964000 -0700
@@ -38,7 +38,7 @@
  * Specific MID handler functions. (SEND)
  */
 
-static int sdp_sent_disconnect(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_disconnect(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result;
 
@@ -116,7 +116,7 @@ error:
 	return result;
 }
 
-static int sdp_sent_abort(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_abort(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result;
 
@@ -134,12 +134,12 @@ static int sdp_sent_abort(struct sdp_opt
 	return result;
 }
 
-static int sdp_sent_send_sm(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_send_sm(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_sent_rdma_wr(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_rdma_wr(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_rwch *rwch;
 
@@ -151,7 +151,7 @@ static int sdp_sent_rdma_wr(struct sdp_o
 	return 0;
 }
 
-static int sdp_sent_rdma_rd(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_rdma_rd(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_rrch *rrch;
 
@@ -163,7 +163,7 @@ static int sdp_sent_rdma_rd(struct sdp_o
 	return 0;
 }
 
-static int sdp_sent_mode_change(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_mode_change(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_mch *mch;
 
@@ -175,22 +175,22 @@ static int sdp_sent_mode_change(struct s
 	return 0;
 }
 
-static int sdp_sent_src_cancel(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_src_cancel(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_sent_snk_cancel(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_snk_cancel(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_sent_snk_cancel_ack(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_snk_cancel_ack(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_sent_resize_buff_ack(struct sdp_opt *conn,
+static int sdp_sent_resize_buff_ack(struct sdp_sock *conn,
 				    struct sdpc_buff *buff)
 {
 	struct msg_hdr_crbah *crbah;
@@ -203,7 +203,7 @@ static int sdp_sent_resize_buff_ack(stru
 	return 0;
 }
 
-static int sdp_sent_suspend(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_suspend(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_sch *sch;
 
@@ -215,12 +215,12 @@ static int sdp_sent_suspend(struct sdp_o
 	return 0;
 }
 
-static int sdp_sent_suspend_ack(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_suspend_ack(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	return 0;
 }
 
-static int sdp_sent_snk_avail(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_snk_avail(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_snkah *snkah;
 
@@ -232,7 +232,7 @@ static int sdp_sent_snk_avail(struct sdp
 	return 0;
 }
 
-static int sdp_sent_src_avail(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_src_avail(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	struct msg_hdr_srcah *srcah;
 
@@ -247,7 +247,7 @@ static int sdp_sent_src_avail(struct sdp
 /*
  * sdp_sent_data - SDP data message event received
  */
-static int sdp_sent_data(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_data(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	int result = 0;
 
@@ -259,7 +259,7 @@ static int sdp_sent_data(struct sdp_opt 
 /*
  * sdp_sent_unsupported - Valid messages we're not sending
  */
-static int sdp_sent_unsupported(struct sdp_opt *conn, struct sdpc_buff *buff)
+static int sdp_sent_unsupported(struct sdp_sock *conn, struct sdpc_buff *buff)
 {
 	/*
 	 * Since the gateway only initates RDMA's but is never a target, and
@@ -321,7 +321,7 @@ static sdp_event_cb_func send_event_func
 /*
  * sdp_event_send - send event handler
  */
-int sdp_event_send(struct sdp_opt *conn, struct ib_wc *comp)
+int sdp_event_send(struct sdp_sock *conn, struct ib_wc *comp)
 {
 	sdp_event_cb_func dispatch_func;
 	u32 free_count = 0;
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_wall.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_wall.c	2005-04-25 11:55:01.525001000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_wall.c	2005-04-26 21:28:40.416962000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_wall_send_close - callback to accept an active close
  */
-int sdp_wall_send_close(struct sdp_opt *conn)
+int sdp_wall_send_close(struct sdp_sock *conn)
 {
 	int result;
 
@@ -108,7 +108,7 @@ error:
 /*
  * sdp_wall_send_closing - callback to confirm a passive close
  */
-int sdp_wall_send_closing(struct sdp_opt *conn)
+int sdp_wall_send_closing(struct sdp_sock *conn)
 {
 	int result;
 
@@ -158,7 +158,7 @@ error:
 /*
  * sdp_wall_send_abort - callback to accept an active abort
  */
-int sdp_wall_send_abort(struct sdp_opt *conn)
+int sdp_wall_send_abort(struct sdp_sock *conn)
 {
 	int result = 0;
 
@@ -265,7 +265,7 @@ error:
 /*
  * sdp_wall_recv_close - callback to accept an active close
  */
-int sdp_wall_recv_close(struct sdp_opt *conn)
+int sdp_wall_recv_close(struct sdp_sock *conn)
 {
 	sdp_dbg_ctrl(conn, "Close recv. src <%08x:%04x> dst <%08x:%04x>",
 		     conn->src_addr, conn->src_port, 
@@ -306,7 +306,7 @@ int sdp_wall_recv_close(struct sdp_opt *
 /*
  * sdp_wall_recv_closing - callback for a close confirmation
  */
-int sdp_wall_recv_closing(struct sdp_opt *conn)
+int sdp_wall_recv_closing(struct sdp_sock *conn)
 {
 	sdp_dbg_ctrl(conn, "Closing recv. src <%08x:%04x> dst <%08x:%04x>",
 		     conn->src_addr, conn->src_port, 
@@ -336,7 +336,7 @@ int sdp_wall_recv_closing(struct sdp_opt
 /*
  * sdp_wall_recv_abort - abortive close notification
  */
-int sdp_wall_recv_abort(struct sdp_opt *conn)
+int sdp_wall_recv_abort(struct sdp_sock *conn)
 {
 	sdp_dbg_ctrl(conn, "Abort recv. src <%08x:%04x> dst <%08x:%04x>",
 		     conn->src_addr, conn->src_port, 
@@ -376,7 +376,7 @@ int sdp_wall_recv_abort(struct sdp_opt *
 /*
  * sdp_wall_recv_drop - drop SDP protocol reference to socket
  */
-void sdp_wall_recv_drop(struct sdp_opt *conn)
+void sdp_wall_recv_drop(struct sdp_sock *conn)
 {
 	int result;
 
@@ -433,7 +433,7 @@ void sdp_wall_recv_drop(struct sdp_opt *
 /*
  * sdp_wall_abort - intiate socket dropping
  */
-int sdp_wall_abort(struct sdp_opt *conn)
+int sdp_wall_abort(struct sdp_sock *conn)
 {
 	int result;
 
diff -up linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_write.c linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_write.c
--- linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp.pre/sdp_write.c	2005-04-25 11:55:01.528004000 -0700
+++ linux-2.6.12-rc3-openib/drivers/infiniband/ulp/sdp/sdp_write.c	2005-04-26 21:28:40.419965000 -0700
@@ -41,7 +41,7 @@
 /*
  * sdp_event_write - RDMA write event handler
  */
-int sdp_event_write(struct sdp_opt *conn, struct ib_wc *comp)
+int sdp_event_write(struct sdp_sock *conn, struct ib_wc *comp)
 {
 	struct sdpc_iocb *iocb;
 	struct sdpc_buff *buff;


From halr at voltaire.com  Wed Apr 27 11:49:24 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Apr 2005 14:49:24 -0400
Subject: [openib-general] [PATCH] ping: Use Sean's new nifty MAD helper
	functions
Message-ID: <1114627763.1764.628.camel@localhost.localdomain>

ping: Use Sean's new nifty MAD helper functions
Also, minor cleanup during startup and shutdown

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

Note that the patch below addresses a litle more than half of Roland's
comments on this. The TODO list for this is:
1. finish addressing comments
2. wire MAD changes (discussed on list)
3. support for RMPP

Index: ping.c
===================================================================
--- ping.c	(revision 2212)
+++ ping.c	(working copy)
@@ -40,9 +40,20 @@
 #include <linux/utsname.h>
 #include <asm/bug.h>
 
-#include "ping_priv.h"
-#include "mad_priv.h"
+#include <ib_mad.h>
 
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("kernel IB ping agent");
+MODULE_AUTHOR("Shahar Frank");
+
+#define SPFX "ib_ping: "
+
+struct ib_ping_port_private {
+	struct list_head port_list;
+	int port_num;
+	struct ib_mad_agent *pingd_agent;     /* OpenIB Ping class */
+};
+
 static spinlock_t ib_ping_port_list_lock;
 static LIST_HEAD(ib_ping_port_list);
 
@@ -86,172 +97,79 @@
 	return entry;
 }
 
-static int ping_mad_send(struct ib_mad_agent *mad_agent,
-			 struct ib_ping_port_private *port_priv,
-			 struct ib_mad_private *mad_priv,
-			 struct ib_grh *grh,
-			 struct ib_wc *wc)
-{
-	struct ib_ping_send_wr *ping_send_wr;
-	struct ib_sge gather_list;
-	struct ib_send_wr send_wr;
-	struct ib_send_wr *bad_send_wr;
-	struct ib_ah_attr ah_attr;
-	unsigned long flags;
-	int ret = 1;
-
-	ping_send_wr = kmalloc(sizeof(*ping_send_wr), GFP_KERNEL);
-	if (!ping_send_wr)
-		goto out;
-	ping_send_wr->mad = mad_priv;
-
-	/* PCI mapping */
-	gather_list.addr = dma_map_single(mad_agent->device->dma_device,
-					  &mad_priv->mad,
-					  sizeof(mad_priv->mad),
-					  DMA_TO_DEVICE);
-	gather_list.length = sizeof(mad_priv->mad);
-	gather_list.lkey = mad_agent->mr->lkey;
-
-	send_wr.next = NULL;
-	send_wr.opcode = IB_WR_SEND;
-	send_wr.sg_list = &gather_list;
-	send_wr.num_sge = 1;
-	send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */
-	send_wr.wr.ud.timeout_ms = 0;
-	send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED;
-
-	ah_attr.dlid = wc->slid;
-	ah_attr.port_num = mad_agent->port_num;
-	ah_attr.src_path_bits = wc->dlid_path_bits;
-	ah_attr.sl = wc->sl;
-	ah_attr.static_rate = 0;
-	ah_attr.ah_flags = 0; /* No GRH */
-	if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_OPENIB_PING) {
-		if (wc->wc_flags & IB_WC_GRH) {
-			ah_attr.ah_flags = IB_AH_GRH;
-			/* Should sgid be looked up ? */
-			ah_attr.grh.sgid_index = 0;
-			ah_attr.grh.hop_limit = grh->hop_limit;
-			ah_attr.grh.flow_label = be32_to_cpu(
-				grh->version_tclass_flow)  & 0xfffff;
-			ah_attr.grh.traffic_class = (be32_to_cpu(
-				grh->version_tclass_flow) >> 20) & 0xff;
-			memcpy(ah_attr.grh.dgid.raw,
-			       grh->sgid.raw,
-			       sizeof(ah_attr.grh.dgid));
-		}
-	} else {
-		printk(KERN_ERR SPFX "Not OpenIB ping class 0x%x\n",
-		       mad_priv->mad.mad.mad_hdr.mgmt_class);
-		kfree(ping_send_wr);
-		goto out;
-	}
-
-	ping_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr);
-	if (IS_ERR(ping_send_wr->ah)) {
-		printk(KERN_ERR SPFX "No memory for address handle\n");
-		kfree(ping_send_wr);
-		goto out;
-	}
-
-	send_wr.wr.ud.ah = ping_send_wr->ah;
-	send_wr.wr.ud.pkey_index = wc->pkey_index;
-	send_wr.wr.ud.remote_qkey = IB_QP1_QKEY;
-	send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr;
-	send_wr.wr_id = (unsigned long)ping_send_wr;
-
-	pci_unmap_addr_set(ping_send_wr, mapping, gather_list.addr);
-
-	/* Send */
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) {
-		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-		dma_unmap_single(mad_agent->device->dma_device,
-				 pci_unmap_addr(ping_send_wr, mapping),
-				 sizeof(mad_priv->mad),
-				 DMA_TO_DEVICE);
-		ib_destroy_ah(ping_send_wr->ah);
-		kfree(ping_send_wr);
-	} else {
-		list_add_tail(&ping_send_wr->send_list,
-			      &port_priv->send_posted_list);
-		spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-		ret = 0;
-	}
-
-out:
-	return ret;
-}
-
 static void pingd_recv_handler(struct ib_mad_agent *mad_agent,
 			       struct ib_mad_recv_wc *mad_recv_wc)
 {
-	struct ib_ping_port_private	*port_priv;
-	struct ib_vendor_mad	*vend;
-	struct ib_mad_private *recv = container_of(mad_recv_wc,
-					struct ib_mad_private,
-					header.recv_wc);
+	struct ib_ping_port_private *port_priv;
+	struct ib_ah *ah;
+	struct ib_mad_send_buf *msg;
+	struct ib_vendor_mad *vend;
+	struct ib_send_wr *bad_send_wr;
+	int ret;
 
 	/* Find matching MAD agent */
 	port_priv = ib_get_ping_port(NULL, 0, mad_agent);
 	if (!port_priv) {
-		kmem_cache_free(ib_mad_cache, recv);
 		printk(KERN_ERR SPFX "pingd_recv_handler: no matching MAD "
 		       "agent %p\n", mad_agent);
-		return;
+		goto error1;
 	}
 
-	vend = (struct ib_vendor_mad *)mad_recv_wc->recv_buf.mad;
+	ah = ib_create_ah_from_wc(mad_agent->qp->pd, mad_recv_wc->wc,
+				  mad_recv_wc->recv_buf.grh,
+				  mad_agent->port_num);
+	if (IS_ERR(ah)) {
+		printk(KERN_ERR SPFX "pingd_recv_handler: failed to create AH from recv WC\n");
+		goto error1;
+	}
 
+	msg = ib_create_send_mad(mad_agent, mad_recv_wc->wc->src_qp,
+				 mad_recv_wc->wc->pkey_index, ah,
+				 offsetof(struct ib_vendor_mad, data),
+				 mad_recv_wc->mad_len -
+					offsetof(struct ib_vendor_mad, data),
+				 GFP_KERNEL);
+	if (IS_ERR(msg)) {
+		printk(KERN_ERR SPFX "pingd_recv_handler: failed to create response MAD\n");
+		goto error2;
+	}
+
+	vend = (struct ib_vendor_mad *) msg->mad;
+	memcpy(vend, mad_recv_wc->recv_buf.mad, sizeof(*vend));
 	vend->mad_hdr.method |= IB_MGMT_METHOD_RESP;
 	vend->mad_hdr.status = 0;
 	if (!system_utsname.domainname[0])
 		strncpy(vend->data, system_utsname.nodename, sizeof vend->data);
 	else
 		snprintf(vend->data, sizeof vend->data, "%s.%s",
-			system_utsname.nodename, system_utsname.domainname);
+			 system_utsname.nodename, system_utsname.domainname);
 
 	/* Send response */
-	if (ping_mad_send(mad_agent, port_priv, recv,
-			  mad_recv_wc->recv_buf.grh, mad_recv_wc->wc)) {
-		kmem_cache_free(ib_mad_cache, recv);
-		printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n");
+	ret = ib_post_send_mad(mad_agent, &msg->send_wr, &bad_send_wr);
+	if (!ret) {
+		ib_free_recv_mad(mad_recv_wc);
+		return;
 	}
+
+	ib_free_send_mad(msg);
+	printk(KERN_ERR SPFX "pingd_recv_handler: reply failed\n");
+
+error2:
+	ib_destroy_ah(ah);
+error1:
+	ib_free_recv_mad(mad_recv_wc);
 }
 
 static void pingd_send_handler(struct ib_mad_agent *mad_agent,
 			       struct ib_mad_send_wc *mad_send_wc)
 {
-	struct ib_ping_port_private	*port_priv;
-	struct ib_ping_send_wr		*ping_send_wr;
-	unsigned long			flags;
+	struct ib_mad_send_buf *msg;
 
-	/* Find matching MAD agent */
-	port_priv = ib_get_ping_port(NULL, 0, mad_agent);
-	if (!port_priv) {
-		printk(KERN_ERR SPFX "pingd_send_handler: no matching MAD "
-		       "agent %p\n", mad_agent);
-		return;
-	}
-
-	ping_send_wr = (struct ib_ping_send_wr *)(unsigned long)mad_send_wc->wr_id;
-	spin_lock_irqsave(&port_priv->send_list_lock, flags);
-	/* Remove completed send from posted send MAD list */
-	list_del(&ping_send_wr->send_list);
-	spin_unlock_irqrestore(&port_priv->send_list_lock, flags);
-
-	/* Unmap PCI */
-	dma_unmap_single(mad_agent->device->dma_device,
-			 pci_unmap_addr(ping_send_wr, mapping),
-			 sizeof(ping_send_wr->mad->mad),
-			 DMA_TO_DEVICE);
-
-	ib_destroy_ah(ping_send_wr->ah);
-
-	/* Release allocated memory */
-	kmem_cache_free(ib_mad_cache, ping_send_wr->mad);
-	kfree(ping_send_wr);
+	msg = (struct ib_mad_send_buf *) (unsigned long) mad_send_wc->wr_id;
+	ib_destroy_ah(msg->send_wr.wr.ud.ah);
+	if (mad_send_wc->status != IB_WC_SUCCESS)
+		printk(KERN_ERR SPFX "pingd_send_handler: Error sending MAD: %d\n", mad_send_wc->status);
+	ib_free_send_mad(msg);
 }
 
 static int ib_ping_port_open(struct ib_device *device, int port_num)
@@ -261,14 +179,6 @@
 	struct ib_mad_reg_req pingd_reg_req;
 	unsigned long flags;
 
-	/* First, check if port already open */
-	port_priv = ib_get_ping_port(device, port_num, NULL);
-	if (port_priv) {
-		printk(KERN_DEBUG SPFX "%s port %d already open\n",
-		       device->name, port_num);
-		return 0;
-	}
-
 	/* Create new device info */
 	port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL);
 	if (!port_priv) {
@@ -279,9 +189,6 @@
 
 	memset(port_priv, 0, sizeof *port_priv);
 	port_priv->port_num = port_num;
-	spin_lock_init(&port_priv->send_list_lock);
-	INIT_LIST_HEAD(&port_priv->send_posted_list);
-
 	pingd_reg_req.mgmt_class = IB_MGMT_CLASS_OPENIB_PING;
 	pingd_reg_req.mgmt_class_version = 1;
 	pingd_reg_req.oui[0] = (IB_OPENIB_OUI >> 16) & 0xff;
@@ -336,7 +243,7 @@
 
 static void ib_ping_init_device(struct ib_device *device)
 {
-	int ret, num_ports, cur_port, i, ret2;
+	int num_ports, cur_port, i;
 
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
@@ -347,34 +254,27 @@
 	}
 
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret = ib_ping_port_open(device, cur_port);
-		if (ret) {
+		if (ib_ping_port_open(device, cur_port))
 			printk(KERN_ERR SPFX "Couldn't open %s port %d\n",
 			       device->name, cur_port);
 			goto error_device_open;
-		}
 	}
-	goto error_device_query;
+	return;
 
 error_device_open:
 	while (i > 0) {
 		cur_port--;
-		ret2 = ib_ping_port_close(device, cur_port);
-		if (ret2) {
-			printk(KERN_ERR PFX "Couldn't close %s port %d "
+		if (ib_ping_port_close(device, cur_port))
+			printk(KERN_ERR SPFX "Couldn't close %s port %d "
 			       "for ping agent\n",
 			       device->name, cur_port);
-		}
 		i--;
 	}
-
-error_device_query:
-	return;
 }
 
 static void ib_ping_remove_device(struct ib_device *device)
 {
-	int ret = 0, i, num_ports, cur_port, ret2;
+	int i, num_ports, cur_port;
 
 	if (device->node_type == IB_NODE_SWITCH) {
 		num_ports = 1;
@@ -384,14 +284,10 @@
 		cur_port = 1;
 	}
 	for (i = 0; i < num_ports; i++, cur_port++) {
-		ret2 = ib_ping_port_close(device, cur_port);
-		if (ret2) {
+		if (ib_ping_port_close(device, cur_port))
 			printk(KERN_ERR SPFX "Couldn't close %s port %d "
 			       "for ping agent\n",
 			       device->name, cur_port);
-			if (!ret)
-				ret = ret2;
-		}
 	}
 }
 
Index: mad.c
===================================================================
--- mad.c	(revision 2217)
+++ mad.c	(working copy)
@@ -44,7 +44,6 @@
 
 
 kmem_cache_t *ib_mad_cache;
-EXPORT_SYMBOL(ib_mad_cache);
 
 static struct list_head ib_mad_port_list;
 static u32 ib_mad_client_id = 0;
Index: ping_priv.h
===================================================================
--- ping_priv.h	(revision 2212)
+++ ping_priv.h	(working copy)
@@ -1,61 +0,0 @@
-/*
- * Copyright (c) 2004, 2005 Mellanox Technologies Ltd.  All rights reserved.
- * Copyright (c) 2004, 2005 Infinicon Corporation.  All rights reserved.
- * Copyright (c) 2004, 2005 Intel Corporation.  All rights reserved.
- * Copyright (c) 2004, 2005 Topspin Corporation.  All rights reserved.
- * Copyright (c) 2004, 2005 Voltaire Corporation.  All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- *     Redistribution and use in source and binary forms, with or
- *     without modification, are permitted provided that the following
- *     conditions are met:
- *
- *      - Redistributions of source code must retain the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer.
- *
- *      - Redistributions in binary form must reproduce the above
- *        copyright notice, this list of conditions and the following
- *        disclaimer in the documentation and/or other materials
- *        provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
- * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
- * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
- * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- * $Id$
- */
-
-#ifndef __IB_PING_PRIV_H__
-#define __IB_PING_PRIV_H__
-
-#include <linux/pci.h>
-
-#define SPFX "ib_ping: "
-
-struct ib_ping_send_wr {
-	struct list_head send_list;
-	struct ib_ah *ah;
-	struct ib_mad_private *mad;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
-};
-
-struct ib_ping_port_private {
-	struct list_head port_list;
-	struct list_head send_posted_list;
-	spinlock_t send_list_lock;
-	int port_num;
-	struct ib_mad_agent *pingd_agent;     /* OpenIB Ping class */
-};
-
-#endif	/* __IB_PING_PRIV_H__ */


From jcarr at linuxmachines.com  Wed Apr 27 11:56:01 2005
From: jcarr at linuxmachines.com (Jeff Carr)
Date: Wed, 27 Apr 2005 11:56:01 -0700
Subject: [openib-general] in need of a simple ulp
In-Reply-To: <426FD098.9090300@ichips.intel.com>
References: <426FCF0A.3070806@linuxmachines.com>
	<426FD098.9090300@ichips.intel.com>
Message-ID: <426FE041.7000409@linuxmachines.com>

Sean Hefty wrote:

> Within the SVN repository, your best bet for finding things is staying 
>  withing the gen2 branch.  

I figured that, but when I check out the repository with svn it doesn't 
seem to keep the correct dates on the files (it doesn't preserve the 
mtime). That makes it hard to figure out what has been modified recently!

For a relatively simple example that does
> what you mention above, try:
> 
> https://openib.org/svn/gen2/utils/src/linux-kernel/infiniband/util/cmpost/
> 
> This is a simple CM test program for the kernel.

Thanks, I'll take a look.
Jeff


From jcarr at linuxmachines.com  Wed Apr 27 12:25:31 2005
From: jcarr at linuxmachines.com (Jeff Carr)
Date: Wed, 27 Apr 2005 12:25:31 -0700
Subject: [openib-general] correct method to update 2.6.11.7 to gen2
Message-ID: <426FE72B.6030301@linuxmachines.com>

In trying to build the modules from:
https://openib.org/svn/gen2/utils/src/linux-kernel/infiniband/util/

I would like (need in this case) to use the newest IB kernel code. There 
seem to be three places in the gen2 tree for this code.
Is this correct: ?

the newest code is here:
https://openib.org/svn/gen2/branches/roland-uverbs/src/linux-kernel/infiniband/

a somewhat older version is here:
https://openib.org/svn/gen2/branches/roland-merge/src/linux-kernel/infiniband/

a real old version is here:
https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/


Thanks,
Jeff


From roland at topspin.com  Wed Apr 27 12:37:59 2005
From: roland at topspin.com (Roland Dreier)
Date: Wed, 27 Apr 2005 12:37:59 -0700
Subject: [openib-general] correct method to update 2.6.11.7 to gen2
In-Reply-To: <426FE72B.6030301@linuxmachines.com> (Jeff Carr's message of
	"Wed, 27 Apr 2005 12:25:31 -0700")
References: <426FE72B.6030301@linuxmachines.com>
Message-ID: <52ll74nh7s.fsf@topspin.com>

    Jeff> I would like (need in this case) to use the newest IB kernel
    Jeff> code. There seem to be three places in the gen2 tree for
    Jeff> this code.  Is this correct: ?

Not quite...

    Jeff> https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/

This is the newest.

 - R.


From tduffy at sun.com  Wed Apr 27 13:28:04 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 13:28:04 -0700
Subject: [openib-general] [PATCH][DAPL] make dapl build outside of kernel
	tree
Message-ID: <1114633684.20016.8.camel@duffman>

Until DAPL is in the trunk, it should be albe to be built outside of
your normal kernel tree.  These changes make that possible.  Now, you
can type something like:

$ KERNELDIR=/path/to/kernel/dir/or/object/dir make

in each of dat, dat-provider, and patches (to get ib_at).

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: gen2/users/jlentini/linux-kernel/dat-provider/Makefile
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/Makefile	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/Makefile	(working copy)
@@ -1,18 +1,3 @@
-
-obj-$(CONFIG_INFINIBAND_DAT_PROVIDER) += ib_dat_provider.o
-
-#debug
-KDAPL_DEBUG = 1
-ifeq (1,$(KDAPL_DEBUG))
-  EXTRA_CFLAGS += -O0 -g
-  EXTRA_CFLAGS += -DDAPL_DBG # -DDAPL_DBG_IO_TRC
-endif
-
-EXTRA_CFLAGS += 				\
-    -DDAPL_ATS					\
-    -Idrivers/infiniband/include		\
-    -Idrivers/dat
-
 PROVIDER_MODULES := \
 	dapl_openib_qp			\
 	dapl_openib_util		\
@@ -106,5 +91,25 @@ PROVIDER_MODULES := \
 
 PROVIDER_OBJS := $(foreach s, $(PROVIDER_MODULES), $(s).o)
 
-ib_dat_provider-y:= $(PROVIDER_OBJS)
+KDAPL_DEBUG = 1
+ifeq (1,$(KDAPL_DEBUG))
+  EXTRA_CFLAGS += -O0 -g
+  EXTRA_CFLAGS += -DDAPL_DBG # -DDAPL_DBG_IO_TRC
+endif
+
+EXTRA_CFLAGS += -DDAPL_ATS -Idrivers/infiniband/include -I$(obj)/../dat -I$(obj)/../patches/
+
+ifneq ($(KERNELRELEASE),)
+        obj-m := ib_dat_provider.o
+        ib_dat_provider-objs := $(PROVIDER_OBJS)
+else
+	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
+	PWD := $(shell pwd)
+
+default:
+	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
+
+endif
 
+clean:
+	rm -f *.o *.ko
Index: gen2/users/jlentini/linux-kernel/patches/at.c
===================================================================
--- gen2/users/jlentini/linux-kernel/patches/at.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/patches/at.c	(working copy)
@@ -45,7 +45,7 @@
 #include <ib_verbs.h>
 #include <ib_sa.h>
 
-#include "../ulp/ipoib/ipoib.h"
+#include <ipoib.h>
 #include <ib_at.h>
 
 MODULE_AUTHOR("Shahar Frank");
Index: gen2/users/jlentini/linux-kernel/patches/Makefile
===================================================================
--- gen2/users/jlentini/linux-kernel/patches/Makefile	(revision 0)
+++ gen2/users/jlentini/linux-kernel/patches/Makefile	(revision 0)
@@ -0,0 +1,16 @@
+EXTRA_CFLAGS += -Werror -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib/ -I$(obj)
+
+ifneq ($(KERNELRELEASE),)
+	obj-m := ib_at.o
+	ib_at-objs := at.o
+else
+	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
+	PWD := $(shell pwd)
+
+default:
+	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
+
+endif
+
+clean:
+	rm -f *.o *.ko
Index: gen2/users/jlentini/linux-kernel/dat/Makefile
===================================================================
--- gen2/users/jlentini/linux-kernel/dat/Makefile	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat/Makefile	(working copy)
@@ -1,13 +1,16 @@
+EXTRA_CFLAGS += -Werror -I$(obj)
 
-EXTRA_CFLAGS += \
-    -Idrivers/dat      	\
-    -Werror
+ifneq ($(KERNELRELEASE),)
+	obj-m := dat.o
+	dat-objs := consumer.o core.o dictionary.o dr.o provider.o
+else
+	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
+	PWD := $(shell pwd)
 
-obj-$(CONFIG_DAT) += dat.o
+default:
+	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
 
-dat-y := \
-    consumer.o		\
-    core.o 		\
-    dictionary.o	\
-    dr.o		\
-    provider.o
+endif
+
+clean:
+	rm -r *.o *.ko


From tduffy at sun.com  Wed Apr 27 13:53:21 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 13:53:21 -0700
Subject: [openib-general] [PATCH][DAPL] Fix sparse warnings on dapl builds
Message-ID: <1114635201.20016.13.camel@duffman>

This patch fixes all the sparse warnings during build of dat,
dat-provider, and ib_at.

Signed-off-by: Tom Duffy <tduffy at sun.com>

Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c	(working copy)
@@ -263,7 +263,7 @@ dapl_ep_connect(DAT_EP_HANDLE ep_handle,
 						       connect_evd_handle,
 						       DAT_CONNECTION_EVENT_UNREACHABLE,
 						       (DAT_HANDLE) ep_ptr, 0,
-						       0);
+						       NULL);
 			dat_status = DAT_SUCCESS;
 		}
 	} else {
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c	(working copy)
@@ -44,7 +44,7 @@ MODULE_DESCRIPTION("DAT Provider for Inf
 MODULE_AUTHOR("James Lentini");
 
 int g_dapl_dbg_type = 0;
-MODULE_PARM(g_dapl_dbg_type, "i");
+module_param(g_dapl_dbg_type, int, 0644);
 MODULE_PARM_DESC(g_dapl_dbg_type, "Enable dapl debug types");
 
 static int dapl_init(void);
@@ -209,13 +209,13 @@ void DAT_PROVIDER_FINI_FUNC_NAME(const D
 	(void)dapl_provider_list_remove(provider_info->ia_name);
 }
 
-struct ib_client dapl_client = {
+static struct ib_client dapl_client = {
 	.name = "dapl",
 	.add = dapl_add_one,
 	.remove = dapl_remove_one
 };
 
-char *dev_name_suffix_table[3] = {
+static char *dev_name_suffix_table[3] = {
 	"",
 	"a",
 	"b"
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c	(working copy)
@@ -53,7 +53,7 @@ DAPL_PROVIDER_LIST g_dapl_provider_list;
 
 DAT_PROVIDER g_dapl_provider_template = {
 	NULL,
-	0,
+	NULL,
 	&dapl_ia_open,
 	&dapl_ia_query,
 	&dapl_ia_close,
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c	(working copy)
@@ -628,7 +628,7 @@ dapl_ib_setup_conn_listener(DAPL_IA * ia
 	if (status) {
 		/* need to destroy CM ID ??? */
 
-		sp_ptr->cm_srvc_handle = 0;
+		sp_ptr->cm_srvc_handle = NULL;
 
 		if (status == -EBUSY)
 			return DAT_CONN_QUAL_IN_USE;
@@ -799,22 +799,6 @@ DAT_RETURN dapl_ib_accept_connection(DAT
 	return DAT_SUCCESS;
 }
 
-DAT_RETURN dapl_ib_comm_established(DAPL_EP * ep_ptr)
-{
-	int status;
-	DAT_RETURN dat_status = DAT_SUCCESS;
-
-	status = ib_send_cm_rtu(ep_ptr->cm_handle, NULL, 0);
-	if (status) {
-		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
-			     " dapl_ib_comm_established: ib_send_cm_rtu failed: %d cm_handle: %x\n",
-			     status, ep_ptr->cm_handle);
-		return DAT_ERROR(DAT_INSUFFICIENT_RESOURCES, 0);
-	}
-
-	return dat_status;
-}
-
 /*
  * ib_cm_get_remote_gid 
  */
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c	(working copy)
@@ -683,7 +683,7 @@ dapl_ib_mw_unbind(DAPL_RMR * rmr,
 	mw_bind_prop.mw_access_flags = 0;
 	mw_bind_prop.send_flags =
 	    (is_signaled == DAT_TRUE) ? IB_SEND_SIGNALED : 0;
-	mw_bind_prop.mr = 0;
+	mw_bind_prop.mr = NULL;
 	mw_bind_prop.wr_id = (u64) (uintptr_t) cookie;
 	ib_status = ib_bind_mw(ep->qp_handle, rmr->mw_handle, &mw_bind_prop);
 	if (ib_status < 0) {
@@ -954,16 +954,6 @@ dapl_ib_get_async_event(ib_error_record_
 }
 
 DAT_RETURN
-dapl_ib_ncompletion_notify(ib_hca_handle_t hca_handle,
-			   ib_cq_handle_t cq_handle, DAT_COUNT num)
-{
-	int ib_status;
-
-	ib_status = ib_req_ncomp_notif(cq_handle, num);
-	return dapl_ib_status_convert(ib_status);
-}
-
-DAT_RETURN
 dapl_ib_get_hca_ids(ib_hca_handle_t hca, u8 port, union ib_gid * gid, u16 * lid)
 {
 	int status;
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c	(working copy)
@@ -52,7 +52,7 @@
 #include "dapl.h"
 #include "dapl_timer_util.h"
 
-struct timer_head {
+static struct timer_head {
 	DAPL_LLIST_HEAD timer_list_head;
 	spinlock_t lock;
 	DAPL_OS_WAIT_OBJECT wait_object;
@@ -63,7 +63,7 @@ typedef struct timer_head DAPL_TIMER_HEA
 
 void dapl_timer_thread(void *arg);
 
-void dapl_timer_init()
+void dapl_timer_init(void)
 {
 	/*
 	 * Set up the timer thread elements. The timer thread isn't
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h	(working copy)
@@ -105,7 +105,7 @@ typedef struct ib_shm_transport {
 	ib_mr_handle_t mr_handle;
 } ib_shm_transport_t;
 
-#define 	 IB_INVALID_HANDLE	       0
+#define 	 IB_INVALID_HANDLE	       NULL
 
 #define 	 IB_MAX_REQ_PDATA_SIZE	    92
 #define 	 IB_MAX_REP_PDATA_SIZE	    196
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c	(working copy)
@@ -190,7 +190,7 @@ dapl_cr_accept(DAT_CR_HANDLE cr_handle,
 							   request_evd_handle,
 							   DAT_CONNECTION_EVENT_ACCEPT_COMPLETION_ERROR,
 							   (DAT_HANDLE) ep_ptr,
-							   0, 0);
+							   0, NULL);
 
 			cr_ptr->header.magic = DAPL_MAGIC_CR_DESTROYED;
 		} else {
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c	(working copy)
@@ -49,7 +49,7 @@ DAPL_RMR *dapl_rmr_alloc(DAPL_PZ * pz)
 	rmr->header.handle_type = DAT_HANDLE_TYPE_RMR;
 	rmr->header.owner_ia = pz->header.owner_ia;
 	rmr->header.user_context.as_64 = 0;
-	rmr->header.user_context.as_ptr = 0;
+	rmr->header.user_context.as_ptr = NULL;
 	dapl_llist_init_entry(&rmr->header.ia_list_entry);
 	dapl_ia_link_rmr(rmr->header.owner_ia, rmr);
 	spin_lock_init(&rmr->header.lock);
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c	(working copy)
@@ -368,7 +368,7 @@ void dapl_ep_timeout(uintptr_t arg)
 	(void)dapl_evd_post_connection_event((DAPL_EVD *) ep_ptr->param.
 					     connect_evd_handle,
 					     DAT_CONNECTION_EVENT_TIMED_OUT,
-					     (DAT_HANDLE) ep_ptr, 0, 0);
+					     (DAT_HANDLE) ep_ptr, 0, NULL);
 }
 
 /*
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c	(working copy)
@@ -358,7 +358,7 @@ void dapl_evd_eh_print_cqe(ib_work_compl
 		"OP_COMP_AND_SWAP",
 		"OP_FETCH_AND_ADD",
 		"OP_BIND_MW",
-		0
+		NULL
 	};
 	dapl_dbg_log(DAPL_DBG_TYPE_CALLBACK,
 		     "\t >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<\n");
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c	(working copy)
@@ -144,7 +144,7 @@ dapl_ep_disconnect(DAT_EP_HANDLE ep_hand
 		evd_ptr = (DAPL_EVD *) ep_ptr->param.connect_evd_handle;
 		dapl_evd_post_connection_event(evd_ptr,
 					       DAT_CONNECTION_EVENT_DISCONNECTED,
-					       (DAT_HANDLE) ep_ptr, 0, 0);
+					       (DAT_HANDLE) ep_ptr, 0, NULL);
 		dat_status = DAT_SUCCESS;
 		goto bail;
 	}
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c	(working copy)
@@ -145,7 +145,7 @@ dapl_hash_rehash(DAPL_HASH_ELEM * elemen
 			return;
 		}
 	}
-	*head = 0;
+	*head = NULL;
 }
 
 /*
@@ -209,7 +209,7 @@ dapl_hash_add(DAPL_HASH_TABLEP p_table,
 		 */
 		p_table->table[hashValue].key = key;
 		p_table->table[hashValue].datum = datum;
-		p_table->table[hashValue].next_element = 0;
+		p_table->table[hashValue].next_element = NULL;
 		p_table->num_entries++;
 		status = DAT_TRUE;
 	} else {
@@ -222,7 +222,7 @@ dapl_hash_add(DAPL_HASH_TABLEP p_table,
 			DAPL_HASH_ELEM *lastelement;
 			newelement->key = key;
 			newelement->datum = datum;
-			newelement->next_element = 0;
+			newelement->next_element = NULL;
 			for (lastelement = &p_table->table[hashValue];
 			     lastelement->next_element;
 			     lastelement = lastelement->next_element) {
@@ -354,7 +354,7 @@ DAT_RETURN dapl_hash_create(DAT_COUNT ta
 	for (i = 0; i < table_size; i++) {
 		p_table->table[i].datum = NO_DATUM_VALUE;
 		p_table->table[i].key = 0;
-		p_table->table[i].next_element = 0;
+		p_table->table[i].next_element = NULL;
 	}
 
 	*pp_table = p_table;
Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c	(working copy)
@@ -71,7 +71,7 @@ void dapl_llist_init_entry(DAPL_LLIST_EN
 {
 	entry->blink = NULL;
 	entry->flink = NULL;
-	entry->data = 0;
+	entry->data = NULL;
 	entry->list_head = NULL;
 }
 
Index: gen2/users/jlentini/linux-kernel/patches/at.c
===================================================================
--- gen2/users/jlentini/linux-kernel/patches/at.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/patches/at.c	(working copy)
@@ -118,7 +118,7 @@ struct async {
 	int sa_id;
 };
 
-struct async pending_reqs;	/* dummy head for cyclic list */
+static struct async pending_reqs;	/* dummy head for cyclic list */
 
 struct ib_at_src {
 	u32 ip;
@@ -320,7 +320,7 @@ static void req_free(struct async *pend)
 
 	pend->status = IB_AT_STATUS_INVALID;
 	pend->type = IBAT_REQ_NONE;
-	pend->sa_query = 0;
+	pend->sa_query = NULL;
 }
 
 static int req_start(struct async *q, struct async *pend,
@@ -336,7 +336,7 @@ static int req_start(struct async *q, st
 
 	if (parent) {
 		DEBUG("wait on parent %p", parent);
-		pend->next = pend->prev = 0;
+		pend->next = pend->prev = NULL;
 		pend->parent = parent;
 		pend->waiting = parent->waiting;
 		parent->waiting = pend;
@@ -344,8 +344,8 @@ static int req_start(struct async *q, st
 		return 0;	/* waiting on other request */
 	}
 
-	pend->waiting = 0;
-	pend->parent = 0;
+	pend->waiting = NULL;
+	pend->parent = NULL;
 
 	DEBUG("link to pending list %p", q);
 	pend->next = q;
@@ -396,7 +396,7 @@ static void req_end(struct async *pend, 
 		if (!*rr)
 			WARN("pending request not found in parent request!");
 
-		pend->waiting = 0;
+		pend->waiting = NULL;
 		DEBUG("child %p removed from parent %p list",
 			pend, pend->parent);
 	}
@@ -405,10 +405,10 @@ static void req_end(struct async *pend, 
 		DEBUG("pend %p ending child req %p", pend, waiting);
 		pend->waiting = waiting->waiting;
 
-		waiting->waiting = 0;
-		waiting->parent = 0;
+		waiting->waiting = NULL;
+		waiting->parent = NULL;
 
-		req_end(waiting, nrec, 0);
+		req_end(waiting, nrec, NULL);
 	}
 
 	if (pend->next) {
@@ -483,7 +483,7 @@ static struct async *lookup_pending(stru
 			break;
 
 	spin_unlock_irqrestore(&q->lock, flags);
-	return a == q ? 0 : a;
+	return a == q ? NULL : a;
 }
 
 static struct async *lookup_req_id(struct async *q, u64 id)
@@ -498,7 +498,7 @@ static struct async *lookup_req_id(struc
 			break;
 
 	spin_unlock_irqrestore(&q->lock, flags);
-	return a == q ? 0 : a;
+	return a == q ? NULL : a;
 }
 
 static void flush_pending(struct async *q)
@@ -509,7 +509,7 @@ static void flush_pending(struct async *
 	DEBUG("flushing pending q %p", q);
 	spin_lock_irqsave(&q->lock, flags);
 	while ((a = q->next) != q)
-		req_end(a, -EINTR, 0);
+		req_end(a, -EINTR, NULL);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 
@@ -561,7 +561,7 @@ route_req_complete(struct route_req *req
 	for (pend = req->pend.waiting; pend; pend = pend->waiting)	
 		route_req_output(req, pend->data);
 
-	req_end(&req->pend, 1, 0);
+	req_end(&req->pend, 1, NULL);
 }
 
 static void
@@ -587,7 +587,7 @@ path_req_complete(int status, struct ib_
 		return;
 	}
 
-	req->pend.sa_query = 0;
+	req->pend.sa_query = NULL;
 
 	req->pend.nelem = path_req_output(req, resp, 1,
 					  req->pend.data, req->pend.nelem);
@@ -597,7 +597,7 @@ path_req_complete(int status, struct ib_
 		pend->nelem = path_req_output(req, resp, 1,
 					      pend->data, pend->nelem);
 
-	req_end(&req->pend, req->pend.nelem, 0);
+	req_end(&req->pend, req->pend.nelem, NULL);
 	spin_unlock_irqrestore(&pending_reqs.lock, flags);
 }
 
@@ -624,7 +624,7 @@ static void ib_at_sweep(void *data)
 			     (req->dst_ip & 0xff000000) >> 24,
 			     jiffies, pend->start);
 
-			req_end(pend, -ETIMEDOUT, 0);
+			req_end(pend, -ETIMEDOUT, NULL);
 		}
 	}
 
@@ -902,7 +902,7 @@ int ib_at_cancel(u64 req_id)
 
 	/* Promote first child to be pending req */
 	if ((child = a->waiting)) {
-		child->parent = 0;
+		child->parent = NULL;
 
 		/* link child after parent in pending list */
 		child->next = a->next;
@@ -910,10 +910,10 @@ int ib_at_cancel(u64 req_id)
 		a->next->prev = child;
 		a->next = child;
 
-		a->waiting = 0;		/* clear to avoid cancelling childs */
+		a->waiting = NULL;	/* clear to avoid cancelling childs */
 	}
 
-	req_end(a, -EINTR, 0);
+	req_end(a, -EINTR, NULL);
 
 	spin_unlock_irqrestore(&pending_reqs.lock, flags);
 
Index: gen2/users/jlentini/linux-kernel/dat/dr.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat/dr.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat/dr.c	(working copy)
@@ -86,7 +86,7 @@ DAT_RETURN dat_dr_fini(void)
  * Function: dat_dr_insert
  ************************************************************************/
 
-extern DAT_RETURN
+DAT_RETURN
 dat_dr_insert(const DAT_PROVIDER_INFO * info, DAT_DR_ENTRY * entry)
 {
 	DAT_RETURN status;
@@ -134,7 +134,7 @@ dat_dr_insert(const DAT_PROVIDER_INFO * 
  * Function: dat_dr_remove
  ************************************************************************/
 
-extern DAT_RETURN dat_dr_remove(const DAT_PROVIDER_INFO * info)
+DAT_RETURN dat_dr_remove(const DAT_PROVIDER_INFO * info)
 {
 	DAT_DR_ENTRY *data;
 	DAT_DICTIONARY_ENTRY dict_entry;
@@ -180,7 +180,7 @@ extern DAT_RETURN dat_dr_remove(const DA
  * Function: dat_dr_provider_open
  ************************************************************************/
 
-extern DAT_RETURN
+DAT_RETURN
 dat_dr_provider_open(const DAT_PROVIDER_INFO * info,
 		     DAT_IA_OPEN_FUNC * p_ia_open_func)
 {
@@ -206,7 +206,7 @@ dat_dr_provider_open(const DAT_PROVIDER_
  * Function: dat_dr_provider_close
  ************************************************************************/
 
-extern DAT_RETURN dat_dr_provider_close(const DAT_PROVIDER_INFO * info)
+DAT_RETURN dat_dr_provider_close(const DAT_PROVIDER_INFO * info)
 {
 	DAT_RETURN status;
 	DAT_DR_ENTRY *data;
Index: gen2/users/jlentini/linux-kernel/dat/core.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat/core.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat/core.c	(working copy)
@@ -77,7 +77,7 @@ static DAT_MODULE_STATE g_module_state =
 
 static DAT_DBG_CLASS g_dbg_class = DAT_DBG_CLASS_ERROR;
 
-MODULE_PARM(g_dbg_class, "i");
+module_param(g_dbg_class, int, 0644);
 MODULE_PARM_DESC(g_dbg_class,
 		 "Bit mask to specify class of DAT debug messages.");
 
Index: gen2/users/jlentini/linux-kernel/dat/consumer.c
===================================================================
--- gen2/users/jlentini/linux-kernel/dat/consumer.c	(revision 2219)
+++ gen2/users/jlentini/linux-kernel/dat/consumer.c	(working copy)
@@ -48,7 +48,7 @@
  *
  ***********************************************************************/
 
-DAT_RETURN dat_strerror_major(DAT_RETURN value, const char **message)
+static DAT_RETURN dat_strerror_major(DAT_RETURN value, const char **message)
 {
 	switch (DAT_GET_TYPE(value)) {
 	case DAT_SUCCESS:
@@ -168,7 +168,7 @@ DAT_RETURN dat_strerror_major(DAT_RETURN
  * Function: dat_strerror_minor
  *
  ***********************************************************************/
-DAT_RETURN dat_strerror_minor(DAT_RETURN value, const char **message)
+static DAT_RETURN dat_strerror_minor(DAT_RETURN value, const char **message)
 {
 	switch (DAT_GET_SUBTYPE(value)) {
 
@@ -1480,17 +1480,6 @@ DAT_RETURN dat_cr_handoff(DAT_CR_HANDLE 
 	return DAT_CR_HANDOFF(cr_handle, handoff);
 }
 
-DAT_RETURN dat_evd_kquery(DAT_EVD_HANDLE evd_handle,
-			  DAT_EVD_PARAM_MASK evd_param_mask,
-			  DAT_EVD_PARAM * evd_param)
-{
-	if (evd_handle == NULL) {
-		return DAT_ERROR(DAT_INVALID_HANDLE,
-				 DAT_INVALID_HANDLE_EVD_REQUEST);
-	}
-	return DAT_EVD_QUERY(evd_handle, evd_param_mask, evd_param);
-}
-
 DAT_RETURN dat_lmr_query(DAT_LMR_HANDLE lmr_handle,
 			 DAT_LMR_PARAM_MASK lmv_param_mask,
 			 DAT_LMR_PARAM * lmr_param)


From mlleinin at hpcn.ca.sandia.gov  Wed Apr 27 13:54:07 2005
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Wed, 27 Apr 2005 13:54:07 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <523btcp1wd.fsf@topspin.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
Message-ID: <1114635247.12544.549.camel@localhost>

On Wed, 2005-04-27 at 10:25 -0700, Roland Dreier wrote:
>     Tom> The problem seems to stem from the fact that the horizontal
>     Tom> blue bar does not move when the font is increased or
>     Tom> decreased.  Here is a series of screenshots to demonstrate
>     Tom> the issue:
> 
> Looks like there's some absolute positioning hard-coded in the html:
> 
>     <div id="header" style="height: 78px;" id="header"> <a href="index.html"><img alt="OpenIB.org" src="images/openib.gif" style="border: 0px solid ; width: 128px; height: 56px;">
> 
> even better is:
> 
>     <meta name="generator" content="Windows Notepad" />
> 
  We'll look into it.

	- Matt


From mplee at hpcn.ca.sandia.gov  Mon Apr 25 18:23:55 2005
From: mplee at hpcn.ca.sandia.gov (Michael Lee)
Date: Mon, 25 Apr 2005 18:23:55 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <523btcp1wd.fsf@topspin.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
Message-ID: <1114478635.6749.26.camel@acheron.ca.sandia.gov>

I removed the hard-coded positioning of the blue header bar...let me
know if it's okay.  Also, if anyone's interested in looking it over and
providing feedback, we had designed another OpenIB page that we were
considering rolling out.  Take a look:

http://hpcn.ca.sandia.gov/~cstanak/index.php

Michael


On Wed, 2005-04-27 at 10:25 -0700, Roland Dreier wrote:
>     Tom> The problem seems to stem from the fact that the horizontal
>     Tom> blue bar does not move when the font is increased or
>     Tom> decreased.  Here is a series of screenshots to demonstrate
>     Tom> the issue:
> 
> Looks like there's some absolute positioning hard-coded in the html:
> 
>     <div id="header" style="height: 78px;" id="header"> <a href="index.html"><img alt="OpenIB.org" src="images/openib.gif" style="border: 0px solid ; width: 128px; height: 56px;">
> 
> even better is:
> 
>     <meta name="generator" content="Windows Notepad" />
> 
>  - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From tduffy at sun.com  Wed Apr 27 17:50:43 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 17:50:43 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <1114478635.6749.26.camel@acheron.ca.sandia.gov>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
Message-ID: <1114649443.14721.6.camel@duffman>

On Mon, 2005-04-25 at 18:23 -0700, Michael Lee wrote:
> I removed the hard-coded positioning of the blue header bar...let me
> know if it's okay.

That seems better, although you can still go too small on the font and
it will overlap the logo.  May be a problem on small screens like cell
phones or pdas.

> Also, if anyone's interested in looking it over and
> providing feedback, we had designed another OpenIB page that we were
> considering rolling out.  Take a look:
> 
> http://hpcn.ca.sandia.gov/~cstanak/index.php

I really like the new design better.

BTW, tangentially, who gets the bugzilla at openib.org mail?

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050427/bdff9f94/attachment.sig>

From libor at topspin.com  Wed Apr 27 18:22:54 2005
From: libor at topspin.com (Libor Michalek)
Date: Wed, 27 Apr 2005 18:22:54 -0700
Subject: [openib-general] [PATCH][SDP] AIO buffer corruption
Message-ID: <20050427182254.A11429@topspin.com>


  Patch to fix the problem a few people reported as ttcp.aio.c 
aborting with an error (-104) on longer AIO runs.

  The bug is in the calculation of an AIO buffers starting address. 
It would cause data to potentially be written past the end of the 
AIO buffer corrupting whatever happen to be there. In the case of
ttcp.aio.c this happen to be the iocb array, which once corrupted
would generate this error when passed to io_submit.

-Libor

Signed-off-by: Libor Michalek <libor at topspin.com>

Index: sdp_recv.c
===================================================================
--- sdp_recv.c	(revision 2220)
+++ sdp_recv.c	(working copy)
@@ -674,14 +674,16 @@
 #ifndef _SDP_DATA_PATH_NULL
 		memcpy((addr + offset), buff->data, copy);
 #endif
-    
+
 		buff->data += copy;
 		iocb->post += copy;
 		iocb->len  -= copy;
 
 		offset     += copy;
 		offset     &= (~PAGE_MASK);
-		
+
+		iocb->io_addr += copy;
+
 		sdp_kunmap(iocb->page_array[counter++]);
 	}
 	/*
@@ -1443,7 +1445,8 @@
 			iocb->size = size;
 			iocb->req  = req;
 			iocb->key  = req->ki_key;
-			iocb->addr = (unsigned long)msg->msg_iov->iov_base;
+			iocb->addr = ((unsigned long)msg->msg_iov->iov_base -
+				      copied);
 
 			req->ki_cancel = sdp_inet_read_cancel;
 
Index: sdp_send.c
===================================================================
--- sdp_send.c	(revision 2220)
+++ sdp_send.c	(working copy)
@@ -751,6 +751,7 @@
 		buff->tail      += copy;
 		iocb->post      += copy;
 		iocb->len       -= copy;
+		iocb->io_addr   += copy;
 
 		offset += copy;
 		offset &= (~PAGE_MASK);
@@ -2195,7 +2196,7 @@
 		iocb->size = size;
 		iocb->req  = req;
 		iocb->key  = req->ki_key;
-		iocb->addr = (unsigned long)msg->msg_iov->iov_base;
+		iocb->addr = (unsigned long)msg->msg_iov->iov_base - copied;
       
 		req->ki_cancel = sdp_inet_write_cancel;
 

From mplee at hpcn.ca.sandia.gov  Mon Apr 25 17:08:54 2005
From: mplee at hpcn.ca.sandia.gov (Michael Lee)
Date: Mon, 25 Apr 2005 17:08:54 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <1114649443.14721.6.camel@duffman>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
	<1114649443.14721.6.camel@duffman>
Message-ID: <1114474135.7180.14.camel@acheron.ca.sandia.gov>

> That seems better, although you can still go too small on the font and
> it will overlap the logo.  May be a problem on small screens like cell
> phones or pdas.

I noticed that shrinking the font had the same problem, but I didn't
give it much thought at the time since I had to reduce it six times for
the fonts to screw up and about eight times for the second line to cause
problems.  However, never considered palm or cellphone users.  If it's
crucial to the community, I'll look into it further.  I only say "if"
because if we end up going w/ the other site design or something else.


> I really like the new design better.

I'm glad you like it...it'll make the designer happy.  I basically stole
the original site from some one else because I was fielding too many
complaints regarding the original Plone site.  I do like Notepad
though...

Another idea I'm not sure if the community would be interested in is if
we just completely converted the entire site to tiki-wiki.  I installed
a wiki after the fact, but it never seemed to quite mesh well w/ the
look and feel of either the existing site or the new one you looked at
earlier.  Tiki-wiki has a lot of portal-esque features I disabled when
it was installed because they didn't seem necessary.  But if you guys
think using tiki-wiki exclusively is the way to go, we're willing to
revisit it on our side.

> 
> BTW, tangentially, who gets the bugzilla at openib.org mail?
Bugzilla is aliased to myself and Matt.  I'd be happy to add different
people where feasible.

Michael


From tduffy at sun.com  Wed Apr 27 19:12:11 2005
From: tduffy at sun.com (Tom Duffy)
Date: Wed, 27 Apr 2005 19:12:11 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <1114474135.7180.14.camel@acheron.ca.sandia.gov>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
	<1114649443.14721.6.camel@duffman>
	<1114474135.7180.14.camel@acheron.ca.sandia.gov>
Message-ID: <1114654331.19213.3.camel@duffman>

On Mon, 2005-04-25 at 17:08 -0700, Michael Lee wrote:
> I noticed that shrinking the font had the same problem, but I didn't
> give it much thought at the time since I had to reduce it six times for
> the fonts to screw up and about eight times for the second line to cause
> problems.  However, never considered palm or cellphone users.  If it's
> crucial to the community, I'll look into it further.  I only say "if"
> because if we end up going w/ the other site design or something else.

I don't know how many IB hackers are going to be visiting the site on
their cell phones, but whatever.

> Another idea I'm not sure if the community would be interested in is if
> we just completely converted the entire site to tiki-wiki.  I installed
> a wiki after the fact, but it never seemed to quite mesh well w/ the
> look and feel of either the existing site or the new one you looked at
> earlier.  Tiki-wiki has a lot of portal-esque features I disabled when
> it was installed because they didn't seem necessary.  But if you guys
> think using tiki-wiki exclusively is the way to go, we're willing to
> revisit it on our side.

I would like the FAQ to at least be a wiki.  The whole site might be
overkill.

> > 
> > BTW, tangentially, who gets the bugzilla at openib.org mail?
> Bugzilla is aliased to myself and Matt.  I'd be happy to add different
> people where feasible.

I propose that we make it go to openib-general until the volume gets too
large.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050427/37da9fb5/attachment.sig>

From iod00d at hp.com  Wed Apr 27 23:20:14 2005
From: iod00d at hp.com (Grant Grundler)
Date: Wed, 27 Apr 2005 23:20:14 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <1114478635.6749.26.camel@acheron.ca.sandia.gov>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com>
	<1114621995.2221.2.camel@duffman> <523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
Message-ID: <20050428062014.GM17101@esmail.cup.hp.com>

On Mon, Apr 25, 2005 at 06:23:55PM -0700, Michael Lee wrote:
> I removed the hard-coded positioning of the blue header bar...let me
> know if it's okay.

It looks ok.

> Also, if anyone's interested in looking it over and
> providing feedback, we had designed another OpenIB page that we were
> considering rolling out.  Take a look:
> 
> http://hpcn.ca.sandia.gov/~cstanak/index.php

I prefer the original. Have two Nav BARs (one horizontal and
a different one vertical) just makes it harder to find the
right things. Violates KISS.

thanks,
grant


From roland at topspin.com  Thu Apr 28 06:45:27 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 28 Apr 2005 06:45:27 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com> <1114621995.2221.2.camel@duffman>
	<523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
	<1114649443.14721.6.camel@duffman>
	<1114474135.7180.14.camel@acheron.ca.sandia.gov>
Message-ID: <52is27m2vc.fsf@topspin.com>

    Michael> Another idea I'm not sure if the community would be
    Michael> interested in is if we just completely converted the
    Michael> entire site to tiki-wiki.  I installed a wiki after the
    Michael> fact, but it never seemed to quite mesh well w/ the look
    Michael> and feel of either the existing site or the new one you
    Michael> looked at earlier.  Tiki-wiki has a lot of portal-esque
    Michael> features I disabled when it was installed because they
    Michael> didn't seem necessary.  But if you guys think using
    Michael> tiki-wiki exclusively is the way to go, we're willing to
    Michael> revisit it on our side.

I definitely like the idea of making the whole site a wiki, with the
main pages locked.  For example http://hula-project.org/ is running on
MediaWiki, and I think it looks really good.

I'm not sure that I would pick TikiWiki -- it's probably possible to
fix the theme but even the tikiwiki.org mothership looks pretty
cluttered and ugly to me.

I've been planning to try and kick-start the wiki part of openib.org
by writing some content but unfortunately I've never gotten around to it.

 - R.


From paul.baxter at dsl.pipex.com  Thu Apr 28 07:04:00 2005
From: paul.baxter at dsl.pipex.com (Paul Baxter)
Date: Thu, 28 Apr 2005 15:04:00 +0100
Subject: [openib-general] rendering openib.org on Firefox/Linux
References: <9d3b7de7050426200569e83b68@mail.gmail.com><Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com><426FBD50.5070704@linuxmachines.com>
	<1114621995.2221.2.camel@duffman><523btcp1wd.fsf@topspin.com><1114478635.6749.26.camel@acheron.ca.sandia.gov><1114649443.14721.6.camel@duffman><1114474135.7180.14.camel@acheron.ca.sandia.gov>
	<52is27m2vc.fsf@topspin.com>
Message-ID: <004201c54bfb$218891c0$8000000a@blorp>

> I'm not sure that I would pick TikiWiki -- it's probably possible to
> fix the theme but even the tikiwiki.org mothership looks pretty
> cluttered and ugly to me.
>
> I've been planning to try and kick-start the wiki part of openib.org
> by writing some content but unfortunately I've never gotten around to it.

Great idea.

What is the state of documentation, and is there any chance that some of the 
docs at the sourceforge site can be 'borrowed with their blesing', updated 
and modified for openib purposes as a set of backgrounders on IB usage? 


From ftillier at infiniconsys.com  Thu Apr 28 07:59:43 2005
From: ftillier at infiniconsys.com (Fab Tillier)
Date: Thu, 28 Apr 2005 07:59:43 -0700
Subject: [openib-general] FMR and how they work
Message-ID: <000501c54c02$ea900600$fede1142@infiniconsys.com>

When you deregister a FMR, what information from that FMR can the HCA cache
- just the MTTs or also the MPTs?  Can an incoming RDMA transfer access the
pages previously referenced by that FMR?  If so, for how long?  When is it
safe to unpin the pages?

What about when you modify the FMR?  How long are previous mappings "at
risk" of transfers?

I'm kinda confused because the Mellanox docs indicate that the HCA can cache
stuff and that the pages can't be unpinned until the cache is flushed.  If
that's true, I don't see how FMRs can be used because they could be suspect
to malicious use by a remote host.

Help?

- Fab


From halr at voltaire.com  Thu Apr 28 08:31:21 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Apr 2005 11:31:21 -0400
Subject: [openib-general] Extending umad for RMPP support
Message-ID: <1114702281.1764.703.camel@localhost.localdomain>

Hi Roland,

I am looking at adding RMPP support into umad (so OpenSM can support SA
GetTableResp for real).

Currently, the 256 byte MAD is the first field in the ib_user_mad
structure. In order to support variable length MADs (for RMPP), I would
propose moving this to the end of the struct and supporting a variable
size. (Obviously, there are a number of other associated changes with
doing this in the code). Just wanted to double check this with you to
make sure this approach is acceptable before investing time in doing it.

Of course, the ABI version will be bumped and rmpp_version will be added
as a field to the ib_user_mad_reg_req structure.

Thanks.

-- Hal


From roland at topspin.com  Thu Apr 28 08:45:45 2005
From: roland at topspin.com (Roland Dreier)
Date: Thu, 28 Apr 2005 08:45:45 -0700
Subject: [openib-general] Extending umad for RMPP support
In-Reply-To: <1114702281.1764.703.camel@localhost.localdomain> (Hal
	Rosenstock's message of "28 Apr 2005 11:31:21 -0400")
References: <1114702281.1764.703.camel@localhost.localdomain>
Message-ID: <52d5senbva.fsf@topspin.com>

    Hal> Currently, the 256 byte MAD is the first field in the
    Hal> ib_user_mad structure. In order to support variable length
    Hal> MADs (for RMPP), I would propose moving this to the end of
    Hal> the struct and supporting a variable size. (Obviously, there
    Hal> are a number of other associated changes with doing this in
    Hal> the code). Just wanted to double check this with you to make
    Hal> sure this approach is acceptable before investing time in
    Hal> doing it.

Yes, that seems to be the only way to do it.

 - R.


From Diego at Mellanox.com  Thu Apr 28 09:30:51 2005
From: Diego at Mellanox.com (Diego Crupnicoff)
Date: Thu, 28 Apr 2005 09:30:51 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace v
	erbs implementation
Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA04D57363@MTIEX01>


> The userspace library should be able to track the tree and
> the overlaps, etc.  Things might become interesting when the 
> memory is MAP_SHARED pagecache and multiple independent 
> processes are involved, although I guess that'd work OK.
> 
> But afaict the problem wherein part of a page needs
> VM_DONTCOPY and the other part does not cannot be solved.

Not sure it was such a good idea, but at some point we thought of forcing
the copy of (***only***) such pages at fork time (leaving of course the
original one for the parent). This eliminates the COW that would have messed
the parent's mapping, and still allows the child process to access the
"un-registered" portions of the page.

BTW: We did try to "motivate" applications to do whole page registrations
only so to avoid this issue altogether. But that did not work. Some (hard to
ignore) applications want byte granularity.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050428/d70374ae/attachment.html>

From iod00d at hp.com  Thu Apr 28 09:38:02 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 28 Apr 2005 09:38:02 -0700
Subject: [openib-general] rendering openib.org on Firefox/Linux
In-Reply-To: <52is27m2vc.fsf@topspin.com>
References: <9d3b7de7050426200569e83b68@mail.gmail.com>
	<Pine.LNX.4.61.0504271205500.5321@jlentini-linux.nane.netapp.com>
	<426FBD50.5070704@linuxmachines.com>
	<1114621995.2221.2.camel@duffman> <523btcp1wd.fsf@topspin.com>
	<1114478635.6749.26.camel@acheron.ca.sandia.gov>
	<1114649443.14721.6.camel@duffman>
	<1114474135.7180.14.camel@acheron.ca.sandia.gov>
	<52is27m2vc.fsf@topspin.com>
Message-ID: <20050428163802.GC20957@esmail.cup.hp.com>

On Thu, Apr 28, 2005 at 06:45:27AM -0700, Roland Dreier wrote:
> I definitely like the idea of making the whole site a wiki, with the
> main pages locked.  For example http://hula-project.org/ is running on
> MediaWiki, and I think it looks really good.
> 
> I'm not sure that I would pick TikiWiki -- it's probably possible to
> fix the theme but even the tikiwiki.org mothership looks pretty
> cluttered and ugly to me.

parisc-linux.org started using WikiWikiWeb 4-6 monthes ago:
	http://wiki.parisc-linux.org/

But we've had to authenticate and authorize users because of spammers
abusing the open access. Yes - spam happens on wiki pages too.
Because of this, I consider Wiki a variation on CVS/Subversion
that targets webpages instead of source code. Another sort
of internet based groupware if you will.

If someone wants to poke at WikiWikiWeb on wiki.p-l.o, I can
arrange access. Please contact me off-list.

> I've been planning to try and kick-start the wiki part of openib.org
> by writing some content but unfortunately I've never gotten around to it.

Well, that's probably a good thing. Let other folks worry about
the web page mechanics for now until the rate of code change goes down.
Whatever the web maniacs running parisc-linux.org do is fine with
me as long as I can easily modify and test web content and don't
have to learn a new system every 6 monthes.

grant


From mheffner at californiadigital.com  Thu Apr 28 11:18:47 2005
From: mheffner at californiadigital.com (Mike Heffner)
Date: Thu, 28 Apr 2005 14:18:47 -0400
Subject: [openib-general] Example of VAPI_ATOMIC_FETCH_AND_ADD
Message-ID: <42712907.6050701@californiadigital.com>


Does anyone have an example of how to correctly post a 
VAPI_ATOMIC_FETCH_AND_ADD send request? Specifically: how are the fields 
of the VAPI_sr_desc_t supposed to be filled out and what memory portions
must be wired.


Thanks,

Mike

-- 

   Mike Heffner <mheffner at californiadigital.com>
   California Digital Corporation
   Blacksburg, VA USA

   Voice: (540) 443-3500 #603


From halr at voltaire.com  Thu Apr 28 12:55:44 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Apr 2005 15:55:44 -0400
Subject: [openib-general] User MAD registration with RMPP
Message-ID: <1114718144.4477.11.camel@localhost.localdomain>

Hi,

One more issue in terms of supporting RMPP to user space:

Should the user registration check that RMPP is allowed for that
management class (being registered) ? (It seems to me that allowing the
user to enable RMPP on a management class which does not support RMPP
might be dangerous).

If so, the list would be SA (class 3) and vendor range 2 (0x30-0x4f).
Correct ?

Thanks.

-- Hal


From itamar at mellanox.co.il  Thu Apr 28 13:29:14 2005
From: itamar at mellanox.co.il (Itamar Rabenstein)
Date: Thu, 28 Apr 2005 23:29:14 +0300
Subject: [openib-general] [PATCH][DAPL] Fix sparse warnings on dapl bu ilds
Message-ID: <91DB792C7985D411BEC300B40080D29CC359F7@mtvex01.mtv.mtl.com>

Thanks for the patch.
Did you run kdapltest or only compiled  the code ?
kdapltest -T Q and kdapltest -T T with 1 ep should work now.
if you need help with the test let me know.

Some general notes:

1)
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c
> ===================================================================
> --- 
   
    <<  snip >>

> @@ -799,22 +799,6 @@ DAT_RETURN dapl_ib_accept_connection(DAT
>  	return DAT_SUCCESS;
>  }
>  
> -DAT_RETURN dapl_ib_comm_established(DAPL_EP * ep_ptr)
> -{
> -	int status;
> -	DAT_RETURN dat_status = DAT_SUCCESS;
> -
> -	status = ib_send_cm_rtu(ep_ptr->cm_handle, NULL, 0);
> -	if (status) {
> -		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
> -			     " dapl_ib_comm_established: 
> ib_send_cm_rtu failed: %d cm_handle: %x\n",
> -			     status, ep_ptr->cm_handle);
> -		return DAT_ERROR(DAT_INSUFFICIENT_RESOURCES, 0);
> -	}
> -
> -	return dat_status;
> -}
> -
>  /*
>   * ib_cm_get_remote_gid 
>   */

This function is currently not in use but I think we will need if we want to
support 
merging dto evds with communication  evds.

2)
 
> Index: gen2/users/jlentini/linux-kernel/dat/consumer.c
> ===================================================================
> -DAT_RETURN dat_evd_kquery(DAT_EVD_HANDLE evd_handle,
> -			  DAT_EVD_PARAM_MASK evd_param_mask,
> -			  DAT_EVD_PARAM * evd_param)
> -{
> -	if (evd_handle == NULL) {
> -		return DAT_ERROR(DAT_INVALID_HANDLE,
> -				 DAT_INVALID_HANDLE_EVD_REQUEST);
> -	}
> -	return DAT_EVD_QUERY(evd_handle, evd_param_mask, evd_param);
> -}
> -
why did you delete this function ?
Maybe there is a bug with this function but this is a real function in the
API.
I think that the bug is a mix between dat_evd_kquery and dat_evd_query.
I will look on it later


 -Itamar


From tduffy at sun.com  Thu Apr 28 13:15:04 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 28 Apr 2005 13:15:04 -0700
Subject: [openib-general] [PATCH][DAPL] Fix sparse warnings on dapl bu	ilds
In-Reply-To: <91DB792C7985D411BEC300B40080D29CC359F7@mtvex01.mtv.mtl.com>
References: <91DB792C7985D411BEC300B40080D29CC359F7@mtvex01.mtv.mtl.com>
Message-ID: <1114719304.5658.7.camel@duffman>

On Thu, 2005-04-28 at 23:29 +0300, Itamar Rabenstein wrote:
> Thanks for the patch.
> Did you run kdapltest or only compiled  the code ?
> kdapltest -T Q and kdapltest -T T with 1 ep should work now.
> if you need help with the test let me know.

I just compiled the code and loaded the modules into my kernel.  Where
can I find said test?

> This function is currently not in use but I think we will need if we want to
> support 
> merging dto evds with communication  evds.

Ok, merge it in at that point.  For now, it is not used, leave it out.

> why did you delete this function ?
> Maybe there is a bug with this function but this is a real function in the
> API.
> I think that the bug is a mix between dat_evd_kquery and dat_evd_query.
> I will look on it later

I deleted it because it is not used.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050428/ba0231a1/attachment.sig>

From mshefty at ichips.intel.com  Thu Apr 28 13:18:15 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Apr 2005 13:18:15 -0700
Subject: [openib-general] Re: User MAD registration with RMPP
In-Reply-To: <1114718144.4477.11.camel@localhost.localdomain>
References: <1114718144.4477.11.camel@localhost.localdomain>
Message-ID: <42714507.3010807@ichips.intel.com>

Hal Rosenstock wrote:

> Should the user registration check that RMPP is allowed for that
> management class (being registered) ? (It seems to me that allowing the
> user to enable RMPP on a management class which does not support RMPP
> might be dangerous).
> 
> If so, the list would be SA (class 3) and vendor range 2 (0x30-0x4f).
> Correct ?

I'm not sure that letting a user take over a management class that 
doesn't usually support RMPP would necessarily be dangerous; it just 
may not work for the user.  For the most part, the RMPP relies on the 
registration request to determine when to invoke RMPP.

The exception to this is that the code needs to calculate the offset of 
the user data separate from the common MAD headers.  This is done in 
the data_offset() function in mad_rmpp.c.

The classes that you mention above are the only classes that I know of 
that have RMPP defined.

- Sean


From itamar at mellanox.co.il  Thu Apr 28 13:48:20 2005
From: itamar at mellanox.co.il (Itamar Rabenstein)
Date: Thu, 28 Apr 2005 23:48:20 +0300
Subject: [openib-general] [PATCH][DAPL] Fix sparse warnings on dapl bu ilds
Message-ID: <91DB792C7985D411BEC300B40080D29CC359F8@mtvex01.mtv.mtl.com>

 
> I just compiled the code and loaded the modules into my kernel.  Where
> can I find said test?
> 

look at the README file locate
https://openib.org/svn/gen2/users/jlentini/linux-kernel
there are instractions how to build the test.
simple test is (for port 1):
server : kdapltest -T S -D mthca0a -d
client : kdapltest -T Q -s #.#.#.# -D mthca0a -d
you should fill the ipoib ip number of the server port

> Ok, merge it in at that point.  For now, it is not used, leave it out.
> 

O.K.

> 
> I deleted it because it is not used.
> 
As i said it is a bug, we need this function.
I will send a patch soon.


  -Itamar 


From halr at voltaire.com  Thu Apr 28 13:27:01 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Apr 2005 16:27:01 -0400
Subject: [openib-general] Re: User MAD registration with RMPP
In-Reply-To: <42714507.3010807@ichips.intel.com>
References: <1114718144.4477.11.camel@localhost.localdomain>
	<42714507.3010807@ichips.intel.com>
Message-ID: <1114720020.4477.25.camel@localhost.localdomain>

On Thu, 2005-04-28 at 16:18, Sean Hefty wrote:
> Hal Rosenstock wrote:
> 
> > Should the user registration check that RMPP is allowed for that
> > management class (being registered) ? (It seems to me that allowing the
> > user to enable RMPP on a management class which does not support RMPP
> > might be dangerous).
> > 
> > If so, the list would be SA (class 3) and vendor range 2 (0x30-0x4f).
> > Correct ?
> 
> I'm not sure that letting a user take over a management class that 
> doesn't usually support RMPP would necessarily be dangerous; it just 
> may not work for the user.  

I'm not sure either but...

There's a difference between doesn't usually support it and isn't
supposed to support it.

> For the most part, the RMPP relies on the 
> registration request to determine when to invoke RMPP.

> The exception to this is that the code needs to calculate the offset of 
> the user data separate from the common MAD headers.  This is done in 
> the data_offset() function in mad_rmpp.c.

So do you think nothing bad would happen (other than perhaps to that
user) ? I was concerned about both transmit and receive and whether it
is just better to protect against this. There's a downside in that if
additional classes support RMPP then the code needs updating...

Is it worth checking for this ?

> The classes that you mention above are the only classes that I know of 
> that have RMPP defined.

Thanks.

-- Hal

> - Sean


From tduffy at sun.com  Thu Apr 28 13:35:42 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 28 Apr 2005 13:35:42 -0700
Subject: [openib-general] [DAPL] error trying to build kdapltest
Message-ID: <1114720542.5658.8.camel@duffman>

/build1/tduffy/openib-work/gen2/users/jlentini/linux-kernel/test/dapltest/kdapl/../test/dapl_performance_stats.c: In function ‘DT_performance_stats_data_print’:/build1/tduffy/openib-work/gen2/users/jlentini/linux-kernel/test/dapltest/kdapl/../test/dapl_performance_stats.c:202: error: SSE register return with SSE disabled
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050428/359b72eb/attachment.sig>

From itamar at mellanox.co.il  Thu Apr 28 14:06:13 2005
From: itamar at mellanox.co.il (Itamar Rabenstein)
Date: Fri, 29 Apr 2005 00:06:13 +0300
Subject: [openib-general] [DAPL] error trying to build kdapltest
Message-ID: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>

>/build1/tduffy/openib-work/gen2/users/jlentini/linux-
>kernel/test/dapltest/kdapl/../test/dapl_performance_stats.c: In function a
>EUR~DT_performance_stats_data_printaEUR(tm):/build1/tduffy/openib-work/gen2
/users/jlentini/linux-
>kernel/test/dapltest/kdapl/../test/dapl_performance_stats.c:202: error: SSE
register return with SSE 
>disabled


I think the problem is related to the use of double in kernel
in i386 arch we need to add to makefile :
ifeq (${IS_i686},i686)
# Override -msoft-float in arch/i386/Makefile
EXTRA_CFLAGS += -mhard-float
endif 

I am not sure that you have a flag like this .
I am working now on a new version of kdapletst without any use of double's
in kernel.
I think it will be ready early next week.

  Itamar
 

From mshefty at ichips.intel.com  Thu Apr 28 14:02:12 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Apr 2005 14:02:12 -0700
Subject: [openib-general] Re: User MAD registration with RMPP
In-Reply-To: <1114720020.4477.25.camel@localhost.localdomain>
References: <1114718144.4477.11.camel@localhost.localdomain>	
	<42714507.3010807@ichips.intel.com>
	<1114720020.4477.25.camel@localhost.localdomain>
Message-ID: <42714F54.4040905@ichips.intel.com>

Hal Rosenstock wrote:
>>>Should the user registration check that RMPP is allowed for that
>>>management class (being registered) ? (It seems to me that allowing the
>>>user to enable RMPP on a management class which does not support RMPP
>>>might be dangerous).
>>>
>>>If so, the list would be SA (class 3) and vendor range 2 (0x30-0x4f).
>>>Correct ?
>>
>>I'm not sure that letting a user take over a management class that 
>>doesn't usually support RMPP would necessarily be dangerous; it just 
>>may not work for the user.  
> 
> I'm not sure either but...
> 
> There's a difference between doesn't usually support it and isn't
> supposed to support it.

Good point.  I think that it makes sense to check against this for 
classes that aren't supposed to support it, and make sure that ones 
that do ask for it.


> So do you think nothing bad would happen (other than perhaps to that
> user) ? I was concerned about both transmit and receive and whether it
> is just better to protect against this. There's a downside in that if
> additional classes support RMPP then the code needs updating...

I thought about the receive side as well, but figured that receivers 
should handle mis-formatted MADs.  So, I _think_ that the user would 
just fail to communicate with anyone...

One consideration is that for the two class sets (SA and vendor2) that 
support RMPP, both have defined extra header information that must be 
duplicated in each MAD.  If another RMPP class comes along and needs to 
add addition header information, it will result in code changes anyway...

One possible solution around this would be for clients to specify the 
size of the common header when registering.  I didn't think it was 
worth changing the API for this though.

- Sean


From robert.j.woodruff at intel.com  Thu Apr 28 14:33:02 2005
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 28 Apr 2005 14:33:02 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
Message-ID: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>

 
Hi Libor, 

I was trying to backport SDP to a 2.6.9 kernel and ran into
a compile error when compiling  sdp_conn.c and sdp_inet.c 
They seem to have a reference to a
value of ECANCELLED.

I did a search and found it was defined in asm-parisc/errno.h
but no where else. 

#define ECANCELLED 253

Is that the value you intended in these calls and if so, can I just
define it
in somewhere like sdp_main.h for my backport version ?

woody


From mshefty at ichips.intel.com  Thu Apr 28 15:32:51 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Apr 2005 15:32:51 -0700
Subject: [openib-general] FMR and how they work
In-Reply-To: <000501c54c02$ea900600$fede1142@infiniconsys.com>
References: <000501c54c02$ea900600$fede1142@infiniconsys.com>
Message-ID: <42716493.8040308@ichips.intel.com>

Fab Tillier wrote:

> When you deregister a FMR, what information from that FMR can the HCA cache
> - just the MTTs or also the MPTs?  Can an incoming RDMA transfer access the
> pages previously referenced by that FMR?  If so, for how long?  When is it
> safe to unpin the pages?
> 
> What about when you modify the FMR?  How long are previous mappings "at
> risk" of transfers?
> 
> I'm kinda confused because the Mellanox docs indicate that the HCA can cache
> stuff and that the pages can't be unpinned until the cache is flushed.  If
> that's true, I don't see how FMRs can be used because they could be suspect
> to malicious use by a remote host.

Based on my understanding of Mellanox FMRs, I thought that this true. 
The buffers are still available to the remote host even after 
deregistration occurs.  Of course my understanding could be wrong...

- Sean


From tduffy at sun.com  Thu Apr 28 15:31:23 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 28 Apr 2005 15:31:23 -0700
Subject: [openib-general] [DAPL] ran kdapl test, got slab corruption
In-Reply-To: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
References: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
Message-ID: <1114727483.25364.7.camel@duffman>

On Fri, 2005-04-29 at 00:06 +0300, Itamar Rabenstein wrote:
> I think the problem is related to the use of double in kernel
> in i386 arch we need to add to makefile :
> ifeq (${IS_i686},i686)
> # Override -msoft-float in arch/i386/Makefile
> EXTRA_CFLAGS += -mhard-float
> endif 
> 
> I am not sure that you have a flag like this .
> I am working now on a new version of kdapletst without any use of double's
> in kernel.
> I think it will be ready early next week.

If I add:

EXTRA_CLFAGS += -msse

I can compile on x86_64.

Now that I run the test, I get the following:

[root at flopteron2 ~]# ./kdapltest -T S -D mthca0a -d
Server_Cmd.debug:       1
Server_Cmd.dapl_name: mthca0a
DT_cs_Server: IA mthca0a opened
DT_cs_Server: PZ created
DT_cs_Server: EP created
DT_cs_Server: PSP created
*****  DAPL  Characteristics  *****
Provider: mthca0a  Version 1.0  DAPL 1.2
Adapter: Generic InfiniBand HCA by DAPL Reference Implementation Version
0.0
Supporting:
        64512 EPs with 65535 DTOs and 0 RDMA/RDs each
        65408 EVDs of up to 65535 entries  (default S/R size is 16/16)
        IOVs of up to 59 elements
        131056 LMRs (and 131056 RMRs) of up to 0xffffffffffffffff bytes
        Maximum MTU 0x80000000 bytes, RDMA 0x80000000 bytes
        Maximum Private data size 92 bytes
***** ***** ***** ***** ***** *****
DT_cs_Server: Posting 2 recvs
Dapltest: Service Point Ready - mthca0a
DT_cs_Server: Waiting for Connection Request
DT_cs_Server: Accepting Connection Request
DT_cs_Server: Awaiting connection ...
DT_cs_Server: Connected!
DAT_STATE: DAT_EP_STATE_CONNECTED
DAT_STATE: Inbound DTO Status: Active
DAT_STATE: Outbound DTO Status: Idle
DT_cs_Server: Waiting for Client_Info
DT_cs_Server: Got Client_Info
DT_cs_Server: Waiting for Client_Cmd_Info
DT_cs_Server: Send Server_Info
Client Requests Server to Quit
DT_cs_Server: Waiting for clients to all go away...
DT_cs_Server: Cleaning up ...
DT_cs_Server: IA mthca0a closed
DT_cs_Server (mthca0a):  Exiting.
TEST INSTANCE 1

[root at sins-stinger-10 ~]# ./kdapltest -T Q -s 192.168.0.26 -D mthca0a -d
Server Name: 192.168.0.26
Server Net Address: 192.168.0.26
DT_cs_Client: Starting Test ...
DT_cs_Client: IA mthca0a opened
DT_cs_Client: EP created
*****  DAPL  Characteristics  *****
Provider: mthca0a  Version 1.0  DAPL 1.2
Adapter: Generic InfiniBand HCA by DAPL Reference Implementation Version
0.0
Supporting:
        64512 EPs with 65535 DTOs and 0 RDMA/RDs each
        65408 EVDs of up to 65535 entries  (default S/R size is 16/16)
        IOVs of up to 28 elements
        131056 LMRs (and 131056 RMRs) of up to 0xffffffffffffffff bytes
        Maximum MTU 0x80000000 bytes, RDMA 0x80000000 bytes
        Maximum Private data size 92 bytes
***** ***** ***** ***** ***** *****
DT_cs_Client: Posting 1 recv buffer
DT_cs_Client: Connect Endpoint
DT_cs_Client: Await connection ...
DT_cs_Client: Connected!
DAT_STATE: DAT_EP_STATE_CONNECTED
DAT_STATE: Inbound DTO Status: Active
DAT_STATE: Outbound DTO Status: Idle
DT_cs_Client: Sending Client_Info
DT_cs_Client: Sent Client_Info - awaiting completion
DT_cs_Client: Sending Command
DT_cs_Client: Sent Command - awaiting completion
DT_cs_Client: Waiting for Server_Info
DT_cs_Client: Server_Info Received
DT_cs_Client: Version OK!
-------------------------------------
Server_Info.dapltest_version   : 6
Server_Info.is_little_endian   : 1
-------------------------------------
Client_Info.dapltest_version   : 6
Client_Info.is_little_endian   : 1
Client_Info.test_type          : 4
Quit_Cmd.server_name: 192.168.0.26
Quit_Cmd.device_name: mthca0a
DT_cs_Client: Cleaning Up ...
DT_cs_Client: IA mthca0a closed
DT_cs_Client: ========== End of Work -- Client Exiting
TEST INSTANCE 1

Unfortunately, I get this error in dmesg:

dapl_ib_disconnect_clean: ep_ptr 0xffff81003f085320 has invalid CM handle

Also, this is bad: slab corruption

Slab corruption: start=ffff810077455eb8, len=312
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff882c8252>](req_comp_work+0x42/0x90 [ib_at])
050: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 00 00 00
120: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 00 00 00
Prev obj: start=ffff810077455d68, len=312
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<0000000000000000>](0x0)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=ffff81003aecbdb0, len=288
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<ffffffff882c8264>](req_comp_work+0x54/0x90 [ib_at])
040: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b
110: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b a5
Prev obj: start=ffff81003aecbc78, len=288
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<0000000000000000>](0x0)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b


-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050428/b3a2f0ba/attachment.sig>

From tduffy at sun.com  Thu Apr 28 15:38:31 2005
From: tduffy at sun.com (Tom Duffy)
Date: Thu, 28 Apr 2005 15:38:31 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
Message-ID: <1114727911.25364.10.camel@duffman>

On Thu, 2005-04-28 at 14:33 -0700, Woodruff, Robert J wrote:
>  Hi Libor, 
> 
> I was trying to backport SDP to a 2.6.9 kernel and ran into
> a compile error when compiling  sdp_conn.c and sdp_inet.c 
> They seem to have a reference to a
> value of ECANCELLED.
> 
> I did a search and found it was defined in asm-parisc/errno.h
> but no where else. 
> 
> #define ECANCELLED 253

You misspelled it, SDP uses ECANCELED.  Which seems to be 125 in
asm-generic/errno.h

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050428/38b2234a/attachment.sig>

From iod00d at hp.com  Thu Apr 28 15:47:21 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 28 Apr 2005 15:47:21 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
Message-ID: <20050428224721.GL20957@esmail.cup.hp.com>

On Thu, Apr 28, 2005 at 02:33:02PM -0700, Woodruff, Robert J wrote:
>  
> Hi Libor, 
> 
> I was trying to backport SDP to a 2.6.9 kernel and ran into
> a compile error when compiling  sdp_conn.c and sdp_inet.c 
> They seem to have a reference to a
> value of ECANCELLED.
> 
> I did a search and found it was defined in asm-parisc/errno.h
> but no where else. 
> 
> #define ECANCELLED 253

That's funny. parisc is the only one that was (sortof) wrong.

I just added an alias to parisc-linux so ECANCELLED is an alias
for ECANCELED:
	http://lists.parisc-linux.org/pipermail/parisc-linux-cvs/2005-April/035557.html

IIRC, parisc-linux defined ECANCELLED because that's what HPUX used.
And according to SuSv3, is an acceptable definition of the same error.
But Linux uses ECANCELED (with one "el").

In my spare time, I've been trying to build openib.org code on
parisc-linux. Last time I tried, I was running into the same
issue that Tom Duffy already pointed out about one of the 
new vm interfaces not being supported. I have to fix that
bit in parisc-linux port as well before I can move forward.

> Is that the value you intended in these calls and if so, can I just
> define it in somewhere like sdp_main.h for my backport version ?

Please do not. Just fix the code to use ECANCELED since that's
what SuSv3 specifies and linux prefers:
include/asm-generic/errno.h:#define     ECANCELED       125     /* Operation Canceled */

grant


From robert.j.woodruff at intel.com  Thu Apr 28 15:42:55 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Thu, 28 Apr 2005 15:42:55 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <1114727911.25364.10.camel@duffman>
Message-ID: <ORSMSX408FRaqbC8wSA00000018@orsmsx408.amr.corp.intel.com>

Tom Duffy wrote, 
>You misspelled it, SDP uses ECANCELED.  Which seems to be 125 in
>asm-generic/errno.h

>-tduffy

Yes. thanks, I found it after I sent the email. 


From libor at topspin.com  Thu Apr 28 16:33:05 2005
From: libor at topspin.com (Libor Michalek)
Date: Thu, 28 Apr 2005 16:33:05 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <20050428224721.GL20957@esmail.cup.hp.com>;
	from iod00d@hp.com on Thu, Apr 28, 2005 at 03:47:21PM -0700
References: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
	<20050428224721.GL20957@esmail.cup.hp.com>
Message-ID: <20050428163305.A12393@topspin.com>

On Thu, Apr 28, 2005 at 03:47:21PM -0700, Grant Grundler wrote:
> On Thu, Apr 28, 2005 at 02:33:02PM -0700, Woodruff, Robert J wrote:
> >  
> > Hi Libor, 
> > 
> > I was trying to backport SDP to a 2.6.9 kernel and ran into
> > a compile error when compiling  sdp_conn.c and sdp_inet.c 
> > They seem to have a reference to a
> > value of ECANCELLED.
> > 
> > I did a search and found it was defined in asm-parisc/errno.h
> > but no where else. 
> > 
> > #define ECANCELLED 253
> 
> > Is that the value you intended in these calls and if so, can I just
> > define it in somewhere like sdp_main.h for my backport version ?
> 
> Please do not. Just fix the code to use ECANCELED since that's
> what SuSv3 specifies and linux prefers:
> include/asm-generic/errno.h:#define     ECANCELED       125

  In 2.6.9 the contant ECANCELED does not exists, it was only introduced
in 2.6.10, so for the patch Woody is trying to create I would just say
add the #define to sdp_main.h

-Libor


From robert.j.woodruff at intel.com  Thu Apr 28 16:37:13 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Thu, 28 Apr 2005 16:37:13 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <20050428163305.A12393@topspin.com>
Message-ID: <ORSMSX4081XvpFVjCRG00000019@orsmsx408.amr.corp.intel.com>

Libor wrote,
>  In 2.6.9 the contant ECANCELED does not exists, it was only introduced
>in 2.6.10, so for the patch Woody is trying to create I would just say
>add the #define to sdp_main.h

>-Libor

Since we already have several kernel fixups needed to backport
to 2.6.9, I just went ahead and added ECANCELED to my
asm-generic/errno.h and can add that to the patch for kernel fixups
for the back port. 


From info at qsv04.com  Thu Apr 28 07:40:38 2005
From: info at qsv04.com (info at qsv04.com)
Date: 28 Apr 2005 23:40:38 +0900
Subject: [openib-general] $B!ZHk![N">pJs!T#G#WHG!U(B
Message-ID: <20050428144038.3663.qmail@mail.qsv04.com>


$B!!!!!!!!!!!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y(B
$B!!!!!!!!!!!z!!N">pJs8x3+!*!!%4!<%k%G%s%&%$!<%/HG(B $B!z(B
$B!!!!!!!!!!!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y(B


$B!!=cNx0&Aj<j!&7k:'Aj<j!&%a%kM'!&6X$8$i$l$?Nx$NAj<j(B
$B!!Am2q0w?t$,B?$$$+$i$3$=8+$D$1$k$3$H$,=PMh$^$9!#(B
$B!!$?$@:#!!%-%c%s%Z!<%sCf!*!!(B10,000$B1_J,$NL5NA%]%$%s%H%2%C%H!*!)(B
     $B!!(B   $B!cD>%a(B,$BD>EE8r49$b;W$$$N$^$^!d(B
  $B!!(B $B!!!!!!(Bhttp://www.getluck2.net/?03trial
  $B5v2D$r$$$?$@$$$?2q0w$NJ}$N%W%m%U$rFCJL$K8x3+$7$^$9!#(B
$B!!:#2s$O!"2hA|EPO?$5$l$F$$$kJ}$,#2L>$$$^$9!#%I%-%I%-!J!&&X!&!K(B


$B!!%K%C%/%M!<%`!'!!(Bmika$B!!$5$s!J=w at -!K(B
$B!!G/Np!'!!(B10$BBe8eH>(B
$B!!7l1U7?!'!!(BAB$B7?(B
$B!!7k:'!'!!L$:'(B
$B!!?&6H!'!!%U%j!<%?!<(B
$B!!%k%C%/%9!'!!!z!z!z(B
$B!!%9%?%$%k!'!!!z!z!z!z(B
$B!!#HEY!'!!!!!!"!"!"!"!(B
$B!!<+8J(BPR$B!'!!C/$+H~9a$HM7$s$G$A$g"v(B
$B!!2hA|EPO?!'!!$"$j(B


   $B!!!!!!!c$^$:$OL5NA%(%s%H%j!<$+$i$I$&$>!d(B
$B!!(B   $B!!!!!!(Bhttp://www.getluck2.net/?03trial


$B!!%K%C%/%M!<%`!'!!AaID!!$5$s!J=w at -!K(B
$B!!G/Np!'!!(B20$BBeA0H>(B
$B!!7l1U7?!'!!(BB$B7?(B
$B!!7k:'!'!!4{:'(B
$B!!?&6H!'!!8xL30w(B
$B!!%k%C%/%9!'!!!z!z!z(B
$B!!%9%?%$%k!'!!!z!z(B
$B!!#HEY!'!!!!!!"!"!"!"!(B
$B!!<+8J(BPR$B!'!!<RFb7k:'$@$C$?$s$G$9$,%^%s%M%j2=!D$b$&K0$-K0$-$7$F$$$^$9!#(B
$B!!2hA|EPO?!'!!$"$j(B


     $B!!(B   $B!c7HBSEPO?$G$N%9%T!<%I%"%?%C%/!d(B
  $B!!(B $B!!!!!!(Bhttp://www.getluck2.net/?03trial


$B!!%K%C%/%M!<%`!'!!E5;R!!$5$s!J=w at -!K(B
$B!!G/Np!'!!(B30$BBe8eH>(B
$B!!7l1U7?!'!!(BO$B7?(B
$B!!7k:'!'!!L$:'(B
$B!!?&6H!'!!#O#L(B
$B!!%k%C%/%9!'!!!z!z!z(B
$B!!%9%?%$%k!'!!!z!z!z(B
$B!!#HEY!'!!!!!!"!"!"!"!"!(B
$B!!<+8J(BPR$B!'!!@h7nN%:'$7$?$P$+$j$J$s$G$9$,<d$7$9$.$k$N$GAj<j$r$7$FM_$7$$$G$9!D!#(B
$B!!2hA|EPO?!'!!$J$7(B


   $B!!!!!!!c%U%j!<%a!<%k$G$NEPO?$b<uIU2DG=!d(B
 $B!!(B  $B!!!!!!(Bhttp://www.getluck2.net/?03trial


=============================================================
$B"($?$/$5$s$NJ}$+$i8x3+5v2D$r$$$?$@$-$^$7$?!#(B
 $B!!=gHV$r<i$j$?$$$H;W$$$^$9$N$G7G:\$5$l$F$$$J$$J}$O$4N;>52<$5$$!#(B


$B(.(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(/(B
$B(-&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#&#(-(B
$B(-&#!!!!(Bhttp://night.loves-tokyo.com/index.htm  $B(-(B
$B(-&#!!!!2r6X=w;R!{@8$+$i6XCG$N%;%l%VG($l>lEpD0(B  $B(-(B
$B(-&#!!!!7HBS$+$i$b(BOK $B!*!*!y0B?4(BFreeDial$B!y(B       $B(-(B
$B(-&#!!!!(Bhttp://voice.loves-tokyo.com/           $B(-(B
$B(1(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(0(B


From iod00d at hp.com  Thu Apr 28 17:18:16 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 28 Apr 2005 17:18:16 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <20050428163305.A12393@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F00043EA659@orsmsx408>
	<20050428224721.GL20957@esmail.cup.hp.com>
	<20050428163305.A12393@topspin.com>
Message-ID: <20050429001816.GQ20957@esmail.cup.hp.com>

On Thu, Apr 28, 2005 at 04:33:05PM -0700, Libor Michalek wrote:
> > Please do not. Just fix the code to use ECANCELED since that's
> > what SuSv3 specifies and linux prefers:
> > include/asm-generic/errno.h:#define     ECANCELED       125
> 
>   In 2.6.9 the contant ECANCELED does not exists, it was only introduced
> in 2.6.10, so for the patch Woody is trying to create I would just say
> add the #define to sdp_main.h

Why not patch the include/asm*/errno.h files?
alpha, mips, parisc, sparc64 use different values.
Even if you don't care about those arches on 2.6.9,
it would still make more sense to patch at least
the asm-generic/errno.h.

grant


From akpm at osdl.org  Thu Apr 28 17:56:09 2005
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 28 Apr 2005 17:56:09 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <1113840973.6274.84.camel@laptopd505.fenrus.org>
References: <200544159.Ahk9l0puXy39U6u6@topspin.com>
	<20050411142213.GC26127@kalmia.hozed.org>
	<52mzs51g5g.fsf@topspin.com> <4263DBBF.9040801@ammasso.com>
	<1113840973.6274.84.camel@laptopd505.fenrus.org>
Message-ID: <20050428175609.1893c8bd.akpm@osdl.org>

Arjan van de Ven <arjan at infradead.org> wrote:
>
> > Why do you call mlock() and get_user_pages()?  In our code, we only call mlock(), and the 
> > memory is pinned. 
> 
> this is a myth; linux is free to move the page about in physical memory
> even if it's mlock()ed!!

eh?  I guess the kernel _is_ free to move the page about, but it doesn't.

We might do at some time in the future for memory hotplug, I guess.


From iod00d at hp.com  Thu Apr 28 22:10:09 2005
From: iod00d at hp.com (Grant Grundler)
Date: Thu, 28 Apr 2005 22:10:09 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
Message-ID: <20050429051009.GT20957@esmail.cup.hp.com>

Hi,

I'm trying out uverbs stuff for the first time on ia64 
with svn 2225 and 2.6.11 kernel. I'm not sure what needs to happen
to ibv_pingpong to work. I was using the email roland sent out
announcing libuverbs support as a guide:
    http://openib.org/pipermail/openib-general/2005-February/004454.html

gsyprf3:/usr/src/linux-2.6# modprobe ib_mthca msi_x=1
ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004)
ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0)
GSI 60 (level, low) -> CPU 1 (0x0100) vector 69
ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69
gsyprf3:/usr/src/linux-2.6# modprobe ib_ipoib
gsyprf3:/usr/src/linux-2.6# ifconfig ib0 10.0.0.51 netmask 255.255.255.0 broadcast 10.0.0.255
gsyprf3:/usr/src/linux-2.6# modprobe ib_uverbs
gsyprf3:/usr/src/linux-2.6# lsmod
Module                  Size  Used by
ib_uverbs              50920  0 
ib_ipoib               90488  0 
ib_sa                  23980  1 ib_ipoib
ib_mthca              234383  0 
ib_mad                 82808  2 ib_sa,ib_mthca
ib_core                85416  5 ib_uverbs,ib_ipoib,ib_sa,ib_mthca,ib_mad
sg                     83200  0 
qla2300               127272  0 
qla2xxx               250463  1 qla2300
scsi_transport_fc      45672  1 qla2xxx
e1000                 187588  0 
tg3                   197888  0 
e100                   79630  0 
dm_mod                136584  0 
gsyprf3:/usr/src/linux-2.6# ls -l /dev/infiniband/uverbs*
crw-r--r--  1 root root 231, 128 Apr 28 12:28 /dev/infiniband/uverbs0
crw-r--r--  1 root root 231, 129 Apr 28 12:28 /dev/infiniband/uverbs1
gsyprf3:/usr/src/linux-2.6# ibv_pingpong
libibverbs: Warning: no driver for uverbs0
No IB devices found

What obvious, silly thing did I overlook?

thanks,
grant


From halr at voltaire.com  Fri Apr 29 03:31:04 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Apr 2005 06:31:04 -0400
Subject: [openib-general] [DAPL] ran kdapl test, got slab corruption
In-Reply-To: <1114727483.25364.7.camel@duffman>
References: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
	<1114727483.25364.7.camel@duffman>
Message-ID: <1114770000.4477.850.camel@localhost.localdomain>

On Thu, 2005-04-28 at 18:31, Tom Duffy wrote:
> On Fri, 2005-04-29 at 00:06 +0300, Itamar Rabenstein wrote:
> > I think the problem is related to the use of double in kernel
> > in i386 arch we need to add to makefile :
> > ifeq (${IS_i686},i686)
> > # Override -msoft-float in arch/i386/Makefile
> > EXTRA_CFLAGS += -mhard-float
> > endif 
> > 
> > I am not sure that you have a flag like this .
> > I am working now on a new version of kdapletst without any use of double's
> > in kernel.
> > I think it will be ready early next week.
> 
> If I add:
> 
> EXTRA_CLFAGS += -msse
> 
> I can compile on x86_64.

I don't need to do this on x64_64. Not sure why. Could this be due to a
compiler difference ? I am using gcc version 3.4.2 20041017 (Red Hat
3.4.2-6.fc3). I also build "in tree" rather that out of tree.

> Now that I run the test, I get the following:
> 
> [root at flopteron2 ~]# ./kdapltest -T S -D mthca0a -d
> Server_Cmd.debug:       1
> Server_Cmd.dapl_name: mthca0a
> DT_cs_Server: IA mthca0a opened
> DT_cs_Server: PZ created
> DT_cs_Server: EP created
> DT_cs_Server: PSP created
> *****  DAPL  Characteristics  *****
> Provider: mthca0a  Version 1.0  DAPL 1.2
> Adapter: Generic InfiniBand HCA by DAPL Reference Implementation Version
> 0.0
> Supporting:
>         64512 EPs with 65535 DTOs and 0 RDMA/RDs each
>         65408 EVDs of up to 65535 entries  (default S/R size is 16/16)
>         IOVs of up to 59 elements
>         131056 LMRs (and 131056 RMRs) of up to 0xffffffffffffffff bytes
>         Maximum MTU 0x80000000 bytes, RDMA 0x80000000 bytes
>         Maximum Private data size 92 bytes
> ***** ***** ***** ***** ***** *****
> DT_cs_Server: Posting 2 recvs
> Dapltest: Service Point Ready - mthca0a
> DT_cs_Server: Waiting for Connection Request
> DT_cs_Server: Accepting Connection Request
> DT_cs_Server: Awaiting connection ...
> DT_cs_Server: Connected!
> DAT_STATE: DAT_EP_STATE_CONNECTED
> DAT_STATE: Inbound DTO Status: Active
> DAT_STATE: Outbound DTO Status: Idle
> DT_cs_Server: Waiting for Client_Info
> DT_cs_Server: Got Client_Info
> DT_cs_Server: Waiting for Client_Cmd_Info
> DT_cs_Server: Send Server_Info
> Client Requests Server to Quit
> DT_cs_Server: Waiting for clients to all go away...
> DT_cs_Server: Cleaning up ...
> DT_cs_Server: IA mthca0a closed
> DT_cs_Server (mthca0a):  Exiting.
> TEST INSTANCE 1
> 
> [root at sins-stinger-10 ~]# ./kdapltest -T Q -s 192.168.0.26 -D mthca0a -d
> Server Name: 192.168.0.26
> Server Net Address: 192.168.0.26
> DT_cs_Client: Starting Test ...
> DT_cs_Client: IA mthca0a opened
> DT_cs_Client: EP created
> *****  DAPL  Characteristics  *****
> Provider: mthca0a  Version 1.0  DAPL 1.2
> Adapter: Generic InfiniBand HCA by DAPL Reference Implementation Version
> 0.0
> Supporting:
>         64512 EPs with 65535 DTOs and 0 RDMA/RDs each
>         65408 EVDs of up to 65535 entries  (default S/R size is 16/16)
>         IOVs of up to 28 elements
>         131056 LMRs (and 131056 RMRs) of up to 0xffffffffffffffff bytes
>         Maximum MTU 0x80000000 bytes, RDMA 0x80000000 bytes
>         Maximum Private data size 92 bytes
> ***** ***** ***** ***** ***** *****
> DT_cs_Client: Posting 1 recv buffer
> DT_cs_Client: Connect Endpoint
> DT_cs_Client: Await connection ...
> DT_cs_Client: Connected!
> DAT_STATE: DAT_EP_STATE_CONNECTED
> DAT_STATE: Inbound DTO Status: Active
> DAT_STATE: Outbound DTO Status: Idle
> DT_cs_Client: Sending Client_Info
> DT_cs_Client: Sent Client_Info - awaiting completion
> DT_cs_Client: Sending Command
> DT_cs_Client: Sent Command - awaiting completion
> DT_cs_Client: Waiting for Server_Info
> DT_cs_Client: Server_Info Received
> DT_cs_Client: Version OK!
> -------------------------------------
> Server_Info.dapltest_version   : 6
> Server_Info.is_little_endian   : 1
> -------------------------------------
> Client_Info.dapltest_version   : 6
> Client_Info.is_little_endian   : 1
> Client_Info.test_type          : 4
> Quit_Cmd.server_name: 192.168.0.26
> Quit_Cmd.device_name: mthca0a
> DT_cs_Client: Cleaning Up ...
> DT_cs_Client: IA mthca0a closed
> DT_cs_Client: ========== End of Work -- Client Exiting
> TEST INSTANCE 1
> 
> Unfortunately, I get this error in dmesg:
> 
> dapl_ib_disconnect_clean: ep_ptr 0xffff81003f085320 has invalid CM handle

Do you know if the DREQ actually has been sent on IB ?

In any case, that message is really debug and can be eliminated.

> Also, this is bad: slab corruption
> 
> Slab corruption: start=ffff810077455eb8, len=312
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff882c8252>](req_comp_work+0x42/0x90 [ib_at])
> 050: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 00 00 00
> 120: 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 00 00 00
> Prev obj: start=ffff810077455d68, len=312
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<0000000000000000>](0x0)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> Slab corruption: start=ffff81003aecbdb0, len=288
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<ffffffff882c8264>](req_comp_work+0x54/0x90 [ib_at])
> 040: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b
> 110: 00 00 00 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b a5
> Prev obj: start=ffff81003aecbc78, len=288
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<0000000000000000>](0x0)
> 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

Is this associated with the above (quit test) ? Can this be reproduced ?
I will inspect the code to see how this could occur. Is it between 2
x86_64 machines ?

-- Hal

> -tduffy
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From halr at voltaire.com  Fri Apr 29 06:44:12 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Apr 2005 09:44:12 -0400
Subject: [openib-general] Some Initial RMPP Comments and Questions
Message-ID: <1114782148.4477.888.camel@localhost.localdomain>

Hi Sean,

I just started playing with RMPP and have one comment and a couple of
questions. I am just doing some preliminary testing with SA
GetTable/GetTableResp.

1. I see a NWL in some ACKs of 0x41 when ACKing segment 1. So the
receive side supports 64 incoming MADs. Is that correct ? (That seems
good to me).

2. Since RMPP supports streaming (I know OpenIB doesn't do this on
transmit), what does the receiver do with an incoming RMPP stream whose
PayloadLength is not specified in the FIRST DATA packet ? (I think this
may be needed for interoperability with third party RMPP implementations
which do this).

3. How does the RMPP receiver handle discrepancies between the FIRST
DATA PayloadLength and the LAST DATA PayloadLength ?

Thanks.

-- Hal


From woodennickel at gmail.com  Fri Apr 29 07:26:41 2005
From: woodennickel at gmail.com (Bill Jordan)
Date: Fri, 29 Apr 2005 10:26:41 -0400
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050426133752.37d74805.akpm@osdl.org>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<52acnmtmh6.fsf@topspin.com> <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com> <20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>
Message-ID: <5ebee0d105042907265ff58a73@mail.gmail.com>

On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:

> Our point is that contemporary microprocessors cannot electrically do what
> you want them to do!
> 
> Now, conceeeeeeiveably the kernel could keep track of the state of the
> pages down to the byte level, and could keep track of all COWed pages and
> could look at faulting addresses at the byte level and could copy sub-page
> ranges by hand from one process's address space into another process's
> after I/O completion.  I don't think we want to do that.
> 
> Methinks your specification is busted.

I agree in principal. However, I expect this issue will come up with
more and more new specifications, and if it isn't addressed once in
the linux kernel, it will be kludged and broken many times in many
drivers.

I believe we need an kernel level interface that will pin user pages,
and lock the user vma in a single step. The interface should be used
by drivers when the hardware mappings are done. If the process is
split into a user operation to lock the memory, and a driver operation
to map the hardware, there will always be opportunity for abuse.

Reference counting needs to be done by this interface to allow
different hardware to interoperate.

The interface can't overload the VM_LOCKED flag, or rely on any other
attributes that the user can tinker with via any other interface.

And as much as I hate to admit it, I think on a fork, we will need to
copy parts of pages at the beginning or end of user I/O buffers.

-- 
Bill Jordan
InfiniCon Systems


From roland at topspin.com  Fri Apr 29 07:48:09 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 07:48:09 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
Message-ID: <52pswdk5au.fsf@topspin.com>

    Grant> gsyprf3:/usr/src/linux-2.6# ibv_pingpong libibverbs:
    Grant> Warning: no driver for uverbs0

    Grant> What obvious, silly thing did I overlook?

Where is mthca.so installed?  By default libibverbs only looks in
$PREFIX/lib/infiniband, but you can add any path via the
OPENIB_DRIVER_PATH environment variable.

 - R.


From robert.j.woodruff at intel.com  Fri Apr 29 08:17:32 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 29 Apr 2005 08:17:32 -0700
Subject: [openib-general] Compiling SDP on 2.6.9
In-Reply-To: <20050429001816.GQ20957@esmail.cup.hp.com>
Message-ID: <ORSMSX408HM3SOlbpH70000001a@orsmsx408.amr.corp.intel.com>

 Grant wrote, 
>Why not patch the include/asm*/errno.h files?
>alpha, mips, parisc, sparc64 use different values.
>Even if you don't care about those arches on 2.6.9,
>it would still make more sense to patch at least
>the asm-generic/errno.h.

>grant

Good point. I will patch the files for all the architectures. 

woody


From woodennickel at gmail.com  Fri Apr 29 08:47:17 2005
From: woodennickel at gmail.com (Bill Jordan)
Date: Fri, 29 Apr 2005 11:47:17 -0400
Subject: [openib-general] Status of SRP
Message-ID: <5ebee0d105042908471d076ff@mail.gmail.com>

What is the status of SRP? I see code in the roland-merge branch, but
none in the trunk.

-- 
Bill Jordan
InfiniCon Systems


From caitlin.bestler at gmail.com  Fri Apr 29 08:56:20 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Fri, 29 Apr 2005 08:56:20 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <5ebee0d105042907265ff58a73@mail.gmail.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
Message-ID: <469958e005042908566f177b50@mail.gmail.com>

On 4/29/05, Bill Jordan <woodennickel at gmail.com> wrote:
> On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:
> 
> > Our point is that contemporary microprocessors cannot electrically do what
> > you want them to do!
> >
> > Now, conceeeeeeiveably the kernel could keep track of the state of the
> > pages down to the byte level, and could keep track of all COWed pages and
> > could look at faulting addresses at the byte level and could copy sub-page
> > ranges by hand from one process's address space into another process's
> > after I/O completion.  I don't think we want to do that.
> >
> > Methinks your specification is busted.
> 
> I agree in principal. However, I expect this issue will come up with
> more and more new specifications, and if it isn't addressed once in
> the linux kernel, it will be kludged and broken many times in many
> drivers.
> 
> I believe we need an kernel level interface that will pin user pages,
> and lock the user vma in a single step. The interface should be used
> by drivers when the hardware mappings are done. If the process is
> split into a user operation to lock the memory, and a driver operation
> to map the hardware, there will always be opportunity for abuse.
> 
> Reference counting needs to be done by this interface to allow
> different hardware to interoperate.
> 
> The interface can't overload the VM_LOCKED flag, or rely on any other
> attributes that the user can tinker with via any other interface.
> 
> And as much as I hate to admit it, I think on a fork, we will need to
> copy parts of pages at the beginning or end of user I/O buffers.
> 

I agree with all but the last part, in my opinion there is no need to deal
with fork issues as long as solutions do not result in failures. There is
*no* basis for a child process to expect that it will inherit RDMA resources.
A child process that uses such resources will get undefined results, nothing
further needs to be stated, and no heroic efforts are required to avoid them.

What is definitely needed is kernel counting of locks on user pages.
Finer granularity is not expected, it is the RDMA hardware that works
at finer granularity. All it needs is to know what bus address a given
virtual page maps to -- and it needs to know that said mapping will
not change without advance notice.

Further, any revocation of an existing mapping (to deal with hot page
swapping or whatever) cannot expect the RDMA hardware to respond
any faster than it would to invalidating a memory region.

The RDMA hardware has an inherent need to cache translations.
That is why it cannot guarantee that it will cease updating a memory
region the nanosecond that a request is made to invalidate an STag.
Instead it is allowed to block on such a request and only guarantees
to have ceased access when the invalidate request completes.

The same need for a delay exists for any interface that moves memory
around, or requests to reclaim memory from the application.

This also applies on process death. The hardware cannot stop on a dime.
The best it can do is stop promptly, and given an unambiguous indication
to the OS as to when it has stopped.


From iod00d at hp.com  Fri Apr 29 09:21:27 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 29 Apr 2005 09:21:27 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <52pswdk5au.fsf@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
	<52pswdk5au.fsf@topspin.com>
Message-ID: <20050429162127.GA24871@esmail.cup.hp.com>

On Fri, Apr 29, 2005 at 07:48:09AM -0700, Roland Dreier wrote:
>     Grant> gsyprf3:/usr/src/linux-2.6# ibv_pingpong libibverbs:
>     Grant> Warning: no driver for uverbs0
> 
>     Grant> What obvious, silly thing did I overlook?
> 
> Where is mthca.so installed?

sorry - it wasn't installed. I missed the "and src/userspace/libmthca"
in the original instructions. Here's a summary of what I have done
so far:

	cd /usr/src/openib_gen2/src/userspace/libibverbs
	./autogen.sh
	dpkg-buildpackage -rfakeroot -uc -us
	dpkg -i ../libibverbs-dev_0.1.0-1_ia64.deb
	dpkg -i ../libibverbs-dev_0.1.0-1_ia64.deb
	dpkg -i ../ibverbs-examples_0.1.0-1_ia64.deb

	cd /usr/src/openib_gen2/src/userspace/libmthca
	./autogen.sh
	dpkg-buildpackage -rfakeroot -uc -us
	dpkg -i ../libmthca1_0.1.0-1_ia64.deb

> By default libibverbs only looks in
> $PREFIX/lib/infiniband, but you can add any path via the
> OPENIB_DRIVER_PATH environment variable.

It's now installed here:
gsyprf3:/usr/src/openib_gen2/src/userspace# dpkg -S mthca.so
libmthca1: /usr/lib/infiniband/mthca.so

gsyprf3:~# dpkg -S ibv_pingpong
ibverbs-examples: /usr/bin/ibv_pingpong

gsyprf3:~# ibv_pingpong
libibverbs: Warning: no driver for uverbs0
No IB devices found

Debian packages can "suggest" or "require" libmthca1 pkg be installed.
But that won't help if the two packages don't use the same PREFIX.
config.status logfile indicates autogen.sh uses '--prefix=/usr'.
That's true for both libmthca/config.status and libibverbs/config.status.

Turns out I had ibv_pingpong still in /usr/local/bin/ from a "manual"
build i had done as reccomended in the original announcement.
Deleting that got me past this...but...

Users of other distro's I think are going to be just as confused.
Can the "no driver for uverbs0" error message indicate the *userspace*
driver is missing?
And indicate where it's looking?
e.g. something like:
libibverbs: no userspace driver found in /lib/infiniband
	for /dev/infiniband/uverbs0. $OPENIB_DRIVER_PATH is set?


But it's still not working:
yprf3:~# OPENIB_DRIVER_PATH=/usr/lib/infiniband/ ibv_pingpong
Couldn't get context for mthca0

What silly detail am I missing now?

I'm not getting any console output either when running this.


And one minor diff appended to help people if they forget
to modprobe ib_uverbs.

thanks,
grant


Index: src/init.c
===================================================================
--- src/init.c	(revision 2231)
+++ src/init.c	(working copy)
@@ -214,7 +214,7 @@
 
 	cls = sysfs_open_class("infiniband_verbs");
 	if (!cls) {
-		fprintf(stderr, PFX "Fatal: couldn't open infiniband sysfs class.\n");
+		fprintf(stderr, PFX "Fatal: couldn't open sysfs class \"infiniband_verbs\".\n");
 		return;
 	}
 

From robert.j.woodruff at intel.com  Fri Apr 29 09:25:04 2005
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 29 Apr 2005 09:25:04 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
Message-ID: <1AC79F16F5C5284499BB9591B33D6F000441F575@orsmsx408>

 
Grant wrote,
>But it's still not working:
>yprf3:~# OPENIB_DRIVER_PATH=/usr/lib/infiniband/ ibv_pingpong
>Couldn't get context for mthca0

Did you create the dev nodes ?
for port 0, 

mkdir /dev/infiniband
/bin/mknod /dev/infiniband/uverbs0 c 231 128

woody


From mshefty at ichips.intel.com  Fri Apr 29 09:29:18 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 29 Apr 2005 09:29:18 -0700
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <1114782148.4477.888.camel@localhost.localdomain>
References: <1114782148.4477.888.camel@localhost.localdomain>
Message-ID: <427260DE.6090109@ichips.intel.com>

Hal Rosenstock wrote:

> Hi Sean,
> 
> I just started playing with RMPP and have one comment and a couple of
> questions. I am just doing some preliminary testing with SA
> GetTable/GetTableResp.
> 
> 1. I see a NWL in some ACKs of 0x41 when ACKing segment 1. So the
> receive side supports 64 incoming MADs. Is that correct ? (That seems
> good to me).

This is set in the call window_size().  The window size is set to the 
size of the QP >> 3.  I made this value up.

> 2. Since RMPP supports streaming (I know OpenIB doesn't do this on
> transmit), what does the receiver do with an incoming RMPP stream whose
> PayloadLength is not specified in the FIRST DATA packet ? (I think this
> may be needed for interoperability with third party RMPP implementations
> which do this).

The RMPP receive code doesn't use the FIRST payload length when 
receiving data.  It looks for the LAST bit to be set in the incoming 
data MAD.

Also, if you can think of a way to support this on the send side, 
you/I/someone could add this.  I think that this would require 
extending the send MAD API.

> 3. How does the RMPP receiver handle discrepancies between the FIRST
> DATA PayloadLength and the LAST DATA PayloadLength ?

Currently it doesn't.  The payload length in the LAST data packet is 
used to calculate the padding for the total MAD length.  See 
get_mad_len().  The only check that's done is to ensure that the 
calculated padding is smaller than the MAD's data size.

- Sean


From iod00d at hp.com  Fri Apr 29 09:37:03 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 29 Apr 2005 09:37:03 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000441F575@orsmsx408>
References: <1AC79F16F5C5284499BB9591B33D6F000441F575@orsmsx408>
Message-ID: <20050429163703.GC24871@esmail.cup.hp.com>

On Fri, Apr 29, 2005 at 09:25:04AM -0700, Woodruff, Robert J wrote:
> Did you create the dev nodes ?
> for port 0, 
> 
> mkdir /dev/infiniband
> /bin/mknod /dev/infiniband/uverbs0 c 231 128

yup. I didn't miss that bit from the instructions:
gsyprf3:~# ls -l /dev/infiniband/
total 0
crw-rw-rw-  1 root root 231, 128 Apr 28 12:28 uverbs0
crw-rw-rw-  1 root root 231, 129 Apr 28 12:28 uverbs1

thanks,
grant


From roland at topspin.com  Fri Apr 29 09:36:43 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 09:36:43 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <20050429162127.GA24871@esmail.cup.hp.com> (Grant Grundler's
	message of "Fri, 29 Apr 2005 09:21:27 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
	<52pswdk5au.fsf@topspin.com>
	<20050429162127.GA24871@esmail.cup.hp.com>
Message-ID: <52ll71k09w.fsf@topspin.com>

    Grant> Users of other distro's I think are going to be just as
    Grant> confused.  Can the "no driver for uverbs0" error message
    Grant> indicate the *userspace* driver is missing?  And indicate
    Grant> where it's looking?

I just updated the code so the error messages are:

    libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'.

    libibverbs: Warning: no userspace device-specific driver found for uverbs0
            driver search path: /usr/lib/infiniband:/usr/local/lib/infiniband

    Grant> But it's still not working: yprf3:~#
    Grant> OPENIB_DRIVER_PATH=/usr/lib/infiniband/ ibv_pingpong
    Grant> Couldn't get context for mthca0

Does the kernel driver match the userspace code -- ie are they both
the latest subversion?  I haven't started being anal about bumping the
ABI version when changing the interface, so it's possible to get out
of sync with this development code.

 - R.


From roland at topspin.com  Fri Apr 29 09:37:36 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 09:37:36 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000441F575@orsmsx408> (Robert J.
	Woodruff's message of "Fri, 29 Apr 2005 09:25:04 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F000441F575@orsmsx408>
Message-ID: <52hdhpk08f.fsf@topspin.com>

    Robert> Did you create the dev nodes ?

That can't be it -- the error message Grant is getting means the
userspace library opened a dev node.

 - R.


From jlentini at netapp.com  Fri Apr 29 09:43:45 2005
From: jlentini at netapp.com (James Lentini)
Date: Fri, 29 Apr 2005 12:43:45 -0400 (EDT)
Subject: [openib-general] Re: [PATCH][DAPL] make dapl build outside of
	kernel tree
In-Reply-To: <1114633684.20016.8.camel@duffman>
References: <1114633684.20016.8.camel@duffman>
Message-ID: <Pine.LNX.4.61.0504291239120.5321@jlentini-linux.nane.netapp.com>


Tom,

Did you try building using the method described in the README file?

svn/gen2/users/jlentini/linux-kernel/README

The procedure there has been working for me. I'd rather continue using 
it than change. I like the idea that the build setup is exactly as it 
would be if it were part of the trunk.

james

On Wed, 27 Apr 2005, Tom Duffy wrote:

> Until DAPL is in the trunk, it should be albe to be built outside of
> your normal kernel tree.  These changes make that possible.  Now, you
> can type something like:
>
> $ KERNELDIR=/path/to/kernel/dir/or/object/dir make
>
> in each of dat, dat-provider, and patches (to get ib_at).
>
> Signed-off-by: Tom Duffy <tduffy at sun.com>
>
> Index: gen2/users/jlentini/linux-kernel/dat-provider/Makefile
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/Makefile	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/Makefile	(working copy)
> @@ -1,18 +1,3 @@
> -
> -obj-$(CONFIG_INFINIBAND_DAT_PROVIDER) += ib_dat_provider.o
> -
> -#debug
> -KDAPL_DEBUG = 1
> -ifeq (1,$(KDAPL_DEBUG))
> -  EXTRA_CFLAGS += -O0 -g
> -  EXTRA_CFLAGS += -DDAPL_DBG # -DDAPL_DBG_IO_TRC
> -endif
> -
> -EXTRA_CFLAGS += 				\
> -    -DDAPL_ATS					\
> -    -Idrivers/infiniband/include		\
> -    -Idrivers/dat
> -
> PROVIDER_MODULES := \
> 	dapl_openib_qp			\
> 	dapl_openib_util		\
> @@ -106,5 +91,25 @@ PROVIDER_MODULES := \
>
> PROVIDER_OBJS := $(foreach s, $(PROVIDER_MODULES), $(s).o)
>
> -ib_dat_provider-y:= $(PROVIDER_OBJS)
> +KDAPL_DEBUG = 1
> +ifeq (1,$(KDAPL_DEBUG))
> +  EXTRA_CFLAGS += -O0 -g
> +  EXTRA_CFLAGS += -DDAPL_DBG # -DDAPL_DBG_IO_TRC
> +endif
> +
> +EXTRA_CFLAGS += -DDAPL_ATS -Idrivers/infiniband/include -I$(obj)/../dat -I$(obj)/../patches/
> +
> +ifneq ($(KERNELRELEASE),)
> +        obj-m := ib_dat_provider.o
> +        ib_dat_provider-objs := $(PROVIDER_OBJS)
> +else
> +	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
> +	PWD := $(shell pwd)
> +
> +default:
> +	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
> +
> +endif
>
> +clean:
> +	rm -f *.o *.ko
> Index: gen2/users/jlentini/linux-kernel/patches/at.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/patches/at.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/patches/at.c	(working copy)
> @@ -45,7 +45,7 @@
> #include <ib_verbs.h>
> #include <ib_sa.h>
>
> -#include "../ulp/ipoib/ipoib.h"
> +#include <ipoib.h>
> #include <ib_at.h>
>
> MODULE_AUTHOR("Shahar Frank");
> Index: gen2/users/jlentini/linux-kernel/patches/Makefile
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/patches/Makefile	(revision 0)
> +++ gen2/users/jlentini/linux-kernel/patches/Makefile	(revision 0)
> @@ -0,0 +1,16 @@
> +EXTRA_CFLAGS += -Werror -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib/ -I$(obj)
> +
> +ifneq ($(KERNELRELEASE),)
> +	obj-m := ib_at.o
> +	ib_at-objs := at.o
> +else
> +	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
> +	PWD := $(shell pwd)
> +
> +default:
> +	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
> +
> +endif
> +
> +clean:
> +	rm -f *.o *.ko
> Index: gen2/users/jlentini/linux-kernel/dat/Makefile
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat/Makefile	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat/Makefile	(working copy)
> @@ -1,13 +1,16 @@
> +EXTRA_CFLAGS += -Werror -I$(obj)
>
> -EXTRA_CFLAGS += \
> -    -Idrivers/dat      	\
> -    -Werror
> +ifneq ($(KERNELRELEASE),)
> +	obj-m := dat.o
> +	dat-objs := consumer.o core.o dictionary.o dr.o provider.o
> +else
> +	KERNELDIR ?= /lib/modules/$(shell uname -r)/build
> +	PWD := $(shell pwd)
>
> -obj-$(CONFIG_DAT) += dat.o
> +default:
> +	$(MAKE) -C $(KERNELDIR) M=$(PWD) modules
>
> -dat-y := \
> -    consumer.o		\
> -    core.o 		\
> -    dictionary.o	\
> -    dr.o		\
> -    provider.o
> +endif
> +
> +clean:
> +	rm -r *.o *.ko
>


From roland at topspin.com  Fri Apr 29 09:45:50 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 09:45:50 -0700
Subject: RDMA memory registration (was: [openib-general] Re:
	[PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
In-Reply-To: <469958e005042908566f177b50@mail.gmail.com> (Caitlin Bestler's
	message of "Fri, 29 Apr 2005 08:56:20 -0700")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
Message-ID: <52d5sdjzup.fsf_-_@topspin.com>

Is there anything wrong with the following plan?

1) For memory registration, use get_user_pages() in the kernel.  Use
   locked_vm and RLIMIT_MEMLOCK to limit the amount of memory pinned
   by a given process.  One disadvantage of this is that the
   accounting will overestimate the amount of pinned memory if a
   process pins the same page twice, but this doesn't seem that bad to
   me -- it errs on the side of safety.

2) For fork() support:

   a) Extend mprotect() with PROT_DONTCOPY so processes can avoid
      copy-on-write problems.

   b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect()
      with PROT_ALWAYSCOPY so processes can mark pages to be
      pre-copied into child processes, to handle the case where only
      half a page is registered.

I believe this puts the code that must be trusted into the kernel and
gives userspace primitives that let apps handle the rest.

 - R.


From jlentini at netapp.com  Fri Apr 29 09:52:21 2005
From: jlentini at netapp.com (James Lentini)
Date: Fri, 29 Apr 2005 12:52:21 -0400 (EDT)
Subject: [openib-general] Re: [PATCH][DAPL] Fix sparse warnings on dapl
	builds
In-Reply-To: <1114635201.20016.13.camel@duffman>
References: <1114635201.20016.13.camel@duffman>
Message-ID: <Pine.LNX.4.61.0504291249400.5321@jlentini-linux.nane.netapp.com>


Tom,

How did you produce these errors?

I don't see any warning when I build the code w/ kbuild:

# cd root_of_my_2.6.11_linux_src_tree

# make drivers/dat/
# make drivers/infiniband/ulp/dat-provider/

james

On Wed, 27 Apr 2005, Tom Duffy wrote:

> This patch fixes all the sparse warnings during build of dat,
> dat-provider, and ib_at.
>
> Signed-off-by: Tom Duffy <tduffy at sun.com>
>
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_connect.c	(working copy)
> @@ -263,7 +263,7 @@ dapl_ep_connect(DAT_EP_HANDLE ep_handle,
> 						       connect_evd_handle,
> 						       DAT_CONNECTION_EVENT_UNREACHABLE,
> 						       (DAT_HANDLE) ep_ptr, 0,
> -						       0);
> +						       NULL);
> 			dat_status = DAT_SUCCESS;
> 		}
> 	} else {
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_module.c	(working copy)
> @@ -44,7 +44,7 @@ MODULE_DESCRIPTION("DAT Provider for Inf
> MODULE_AUTHOR("James Lentini");
>
> int g_dapl_dbg_type = 0;
> -MODULE_PARM(g_dapl_dbg_type, "i");
> +module_param(g_dapl_dbg_type, int, 0644);
> MODULE_PARM_DESC(g_dapl_dbg_type, "Enable dapl debug types");
>
> static int dapl_init(void);
> @@ -209,13 +209,13 @@ void DAT_PROVIDER_FINI_FUNC_NAME(const D
> 	(void)dapl_provider_list_remove(provider_info->ia_name);
> }
>
> -struct ib_client dapl_client = {
> +static struct ib_client dapl_client = {
> 	.name = "dapl",
> 	.add = dapl_add_one,
> 	.remove = dapl_remove_one
> };
>
> -char *dev_name_suffix_table[3] = {
> +static char *dev_name_suffix_table[3] = {
> 	"",
> 	"a",
> 	"b"
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_provider.c	(working copy)
> @@ -53,7 +53,7 @@ DAPL_PROVIDER_LIST g_dapl_provider_list;
>
> DAT_PROVIDER g_dapl_provider_template = {
> 	NULL,
> -	0,
> +	NULL,
> 	&dapl_ia_open,
> 	&dapl_ia_query,
> 	&dapl_ia_close,
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_cm.c	(working copy)
> @@ -628,7 +628,7 @@ dapl_ib_setup_conn_listener(DAPL_IA * ia
> 	if (status) {
> 		/* need to destroy CM ID ??? */
>
> -		sp_ptr->cm_srvc_handle = 0;
> +		sp_ptr->cm_srvc_handle = NULL;
>
> 		if (status == -EBUSY)
> 			return DAT_CONN_QUAL_IN_USE;
> @@ -799,22 +799,6 @@ DAT_RETURN dapl_ib_accept_connection(DAT
> 	return DAT_SUCCESS;
> }
>
> -DAT_RETURN dapl_ib_comm_established(DAPL_EP * ep_ptr)
> -{
> -	int status;
> -	DAT_RETURN dat_status = DAT_SUCCESS;
> -
> -	status = ib_send_cm_rtu(ep_ptr->cm_handle, NULL, 0);
> -	if (status) {
> -		dapl_dbg_log(DAPL_DBG_TYPE_ERR,
> -			     " dapl_ib_comm_established: ib_send_cm_rtu failed: %d cm_handle: %x\n",
> -			     status, ep_ptr->cm_handle);
> -		return DAT_ERROR(DAT_INSUFFICIENT_RESOURCES, 0);
> -	}
> -
> -	return dat_status;
> -}
> -
> /*
>  * ib_cm_get_remote_gid
>  */
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.c	(working copy)
> @@ -683,7 +683,7 @@ dapl_ib_mw_unbind(DAPL_RMR * rmr,
> 	mw_bind_prop.mw_access_flags = 0;
> 	mw_bind_prop.send_flags =
> 	    (is_signaled == DAT_TRUE) ? IB_SEND_SIGNALED : 0;
> -	mw_bind_prop.mr = 0;
> +	mw_bind_prop.mr = NULL;
> 	mw_bind_prop.wr_id = (u64) (uintptr_t) cookie;
> 	ib_status = ib_bind_mw(ep->qp_handle, rmr->mw_handle, &mw_bind_prop);
> 	if (ib_status < 0) {
> @@ -954,16 +954,6 @@ dapl_ib_get_async_event(ib_error_record_
> }
>
> DAT_RETURN
> -dapl_ib_ncompletion_notify(ib_hca_handle_t hca_handle,
> -			   ib_cq_handle_t cq_handle, DAT_COUNT num)
> -{
> -	int ib_status;
> -
> -	ib_status = ib_req_ncomp_notif(cq_handle, num);
> -	return dapl_ib_status_convert(ib_status);
> -}
> -
> -DAT_RETURN
> dapl_ib_get_hca_ids(ib_hca_handle_t hca, u8 port, union ib_gid * gid, u16 * lid)
> {
> 	int status;
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_timer_util.c	(working copy)
> @@ -52,7 +52,7 @@
> #include "dapl.h"
> #include "dapl_timer_util.h"
>
> -struct timer_head {
> +static struct timer_head {
> 	DAPL_LLIST_HEAD timer_list_head;
> 	spinlock_t lock;
> 	DAPL_OS_WAIT_OBJECT wait_object;
> @@ -63,7 +63,7 @@ typedef struct timer_head DAPL_TIMER_HEA
>
> void dapl_timer_thread(void *arg);
>
> -void dapl_timer_init()
> +void dapl_timer_init(void)
> {
> 	/*
> 	 * Set up the timer thread elements. The timer thread isn't
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_openib_util.h	(working copy)
> @@ -105,7 +105,7 @@ typedef struct ib_shm_transport {
> 	ib_mr_handle_t mr_handle;
> } ib_shm_transport_t;
>
> -#define 	 IB_INVALID_HANDLE	       0
> +#define 	 IB_INVALID_HANDLE	       NULL
>
> #define 	 IB_MAX_REQ_PDATA_SIZE	    92
> #define 	 IB_MAX_REP_PDATA_SIZE	    196
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_cr_accept.c	(working copy)
> @@ -190,7 +190,7 @@ dapl_cr_accept(DAT_CR_HANDLE cr_handle,
> 							   request_evd_handle,
> 							   DAT_CONNECTION_EVENT_ACCEPT_COMPLETION_ERROR,
> 							   (DAT_HANDLE) ep_ptr,
> -							   0, 0);
> +							   0, NULL);
>
> 			cr_ptr->header.magic = DAPL_MAGIC_CR_DESTROYED;
> 		} else {
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_rmr_util.c	(working copy)
> @@ -49,7 +49,7 @@ DAPL_RMR *dapl_rmr_alloc(DAPL_PZ * pz)
> 	rmr->header.handle_type = DAT_HANDLE_TYPE_RMR;
> 	rmr->header.owner_ia = pz->header.owner_ia;
> 	rmr->header.user_context.as_64 = 0;
> -	rmr->header.user_context.as_ptr = 0;
> +	rmr->header.user_context.as_ptr = NULL;
> 	dapl_llist_init_entry(&rmr->header.ia_list_entry);
> 	dapl_ia_link_rmr(rmr->header.owner_ia, rmr);
> 	spin_lock_init(&rmr->header.lock);
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_util.c	(working copy)
> @@ -368,7 +368,7 @@ void dapl_ep_timeout(uintptr_t arg)
> 	(void)dapl_evd_post_connection_event((DAPL_EVD *) ep_ptr->param.
> 					     connect_evd_handle,
> 					     DAT_CONNECTION_EVENT_TIMED_OUT,
> -					     (DAT_HANDLE) ep_ptr, 0, 0);
> +					     (DAT_HANDLE) ep_ptr, 0, NULL);
> }
>
> /*
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c	(working copy)
> @@ -358,7 +358,7 @@ void dapl_evd_eh_print_cqe(ib_work_compl
> 		"OP_COMP_AND_SWAP",
> 		"OP_FETCH_AND_ADD",
> 		"OP_BIND_MW",
> -		0
> +		NULL
> 	};
> 	dapl_dbg_log(DAPL_DBG_TYPE_CALLBACK,
> 		     "\t >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<\n");
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_ep_disconnect.c	(working copy)
> @@ -144,7 +144,7 @@ dapl_ep_disconnect(DAT_EP_HANDLE ep_hand
> 		evd_ptr = (DAPL_EVD *) ep_ptr->param.connect_evd_handle;
> 		dapl_evd_post_connection_event(evd_ptr,
> 					       DAT_CONNECTION_EVENT_DISCONNECTED,
> -					       (DAT_HANDLE) ep_ptr, 0, 0);
> +					       (DAT_HANDLE) ep_ptr, 0, NULL);
> 		dat_status = DAT_SUCCESS;
> 		goto bail;
> 	}
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_hash.c	(working copy)
> @@ -145,7 +145,7 @@ dapl_hash_rehash(DAPL_HASH_ELEM * elemen
> 			return;
> 		}
> 	}
> -	*head = 0;
> +	*head = NULL;
> }
>
> /*
> @@ -209,7 +209,7 @@ dapl_hash_add(DAPL_HASH_TABLEP p_table,
> 		 */
> 		p_table->table[hashValue].key = key;
> 		p_table->table[hashValue].datum = datum;
> -		p_table->table[hashValue].next_element = 0;
> +		p_table->table[hashValue].next_element = NULL;
> 		p_table->num_entries++;
> 		status = DAT_TRUE;
> 	} else {
> @@ -222,7 +222,7 @@ dapl_hash_add(DAPL_HASH_TABLEP p_table,
> 			DAPL_HASH_ELEM *lastelement;
> 			newelement->key = key;
> 			newelement->datum = datum;
> -			newelement->next_element = 0;
> +			newelement->next_element = NULL;
> 			for (lastelement = &p_table->table[hashValue];
> 			     lastelement->next_element;
> 			     lastelement = lastelement->next_element) {
> @@ -354,7 +354,7 @@ DAT_RETURN dapl_hash_create(DAT_COUNT ta
> 	for (i = 0; i < table_size; i++) {
> 		p_table->table[i].datum = NO_DATUM_VALUE;
> 		p_table->table[i].key = 0;
> -		p_table->table[i].next_element = 0;
> +		p_table->table[i].next_element = NULL;
> 	}
>
> 	*pp_table = p_table;
> Index: gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat-provider/dapl_llist.c	(working copy)
> @@ -71,7 +71,7 @@ void dapl_llist_init_entry(DAPL_LLIST_EN
> {
> 	entry->blink = NULL;
> 	entry->flink = NULL;
> -	entry->data = 0;
> +	entry->data = NULL;
> 	entry->list_head = NULL;
> }
>
> Index: gen2/users/jlentini/linux-kernel/patches/at.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/patches/at.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/patches/at.c	(working copy)
> @@ -118,7 +118,7 @@ struct async {
> 	int sa_id;
> };
>
> -struct async pending_reqs;	/* dummy head for cyclic list */
> +static struct async pending_reqs;	/* dummy head for cyclic list */
>
> struct ib_at_src {
> 	u32 ip;
> @@ -320,7 +320,7 @@ static void req_free(struct async *pend)
>
> 	pend->status = IB_AT_STATUS_INVALID;
> 	pend->type = IBAT_REQ_NONE;
> -	pend->sa_query = 0;
> +	pend->sa_query = NULL;
> }
>
> static int req_start(struct async *q, struct async *pend,
> @@ -336,7 +336,7 @@ static int req_start(struct async *q, st
>
> 	if (parent) {
> 		DEBUG("wait on parent %p", parent);
> -		pend->next = pend->prev = 0;
> +		pend->next = pend->prev = NULL;
> 		pend->parent = parent;
> 		pend->waiting = parent->waiting;
> 		parent->waiting = pend;
> @@ -344,8 +344,8 @@ static int req_start(struct async *q, st
> 		return 0;	/* waiting on other request */
> 	}
>
> -	pend->waiting = 0;
> -	pend->parent = 0;
> +	pend->waiting = NULL;
> +	pend->parent = NULL;
>
> 	DEBUG("link to pending list %p", q);
> 	pend->next = q;
> @@ -396,7 +396,7 @@ static void req_end(struct async *pend,
> 		if (!*rr)
> 			WARN("pending request not found in parent request!");
>
> -		pend->waiting = 0;
> +		pend->waiting = NULL;
> 		DEBUG("child %p removed from parent %p list",
> 			pend, pend->parent);
> 	}
> @@ -405,10 +405,10 @@ static void req_end(struct async *pend,
> 		DEBUG("pend %p ending child req %p", pend, waiting);
> 		pend->waiting = waiting->waiting;
>
> -		waiting->waiting = 0;
> -		waiting->parent = 0;
> +		waiting->waiting = NULL;
> +		waiting->parent = NULL;
>
> -		req_end(waiting, nrec, 0);
> +		req_end(waiting, nrec, NULL);
> 	}
>
> 	if (pend->next) {
> @@ -483,7 +483,7 @@ static struct async *lookup_pending(stru
> 			break;
>
> 	spin_unlock_irqrestore(&q->lock, flags);
> -	return a == q ? 0 : a;
> +	return a == q ? NULL : a;
> }
>
> static struct async *lookup_req_id(struct async *q, u64 id)
> @@ -498,7 +498,7 @@ static struct async *lookup_req_id(struc
> 			break;
>
> 	spin_unlock_irqrestore(&q->lock, flags);
> -	return a == q ? 0 : a;
> +	return a == q ? NULL : a;
> }
>
> static void flush_pending(struct async *q)
> @@ -509,7 +509,7 @@ static void flush_pending(struct async *
> 	DEBUG("flushing pending q %p", q);
> 	spin_lock_irqsave(&q->lock, flags);
> 	while ((a = q->next) != q)
> -		req_end(a, -EINTR, 0);
> +		req_end(a, -EINTR, NULL);
> 	spin_unlock_irqrestore(&q->lock, flags);
> }
>
> @@ -561,7 +561,7 @@ route_req_complete(struct route_req *req
> 	for (pend = req->pend.waiting; pend; pend = pend->waiting)
> 		route_req_output(req, pend->data);
>
> -	req_end(&req->pend, 1, 0);
> +	req_end(&req->pend, 1, NULL);
> }
>
> static void
> @@ -587,7 +587,7 @@ path_req_complete(int status, struct ib_
> 		return;
> 	}
>
> -	req->pend.sa_query = 0;
> +	req->pend.sa_query = NULL;
>
> 	req->pend.nelem = path_req_output(req, resp, 1,
> 					  req->pend.data, req->pend.nelem);
> @@ -597,7 +597,7 @@ path_req_complete(int status, struct ib_
> 		pend->nelem = path_req_output(req, resp, 1,
> 					      pend->data, pend->nelem);
>
> -	req_end(&req->pend, req->pend.nelem, 0);
> +	req_end(&req->pend, req->pend.nelem, NULL);
> 	spin_unlock_irqrestore(&pending_reqs.lock, flags);
> }
>
> @@ -624,7 +624,7 @@ static void ib_at_sweep(void *data)
> 			     (req->dst_ip & 0xff000000) >> 24,
> 			     jiffies, pend->start);
>
> -			req_end(pend, -ETIMEDOUT, 0);
> +			req_end(pend, -ETIMEDOUT, NULL);
> 		}
> 	}
>
> @@ -902,7 +902,7 @@ int ib_at_cancel(u64 req_id)
>
> 	/* Promote first child to be pending req */
> 	if ((child = a->waiting)) {
> -		child->parent = 0;
> +		child->parent = NULL;
>
> 		/* link child after parent in pending list */
> 		child->next = a->next;
> @@ -910,10 +910,10 @@ int ib_at_cancel(u64 req_id)
> 		a->next->prev = child;
> 		a->next = child;
>
> -		a->waiting = 0;		/* clear to avoid cancelling childs */
> +		a->waiting = NULL;	/* clear to avoid cancelling childs */
> 	}
>
> -	req_end(a, -EINTR, 0);
> +	req_end(a, -EINTR, NULL);
>
> 	spin_unlock_irqrestore(&pending_reqs.lock, flags);
>
> Index: gen2/users/jlentini/linux-kernel/dat/dr.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat/dr.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat/dr.c	(working copy)
> @@ -86,7 +86,7 @@ DAT_RETURN dat_dr_fini(void)
>  * Function: dat_dr_insert
>  ************************************************************************/
>
> -extern DAT_RETURN
> +DAT_RETURN
> dat_dr_insert(const DAT_PROVIDER_INFO * info, DAT_DR_ENTRY * entry)
> {
> 	DAT_RETURN status;
> @@ -134,7 +134,7 @@ dat_dr_insert(const DAT_PROVIDER_INFO *
>  * Function: dat_dr_remove
>  ************************************************************************/
>
> -extern DAT_RETURN dat_dr_remove(const DAT_PROVIDER_INFO * info)
> +DAT_RETURN dat_dr_remove(const DAT_PROVIDER_INFO * info)
> {
> 	DAT_DR_ENTRY *data;
> 	DAT_DICTIONARY_ENTRY dict_entry;
> @@ -180,7 +180,7 @@ extern DAT_RETURN dat_dr_remove(const DA
>  * Function: dat_dr_provider_open
>  ************************************************************************/
>
> -extern DAT_RETURN
> +DAT_RETURN
> dat_dr_provider_open(const DAT_PROVIDER_INFO * info,
> 		     DAT_IA_OPEN_FUNC * p_ia_open_func)
> {
> @@ -206,7 +206,7 @@ dat_dr_provider_open(const DAT_PROVIDER_
>  * Function: dat_dr_provider_close
>  ************************************************************************/
>
> -extern DAT_RETURN dat_dr_provider_close(const DAT_PROVIDER_INFO * info)
> +DAT_RETURN dat_dr_provider_close(const DAT_PROVIDER_INFO * info)
> {
> 	DAT_RETURN status;
> 	DAT_DR_ENTRY *data;
> Index: gen2/users/jlentini/linux-kernel/dat/core.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat/core.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat/core.c	(working copy)
> @@ -77,7 +77,7 @@ static DAT_MODULE_STATE g_module_state =
>
> static DAT_DBG_CLASS g_dbg_class = DAT_DBG_CLASS_ERROR;
>
> -MODULE_PARM(g_dbg_class, "i");
> +module_param(g_dbg_class, int, 0644);
> MODULE_PARM_DESC(g_dbg_class,
> 		 "Bit mask to specify class of DAT debug messages.");
>
> Index: gen2/users/jlentini/linux-kernel/dat/consumer.c
> ===================================================================
> --- gen2/users/jlentini/linux-kernel/dat/consumer.c	(revision 2219)
> +++ gen2/users/jlentini/linux-kernel/dat/consumer.c	(working copy)
> @@ -48,7 +48,7 @@
>  *
>  ***********************************************************************/
>
> -DAT_RETURN dat_strerror_major(DAT_RETURN value, const char **message)
> +static DAT_RETURN dat_strerror_major(DAT_RETURN value, const char **message)
> {
> 	switch (DAT_GET_TYPE(value)) {
> 	case DAT_SUCCESS:
> @@ -168,7 +168,7 @@ DAT_RETURN dat_strerror_major(DAT_RETURN
>  * Function: dat_strerror_minor
>  *
>  ***********************************************************************/
> -DAT_RETURN dat_strerror_minor(DAT_RETURN value, const char **message)
> +static DAT_RETURN dat_strerror_minor(DAT_RETURN value, const char **message)
> {
> 	switch (DAT_GET_SUBTYPE(value)) {
>
> @@ -1480,17 +1480,6 @@ DAT_RETURN dat_cr_handoff(DAT_CR_HANDLE
> 	return DAT_CR_HANDOFF(cr_handle, handoff);
> }
>
> -DAT_RETURN dat_evd_kquery(DAT_EVD_HANDLE evd_handle,
> -			  DAT_EVD_PARAM_MASK evd_param_mask,
> -			  DAT_EVD_PARAM * evd_param)
> -{
> -	if (evd_handle == NULL) {
> -		return DAT_ERROR(DAT_INVALID_HANDLE,
> -				 DAT_INVALID_HANDLE_EVD_REQUEST);
> -	}
> -	return DAT_EVD_QUERY(evd_handle, evd_param_mask, evd_param);
> -}
> -
> DAT_RETURN dat_lmr_query(DAT_LMR_HANDLE lmr_handle,
> 			 DAT_LMR_PARAM_MASK lmv_param_mask,
> 			 DAT_LMR_PARAM * lmr_param)
>


From libor at topspin.com  Fri Apr 29 10:04:25 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 29 Apr 2005 10:04:25 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <469958e005042908566f177b50@mail.gmail.com>;
	from caitlin.bestler@gmail.com on Fri, Apr 29, 2005 at 08:56:20AM
	-0700
References: <20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
Message-ID: <20050429100425.A13041@topspin.com>

On Fri, Apr 29, 2005 at 08:56:20AM -0700, Caitlin Bestler wrote:
> On 4/29/05, Bill Jordan <woodennickel at gmail.com> wrote:
> > On 4/26/05, Andrew Morton <akpm at osdl.org> wrote:
> > 
> > > Our point is that contemporary microprocessors cannot electrically
> > > do what you want them to do!
> > >
> > > Now, conceeeeeeiveably the kernel could keep track of the state of the
> > > pages down to the byte level, and could keep track of all COWed pages and
> > > could look at faulting addresses at the byte level and could copy sub-page
> > > ranges by hand from one process's address space into another process's
> > > after I/O completion.  I don't think we want to do that.
> > >
> > > Methinks your specification is busted.
> > 
> > I agree in principal. However, I expect this issue will come up with
> > more and more new specifications, and if it isn't addressed once in
> > the linux kernel, it will be kludged and broken many times in many
> > drivers.
> > 
> > I believe we need an kernel level interface that will pin user pages,
> > and lock the user vma in a single step. The interface should be used
> > by drivers when the hardware mappings are done. If the process is
> > split into a user operation to lock the memory, and a driver operation
> > to map the hardware, there will always be opportunity for abuse.
> > 
> > Reference counting needs to be done by this interface to allow
> > different hardware to interoperate.
> > 
> > The interface can't overload the VM_LOCKED flag, or rely on any other
> > attributes that the user can tinker with via any other interface.
> > 
> > And as much as I hate to admit it, I think on a fork, we will need to
> > copy parts of pages at the beginning or end of user I/O buffers.
> > 
> 
> I agree with all but the last part, in my opinion there is no need to deal
> with fork issues as long as solutions do not result in failures. There is
> *no* basis for a child process to expect that it will inherit RDMA resources.
> A child process that uses such resources will get undefined results, nothing
> further needs to be stated, and no heroic efforts are required to avoid them.

  However, you have a potential problem with registered buffers that
do not begin or end on a page boundary, which is common with malloc.
If the buffer resides on a portion of a page, and you mark the vm
which contains that entire page VM_DONTCOPY, to ensure that the parent
has access to the exact physical page after the fork, the child will
not be able to access anything on that entire page. So if the child
expects to access data on the same page that happens to contain the
registered buffer it will get a segment violation.

The four situations we've discussed are:

  1) Physical page does not get used for anything else.
  2) Processes virtual to physical mapping remains fixed.
  3) Same virtual to physical mapping after forking a child.
  4) Forked child has access to all non-registered memory of
     the parent.

The first two are now taken care of with get_user_pages, (we use to
use VM_LOCKED for the second case) third case is handled by setting
the vm to VM_DONTCOPY, and on the fourth case we've always punted,
but the real answer is to break partial pages into seperate vms and
mark them ALWAYS_COPY.

-Libor


From libor at topspin.com  Fri Apr 29 10:23:58 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 29 Apr 2005 10:23:58 -0700
Subject: RDMA memory registration (was: [openib-general] Re:
	[PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
In-Reply-To: <52d5sdjzup.fsf_-_@topspin.com>;
	from roland@topspin.com on Fri, Apr 29, 2005 at 09:45:50AM -0700
References: <52wtqpsgff.fsf@topspin.com> <20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com>
Message-ID: <20050429102358.B13041@topspin.com>

On Fri, Apr 29, 2005 at 09:45:50AM -0700, Roland Dreier wrote:
> Is there anything wrong with the following plan?
> 
> 1) For memory registration, use get_user_pages() in the kernel.  Use
>    locked_vm and RLIMIT_MEMLOCK to limit the amount of memory pinned
>    by a given process.  One disadvantage of this is that the
>    accounting will overestimate the amount of pinned memory if a
>    process pins the same page twice, but this doesn't seem that bad to
>    me -- it errs on the side of safety.

  I think the overestimate will be fine in practice. If a process is
locking a lot of memory it will most likely be in big chunks, so not
much page overlap there. If the process is locking lots of tiny buffers
with lots of page overlap, the total locked amount will most likely be
small. Although it is odd that you could end up with a total locked
amount larger then the number of physical pages in the system...

> 2) For fork() support:
> 
>    a) Extend mprotect() with PROT_DONTCOPY so processes can avoid
>       copy-on-write problems.
> 
>    b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect()
>       with PROT_ALWAYSCOPY so processes can mark pages to be
>       pre-copied into child processes, to handle the case where only
>       half a page is registered.
> 
> I believe this puts the code that must be trusted into the kernel and
> gives userspace primitives that let apps handle the rest.

  I'm assuming that for libibverbs memory registration you plan on hiding
the mprotect in the library? Without reference counting at the kernel
level this could yield unexpected results in a perfectly legitimate app.

  For example if the app is managing a buffer it will pass to another
device, but also want's to move data in/out with RDMA hardware, the user
marks it themselves with DONTCOPY, registers with libibverbs, performs
IO, unregisters with libibverbs. At this point the user expects the buffer
to have DONTCOPY set, but it does not because of the unregister... Not
that it's likely, but it's a valid thing to do. However, since I don't
have a better suggestion, I'm in favour of using mprotect as you outlined. 


-Libor


From tduffy at sun.com  Fri Apr 29 10:29:39 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 29 Apr 2005 10:29:39 -0700
Subject: [openib-general] Re: [PATCH][DAPL] make dapl build outside of
	kernel tree
In-Reply-To: <Pine.LNX.4.61.0504291239120.5321@jlentini-linux.nane.netapp.com>
References: <1114633684.20016.8.camel@duffman>
	<Pine.LNX.4.61.0504291239120.5321@jlentini-linux.nane.netapp.com>
Message-ID: <1114795779.24949.3.camel@duffman>

On Fri, 2005-04-29 at 12:43 -0400, James Lentini wrote:
> Tom,
> 
> Did you try building using the method described in the README file?

No.

> svn/gen2/users/jlentini/linux-kernel/README
> 
> The procedure there has been working for me. I'd rather continue using 
> it than change. I like the idea that the build setup is exactly as it 
> would be if it were part of the trunk.

Once the code is integrated into trunk, it should follow the in kernel
method.  Until then, it should build outside of the tree.

I think the Makefiles in your branch should build outside the tree.  If
you want Makefiles that build inside the tree, provide those as a patch.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050429/39af10cd/attachment.sig>

From tduffy at sun.com  Fri Apr 29 10:32:38 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 29 Apr 2005 10:32:38 -0700
Subject: [openib-general] [DAPL] ran kdapl test, got slab corruption
In-Reply-To: <1114770000.4477.850.camel@localhost.localdomain>
References: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
	<1114727483.25364.7.camel@duffman>
	<1114770000.4477.850.camel@localhost.localdomain>
Message-ID: <1114795958.24949.7.camel@duffman>

On Fri, 2005-04-29 at 06:31 -0400, Hal Rosenstock wrote:
> I don't need to do this on x64_64. Not sure why. Could this be due to a
> compiler difference ? I am using gcc version 3.4.2 20041017 (Red Hat
> 3.4.2-6.fc3). I also build "in tree" rather that out of tree.

I am building out of tree, using gcc 4.0 on a 2.6.12-rc3 based tree.

> Do you know if the DREQ actually has been sent on IB ?

How would I verify this?

> Is this associated with the above (quit test) ?

yes

> Can this be reproduced ?

I will try again.  You can try by turning on SLAB debugging in the
kernel build.

> I will inspect the code to see how this could occur. Is it between 2
> x86_64 machines ?

Yes, back to back running opensm on one of the nodes.

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050429/372c238f/attachment.sig>

From tduffy at sun.com  Fri Apr 29 10:36:17 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 29 Apr 2005 10:36:17 -0700
Subject: [openib-general] Re: [PATCH][DAPL] Fix sparse warnings on dapl
	builds
In-Reply-To: <Pine.LNX.4.61.0504291249400.5321@jlentini-linux.nane.netapp.com>
References: <1114635201.20016.13.camel@duffman>
	<Pine.LNX.4.61.0504291249400.5321@jlentini-linux.nane.netapp.com>
Message-ID: <1114796177.24949.10.camel@duffman>

On Fri, 2005-04-29 at 12:52 -0400, James Lentini wrote:
> Tom,
> 
> How did you produce these errors?

You will have to download the sparse checker and run with

$ make C=1

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050429/7966c919/attachment.sig>

From roland at topspin.com  Fri Apr 29 10:38:11 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 10:38:11 -0700
Subject: [openib-general] [DAPL] ran kdapl test, got slab corruption
In-Reply-To: <1114795958.24949.7.camel@duffman> (Tom Duffy's message of
	"Fri, 29 Apr 2005 10:32:38 -0700")
References: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
	<1114727483.25364.7.camel@duffman>
	<1114770000.4477.850.camel@localhost.localdomain>
	<1114795958.24949.7.camel@duffman>
Message-ID: <527jiljxfg.fsf@topspin.com>

    Tom> I am building out of tree, using gcc 4.0 on a 2.6.12-rc3
    Tom> based tree.

Brave man... I assume you have the fix for
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21173
applied to your compiler?

 - R.


From iod00d at hp.com  Fri Apr 29 10:56:50 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 29 Apr 2005 10:56:50 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <52ll71k09w.fsf@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
	<52pswdk5au.fsf@topspin.com>
	<20050429162127.GA24871@esmail.cup.hp.com>
	<52ll71k09w.fsf@topspin.com>
Message-ID: <20050429175650.GG24871@esmail.cup.hp.com>

On Fri, Apr 29, 2005 at 09:36:43AM -0700, Roland Dreier wrote:
>     Grant> Users of other distro's I think are going to be just as
>     Grant> confused.  Can the "no driver for uverbs0" error message
>     Grant> indicate the *userspace* driver is missing?  And indicate
>     Grant> where it's looking?
> 
> I just updated the code so the error messages are:
> 
>     libibverbs: Fatal: couldn't open sysfs class 'infiniband_verbs'.
> 
>     libibverbs: Warning: no userspace device-specific driver found for uverbs0
>             driver search path: /usr/lib/infiniband:/usr/local/lib/infiniband

Much nicer. Thanks!

>     Grant> But it's still not working: yprf3:~#
>     Grant> OPENIB_DRIVER_PATH=/usr/lib/infiniband/ ibv_pingpong
>     Grant> Couldn't get context for mthca0
> 
> Does the kernel driver match the userspace code -- ie are they both
> the latest subversion?

hrm...I'm using r2229 for userspace...and looks like r2225
for kernel. Let me sync up and report back again.

> I haven't started being anal about bumping the
> ABI version when changing the interface, so it's possible to get out
> of sync with this development code.

That's fine until this openib.org code gets into kernel.org.
For developement you can break it as much as you need to.

However, it's painful for distro's every time some ABI in
kernel.org breaks....they won't throw rotten tomatoes at you at
OLS or LWE 2006 if you don't break the ABI more than once a year.
:^)

thanks,
grant


From Brice.Goglin at ens-lyon.org  Fri Apr 29 11:22:24 2005
From: Brice.Goglin at ens-lyon.org (Brice Goglin)
Date: Fri, 29 Apr 2005 20:22:24 +0200
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <52d5sdjzup.fsf_-_@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>	<20050425173757.1dbab90b.akpm@osdl.org>
	<52wtqpsgff.fsf@topspin.com>	<20050426084234.A10366@topspin.com>
	<52mzrlsflu.fsf@topspin.com>	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com>	<426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>	<5ebee0d105042907265ff58a73@mail.gmail.com>	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com>
Message-ID: <42727B60.7010507@ens-lyon.org>

Roland Dreier a écrit :
> 2) For fork() support:
> 
>    a) Extend mprotect() with PROT_DONTCOPY so processes can avoid
>       copy-on-write problems.
> 
>    b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect()
>       with PROT_ALWAYSCOPY so processes can mark pages to be
>       pre-copied into child processes, to handle the case where only
>       half a page is registered.
> 
> I believe this puts the code that must be trusted into the kernel and
> gives userspace primitives that let apps handle the rest.

Do you plan to work with David Addison from Quadrics ?
For sure, your hardware have very different capabilities.
But ioproc_ops is a really nice solution and might help a lot
when dealing with deregistration and fork.

For instance, instead of adding PROT_DONT/ALWAYSCOPY, you may use
an ioproc hook in the fork path. This hook (a function in your driver)
would be called for each registered page. It will decide whether
the page should be pre-copied or not and update the registration
table (or whatever stores address translations in the NIC).
In addition, the driver would probably pre-copy cow pages when
registering them.

It's nice to see these two works coming to LKML at the same time.
It would be great if we could merge them and get a generic solution
that's suitable to both registration based cards (IB/Myri/Ammasso)
and MMU-based cards (Quadrics).

Brice


From roland at topspin.com  Fri Apr 29 11:28:02 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 11:28:02 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <20050429175650.GG24871@esmail.cup.hp.com> (Grant Grundler's
	message of "Fri, 29 Apr 2005 10:56:50 -0700")
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
	<52pswdk5au.fsf@topspin.com>
	<20050429162127.GA24871@esmail.cup.hp.com>
	<52ll71k09w.fsf@topspin.com>
	<20050429175650.GG24871@esmail.cup.hp.com>
Message-ID: <523bt9jv4d.fsf@topspin.com>

    Grant> hrm...I'm using r2229 for userspace...and looks like r2225
    Grant> for kernel. Let me sync up and report back again.

Hmm... those two revs don't seem to have any ABI changes between
them.  If it's still breaking, can you send me the output from running
the test with "strace -ewrite=all"?

    Grant> That's fine until this openib.org code gets into
    Grant> kernel.org.  For developement you can break it as much as
    Grant> you need to.

    Grant> However, it's painful for distro's every time some ABI in
    Grant> kernel.org breaks....they won't throw rotten tomatoes at
    Grant> you at OLS or LWE 2006 if you don't break the ABI more than
    Grant> once a year.  :^)

Understood -- once this code is merged then I'll be much more careful.

 - R.


From roland at topspin.com  Fri Apr 29 11:31:35 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 11:31:35 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <42727B60.7010507@ens-lyon.org> (Brice Goglin's message of
	"Fri, 29 Apr 2005 20:22:24 +0200")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050425173757.1dbab90b.akpm@osdl.org> <52wtqpsgff.fsf@topspin.com>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
Message-ID: <52y8b1ige0.fsf@topspin.com>

    Brice> Do you plan to work with David Addison from Quadrics ?  For
    Brice> sure, your hardware have very different capabilities.  But
    Brice> ioproc_ops is a really nice solution and might help a lot
    Brice> when dealing with deregistration and fork.

I'm following the discussion with interest.  Some hardware (eg
Mellanox HCAs) has the ability to use these hooks to avoid pinning
pages at all, but in general IB and iWARP need to pin pages so the
mapping doesn't change.

    Brice> For instance, instead of adding PROT_DONT/ALWAYSCOPY, you
    Brice> may use an ioproc hook in the fork path. This hook (a
    Brice> function in your driver) would be called for each
    Brice> registered page. It will decide whether the page should be
    Brice> pre-copied or not and update the registration table (or
    Brice> whatever stores address translations in the NIC).  In
    Brice> addition, the driver would probably pre-copy cow pages when
    Brice> registering them.

This sort of monkeying around with the VM from driver code seems much
more complicated than letting userspace handle it.

 - R.


From iod00d at hp.com  Fri Apr 29 11:38:46 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 29 Apr 2005 11:38:46 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <52ll71k09w.fsf@topspin.com>
References: <1AC79F16F5C5284499BB9591B33D6F00043B37C9@orsmsx408>
	<20050429051009.GT20957@esmail.cup.hp.com>
	<52pswdk5au.fsf@topspin.com>
	<20050429162127.GA24871@esmail.cup.hp.com>
	<52ll71k09w.fsf@topspin.com>
Message-ID: <20050429183846.GH24871@esmail.cup.hp.com>

On Fri, Apr 29, 2005 at 09:36:43AM -0700, Roland Dreier wrote:
>     Grant> But it's still not working: yprf3:~#
>     Grant> OPENIB_DRIVER_PATH=/usr/lib/infiniband/ ibv_pingpong
>     Grant> Couldn't get context for mthca0
> 
> Does the kernel driver match the userspace code -- ie are they both
> the latest subversion?

The only diff between r2225 and r2229 was this:
grundler at gsyprf3:/usr/src/linux-2.6/drivers/infiniband-r2229$ svn up -r 2229
U  core/cm.c
Updated to revision 2229.

That was it...I now get:
gsyprf3:/usr/src/linux-2.6# ibv_pingpong 
local address:  LID 0x000b, QPN 0x000406, PSN 0x5e48e6


And the other box, ionize, was ok (both using r2229). My bad.
I reloaded the kernel modules in ionize again just to be sure.

And WOOT! It Works. :^)

Reminder, I'm using:
	kernel.org 2.6.11, openib.org r2229, gcc 3.3.5 (Debian 1:3.3.5-12),
	rx2600 (ZX1 chipset), 1.5Ghz Madisons,
	MT23108 (PCI-X, Cougar) in dual rope slot.

ionize:/usr/src/openib_gen2/src/userspace# ibv_pingpong 10.0.0.51
  local address:  LID 0x000d, QPN 0x040406, PSN 0xfe84d1
  remote address: LID 0x000b, QPN 0x000406, PSN 0x5e48e6
8192000 bytes in 0.04 seconds = 1588.06 Mbit/sec
1000 iters in 0.04 seconds = 41.27 usec/iter

gsyprf3:/usr/src/linux-2.6# ibv_pingpong 
  local address:  LID 0x000b, QPN 0x000406, PSN 0x5e48e6
  remote address: LID 0x000d, QPN 0x040406, PSN 0xfe84d1
8192000 bytes in 0.04 seconds = 1592.19 Mbit/sec
1000 iters in 0.04 seconds = 41.16 usec/iter


And again with more iterations:
ionize:/usr/src/openib_gen2/src/userspace# ibv_pingpong -n 100000 10.0.0.51
  local address:  LID 0x000d, QPN 0x050406, PSN 0x5bf9ea
  remote address: LID 0x000b, QPN 0x010406, PSN 0x1e8764
819200000 bytes in 4.09 seconds = 1600.59 Mbit/sec
100000 iters in 4.09 seconds = 40.94 usec/iter


And a few more just for fun:

ionize:~# ibv_pingpong -s 128 -n 100000 10.0.0.51
  local address:  LID 0x000d, QPN 0x060406, PSN 0xbed9c7
  remote address: LID 0x000b, QPN 0x020406, PSN 0xf4b9db
25600000 bytes in 1.69 seconds = 120.95 Mbit/sec
100000 iters in 1.69 seconds = 16.93 usec/iter
ionize:~# ibv_pingpong -s 64 -n 100000 10.0.0.51
  local address:  LID 0x000d, QPN 0x070406, PSN 0x972e9e
  remote address: LID 0x000b, QPN 0x030406, PSN 0x3c7543
12800000 bytes in 1.65 seconds = 62.10 Mbit/sec
100000 iters in 1.65 seconds = 16.49 usec/iter

ionize:~# ibv_pingpong -s 16384 -n 10000 10.0.0.51
  local address:  LID 0x000d, QPN 0x080406, PSN 0xbe0533
  remote address: LID 0x000b, QPN 0x040406, PSN 0xca01ea
327680000 bytes in 1.16 seconds = 2267.26 Mbit/sec
10000 iters in 1.16 seconds = 115.62 usec/iter
ionize:~# ibv_pingpong -s 32768 -n 10000
  local address:  LID 0x000d, QPN 0x0a0406, PSN 0x7fc006
  remote address: LID 0x000b, QPN 0x060406, PSN 0xfdb40d
655360000 bytes in 1.51 seconds = 3471.72 Mbit/sec
10000 iters in 1.51 seconds = 151.02 usec/iter
ionize:~# ibv_pingpong -s 65536 -n 10000
  local address:  LID 0x000d, QPN 0x0b0406, PSN 0xd79a37
  remote address: LID 0x000b, QPN 0x070406, PSN 0x940ac7
1310720000 bytes in 2.84 seconds = 3691.26 Mbit/sec
10000 iters in 2.84 seconds = 284.07 usec/iter

thanks,
grant


From halr at voltaire.com  Fri Apr 29 11:47:09 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Apr 2005 14:47:09 -0400
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <427260DE.6090109@ichips.intel.com>
References: <1114782148.4477.888.camel@localhost.localdomain>
	<427260DE.6090109@ichips.intel.com>
Message-ID: <1114800429.4477.923.camel@localhost.localdomain>

On Fri, 2005-04-29 at 12:29, Sean Hefty wrote: 
> > 2. Since RMPP supports streaming (I know OpenIB doesn't do this on
> > transmit), what does the receiver do with an incoming RMPP stream whose
> > PayloadLength is not specified in the FIRST DATA packet ? (I think this
> > may be needed for interoperability with third party RMPP implementations
> > which do this).
> 
> The RMPP receive code doesn't use the FIRST payload length when 
> receiving data.  It looks for the LAST bit to be set in the incoming 
> data MAD.

OK. That avoids the issue of the inconsistent FIRST length.

Wouldn't the FIRST length be useful as a hint for buffer size if present
?

> Also, if you can think of a way to support this on the send side, 
> you/I/someone could add this.  I think that this would require 
> extending the send MAD API.

This being supporting the PayloadLength in the first DATA packet of a
send ?

> > 3. How does the RMPP receiver handle discrepancies between the FIRST
> > DATA PayloadLength and the LAST DATA PayloadLength ?
> 
> Currently it doesn't.  The payload length in the LAST data packet is 
> used to calculate the padding for the total MAD length.  See 
> get_mad_len().  The only check that's done is to ensure that the 
> calculated padding is smaller than the MAD's data size.

This seems safer and avoids the inconsistency issue.

The only issue seems to be if we wanted to support a peek function to
know the buffer to obtain before copying it.

-- Hal


From tduffy at sun.com  Fri Apr 29 11:51:03 2005
From: tduffy at sun.com (Tom Duffy)
Date: Fri, 29 Apr 2005 11:51:03 -0700
Subject: [openib-general] [DAPL] ran kdapl test, got slab corruption
In-Reply-To: <527jiljxfg.fsf@topspin.com>
References: <91DB792C7985D411BEC300B40080D29CC359F9@mtvex01.mtv.mtl.com>
	<1114727483.25364.7.camel@duffman>
	<1114770000.4477.850.camel@localhost.localdomain>
	<1114795958.24949.7.camel@duffman>  <527jiljxfg.fsf@topspin.com>
Message-ID: <1114800663.4691.8.camel@duffman>

On Fri, 2005-04-29 at 10:38 -0700, Roland Dreier wrote:
> Brave man... I assume you have the fix for
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21173
> applied to your compiler?

More like blissfully ignorant.

As it turns out, gcc version 4.0.0 20050423 (Red Hat 4.0.0-1) does
contain a patch to fix this bug.  From the spec file:

<snip>
Patch21: gcc4-pr20742.patch
Patch22: gcc4-pr21102.patch
Patch23: gcc4-pr21099.patch
Patch24: gcc4-pr21173.patch
</snip>

-tduffy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050429/c30bd12a/attachment.sig>

From halr at voltaire.com  Fri Apr 29 11:52:23 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Apr 2005 14:52:23 -0400
Subject: [openib-general] Receiving and Sending user MADs
Message-ID: <1114800509.4477.926.camel@localhost.localdomain>

linux-kernel/docs/user_mad.txt states:

Receiving MADs

        struct ib_user_mad mad;
        ret = read(fd, &mad, sizeof mad);
        if (ret != sizeof mad)
                perror("read");
...
Sending MADs

        ret = write(fd, &mad, sizeof mad);
        if (ret != sizeof mad)
                perror("write");

Should this still be true (and validated) for non RMPP users ?

Thanks.

-- Hal


From roland at topspin.com  Fri Apr 29 12:02:57 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 12:02:57 -0700
Subject: [openib-general] Re: Receiving and Sending user MADs
In-Reply-To: <1114800509.4477.926.camel@localhost.localdomain> (Hal
	Rosenstock's message of "29 Apr 2005 14:52:23 -0400")
References: <1114800509.4477.926.camel@localhost.localdomain>
Message-ID: <52u0lpiexq.fsf@topspin.com>

    Hal> Should this still be true (and validated) for non RMPP users?

I don't see any reason why it would change, do you?

 - R.


From sean.hefty at intel.com  Fri Apr 29 12:11:53 2005
From: sean.hefty at intel.com (Sean Hefty)
Date: Fri, 29 Apr 2005 12:11:53 -0700
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <1114800429.4477.923.camel@localhost.localdomain>
Message-ID: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>

>On Fri, 2005-04-29 at 12:29, Sean Hefty wrote:
>> > 2. Since RMPP supports streaming (I know OpenIB doesn't do this on
>> > transmit), what does the receiver do with an incoming RMPP stream whose
>> > PayloadLength is not specified in the FIRST DATA packet ? (I think this
>> > may be needed for interoperability with third party RMPP
>implementations
>> > which do this).
>>
>> The RMPP receive code doesn't use the FIRST payload length when
>> receiving data.  It looks for the LAST bit to be set in the incoming
>> data MAD.
>
>OK. That avoids the issue of the inconsistent FIRST length.
>
>Wouldn't the FIRST length be useful as a hint for buffer size if present
>?

The receiving side doesn't perform a data copy.  It collects the separate
MAD buffers together in a list and hands those to the user.

The length for a data buffer that would needed to copy the received data
into a single data buffer is set in struct ib_mad_recv_wc mad_len.  The call
ib_coalesce_recv_mad() is intended to perform the data copy for the user,
but isn't implemented yet.

>> Also, if you can think of a way to support this on the send side,
>> you/I/someone could add this.  I think that this would require
>> extending the send MAD API.
>
>This being supporting the PayloadLength in the first DATA packet of a
>send ?

I was referring to streaming sends.

The issue is that there would need to be a way to join together multiple
send requests together as a single transfer.  I haven't given this much
thought.

I guess one way to support something like this is to conceptually have some
sort of send_id that is used to associate multiple send requests.  Multiple
calls to ib_post_send_mad could chain the requests together until the
transfer is complete...


>> > 3. How does the RMPP receiver handle discrepancies between the FIRST
>> > DATA PayloadLength and the LAST DATA PayloadLength ?
>>
>> Currently it doesn't.  The payload length in the LAST data packet is
>> used to calculate the padding for the total MAD length.  See
>> get_mad_len().  The only check that's done is to ensure that the
>> calculated padding is smaller than the MAD's data size.
>
>This seems safer and avoids the inconsistency issue.
>
>The only issue seems to be if we wanted to support a peek function to
>know the buffer to obtain before copying it.

See above... this is known before any data copying is done.

- Sean


From halr at voltaire.com  Fri Apr 29 12:23:22 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Apr 2005 15:23:22 -0400
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>
References: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>
Message-ID: <1114802602.4477.931.camel@localhost.localdomain>

On Fri, 2005-04-29 at 15:11, Sean Hefty wrote:
> The receiving side doesn't perform a data copy.  It collects the separate
> MAD buffers together in a list and hands those to the user.

Yes. I meant user_mad.c needs a buffer to copy into on ib_umad_read and
hence the size needs to be known ahead of time. That's where I think a
peek might be useful (for RMPP, not for fixed size MADs).

> I was referring to streaming sends.
> 
> The issue is that there would need to be a way to join together multiple
> send requests together as a single transfer.  I haven't given this much
> thought.
> 
> I guess one way to support something like this is to conceptually have some
> sort of send_id that is used to associate multiple send requests.  Multiple
> calls to ib_post_send_mad could chain the requests together until the
> transfer is complete...

I'm not sure how important streaming RMPP is. I would defer this.

-- Hal


From iod00d at hp.com  Fri Apr 29 12:33:54 2005
From: iod00d at hp.com (Grant Grundler)
Date: Fri, 29 Apr 2005 12:33:54 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <42727B60.7010507@ens-lyon.org>
References: <20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
Message-ID: <20050429193354.GJ24871@esmail.cup.hp.com>

On Fri, Apr 29, 2005 at 08:22:24PM +0200, Brice Goglin wrote:
> For instance, instead of adding PROT_DONT/ALWAYSCOPY, you may use
> an ioproc hook in the fork path. This hook (a function in your driver)
> would be called for each registered page. It will decide whether
> the page should be pre-copied or not and update the registration
> table (or whatever stores address translations in the NIC).
> In addition, the driver would probably pre-copy cow pages when
> registering them.

This doesn't scale well as more cards are added to the box.
I think I understand why it's good for single cards though.

> It's nice to see these two works coming to LKML at the same time.
> It would be great if we could merge them and get a generic solution
> that's suitable to both registration based cards (IB/Myri/Ammasso)
> and MMU-based cards (Quadrics).

Aren't the mellanox mem-free cards more or less MMU's as well?
I had that impression after attending Dror Goldberg's talk
though I don't think he asserted that.
Openib.org developers conf (Feb 2005) slideset is here:
	http://www.openib.org/docs/oib_wkshp_022005/memfree-hca-mellanox-dgoldenberg.pdf

Being mostly clueless about Quadrics implementation, I'm probably
missing something that makes Quadrics a MMU but not the IB variants.
Can someone clue me in please?

thanks,
grant


From woodennickel at gmail.com  Fri Apr 29 12:43:10 2005
From: woodennickel at gmail.com (Bill Jordan)
Date: Fri, 29 Apr 2005 15:43:10 -0400
Subject: RDMA memory registration (was: [openib-general] Re:
	[PATCH][RFC][0/4] InfiniBand userspace verbs implementation)
In-Reply-To: <52d5sdjzup.fsf_-_@topspin.com>
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com>
Message-ID: <5ebee0d1050429124333776354@mail.gmail.com>

On 4/29/05, Roland Dreier <roland at topspin.com> wrote:
>   b) (maybe someday?) Add a VM_ALWAYSCOPY flag and extend mprotect()
>      with PROT_ALWAYSCOPY so processes can mark pages to be
>      pre-copied into child processes, to handle the case where only
>      half a page is registered.

Are you suggesting making the partial pages their own VMA, or marking
the entire buffer with this flag? I originally thought the entire
buffer should be copy on fork (instead of copy on write), and I
believe this is the path Mellanox was pursing with the VM_NO_COW flag.
However, if applications are registering gigs of ram, it would be very
bad to have the entire area copied on fork.

On the other hand, I've always wondered about the choice to leave
holes in the child process's address space. I would have chosen to map
the zero page instead.

-- 
Bill Jordan
InfiniCon Systems


From roland at topspin.com  Fri Apr 29 12:45:38 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 12:45:38 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5ebee0d1050429124333776354@mail.gmail.com> (Bill Jordan's
	message of "Fri, 29 Apr 2005 15:43:10 -0400")
References: <20050425135401.65376ce0.akpm@osdl.org>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com>
	<5ebee0d1050429124333776354@mail.gmail.com>
Message-ID: <52ll71icyl.fsf@topspin.com>

    Bill> Are you suggesting making the partial pages their own VMA,
    Bill> or marking the entire buffer with this flag? I originally
    Bill> thought the entire buffer should be copy on fork (instead of
    Bill> copy on write), and I believe this is the path Mellanox was
    Bill> pursing with the VM_NO_COW flag.  However, if applications
    Bill> are registering gigs of ram, it would be very bad to have
    Bill> the entire area copied on fork.

It's up to userspace really but I would expect that the partial pages
would be in a vma by themselves.

 - R.


From lindahl at pathscale.com  Fri Apr 29 13:18:35 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 29 Apr 2005 13:18:35 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <20050429193354.GJ24871@esmail.cup.hp.com>
References: <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org>
	<5264y9s3bs.fsf@topspin.com> <426EA220.6010007@ammasso.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
Message-ID: <20050429201835.GA3341@greglaptop.internal.keyresearch.com>

On Fri, Apr 29, 2005 at 12:33:54PM -0700, Grant Grundler wrote:

> Being mostly clueless about Quadrics implementation, I'm probably
> missing something that makes Quadrics a MMU but not the IB variants.
> Can someone clue me in please?

As far as I can tell it's mostly a marketing distinction. Many
Quadrics customers run with memory registration, and Mellanox could
probably alter their firmware to not require registration.  Myricom
certainly can, and in fact Patrick Geoffrey claimed they were doing so
in their MX software. The only one I know of that isn't that flexible
is PathScale's InfiniPath. Ours is a pure hardware mechanism, but it
requires memory registration and is clearly not an MMU.

Confused yet?

-- greg


From woodennickel at gmail.com  Fri Apr 29 13:40:20 2005
From: woodennickel at gmail.com (Bill Jordan)
Date: Fri, 29 Apr 2005 16:40:20 -0400
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <20050429201835.GA3341@greglaptop.internal.keyresearch.com>
References: <52mzrlsflu.fsf@topspin.com> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
Message-ID: <5ebee0d105042913401601a46c@mail.gmail.com>

On 4/29/05, Greg Lindahl <lindahl at pathscale.com> wrote:
> On Fri, Apr 29, 2005 at 12:33:54PM -0700, Grant Grundler wrote:
> 
> > Being mostly clueless about Quadrics implementation, I'm probably
> > missing something that makes Quadrics a MMU but not the IB variants.
> > Can someone clue me in please?
> 
> As far as I can tell it's mostly a marketing distinction. Many
> Quadrics customers run with memory registration, and Mellanox could
> probably alter their firmware to not require registration.  Myricom
> certainly can, and in fact Patrick Geoffrey claimed they were doing so
> in their MX software. The only one I know of that isn't that flexible
> is PathScale's InfiniPath. Ours is a pure hardware mechanism, but it
> requires memory registration and is clearly not an MMU.
> 
> Confused yet?

I'm very confused at this point. Can you briefly explain how this
works, or point me to a description? I don't see how you could do user
level I/O without registering the memory with the hardware. I'm
especially confused by the comment (may not have been yours) that the
memory doesn't have to be pinned.
-- 
Bill Jordan
InfiniCon Systems


From roland at topspin.com  Fri Apr 29 13:46:31 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 13:46:31 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5ebee0d105042913401601a46c@mail.gmail.com> (Bill Jordan's
	message of "Fri, 29 Apr 2005 16:40:20 -0400")
References: <52mzrlsflu.fsf@topspin.com> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
	<5ebee0d105042913401601a46c@mail.gmail.com>
Message-ID: <52ekctia54.fsf@topspin.com>

    Bill> I'm very confused at this point. Can you briefly explain how
    Bill> this works, or point me to a description? I don't see how
    Bill> you could do user level I/O without registering the memory
    Bill> with the hardware. I'm especially confused by the comment
    Bill> (may not have been yours) that the memory doesn't have to be
    Bill> pinned.  -- Bill Jordan InfiniCon Systems

You add a hook to the kernel so it tells you if a page is about to be
paged out or otherwise move.  Then you set a bit in the adapter's page
table so that it won't try to access that page without telling you.
If the adapter asks for the page, you get the kernel to fault the page
in and program the new physical mapping in the adapter.

 - R.


From trimmer at infiniconsys.com  Fri Apr 29 13:49:44 2005
From: trimmer at infiniconsys.com (Rimmer, Todd)
Date: Fri, 29 Apr 2005 16:49:44 -0400
Subject: [openib-general] Re: RDMA memory registration
Message-ID: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>

> You add a hook to the kernel so it tells you if a page is about to be
> paged out or otherwise move.  Then you set a bit in the adapter's page
> table so that it won't try to access that page without telling you.
> If the adapter asks for the page, you get the kernel to fault the page
> in and program the new physical mapping in the adapter.

But that implies the hardware has an MMU and it also puts an interrupt in the path per page sent.

Wasn't the assertion that there was no MMU in the hardware?

Todd Rimmer


From roland at topspin.com  Fri Apr 29 13:59:22 2005
From: roland at topspin.com (Roland Dreier)
Date: Fri, 29 Apr 2005 13:59:22 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
	(Todd Rimmer's message of "Fri, 29 Apr 2005 16:49:44 -0400")
References: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
Message-ID: <5264y5i9jp.fsf@topspin.com>

    Todd> But that implies the hardware has an MMU and it also puts an
    Todd> interrupt in the path per page sent.

Well, there's one interrupt per non-resident page sent.  But nearly
all of the time the page will be present.

    Todd> Wasn't the assertion that there was no MMU in the hardware?

I don't think so.  Greg's original message said this doesn't work for
PathScale's part precisely because they don't have an MMU.

 - R.


From rminnich at lanl.gov  Fri Apr 29 14:04:01 2005
From: rminnich at lanl.gov (Ronald G. Minnich)
Date: Fri, 29 Apr 2005 15:04:01 -0600 (MDT)
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5ebee0d105042913401601a46c@mail.gmail.com>
References: <52mzrlsflu.fsf@topspin.com> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
	<5ebee0d105042913401601a46c@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0504291502540.23205@enigma.lanl.gov>


On Fri, 29 Apr 2005, Bill Jordan wrote:

> I'm very confused at this point. Can you briefly explain how this works,
> or point me to a description? I don't see how you could do user level
> I/O without registering the memory with the hardware. I'm especially
> confused by the comment (may not have been yours) that the memory
> doesn't have to be pinned. 

you modify the mm layer of linux, so that the PTEs on the Quadrics card 
are in sync with teh PTEs int he mm layer. Then you are in a position to 
have a NIC incite page faults for incoming packets. 

I think greg got it right -- in practice, it's not done any more. Quadrics 
has a kernel-patch-free source base now, I'm told.

ron


From lindahl at pathscale.com  Fri Apr 29 14:04:40 2005
From: lindahl at pathscale.com (Greg Lindahl)
Date: Fri, 29 Apr 2005 14:04:40 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5264y5i9jp.fsf@topspin.com>
References: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
	<5264y5i9jp.fsf@topspin.com>
Message-ID: <20050429210439.GA3650@greglaptop.internal.keyresearch.com>

>     Todd> But that implies the hardware has an MMU and it also puts an
>     Todd> interrupt in the path per page sent.
> 
> Well, there's one interrupt per non-resident page sent.  But nearly
> all of the time the page will be present.

It doesn't imply that there's an MMU, either. I know that Myricom uses
a little lookup routine in software on their nic, which most people
wouldn't call an MMU. I don't know what Mellanox does for this, they
don't talk much about what's hardware and what's software on their
nic. I think Quadrics actually uses the TLB of their risc cpu on their
nic for this lookup, but that's just a guess.

-- greg


From rminnich at lanl.gov  Fri Apr 29 14:05:51 2005
From: rminnich at lanl.gov (Ronald G. Minnich)
Date: Fri, 29 Apr 2005 15:05:51 -0600 (MDT)
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
References: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
Message-ID: <Pine.LNX.4.58.0504291505190.23205@enigma.lanl.gov>


On Fri, 29 Apr 2005, Rimmer, Todd wrote:

> But that implies the hardware has an MMU and it also puts an interrupt
> in the path per page sent.

yes. it does. and it doesn't do per page sent, just per page that has no 
pte on the nic when received.

ron


From rminnich at lanl.gov  Fri Apr 29 14:07:40 2005
From: rminnich at lanl.gov (Ronald G. Minnich)
Date: Fri, 29 Apr 2005 15:07:40 -0600 (MDT)
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <20050429210439.GA3650@greglaptop.internal.keyresearch.com>
References: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
	<5264y5i9jp.fsf@topspin.com>
	<20050429210439.GA3650@greglaptop.internal.keyresearch.com>
Message-ID: <Pine.LNX.4.58.0504291507240.23205@enigma.lanl.gov>


On Fri, 29 Apr 2005, Greg Lindahl wrote:

> It doesn't imply that there's an MMU, either. I know that Myricom uses a
> little lookup routine in software on their nic, which most people
> wouldn't call an MMU. I don't know what Mellanox does for this, they
> don't talk much about what's hardware and what's software on their nic.
> I think Quadrics actually uses the TLB of their risc cpu on their nic
> for this lookup, but that's just a guess.

but only quadrics rewrites the mm layer code ..

ron


From libor at topspin.com  Fri Apr 29 14:16:07 2005
From: libor at topspin.com (Libor Michalek)
Date: Fri, 29 Apr 2005 14:16:07 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <Pine.LNX.4.58.0504291507240.23205@enigma.lanl.gov>;
	from rminnich@lanl.gov on Fri, Apr 29, 2005 at 03:07:40PM -0600
References: <5D78D28F88822E4D8702BB9EEF1A436773DC44@mercury.infiniconsys.com>
	<5264y5i9jp.fsf@topspin.com>
	<20050429210439.GA3650@greglaptop.internal.keyresearch.com>
	<Pine.LNX.4.58.0504291507240.23205@enigma.lanl.gov>
Message-ID: <20050429141607.C13041@topspin.com>

On Fri, Apr 29, 2005 at 03:07:40PM -0600, Ronald G. Minnich wrote:
> On Fri, 29 Apr 2005, Greg Lindahl wrote:
> 
> > It doesn't imply that there's an MMU, either. I know that Myricom uses a
> > little lookup routine in software on their nic, which most people
> > wouldn't call an MMU. I don't know what Mellanox does for this, they
> > don't talk much about what's hardware and what's software on their nic.
> > I think Quadrics actually uses the TLB of their risc cpu on their nic
> > for this lookup, but that's just a guess.
> 
> but only quadrics rewrites the mm layer code ..

  Mellanox, although they have the capability, does not use the feature.
In the existing model the mellanox hardware assumes that the page is
present, hence the entire discussion about how to make sure the page
stays put and that the user mapping to that page stays put.

-Libor


From caitlin.bestler at gmail.com  Fri Apr 29 14:34:06 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Fri, 29 Apr 2005 14:34:06 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <52ekctia54.fsf@topspin.com>
References: <52mzrlsflu.fsf@topspin.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
	<5ebee0d105042913401601a46c@mail.gmail.com>
	<52ekctia54.fsf@topspin.com>
Message-ID: <469958e0050429143414b1b3c@mail.gmail.com>

On 4/29/05, Roland Dreier <roland at topspin.com> wrote:
>     Bill> I'm very confused at this point. Can you briefly explain how
>     Bill> this works, or point me to a description? I don't see how
>     Bill> you could do user level I/O without registering the memory
>     Bill> with the hardware. I'm especially confused by the comment
>     Bill> (may not have been yours) that the memory doesn't have to be
>     Bill> pinned.  -- Bill Jordan InfiniCon Systems
> 
> You add a hook to the kernel so it tells you if a page is about to be
> paged out or otherwise move.  Then you set a bit in the adapter's page
> table so that it won't try to access that page without telling you.
> If the adapter asks for the page, you get the kernel to fault the page
> in and program the new physical mapping in the adapter.
> 

Yes, and you could even have a system that was capable of doing
DMA to a user virtual map (in fact some minis back around 1980
had exactly that capability).

But there are *two* issues involved here:

    One is that the RDMA hardware, however it is marketed, essentially
    needs to act as an MMU. That means that it has to be synchronized
    with normal MMU. The traditional sledge-hammer approach to 

> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>


From rminnich at lanl.gov  Fri Apr 29 14:41:11 2005
From: rminnich at lanl.gov (Ronald G. Minnich)
Date: Fri, 29 Apr 2005 15:41:11 -0600 (MDT)
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <469958e0050429143414b1b3c@mail.gmail.com>
References: <52mzrlsflu.fsf@topspin.com>
	<20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
	<5ebee0d105042913401601a46c@mail.gmail.com>
	<52ekctia54.fsf@topspin.com>
	<469958e0050429143414b1b3c@mail.gmail.com>
Message-ID: <Pine.LNX.4.58.0504291540530.23205@enigma.lanl.gov>


On Fri, 29 Apr 2005, Caitlin Bestler wrote:

>     One is that the RDMA hardware, however it is marketed, essentially
>     needs to act as an MMU. That means that it has to be synchronized
>     with normal MMU. The traditional sledge-hammer approach to 

ah ha! his RDMA mmu just crashed his mm layer. It happens. 

ron


From caitlin.bestler at gmail.com  Fri Apr 29 14:42:55 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Fri, 29 Apr 2005 14:42:55 -0700
Subject: [openib-general] Re: RDMA memory registration
In-Reply-To: <469958e0050429143414b1b3c@mail.gmail.com>
References: <52mzrlsflu.fsf@topspin.com>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<52d5sdjzup.fsf_-_@topspin.com> <42727B60.7010507@ens-lyon.org>
	<20050429193354.GJ24871@esmail.cup.hp.com>
	<20050429201835.GA3341@greglaptop.internal.keyresearch.com>
	<5ebee0d105042913401601a46c@mail.gmail.com>
	<52ekctia54.fsf@topspin.com>
	<469958e0050429143414b1b3c@mail.gmail.com>
Message-ID: <469958e00504291442195457cb@mail.gmail.com>

oops, hit the send to soon. Finishing the response...

On 4/29/05, Caitlin Bestler <caitlin.bestler at gmail.com> wrote:
> On 4/29/05, Roland Dreier <roland at topspin.com> wrote:
> >     Bill> I'm very confused at this point. Can you briefly explain how
> >     Bill> this works, or point me to a description? I don't see how
> >     Bill> you could do user level I/O without registering the memory
> >     Bill> with the hardware. I'm especially confused by the comment
> >     Bill> (may not have been yours) that the memory doesn't have to be
> >     Bill> pinned.  -- Bill Jordan InfiniCon Systems
> >
> > You add a hook to the kernel so it tells you if a page is about to be
> > paged out or otherwise move.  Then you set a bit in the adapter's page
> > table so that it won't try to access that page without telling you.
> > If the adapter asks for the page, you get the kernel to fault the page
> > in and program the new physical mapping in the adapter.
> >
> 
> Yes, and you could even have a system that was capable of doing
> DMA to a user virtual map (in fact some minis back around 1980
> had exactly that capability).
> 
> But there are *two* issues involved here:
> 
>     One is that the RDMA hardware, however it is marketed, essentially
>     needs to act as an MMU. That means that it has to be synchronized
>     with normal MMU. The traditional sledge-hammer approach to
> 
    "synchronizing" is to require that the mapping be frozen. You *could*
    define a method that attempts to be more dynamic in this synchronization,
    but since it is an ex post facto mechanism that must work with multiple
    hardware cards it needs to be defined recognizing that it is not
instantaneous.
    It is virtually the same problem as memory suspend in general, basically
   the RDMA Hardware's MMU is not making calculations for each and every
   access to the host bus.

   Secondly there is the problem that an advertised buffer is implicitly a 
   promise to the the peer that the buffer is available. Using RNRs (or dropping
   TCP segments for iWARP) while paging an image from disk is just not
   playing fair. No host should advertise 20 GB of buffers to its peer when it
   only has 2 GBs of physical memory backing it up. When an application
   registers memory it believes it has permission from the OS to advertise
   buffers within it. RNRs are appropriate to move memory around, not to
   allow a host to overadvertise.


From mshefty at ichips.intel.com  Fri Apr 29 15:09:48 2005
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 29 Apr 2005 15:09:48 -0700
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <1114802602.4477.931.camel@localhost.localdomain>
References: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>
	<1114802602.4477.931.camel@localhost.localdomain>
Message-ID: <4272B0AC.1050405@ichips.intel.com>

Hal Rosenstock wrote:
>>The receiving side doesn't perform a data copy.  It collects the separate
>>MAD buffers together in a list and hands those to the user.
> 
> 
> Yes. I meant user_mad.c needs a buffer to copy into on ib_umad_read and
> hence the size needs to be known ahead of time. That's where I think a
> peek might be useful (for RMPP, not for fixed size MADs).

By the time the kernel client gets the MAD, it's been reassembled, and 
the exact size that's needed is known.  So, I don't think that this is 
an issue for kernel clients.

Maybe you could add a peek for usermode or have read return the correct 
size if the requested size is too small.  Since you have to do a data 
copy for usermode anyway, I think it makes sense to just return the 
coalesced buffer.  This makes me think that I should implement 
ib_coalesce_recv_mad() now.

> I'm not sure how important streaming RMPP is. I would defer this.

I agree.  :)

- Sean


From robert.j.woodruff at intel.com  Fri Apr 29 16:08:03 2005
From: robert.j.woodruff at intel.com (Bob Woodruff)
Date: Fri, 29 Apr 2005 16:08:03 -0700
Subject: [openib-general] Re: user-mode verbs on Itanium
In-Reply-To: <20050429183846.GH24871@esmail.cup.hp.com>
Message-ID: <ORSMSX408ryWtIIZS2T0000001b@orsmsx408.amr.corp.intel.com>

 
Grant wrote,
>And WOOT! It Works. :^)

>Reminder, I'm using:
>	kernel.org 2.6.11, openib.org r2229, gcc 3.3.5 (Debian 1:3.3.5-12),
>	rx2600 (ZX1 chipset), 1.5Ghz Madisons,
>	MT23108 (PCI-X, Cougar) in dual rope slot.

>ionize:/usr/src/openib_gen2/src/userspace# ibv_pingpong 10.0.0.51
>  local address:  LID 0x000d, QPN 0x040406, PSN 0xfe84d1
>  remote address: LID 0x000b, QPN 0x000406, PSN 0x5e48e6
>8192000 bytes in 0.04 seconds = 1588.06 Mbit/sec
>1000 iters in 0.04 seconds = 41.27 usec/iter

>gsyprf3:/usr/src/linux-2.6# ibv_pingpong 
>  local address:  LID 0x000b, QPN 0x000406, PSN 0x5e48e6
>  remote address: LID 0x000d, QPN 0x040406, PSN 0xfe84d1
>8192000 bytes in 0.04 seconds = 1592.19 Mbit/sec
>1000 iters in 0.04 seconds = 41.16 usec/iter

Good to hear that someone else has been able to get uverbs runing on
IPF. I am still having some problems, but I am also trying to 
run on a backport to an earlier 2.6.9 kernel and I was using older SVN code,
2214.
I will upgrade to the latest SVN version and try a stock 2.6.11 kernel
first, before attempting to backport. Unfortunatley, I need to go and
rebuild our developent system that had SVN on it, as it's disk 
decided to crash this afternoon. 

woody


From caitlin.bestler at gmail.com  Fri Apr 29 17:31:44 2005
From: caitlin.bestler at gmail.com (Caitlin Bestler)
Date: Fri, 29 Apr 2005 17:31:44 -0700
Subject: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs
	implementation
In-Reply-To: <20050429100425.A13041@topspin.com>
References: <20050425173757.1dbab90b.akpm@osdl.org>
	<20050426084234.A10366@topspin.com> <52mzrlsflu.fsf@topspin.com>
	<20050426122850.44d06fa6.akpm@osdl.org> <5264y9s3bs.fsf@topspin.com>
	<426EA220.6010007@ammasso.com> <20050426133752.37d74805.akpm@osdl.org>
	<5ebee0d105042907265ff58a73@mail.gmail.com>
	<469958e005042908566f177b50@mail.gmail.com>
	<20050429100425.A13041@topspin.com>
Message-ID: <469958e00504291731eb8287c@mail.gmail.com>

On 4/29/05, Libor Michalek <libor at topspin.com> wrote:

> 
>   However, you have a potential problem with registered buffers that
> do not begin or end on a page boundary, which is common with malloc.
> If the buffer resides on a portion of a page, and you mark the vm
> which contains that entire page VM_DONTCOPY, to ensure that the parent
> has access to the exact physical page after the fork, the child will
> not be able to access anything on that entire page. So if the child
> expects to access data on the same page that happens to contain the
> registered buffer it will get a segment violation.
> 
> The four situations we've discussed are:
> 
>   1) Physical page does not get used for anything else.
>   2) Processes virtual to physical mapping remains fixed.
>   3) Same virtual to physical mapping after forking a child.
>   4) Forked child has access to all non-registered memory of
>      the parent.
> 
> The first two are now taken care of with get_user_pages, (we use to
> use VM_LOCKED for the second case) third case is handled by setting
> the vm to VM_DONTCOPY, and on the fourth case we've always punted,
> but the real answer is to break partial pages into seperate vms and
> mark them ALWAYS_COPY.
> 
> -Libor
> 
> 
Attempting to provide *any* support for applications that fork children
after doing RDMA registrations is a ratshole best avoided. The general
rule that application developers should follow is to do RDMA *only*
in the child processes.

Keep in mind that it is not only the memory regions that must be dealt
with, but control data invisible to the user (the QP context, etc.). This
data frequently is interlinked between kernel residente and user resident
data (such as a QP context has the PD ID somewhere on-chip or in
kernel, which the Send Queue ring needs to be in user memory). Having
two different user processes that both think they have the user half to
this type of split data structure is just asking for trouble, even if you 
manage to get the copy on write bit timing problems all solved.

All of this can be avoided by a simple rule: don't fork after opening
an RDMA device.


From halr at voltaire.com  Sat Apr 30 04:33:09 2005
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Apr 2005 07:33:09 -0400
Subject: [openib-general] Re: Some Initial RMPP Comments and Questions
In-Reply-To: <4272B0AC.1050405@ichips.intel.com>
References: <ORSMSX401FRaqbC8wSA0000000d@orsmsx401.amr.corp.intel.com>
	<1114802602.4477.931.camel@localhost.localdomain>
	<4272B0AC.1050405@ichips.intel.com>
Message-ID: <1114860788.4477.1779.camel@localhost.localdomain>

On Fri, 2005-04-29 at 18:09, Sean Hefty wrote:
> By the time the kernel client gets the MAD, it's been reassembled, and 
> the exact size that's needed is known.  So, I don't think that this is 
> an issue for kernel clients.
> 
> Maybe you could add a peek for usermode or have read return the correct 
> size if the requested size is too small.  

I especially like the requested size too small touch. That saves a
potential extra operation.

> Since you have to do a data 
> copy for usermode anyway, I think it makes sense to just return the 
> coalesced buffer.  This makes me think that I should implement 
> ib_coalesce_recv_mad() now.

That would make the receive side of user space much easier. Shall I
defer implementing this part until coalesce is implemented ?

Thanks.

-- Hal


From jimbeam at mailAccount.com  Sat Apr 30 13:49:25 2005
From: jimbeam at mailAccount.com (Dwayne Neely)
Date: Sat, 30 Apr 2005 13:49:25 -0700
Subject: [openib-general] Become a homeowner with low rates
Message-ID: <CFE2.AA79.9A01jimbeam@mailAccount.com>


Hello,

 We tried contacting you awhile ago about your low interest morta(ge rate.

 You have qualified for the lowest rate in years...

 You could get over $380,000 for as little as $500 a month!

 Ba(d credit? Doesn't matter, low rates are fixed no matter what!

 
 To get a free, no obli,gation consultation click below:

 http://www.h0us1ng.com/sign.asp


 Best Regards,

 Roxie Jernigan
 
 to be remov(ed:	http://www.h0us1ng.com/gone.asp

 this process takes one week, so please be patient. we do our 
 best to take your email/s off but you have to fill out a rem/ove
 or else you will continue to recieve email/s.


From info at qsv14.com  Sat Apr 30 06:59:21 2005
From: info at qsv14.com (info at qsv14.com)
Date: 30 Apr 2005 22:59:21 +0900
Subject: [openib-general] $BFMA3$N%a!<%k$r$5$;$FD:$-$^$7$?!#(B
Message-ID: <20050430135921.32553.qmail@mail.qsv14.com>


 $B!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z(B

 $B1?1D%9%?%C%U$,=w at -$@$1$NEvHVAH$OK\F|%j%K%e!<%"%k%*!<%W%s$K$D$-!"(B
 $BCK at -MM$K0l at Z$N$4IiC4$J$/9b5i=w at -$r$4>R2pCW$7$F$*$j$^$9!#(B
       http://www.lovegal2.net/?summer

 $B!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z!y!z(B
$B!!!!:#2s!"5.J}MM$KFMA3$N%a!<%k$r$5$;$FD:$$$?$N$K$OM}M3$,$"$j$^$9!#(B
$B!!!!!!!!!!!!!!!!!!"-!!"-!!"-!!"-!!"-!!"-!!"-!!"-!!(B
 $B(.(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(/(B
 $B(-!y%j%K%e!<%"%k%*!<%W%s!y$KEv$j!"=w at -$NJ}$N4uK><T$,GzH/E*$K5^A}$7!"(B  $B(-(B
 $B(-CK=wHfN($,JP$jCK at -MM$,>/$J$/$J$C$?0Y$K!"4uK>=w at -$KBP$7$FK~B-$KCK at -MM(-(B
 $B(-$r$4>R2p$9$k;v$,Hs>o$K:$Fq$K$J$C$F$7$^$$$^$7$?!#!!!!!!!!!!!!!!!!!!!!(-(B
 $B(-$=$3$GCK at -MM$N?M?t$rD4 at 0$7!"4uK>$9$kCK=w$N3'MM$K1_K~$J=P2q$$$r$7$FD:(-(B
 $B(-$/0Y$K!"CK at -MM$NA}0w$KF'$_ at Z$j!":#2s5.J}MM$K%a!<%k$r$5$;$FD:$$$?<!Bh(-(B
 $B(-$G$9!#!!!!!!!!!!!!!!(B                                                $B(-(B
 $B(1(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(0(B

 $B4|4VCf$K-dEPO?D:$/$H(B
$B!!!!!!"#(B10,000$B1_AjEv$N(BPOINT$B$,$9$0$KL5NADI2C!*!*(B
$B!!!!!!!!!!!J!,"`!,(B;$B!K%O%!%O%!(B
       http://www.lovegal2.net/?summer
	$B!V!!EPO?L5NA!!$^$:$OEPO?!*!!!W(B


 $B"(:#F|$N%3%&%+%$%3!<%J!<"((B
 $B!!!!$+$[!!(B($B"`"^(B*)$B"v(B
 $B"#G/Np(B  $B!!(B $B!!(B20$BBeA0H>(B 
 $B""%9%?%$%k(B $B!!<L%a8+$l$P!)(B 
 $B"#%"%I8r49(B $B!!2D(B
 $B""<L%a!!!!(B $B!!M-$j(B($B%W%m%U%#!<%k(B)
 $B"#%a%C%;!<%8(B:GW$B$@$H$$$&$N$KM=Dj$b$J$/%R%^$7$F$^$9!#(B
$B!!!!!!!!!!!!(B(^_^;)$B$I$s$I$s%a!<%k$/$@$5$$!#BT$C$F$^$9!*(B(^o^)$BP(!!(B


  $B"(EvA3$G$9$1$I!"$+$[$5$s$+$i$O5v2D$,$b$i$C$F$*$j$^$9$h!#(B(-$B&X!&(B)/~