[ofa-general] [PATCHv3 RFC] Scalable Reliable Connection: API and documentation

Michael S. Tsirkin mst at dev.mellanox.co.il
Wed Aug 8 00:19:10 PDT 2007


Add API extensions and documentation to support Scalable Reliable
Connections.

Signed-off-by: Michael S. Tsirkin <mst at dev.mellanox.co.il>

---

Here's an updated revision of the RFC.
Changes since v2:
- Remove max_src_domains - this breaks library ABI
  and is unlikely to be useful anyway
- Add device capability flag to enable detecting SRC support
- Fill in some implementation bits in libibverbs
- Better document cleanup process in the examples section

diff --git a/SRC.txt b/SRC.txt
new file mode 100644
index 0000000..b3c0459
--- /dev/null
+++ b/SRC.txt
@@ -0,0 +1,138 @@
+Here's some documentation on Scalable Reliable Connections.
+
+ * * *
+
+SRC is an extension supported by recent Mellanox hardware
+which is geared toward reducing the number of QPs
+required for all-to-all communication on systems
+with a high number of jobs per node.
+
+===================================================================
+Motivation:
+===================================================================
+Given N nodes with J jobs per node, number of QPs required
+for all-to-all communication is:
+
+With RC:
+		O((N * J) ^ 2)
+
+	Since each job out of O(N * J) jobs must create a single QP
+	to communicate with each one of O(N * J) other jobs.
+
+With SRC:
+		O(N ^ 2 * J)
+
+	This is achived by using a single send queue (per job, out of O(N * J) jobs)
+	to send data to all J jobs running on a specific node (out of O(N) nodes).
+	Hardware uses new "SRQ number" field in packet header to
+	multiplex receive WRs and WCs to private memory of each job.
+
+This is similiar idea to IB RD.
+Q: Why not use RD then?
+A: Because no hardware supports it.
+
+Details:
+
+===================================================================
+Verbs extension:
+===================================================================
+
+- There is a new transport/QP type "SRC".
+- There is a new object type "SRC domain"
+- Each SRQ gets new (optional) attributes:
+        SRC domain
+	SRC SRQ number
+        SRC CQ
+  SRQ must have either all 3 of these or none of these attributes
+
+- QPs of type SRC have all the same attributes as regular RC QPs
+  connected to SRQ, except that:
+  A. Each SRC QP has a new required attribute "SRC domain"
+  B. SRC QPs do *not* have "SRQ" attribute
+  	(do not have a specific SRQ associated with them)
+
+===================================================================
+Protocol extension:
+===================================================================
+SRC QP behaviour: Requestor
+- Post send WR for this QP type is extended with SRQ number field
+  This number is sent as part of packet header
+- SRC Packets follow rules for RC packets on the wire, exactly
+  What is different is their handling at the responder side
+
+SRC QP behaviour: Responder
+Each incoming packet passes transport checks with respect
+to the SRC QP, following RC rules, exactly.
+
+After this, SRQ number in packet header is used to look up
+a specific SRQ. SRC domain of the resulting SRQ must be equal
+to SRC domain of the QP, otherwise a NAK is sent,
+and QP moves to error state.
+
+If the SRC domains match, receive WR and receive WC processing
+are as follows:
+
+- RC Send
+  - Rather than using SRQ to which the QP is attached,
+    SRQ is looked up by SRQ number in the packet.
+    Receive WR is taken from this SRQ.
+  - Completions are generated on the CQ specified in the SRQ
+
+- RDMA/Atomic
+  - Rather than using PD to which the QP is attached,
+    SRQ is looked up by SRQ number in the packet.
+    PD of this SRQ is used for protection checks.
+
+===================================================================
+Pseudo code:
+===================================================================
+
+Consider again a setup where there are N nodes with J jobs per node.
+All N * J jobs need to perform all-to-all communication.
+Using RC QPs, this would call for O((N * J) ^ 2) QPs.
+Here is how SRC can be used to reduce the number of QPs to O(N ^ 2 * J).
+
+At startup:
+1. All jobs on each node share a single SRC domain
+2. Each job creates a CQ for receive WCs
+3. Each job creates a SRQ attached to this CQ and to the shared domain
+
+When job j1 needs to transmit to job j2 on remote node n for the first time:
+1. Test: does job j1 have an existing connection to some job on node n?
+        - If no:
+		j1 creates an SRC QP qp1 (send QP)
+			qp1 is only used to post send WRs
+		j2 creates an SRC QP qp2
+			qp2 is part of SRC domain
+			qp2 is only used to do transport checks:
+				neither send nor receive WRs are posted on qp2
+		j1 and j2 create a connection between qp1 and qp2
+	- If yes:
+		let qp1 be the QP which belongs to j1 and is connected
+			to some qp on node n
+
+2. j1 gets SRQ number from j2
+3. j1 can now use QP qp2 from step 1
+   and SRQ number from step 3 to send data to j2
+
+Cleanup:
+When job j1 does not need to communicate to any jobs on node n,
+it disconnects qp1 from qp2, and asks j2 to destroy qp2.
+
+Note: both qp1 and qp2 must exist for the communication to take place.
+Thus, j2 should not destroy qp2 (and in particular, should not exit)
+until j1 has completed communication with node n and has
+asked j2 to disconnect.
+
+===================================================================
+
+Resources used (CQs are ignored below):
+Each node:
+- An SRC domain - to the total of n domains
+- A Receive QP for each (remote) job - to the total of N * (N * J) recv QPs
+
+Each job:
+- A SRQ - to the total of N * J SRQs
+- A send QP for each (remote) node - to the total of N * (N * J) send QPs
+
+===================================================================
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index acc1b82..d7e3269 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -92,7 +92,8 @@ enum ibv_device_cap_flags {
 	IBV_DEVICE_SYS_IMAGE_GUID	= 1 << 11,
 	IBV_DEVICE_RC_RNR_NAK_GEN	= 1 << 12,
 	IBV_DEVICE_SRQ_RESIZE		= 1 << 13,
-	IBV_DEVICE_N_NOTIFY_CQ		= 1 << 14
+	IBV_DEVICE_N_NOTIFY_CQ		= 1 << 14,
+	IBV_DEVICE_SRC		        = 1 << 15
 };
 
 enum ibv_atomic_cap {
@@ -370,6 +371,11 @@ struct ibv_ah_attr {
 	uint8_t			port_num;
 };
 
+struct ibv_src_domain {
+	struct ibv_context     *context;
+	uint32_t		handle;
+};
+
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
 	IBV_SRQ_LIMIT	= 1 << 1
@@ -389,7 +395,8 @@ struct ibv_srq_init_attr {
 enum ibv_qp_type {
 	IBV_QPT_RC = 2,
 	IBV_QPT_UC,
-	IBV_QPT_UD
+	IBV_QPT_UD,
+	IBV_QPT_SRC
 };
 
 struct ibv_qp_cap {
@@ -408,6 +415,7 @@ struct ibv_qp_init_attr {
 	struct ibv_qp_cap	cap;
 	enum ibv_qp_type	qp_type;
 	int			sq_sig_all;
+	struct ibv_src_domain  *src_domain;
 };
 
 enum ibv_qp_attr_mask {
@@ -526,6 +534,7 @@ struct ibv_send_wr {
 			uint32_t	remote_qkey;
 		} ud;
 	} wr;
+	uint32_t		src_remote_srq_num;
 };
 
 struct ibv_recv_wr {
@@ -553,6 +562,10 @@ struct ibv_srq {
 	pthread_mutex_t		mutex;
 	pthread_cond_t		cond;
 	uint32_t		events_completed;
+
+	uint32_t		src_srq_num;
+	struct ibv_src_domain  *src_domain;
+	struct ibv_cq	       *src_cq;
 };
 
 struct ibv_qp {
@@ -570,6 +583,8 @@ struct ibv_qp {
 	pthread_mutex_t		mutex;
 	pthread_cond_t		cond;
 	uint32_t		events_completed;
+
+	struct ibv_src_domain  *src_domain;
 };
 
 struct ibv_comp_channel {
@@ -652,6 +667,8 @@ struct ibv_context_ops {
 	int			(*resize_cq)(struct ibv_cq *cq, int cqe);
 	int			(*destroy_cq)(struct ibv_cq *cq);
 	struct ibv_srq *	(*create_srq)(struct ibv_pd *pd,
+					      struct ibv_src_domain *src_domain,
+					      struct ibv_cq *src_cq,
 					      struct ibv_srq_init_attr *srq_init_attr);
 	int			(*modify_srq)(struct ibv_srq *srq,
 					      struct ibv_srq_attr *srq_attr,
@@ -680,6 +697,10 @@ struct ibv_context_ops {
 	int			(*detach_mcast)(struct ibv_qp *qp, union ibv_gid *gid,
 						uint16_t lid);
 	void			(*async_event)(struct ibv_async_event *event);
+	struct ibv_src_domain *	(*open_src_domain)(struct ibv_context *context,
+						int fd, int oflag);
+	int			(*close_src_domain)(struct ibv_src_domain *d);
+	void			(*async_event)(struct ibv_async_event *event);
 };
 
 struct ibv_context {
@@ -912,6 +933,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd,
 			       struct ibv_srq_init_attr *srq_init_attr);
 
 /**
+ * ibv_create_src_srq - Creates a SRQ associated with the specified protection
+ *   domain and src domain.
+ * @pd: The protection domain associated with the SRQ.
+ * @src_domain: The SRC domain associated with the SRQ.
+ * @src_cq: CQ to report completions for SRC packets on.
+ *
+ * @srq_init_attr: A list of initial attributes required to create the SRQ.
+ *
+ * srq_attr->max_wr and srq_attr->max_sge are read the determine the
+ * requested size of the SRQ, and set to the actual values allocated
+ * on return.  If ibv_create_srq() succeeds, then max_wr and max_sge
+ * will always be at least as large as the requested values.
+ */
+struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd,
+				   struct ibv_src_domain *src_domain,
+				   struct ibv_cq *src_cq,
+			           struct ibv_srq_init_attr *srq_init_attr);
+
+/**
  * ibv_modify_srq - Modifies the attributes for the specified SRQ.
  * @srq: The SRQ to modify.
  * @srq_attr: On input, specifies the SRQ attributes to modify.  On output,
@@ -1074,6 +1114,42 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);
  */
 int ibv_fork_init(void);
 
+/**
+ * ibv_open_src_domain - open an SRC domain
+ * Returns a reference to an SRC domain.
+ *
+ * @context: Device context
+ * @fd: descriptor for inode associated with the domain
+ *     If fd == -1, no inode is associated with the domain; in this case,
+ *     the only legal value for oflag is O_CREAT
+ *
+ * @oflag: oflag values are constructed by OR-ing flags from the following list
+ *
+ * O_CREAT
+ *     If a domain belonging to device named by context is already associated
+ *     with the inode, this flag has no effect, except as noted under O_EXCL
+ *     below. Otherwise, a new SRC domain is created and is associated with
+ *     inode specified by fd.
+ * 
+ * O_EXCL
+ *     If O_EXCL and O_CREAT are set, open will fail if a domain associated with
+ *     the inode exists. The check for the existence of the domain and creation
+ *     of the domain if it does not exist is atomic with respect to other
+ *     processes executing open with fd naming the same inode.
+ */
+struct ibv_src_domain *ibv_open_src_domain(struct ibv_context *context,
+					   int fd, int oflag);
+
+/**
+ * ibv_close_src_domain - close an SRC domain
+ * If this is the last reference, destroys the domain.
+ * 
+ * @d: reference to SRC domain to close
+ *
+ * close is implicitly performed at process exit.
+ */
+int ibv_close_src_domain(struct ibv_src_domain *d);
+
 END_C_DECLS
 
 #  undef __attribute_const
diff --git a/src/libibverbs.map b/src/libibverbs.map
index 3a346ed..def9ee8 100644
--- a/src/libibverbs.map
+++ b/src/libibverbs.map
@@ -24,6 +24,7 @@ IBVERBS_1.0 {
 		ibv_get_cq_event;
 		ibv_ack_cq_events;
 		ibv_create_srq;
+		ibv_create_src_srq;
 		ibv_modify_srq;
 		ibv_query_srq;
 		ibv_destroy_srq;
@@ -35,6 +36,8 @@ IBVERBS_1.0 {
 		ibv_destroy_ah;
 		ibv_attach_mcast;
 		ibv_detach_mcast;
+		ibv_open_src_domain;
+		ibv_close_src_domain;
 		ibv_cmd_get_context;
 		ibv_cmd_query_device;
 		ibv_cmd_query_port;
diff --git a/src/verbs.c b/src/verbs.c
index f5cf4d3..a85f458 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -359,11 +359,39 @@ struct ibv_srq *__ibv_create_srq(struct ibv_pd *pd,
 	if (!pd->context->ops.create_srq)
 		return NULL;
 
-	srq = pd->context->ops.create_srq(pd, srq_init_attr);
+	srq = pd->context->ops.create_srq(pd, NULL, NULL, srq_init_attr);
 	if (srq) {
 		srq->context          = pd->context;
 		srq->srq_context      = srq_init_attr->srq_context;
 		srq->pd               = pd;
+		srq->src_domain       = NULL;
+		srq->src_cq           = NULL;
+		srq->events_completed = 0;
+		pthread_mutex_init(&srq->mutex, NULL);
+		pthread_cond_init(&srq->cond, NULL);
+	}
+
+	return srq;
+}
+default_symver(__ibv_create_srq, ibv_create_srq);
+
+struct ibv_srq *__ibv_create_src_srq(struct ibv_pd *pd,
+				     struct ibv_src_domain *src_domain,
+				     struct ibv_cq *src_cq,
+				     struct ibv_srq_init_attr *srq_init_attr)
+{
+	struct ibv_srq *srq;
+
+	if (!pd->context->ops.create_srq)
+		return NULL;
+
+	srq = pd->context->ops.create_srq(pd, src_domain, src_cq, srq_init_attr);
+	if (srq) {
+		srq->context          = pd->context;
+		srq->srq_context      = srq_init_attr->srq_context;
+		srq->pd               = pd;
+		srq->src_domain       = src_domain;
+		srq->src_cq           = src_cq;
 		srq->events_completed = 0;
 		pthread_mutex_init(&srq->mutex, NULL);
 		pthread_cond_init(&srq->cond, NULL);
@@ -541,3 +569,22 @@ int __ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid)
 	return qp->context->ops.detach_mcast(qp, gid, lid);
 }
 default_symver(__ibv_detach_mcast, ibv_detach_mcast);
+
+struct ibv_src_domain *__ibv_open_src_domain(struct ibv_context *context,
+					     int fd, int oflag);
+{
+	struct ibv_src_domain *d;
+
+	d = context->ops.open_src_domain(context, fd, oflag);
+	if (d)
+		d->context = context;
+
+	return d;
+}
+default_symver(__ibv_open_src_domain, ibv_open_src_domain);
+
+int __ibv_close_src_domain(struct ibv_src_domain *d)
+{
+	return pd->context->ops.close_src_domain(d);
+}
+default_symver(__ibv_dealloc_pd, ibv_dealloc_pd);

-- 
MST



More information about the general mailing list