[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Pavel Shamis (Pasha) pasha at dev.mellanox.co.il
Thu Dec 20 07:14:48 PST 2007


Adding Open MPI and MVAPICH community to the thread.

Pasha (Pavel Shamis)

Jack Morgenstein wrote:
> background:  see "XRC Cleanup order issue thread" at
>
> 	http://lists.openfabrics.org/pipermail/general/2007-December/043935.html
>
> (userspace process which created the receiving XRC qp on a given host dies before
> other processes which still need to receive XRC messages on their SRQs which are
> "paired" with the now-destroyed receiving XRC QP.)
>
> Solution: Add a userspace verb (as part of the XRC suite) which enables the user process
> to create an XRC QP owned by the kernel -- which belongs to the required XRC domain.
>
> This QP will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain
> call, but only when the domain's reference count goes to zero).
>
> Below, I give the new userspace API for this function.  Any feedback will be appreciated.
> This API will be implemented in the upcoming OFED 1.3 release, so we need feedback ASAP.
>
> Notes:
> 1. There is no query or destroy verb for this QP. There is also no userspace object for the
>    QP. Userspace has ONLY the raw qp number to use when creating the (X)RC connection.
>
> 2. Since the QP is "owned" by kernel space, async events for this QP are also handled in kernel
>    space (i.e., reported in /var/log/messages). There are no completion events for the QP, since
>    it does not send, and all receives completions are reported in the XRC SRQ's cq.
>
>    If this QP enters the error state, the remote QP which sends will start receiving RETRY_EXCEEDED
>    errors, so the application will be aware of the failure.
>
> - Jack
> ======================================================================================
> /**
>  * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only QP,
>  *	and moves the created qp through the RESET->INIT and INIT->RTR transitions.
>  *      (The RTR->RTS transition is not needed, since this QP does no sending).
>  * 	The sending XRC QP uses this QP as destination, while specifying an XRC SRQ
>  * 	for actually receiving the transmissions and generating all completions on the
>  *	receiving side.
>  *
>  * 	This QP is created in kernel space, and persists until the XRC domain is closed.
>  *	(i.e., its reference count goes to zero).
>  *
>  * @pd: protection domain to use.  At lower layer, this provides access to userspace obj
>  * @xrc_domain: xrc domain to use for the QP.
>  * @attr: modify-qp attributes needed to bring the QP to RTR.
>  * @attr_mask:  bitmap indicating which attributes are provided in the attr struct.
>  * 	used for validity checking.
>  * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote node. The
>  *               remote node will use xrc_rcv_qpn in ibv_post_send when sending to
>  *		 XRC SRQ's on this host in the same xrc domain.
>  *
>  * RETURNS: success (0), or a (negative) error value.
>  */
>
> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
> 			 struct ibv_xrc_domain *xrc_domain,
> 			 struct ibv_qp_attr *attr,
> 			 enum ibv_qp_attr_mask attr_mask,
> 			 uint32_t *xrc_rcv_qpn);
>
> Notes:
>
> 1. Although the kernel creates the qp in the kernel's own PD, we still need the PD
>    parameter to determine the device.
>
> 2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than create
>    a new structure for this purpose.  This also guards against API changes in the event
>    that during development I notice that more modify-qp parameters must be specified
>    for this operation to work.
>
> 3. Table of the ibv_qp_attr parameters showing what values to set:
>
> struct ibv_qp_attr {
> 	enum ibv_qp_state	qp_state;		Not needed
> 	enum ibv_qp_state	cur_qp_state;		Not needed
> 		-- Driver starts from RESET and takes qp to RTR.
> 	enum ibv_mtu		path_mtu;		Yes
> 	enum ibv_mig_state	path_mig_state;		Yes
> 	uint32_t		qkey;			Yes
> 	uint32_t		rq_psn;			Yes
> 	uint32_t		sq_psn;			Not needed
> 	uint32_t		dest_qp_num;		Yes -- this is the remote side QP for the RC conn.
> 	int			qp_access_flags;	Yes
> 	struct ibv_qp_cap	cap;			Need only XRC domain. 
> 							Other caps will use hard-coded values:
> 								max_send_wr = 1;
> 								max_recv_wr = 0;
> 								max_send_sge = 1;
> 								max_recv_sge = 0;
> 								max_inline_data = 0;
> 	struct ibv_ah_attr	ah_attr;		Yes
> 	struct ibv_ah_attr	alt_ah_attr;		Optional
> 	uint16_t		pkey_index;		Yes
> 	uint16_t		alt_pkey_index;		Optional
> 	uint8_t			en_sqd_async_notify;	Not needed (No sq)
> 	uint8_t			sq_draining;		Not needed (No sq)
> 	uint8_t			max_rd_atomic;		Not needed (No sq)
> 	uint8_t			max_dest_rd_atomic;	Yes -- Total max outstanding RDMAs expected
> 							for ALL srq destinations using this receive QP.
> 							(if you are only using SENDs, this value can be 0).
> 	uint8_t			min_rnr_timer;		default - 0
> 	uint8_t			port_num;		Yes
> 	uint8_t			timeout;		Yes
> 	uint8_t			retry_cnt;		Yes
> 	uint8_t			rnr_retry;		Yes
> 	uint8_t			alt_port_num;		Optional
> 	uint8_t			alt_timeout;		Optional
> };
>
> 4. Attribute mask bits to set:
> 	For RESET_to_INIT transition:
> 		IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
>
> 	For INIT_to_RTR transition:
> 		IB_QP_AV | IB_QP_PATH_MTU |
> 		IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
> 	   If you are using RDMA or atomics, also set:
> 		IB_QP_MAX_DEST_RD_ATOMIC
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


-- 
Pavel Shamis (Pasha)
Mellanox Technologies




More information about the general mailing list