[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Jack Morgenstein jackm at dev.mellanox.co.il
Mon Dec 31 03:39:40 PST 2007


> Tang, Changqing wrote:
> >         If I have a MPI server processes on a node, many other MPI
> > client processes will dynamically connect/disconnect with
> > the server. The server use same XRC domain.
> >
> >         Will this cause accumulating the "kernel" QP for such
> > application ? we want the server to run 365 days a year.
> >
> > I have some question about the scenario above. Did you call
> > for the mpi disconnect on the both ends (server/client)
> > before the client exit (did we must to do it?)
> 
> Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call
> is not a collective call, it is just a local operation.
> 
> --CQ
>
Possible solution (internal review as yet):

  Each user process registers with the XRC QP:
    a. each process registers ONCE. If it registers multiple times, there is no reference increment --
       rather the registration succeeds, but only one PID entry is kept per QP.
    b. Can have cleanup in the event of a process dying suddenly.
    c. QP cannot be destroyed while there are any user processes still registered with it.

libibverbs API is as follows:

======================================================================================
/**
 * ibv_xrc_rcv_qp_alloc - creates an XRC QP for serving as a receive-side only QP,
 *	and moves the created qp through the RESET->INIT and INIT->RTR transitions.
 *      (The RTR->RTS transition is not needed, since this QP does no sending).
 * 	The sending XRC QP uses this QP as destination, while specifying an XRC SRQ
 * 	for actually receiving the transmissions and generating all completions on the
 *	receiving side.
 *
 * 	This QP is created in kernel space, and persists until the last process registered
 *      for the QP calls ibv_xrc_rcv_qp_unregister() (at which time the QP is destroyed).
 *
 * @pd: protection domain to use.  At lower layer, this provides access to userspace obj
 * @xrc_domain: xrc domain to use for the QP.
 * @attr: modify-qp attributes needed to bring the QP to RTR.
 * @attr_mask:  bitmap indicating which attributes are provided in the attr struct.
 * 	used for validity checking.
 * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote node (sender).
 *		 The remote node will use xrc_rcv_qpn in ibv_post_send when sending to
 *		 XRC SRQ's on this host in the same xrc domain.
 *
 * RETURNS: success (0), or a (negative) error value.
 *
 * NOTE: this verb also registers the calling user-process with the QP at its creation time
 *       (implicit call to ibv_xrc_rcv_qp_register), to avoid race conditions.
 *       The creating process will need to call ibv_xrc_qp_unregister() for the QP to release it from
 *       this process.
 */

int ibv_xrc_rcv_qp_alloc(struct ibv_pd *pd,
			 struct ibv_xrc_domain *xrc_domain,
			 struct ibv_qp_attr *attr,
			 enum ibv_qp_attr_mask attr_mask,
			 uint32_t *xrc_rcv_qpn);

=====================================================================

/**
 * ibv_xrc_rcv_qp_register: registers a user process with an XRC QP which serves as
 *         a receive-side only QP.
 *
 * @xrc_domain: xrc domain the QP belongs to (for verification).
 * @xrc_qp_num: The (24 bit) number of the XRC QP.
 *
 * RETURNS: success (0), 
 *          or error (-EINVAL), if:
 *            1. There is no such QP_num allocated.
 *            2. The QP is allocated, but is not an receive XRC QP
 *            3. The XRC QP does not belong to the given domain.
 */
int ibv_xrc_rcv_qp_register(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num);

=====================================================================
/**
 * ibv_xrc_rcv_qp_unregister: detaches a user process from an XRC QP serving as
 *         a receive-side only QP. If as a result, there are no remaining userspace processes
 *	   registered for this XRC QP, it is destroyed.
 *
 * @xrc_domain: xrc domain the QP belongs to (for verification).
 * @xrc_qp_num: The (24 bit) number of the XRC QP.
 *
 * RETURNS: success (0), 
 *          or error (-EINVAL), if:
 *            1. There is no such QP_num allocated.
 *            2. The QP is allocated, but is not an XRC QP
 *            3. The XRC QP does not belong to the given domain.
 * NOTE: I don't see any reason to return a special code if the QP is destroyed -- the unregister simply
 *       succeeds.
 */
int ibv_xrc_rcv_qp_unregister(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num);
=============================================================================================

Usage:

1. Sender creates an XRC QP (sending QP)
2. Sender sends some receiving process on a remote node (say R1) a request to provide an XRC QP and XRC SRQ for
   receiving messages (the request includes the sending QP number).
3. R1 calls ibv_xrc_rcv_qp_alloc() to create a receiving XRC QP in kernel space, and move
   that QP up to RTR state. This function also registers process R1 with the XRC QP.
4. R1 calls ibv_create_xrc_srq() to create an SRQ for receive messages via the just created XRC QP.
5. R1 responds to request, providing the XRC qp number, and XRC SRQ number to be used in communication.
6. Sender then may wish to communicate with another receiving process on the remote host (say R2). 
   it sends a request to R2 containing the remote XRC QP number (obtained from R1)
   which it will use to send messages.
7. R2 creates an XRC SRQ (if one does not already exist for the domain), and also
   calls ibv_xrc_rcv_qp_register() to register the process R2 with the XRC QP created by R1.
8. If R1 no longer needs to communicate with the sender, it calls ibv_xrc_rcv_qp_unregister() for the QP.
   The QP will not yet be destroyed, since R2 is still registered with it.
9. If R2 no longer needs to communicate with the sender, it calls ibv_xrc_rcv_qp_unregister() for the QP.
   At this point, the QP is destroyed, since no processes remain registered with it.

NOTES:
1. The problem of the QP being destroyed and quickly re-allocated does not exist -- the upper bits of the
   QP number are incremented at each allocation (except for the MSB which is always 1 for XRC QPs).  Thus,
   even if the same QP is re-allocated, its QP number (stored in the QP object) will be different than
   expected (unless it is re-destroyed/re-allocated several hundred times).

2. With this model, we do not need a heartbeat: if a receiving process dies, all XRC QPs it has registered for will
   be unregistered as part of process cleanup in kernel space.

- Jack




More information about the general mailing list