[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Ishai Rabinovitz ishai at mellanox.co.il
Thu Jan 3 00:59:11 PST 2008


Please see my comments (prefix [Ishai])

-----Original Message-----
From: Tang, Changqing [mailto:changquing.tang at hp.com] 
Sent: ד 02 ינואר 2008 17:27
To: Jack Morgenstein; Pavel Shamis
Cc: Ishai Rabinovitz; Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process


This interface is OK for me.

Now, every rank on a node who wants to receive message from the same remote rank must know the same receiving QP number, and register for receiving using this QP number.

If rank B does not register (receiving QP has been created by another rank A on the node), and sender know B's SRQ number, if sender sends a message to B, can B still receive this
message ?   (I hope, no register, no receive)

[Ishai] I guess that from the MPI layer prospective, the sender can not know B's SRQ number until it ask B to give it to him. So B can register to this QP before sending the SRQ number. 

I hope to know the opinion from other MPI team, or other XRC user.

[Ishai] We already discussed this issues with Open MPI IB group, and it looks fine to them. I'm sending this mail to Prof. Panda, so he can comment on it as well. 

--CQ



> -----Original Message-----
> From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> Sent: Monday, December 31, 2007 5:40 AM
> To: pasha at mellanox.co.il
> Cc: ishai at mellanox.co.il; Gleb Natapov; Roland Dreier; Tang, 
> Changqing; general at lists.openfabrics.org
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP 
> independent of any one user process
>
> > Tang, Changqing wrote:
> > >         If I have a MPI server processes on a node, many
> other MPI
> > > client processes will dynamically connect/disconnect with the 
> > > server. The server use same XRC domain.
> > >
> > >         Will this cause accumulating the "kernel" QP for such 
> > > application ? we want the server to run 365 days a year.
> > >
> > > I have some question about the scenario above. Did you
> call for the
> > > mpi disconnect on the both ends (server/client) before the client 
> > > exit (did we must to do it?)
> >
> > Yes, both ends will call disconnect. But for us,
> MPI_Comm_disconnect()
> > call is not a collective call, it is just a local operation.
> >
> > --CQ
> >
> Possible solution (internal review as yet):
>
>   Each user process registers with the XRC QP:
>     a. each process registers ONCE. If it registers multiple times, 
> there is no reference increment --
>        rather the registration succeeds, but only one PID entry is 
> kept per QP.
>     b. Can have cleanup in the event of a process dying suddenly.
>     c. QP cannot be destroyed while there are any user processes still 
> registered with it.
>
> libibverbs API is as follows:
>
> ==============================================================
> ========================
> /**
>  * ibv_xrc_rcv_qp_alloc - creates an XRC QP for serving as a 
> receive-side only QP,
>  *      and moves the created qp through the RESET->INIT and
> INIT->RTR transitions.
>  *      (The RTR->RTS transition is not needed, since this QP
> does no sending).
>  *      The sending XRC QP uses this QP as destination, while
> specifying an XRC SRQ
>  *      for actually receiving the transmissions and
> generating all completions on the
>  *      receiving side.
>  *
>  *      This QP is created in kernel space, and persists
> until the last process registered
>  *      for the QP calls ibv_xrc_rcv_qp_unregister() (at
> which time the QP is destroyed).
>  *
>  * @pd: protection domain to use.  At lower layer, this provides 
> access to userspace obj
>  * @xrc_domain: xrc domain to use for the QP.
>  * @attr: modify-qp attributes needed to bring the QP to RTR.
>  * @attr_mask:  bitmap indicating which attributes are provided in the 
> attr struct.
>  *      used for validity checking.
>  * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to 
> the remote node (sender).
>  *               The remote node will use xrc_rcv_qpn in
> ibv_post_send when sending to
>  *               XRC SRQ's on this host in the same xrc domain.
>  *
>  * RETURNS: success (0), or a (negative) error value.
>  *
>  * NOTE: this verb also registers the calling user-process with the QP 
> at its creation time
>  *       (implicit call to ibv_xrc_rcv_qp_register), to avoid
> race conditions.
>  *       The creating process will need to call
> ibv_xrc_qp_unregister() for the QP to release it from
>  *       this process.
>  */
>
> int ibv_xrc_rcv_qp_alloc(struct ibv_pd *pd,
>                          struct ibv_xrc_domain *xrc_domain,
>                          struct ibv_qp_attr *attr,
>                          enum ibv_qp_attr_mask attr_mask,
>                          uint32_t *xrc_rcv_qpn);
>
> =====================================================================
>
> /**
>  * ibv_xrc_rcv_qp_register: registers a user process with an XRC QP 
> which serves as
>  *         a receive-side only QP.
>  *
>  * @xrc_domain: xrc domain the QP belongs to (for verification).
>  * @xrc_qp_num: The (24 bit) number of the XRC QP.
>  *
>  * RETURNS: success (0),
>  *          or error (-EINVAL), if:
>  *            1. There is no such QP_num allocated.
>  *            2. The QP is allocated, but is not an receive XRC QP
>  *            3. The XRC QP does not belong to the given domain.
>  */
> int ibv_xrc_rcv_qp_register(struct ibv_xrc_domain *xrc_domain, 
> uint32_t xrc_qp_num);
>
> =====================================================================
> /**
>  * ibv_xrc_rcv_qp_unregister: detaches a user process from an XRC QP 
> serving as
>  *         a receive-side only QP. If as a result, there are
> no remaining userspace processes
>  *         registered for this XRC QP, it is destroyed.
>  *
>  * @xrc_domain: xrc domain the QP belongs to (for verification).
>  * @xrc_qp_num: The (24 bit) number of the XRC QP.
>  *
>  * RETURNS: success (0),
>  *          or error (-EINVAL), if:
>  *            1. There is no such QP_num allocated.
>  *            2. The QP is allocated, but is not an XRC QP
>  *            3. The XRC QP does not belong to the given domain.
>  * NOTE: I don't see any reason to return a special code if the QP is 
> destroyed -- the unregister simply
>  *       succeeds.
>  */
> int ibv_xrc_rcv_qp_unregister(struct ibv_xrc_domain *xrc_domain, 
> uint32_t xrc_qp_num); 
> ==============================================================
> ===============================
>
> Usage:
>
> 1. Sender creates an XRC QP (sending QP) 2. Sender sends some 
> receiving process on a remote node (say R1) a request to provide an 
> XRC QP and XRC SRQ for
>    receiving messages (the request includes the sending QP number).
> 3. R1 calls ibv_xrc_rcv_qp_alloc() to create a receiving XRC QP in 
> kernel space, and move
>    that QP up to RTR state. This function also registers process R1 
> with the XRC QP.
> 4. R1 calls ibv_create_xrc_srq() to create an SRQ for receive messages 
> via the just created XRC QP.
> 5. R1 responds to request, providing the XRC qp number, and XRC SRQ 
> number to be used in communication.
> 6. Sender then may wish to communicate with another receiving process 
> on the remote host (say R2).
>    it sends a request to R2 containing the remote XRC QP number 
> (obtained from R1)
>    which it will use to send messages.
> 7. R2 creates an XRC SRQ (if one does not already exist for the 
> domain), and also
>    calls ibv_xrc_rcv_qp_register() to register the process R2 with the 
> XRC QP created by R1.
> 8. If R1 no longer needs to communicate with the sender, it calls 
> ibv_xrc_rcv_qp_unregister() for the QP.
>    The QP will not yet be destroyed, since R2 is still registered with 
> it.
> 9. If R2 no longer needs to communicate with the sender, it calls 
> ibv_xrc_rcv_qp_unregister() for the QP.
>    At this point, the QP is destroyed, since no processes remain 
> registered with it.
>
> NOTES:
> 1. The problem of the QP being destroyed and quickly re-allocated does 
> not exist -- the upper bits of the
>    QP number are incremented at each allocation (except for the MSB 
> which is always 1 for XRC QPs).  Thus,
>    even if the same QP is re-allocated, its QP number (stored in the 
> QP object) will be different than
>    expected (unless it is re-destroyed/re-allocated several hundred 
> times).
>
> 2. With this model, we do not need a heartbeat: if a receiving process 
> dies, all XRC QPs it has registered for will
>    be unregistered as part of process cleanup in kernel space.
>
> - Jack
>
>



More information about the general mailing list