[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
Ishai Rabinovitz
ishai at mellanox.co.il
Thu Jan 3 00:59:11 PST 2008
Please see my comments (prefix [Ishai])
-----Original Message-----
From: Tang, Changqing [mailto:changquing.tang at hp.com]
Sent: ד 02 ינואר 2008 17:27
To: Jack Morgenstein; Pavel Shamis
Cc: Ishai Rabinovitz; Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
This interface is OK for me.
Now, every rank on a node who wants to receive message from the same remote rank must know the same receiving QP number, and register for receiving using this QP number.
If rank B does not register (receiving QP has been created by another rank A on the node), and sender know B's SRQ number, if sender sends a message to B, can B still receive this
message ? (I hope, no register, no receive)
[Ishai] I guess that from the MPI layer prospective, the sender can not know B's SRQ number until it ask B to give it to him. So B can register to this QP before sending the SRQ number.
I hope to know the opinion from other MPI team, or other XRC user.
[Ishai] We already discussed this issues with Open MPI IB group, and it looks fine to them. I'm sending this mail to Prof. Panda, so he can comment on it as well.
--CQ
> -----Original Message-----
> From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> Sent: Monday, December 31, 2007 5:40 AM
> To: pasha at mellanox.co.il
> Cc: ishai at mellanox.co.il; Gleb Natapov; Roland Dreier; Tang,
> Changqing; general at lists.openfabrics.org
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> > Tang, Changqing wrote:
> > > If I have a MPI server processes on a node, many
> other MPI
> > > client processes will dynamically connect/disconnect with the
> > > server. The server use same XRC domain.
> > >
> > > Will this cause accumulating the "kernel" QP for such
> > > application ? we want the server to run 365 days a year.
> > >
> > > I have some question about the scenario above. Did you
> call for the
> > > mpi disconnect on the both ends (server/client) before the client
> > > exit (did we must to do it?)
> >
> > Yes, both ends will call disconnect. But for us,
> MPI_Comm_disconnect()
> > call is not a collective call, it is just a local operation.
> >
> > --CQ
> >
> Possible solution (internal review as yet):
>
> Each user process registers with the XRC QP:
> a. each process registers ONCE. If it registers multiple times,
> there is no reference increment --
> rather the registration succeeds, but only one PID entry is
> kept per QP.
> b. Can have cleanup in the event of a process dying suddenly.
> c. QP cannot be destroyed while there are any user processes still
> registered with it.
>
> libibverbs API is as follows:
>
> ==============================================================
> ========================
> /**
> * ibv_xrc_rcv_qp_alloc - creates an XRC QP for serving as a
> receive-side only QP,
> * and moves the created qp through the RESET->INIT and
> INIT->RTR transitions.
> * (The RTR->RTS transition is not needed, since this QP
> does no sending).
> * The sending XRC QP uses this QP as destination, while
> specifying an XRC SRQ
> * for actually receiving the transmissions and
> generating all completions on the
> * receiving side.
> *
> * This QP is created in kernel space, and persists
> until the last process registered
> * for the QP calls ibv_xrc_rcv_qp_unregister() (at
> which time the QP is destroyed).
> *
> * @pd: protection domain to use. At lower layer, this provides
> access to userspace obj
> * @xrc_domain: xrc domain to use for the QP.
> * @attr: modify-qp attributes needed to bring the QP to RTR.
> * @attr_mask: bitmap indicating which attributes are provided in the
> attr struct.
> * used for validity checking.
> * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to
> the remote node (sender).
> * The remote node will use xrc_rcv_qpn in
> ibv_post_send when sending to
> * XRC SRQ's on this host in the same xrc domain.
> *
> * RETURNS: success (0), or a (negative) error value.
> *
> * NOTE: this verb also registers the calling user-process with the QP
> at its creation time
> * (implicit call to ibv_xrc_rcv_qp_register), to avoid
> race conditions.
> * The creating process will need to call
> ibv_xrc_qp_unregister() for the QP to release it from
> * this process.
> */
>
> int ibv_xrc_rcv_qp_alloc(struct ibv_pd *pd,
> struct ibv_xrc_domain *xrc_domain,
> struct ibv_qp_attr *attr,
> enum ibv_qp_attr_mask attr_mask,
> uint32_t *xrc_rcv_qpn);
>
> =====================================================================
>
> /**
> * ibv_xrc_rcv_qp_register: registers a user process with an XRC QP
> which serves as
> * a receive-side only QP.
> *
> * @xrc_domain: xrc domain the QP belongs to (for verification).
> * @xrc_qp_num: The (24 bit) number of the XRC QP.
> *
> * RETURNS: success (0),
> * or error (-EINVAL), if:
> * 1. There is no such QP_num allocated.
> * 2. The QP is allocated, but is not an receive XRC QP
> * 3. The XRC QP does not belong to the given domain.
> */
> int ibv_xrc_rcv_qp_register(struct ibv_xrc_domain *xrc_domain,
> uint32_t xrc_qp_num);
>
> =====================================================================
> /**
> * ibv_xrc_rcv_qp_unregister: detaches a user process from an XRC QP
> serving as
> * a receive-side only QP. If as a result, there are
> no remaining userspace processes
> * registered for this XRC QP, it is destroyed.
> *
> * @xrc_domain: xrc domain the QP belongs to (for verification).
> * @xrc_qp_num: The (24 bit) number of the XRC QP.
> *
> * RETURNS: success (0),
> * or error (-EINVAL), if:
> * 1. There is no such QP_num allocated.
> * 2. The QP is allocated, but is not an XRC QP
> * 3. The XRC QP does not belong to the given domain.
> * NOTE: I don't see any reason to return a special code if the QP is
> destroyed -- the unregister simply
> * succeeds.
> */
> int ibv_xrc_rcv_qp_unregister(struct ibv_xrc_domain *xrc_domain,
> uint32_t xrc_qp_num);
> ==============================================================
> ===============================
>
> Usage:
>
> 1. Sender creates an XRC QP (sending QP) 2. Sender sends some
> receiving process on a remote node (say R1) a request to provide an
> XRC QP and XRC SRQ for
> receiving messages (the request includes the sending QP number).
> 3. R1 calls ibv_xrc_rcv_qp_alloc() to create a receiving XRC QP in
> kernel space, and move
> that QP up to RTR state. This function also registers process R1
> with the XRC QP.
> 4. R1 calls ibv_create_xrc_srq() to create an SRQ for receive messages
> via the just created XRC QP.
> 5. R1 responds to request, providing the XRC qp number, and XRC SRQ
> number to be used in communication.
> 6. Sender then may wish to communicate with another receiving process
> on the remote host (say R2).
> it sends a request to R2 containing the remote XRC QP number
> (obtained from R1)
> which it will use to send messages.
> 7. R2 creates an XRC SRQ (if one does not already exist for the
> domain), and also
> calls ibv_xrc_rcv_qp_register() to register the process R2 with the
> XRC QP created by R1.
> 8. If R1 no longer needs to communicate with the sender, it calls
> ibv_xrc_rcv_qp_unregister() for the QP.
> The QP will not yet be destroyed, since R2 is still registered with
> it.
> 9. If R2 no longer needs to communicate with the sender, it calls
> ibv_xrc_rcv_qp_unregister() for the QP.
> At this point, the QP is destroyed, since no processes remain
> registered with it.
>
> NOTES:
> 1. The problem of the QP being destroyed and quickly re-allocated does
> not exist -- the upper bits of the
> QP number are incremented at each allocation (except for the MSB
> which is always 1 for XRC QPs). Thus,
> even if the same QP is re-allocated, its QP number (stored in the
> QP object) will be different than
> expected (unless it is re-destroyed/re-allocated several hundred
> times).
>
> 2. With this model, we do not need a heartbeat: if a receiving process
> dies, all XRC QPs it has registered for will
> be unregistered as part of process cleanup in kernel space.
>
> - Jack
>
>
More information about the general
mailing list