[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
Tang, Changqing
changquing.tang at hp.com
Thu Jan 3 09:50:14 PST 2008
OK, thanks for the clearification.
When can we test the code via OFED ?
--CQ
> -----Original Message-----
> From: Ishai Rabinovitz [mailto:ishai at mellanox.co.il]
> Sent: Thursday, January 03, 2008 9:55 AM
> To: Tang, Changqing; panda at cse.ohio-state.edu; Jack
> Morgenstein; Pavel Shamis
> Cc: Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
> Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> CQ, You are right.
>
> And there is no race because the register and deregister are
> locked in the kernel using the same spin lock.
>
> So in the MPI implementation, when C finds out that the QP is
> no longer valid, he should send a reject back to A, and then
> A ask C to open also a new QP.
>
> Ishai
>
> > -----Original Message-----
> > From: Tang, Changqing [mailto:changquing.tang at hp.com]
> > Sent: ה 03 ינואר 2008 17:49
> > To: Ishai Rabinovitz; panda at cse.ohio-state.edu; Jack Morgenstein;
> > Pavel Shamis
> > Cc: Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > independent of any one user process
> >
> >
> > Thanks for the comment.
> >
> > Another issue I have after thinking about the interface more.
> >
> > Rank A is the sender, rank B and C are two ranks on a
> remote node. At
> > first, B creates the receiving QP and make connection to A and
> > register the QP number for receiving. And A gets the receiving QP
> > nubmer from B. After some communication between A and B, B
> decides to
> > close the connection, and unregister the QP number. Then A
> and C want
> > to talk, so A tell C the receiving QP number, C tries to
> register the
> > QP number.
> >
> > I wonder at the time when C tries to register the QP number, the
> > receiving QP has been destroyed by the kernel, since when B
> unregister
> > the QP number, the reference count becomes zero, and kernel will
> > cleanup it.
> >
> > Am I right ?
> >
> >
> > --CQ
> >
> >
> >
> > > -----Original Message-----
> > > From: Ishai Rabinovitz [mailto:ishai at mellanox.co.il]
> > > Sent: Thursday, January 03, 2008 2:59 AM
> > > To: panda at cse.ohio-state.edu; Tang, Changqing; Jack
> > Morgenstein; Pavel
> > > Shamis
> > > Cc: Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > > independent of any one user process
> > >
> > > Please see my comments (prefix [Ishai])
> > >
> > > -----Original Message-----
> > > From: Tang, Changqing [mailto:changquing.tang at hp.com]
> > > Sent: ד 02 ינואר 2008 17:27
> > > To: Jack Morgenstein; Pavel Shamis
> > > Cc: Ishai Rabinovitz; Gleb Natapov; Roland Dreier;
> > > general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > > independent of any one user process
> > >
> > >
> > > This interface is OK for me.
> > >
> > > Now, every rank on a node who wants to receive message from
> > the same
> > > remote rank must know the same receiving QP number, and
> > register for
> > > receiving using this QP number.
> > >
> > > If rank B does not register (receiving QP has been created
> > by another
> > > rank A on the node), and sender know B's SRQ number, if
> > sender sends a
> > > message to B, can B still receive this
> > > message ? (I hope, no register, no receive)
> > >
> > > [Ishai] I guess that from the MPI layer prospective, the
> sender can
> > > not know B's SRQ number until it ask B to give it to him.
> So B can
> > > register to this QP before sending the SRQ number.
> > >
> > > I hope to know the opinion from other MPI team, or other XRC user.
> > >
> > > [Ishai] We already discussed this issues with Open MPI IB
> > group, and
> > > it looks fine to them. I'm sending this mail to Prof.
> > Panda, so he can
> > > comment on it as well.
> > >
> > > --CQ
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> > > > Sent: Monday, December 31, 2007 5:40 AM
> > > > To: pasha at mellanox.co.il
> > > > Cc: ishai at mellanox.co.il; Gleb Natapov; Roland Dreier; Tang,
> > > > Changqing; general at lists.openfabrics.org
> > > > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > > > independent of any one user process
> > > >
> > > > > Tang, Changqing wrote:
> > > > > > If I have a MPI server processes on a node, many
> > > > other MPI
> > > > > > client processes will dynamically
> connect/disconnect with the
> > > > > > server. The server use same XRC domain.
> > > > > >
> > > > > > Will this cause accumulating the "kernel"
> QP for such
> > > > > > application ? we want the server to run 365 days a year.
> > > > > >
> > > > > > I have some question about the scenario above. Did you
> > > > call for the
> > > > > > mpi disconnect on the both ends (server/client) before
> > > the client
> > > > > > exit (did we must to do it?)
> > > > >
> > > > > Yes, both ends will call disconnect. But for us,
> > > > MPI_Comm_disconnect()
> > > > > call is not a collective call, it is just a local operation.
> > > > >
> > > > > --CQ
> > > > >
> > > > Possible solution (internal review as yet):
> > > >
> > > > Each user process registers with the XRC QP:
> > > > a. each process registers ONCE. If it registers
> > multiple times,
> > > > there is no reference increment --
> > > > rather the registration succeeds, but only one PID
> > entry is
> > > > kept per QP.
> > > > b. Can have cleanup in the event of a process dying
> suddenly.
> > > > c. QP cannot be destroyed while there are any user
> > > processes still
> > > > registered with it.
> > > >
> > > > libibverbs API is as follows:
> > > >
> > > > ==============================================================
> > > > ========================
> > > > /**
> > > > * ibv_xrc_rcv_qp_alloc - creates an XRC QP for serving as a
> > > > receive-side only QP,
> > > > * and moves the created qp through the RESET->INIT and
> > > > INIT->RTR transitions.
> > > > * (The RTR->RTS transition is not needed, since this QP
> > > > does no sending).
> > > > * The sending XRC QP uses this QP as destination, while
> > > > specifying an XRC SRQ
> > > > * for actually receiving the transmissions and
> > > > generating all completions on the
> > > > * receiving side.
> > > > *
> > > > * This QP is created in kernel space, and persists
> > > > until the last process registered
> > > > * for the QP calls ibv_xrc_rcv_qp_unregister() (at
> > > > which time the QP is destroyed).
> > > > *
> > > > * @pd: protection domain to use. At lower layer, this
> provides
> > > > access to userspace obj
> > > > * @xrc_domain: xrc domain to use for the QP.
> > > > * @attr: modify-qp attributes needed to bring the QP to RTR.
> > > > * @attr_mask: bitmap indicating which attributes are
> > > provided in the
> > > > attr struct.
> > > > * used for validity checking.
> > > > * @xrc_rcv_qpn: qp_num of created QP (if success). To be
> > passed to
> > > > the remote node (sender).
> > > > * The remote node will use xrc_rcv_qpn in
> > > > ibv_post_send when sending to
> > > > * XRC SRQ's on this host in the same xrc domain.
> > > > *
> > > > * RETURNS: success (0), or a (negative) error value.
> > > > *
> > > > * NOTE: this verb also registers the calling user-process
> > > with the QP
> > > > at its creation time
> > > > * (implicit call to ibv_xrc_rcv_qp_register), to avoid
> > > > race conditions.
> > > > * The creating process will need to call
> > > > ibv_xrc_qp_unregister() for the QP to release it from
> > > > * this process.
> > > > */
> > > >
> > > > int ibv_xrc_rcv_qp_alloc(struct ibv_pd *pd,
> > > > struct ibv_xrc_domain *xrc_domain,
> > > > struct ibv_qp_attr *attr,
> > > > enum ibv_qp_attr_mask attr_mask,
> > > > uint32_t *xrc_rcv_qpn);
> > > >
> > > >
> > >
> >
> =====================================================================
> > > >
> > > > /**
> > > > * ibv_xrc_rcv_qp_register: registers a user process with
> > an XRC QP
> > > > which serves as
> > > > * a receive-side only QP.
> > > > *
> > > > * @xrc_domain: xrc domain the QP belongs to (for verification).
> > > > * @xrc_qp_num: The (24 bit) number of the XRC QP.
> > > > *
> > > > * RETURNS: success (0),
> > > > * or error (-EINVAL), if:
> > > > * 1. There is no such QP_num allocated.
> > > > * 2. The QP is allocated, but is not an
> receive XRC QP
> > > > * 3. The XRC QP does not belong to the given domain.
> > > > */
> > > > int ibv_xrc_rcv_qp_register(struct ibv_xrc_domain *xrc_domain,
> > > > uint32_t xrc_qp_num);
> > > >
> > > >
> > >
> >
> =====================================================================
> > > > /**
> > > > * ibv_xrc_rcv_qp_unregister: detaches a user process from
> > > an XRC QP
> > > > serving as
> > > > * a receive-side only QP. If as a result, there are
> > > > no remaining userspace processes
> > > > * registered for this XRC QP, it is destroyed.
> > > > *
> > > > * @xrc_domain: xrc domain the QP belongs to (for verification).
> > > > * @xrc_qp_num: The (24 bit) number of the XRC QP.
> > > > *
> > > > * RETURNS: success (0),
> > > > * or error (-EINVAL), if:
> > > > * 1. There is no such QP_num allocated.
> > > > * 2. The QP is allocated, but is not an XRC QP
> > > > * 3. The XRC QP does not belong to the given domain.
> > > > * NOTE: I don't see any reason to return a special code if
> > > the QP is
> > > > destroyed -- the unregister simply
> > > > * succeeds.
> > > > */
> > > > int ibv_xrc_rcv_qp_unregister(struct ibv_xrc_domain
> *xrc_domain,
> > > > uint32_t xrc_qp_num);
> > > > ==============================================================
> > > > ===============================
> > > >
> > > > Usage:
> > > >
> > > > 1. Sender creates an XRC QP (sending QP) 2. Sender sends some
> > > > receiving process on a remote node (say R1) a request to
> > provide an
> > > > XRC QP and XRC SRQ for
> > > > receiving messages (the request includes the sending
> > QP number).
> > > > 3. R1 calls ibv_xrc_rcv_qp_alloc() to create a receiving
> > XRC QP in
> > > > kernel space, and move
> > > > that QP up to RTR state. This function also registers
> > process R1
> > > > with the XRC QP.
> > > > 4. R1 calls ibv_create_xrc_srq() to create an SRQ for
> > > receive messages
> > > > via the just created XRC QP.
> > > > 5. R1 responds to request, providing the XRC qp number,
> > and XRC SRQ
> > > > number to be used in communication.
> > > > 6. Sender then may wish to communicate with another
> > > receiving process
> > > > on the remote host (say R2).
> > > > it sends a request to R2 containing the remote XRC QP number
> > > > (obtained from R1)
> > > > which it will use to send messages.
> > > > 7. R2 creates an XRC SRQ (if one does not already exist for the
> > > > domain), and also
> > > > calls ibv_xrc_rcv_qp_register() to register the process
> > > R2 with the
> > > > XRC QP created by R1.
> > > > 8. If R1 no longer needs to communicate with the
> sender, it calls
> > > > ibv_xrc_rcv_qp_unregister() for the QP.
> > > > The QP will not yet be destroyed, since R2 is still
> > > registered with
> > > > it.
> > > > 9. If R2 no longer needs to communicate with the
> sender, it calls
> > > > ibv_xrc_rcv_qp_unregister() for the QP.
> > > > At this point, the QP is destroyed, since no
> processes remain
> > > > registered with it.
> > > >
> > > > NOTES:
> > > > 1. The problem of the QP being destroyed and quickly
> > > re-allocated does
> > > > not exist -- the upper bits of the
> > > > QP number are incremented at each allocation (except
> > for the MSB
> > > > which is always 1 for XRC QPs). Thus,
> > > > even if the same QP is re-allocated, its QP number
> > > (stored in the
> > > > QP object) will be different than
> > > > expected (unless it is re-destroyed/re-allocated
> > several hundred
> > > > times).
> > > >
> > > > 2. With this model, we do not need a heartbeat: if a
> > > receiving process
> > > > dies, all XRC QPs it has registered for will
> > > > be unregistered as part of process cleanup in kernel space.
> > > >
> > > > - Jack
> > > >
> > > >
> > >
> >
>
More information about the general
mailing list