[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Ishai Rabinovitz ishai at mellanox.co.il
Thu Jan 3 07:55:07 PST 2008


CQ, You are right.

And there is no race because the register and deregister are locked in the kernel using the same spin lock.

So in the MPI implementation, when C finds out that the QP is no longer valid, he should send a reject back to A, and then A ask C to open also a new QP.

Ishai 

> -----Original Message-----
> From: Tang, Changqing [mailto:changquing.tang at hp.com] 
> Sent: ה 03 ינואר 2008 17:49
> To: Ishai Rabinovitz; panda at cse.ohio-state.edu; Jack 
> Morgenstein; Pavel Shamis
> Cc: Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
> Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP 
> independent of any one user process
> 
> 
> Thanks for the comment.
> 
> Another issue I have after thinking about the interface more.
> 
> Rank A is the sender, rank B and C are two ranks on a remote 
> node. At first, B creates the receiving QP and make 
> connection to A and register the QP number for receiving. And 
> A gets the receiving QP nubmer from B.  After some 
> communication between A and B, B decides to close the 
> connection, and unregister the QP number. Then A and C want 
> to talk, so A tell C the receiving QP number, C tries to 
> register the QP number.
> 
> I wonder at the time when C tries to register the QP number, 
> the receiving QP has been destroyed by the kernel, since when 
> B unregister the QP number, the reference count becomes zero, 
> and kernel will cleanup it.
> 
> Am I right ?
> 
> 
> --CQ
> 
> 
> 
> > -----Original Message-----
> > From: Ishai Rabinovitz [mailto:ishai at mellanox.co.il]
> > Sent: Thursday, January 03, 2008 2:59 AM
> > To: panda at cse.ohio-state.edu; Tang, Changqing; Jack 
> Morgenstein; Pavel 
> > Shamis
> > Cc: Gleb Natapov; Roland Dreier; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP 
> > independent of any one user process
> >
> > Please see my comments (prefix [Ishai])
> >
> > -----Original Message-----
> > From: Tang, Changqing [mailto:changquing.tang at hp.com]
> > Sent: ד 02 ינואר 2008 17:27
> > To: Jack Morgenstein; Pavel Shamis
> > Cc: Ishai Rabinovitz; Gleb Natapov; Roland Dreier; 
> > general at lists.openfabrics.org
> > Subject: RE: [ofa-general] [RFC] XRC -- make receiving XRC QP 
> > independent of any one user process
> >
> >
> > This interface is OK for me.
> >
> > Now, every rank on a node who wants to receive message from 
> the same 
> > remote rank must know the same receiving QP number, and 
> register for 
> > receiving using this QP number.
> >
> > If rank B does not register (receiving QP has been created 
> by another 
> > rank A on the node), and sender know B's SRQ number, if 
> sender sends a 
> > message to B, can B still receive this
> > message ?   (I hope, no register, no receive)
> >
> > [Ishai] I guess that from the MPI layer prospective, the sender can 
> > not know B's SRQ number until it ask B to give it to him. So B can 
> > register to this QP before sending the SRQ number.
> >
> > I hope to know the opinion from other MPI team, or other XRC user.
> >
> > [Ishai] We already discussed this issues with Open MPI IB 
> group, and 
> > it looks fine to them. I'm sending this mail to Prof. 
> Panda, so he can 
> > comment on it as well.
> >
> > --CQ
> >
> >
> >
> > > -----Original Message-----
> > > From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> > > Sent: Monday, December 31, 2007 5:40 AM
> > > To: pasha at mellanox.co.il
> > > Cc: ishai at mellanox.co.il; Gleb Natapov; Roland Dreier; Tang, 
> > > Changqing; general at lists.openfabrics.org
> > > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP 
> > > independent of any one user process
> > >
> > > > Tang, Changqing wrote:
> > > > >         If I have a MPI server processes on a node, many
> > > other MPI
> > > > > client processes will dynamically connect/disconnect with the 
> > > > > server. The server use same XRC domain.
> > > > >
> > > > >         Will this cause accumulating the "kernel" QP for such 
> > > > > application ? we want the server to run 365 days a year.
> > > > >
> > > > > I have some question about the scenario above. Did you
> > > call for the
> > > > > mpi disconnect on the both ends (server/client) before
> > the client
> > > > > exit (did we must to do it?)
> > > >
> > > > Yes, both ends will call disconnect. But for us,
> > > MPI_Comm_disconnect()
> > > > call is not a collective call, it is just a local operation.
> > > >
> > > > --CQ
> > > >
> > > Possible solution (internal review as yet):
> > >
> > >   Each user process registers with the XRC QP:
> > >     a. each process registers ONCE. If it registers 
> multiple times, 
> > > there is no reference increment --
> > >        rather the registration succeeds, but only one PID 
> entry is 
> > > kept per QP.
> > >     b. Can have cleanup in the event of a process dying suddenly.
> > >     c. QP cannot be destroyed while there are any user
> > processes still
> > > registered with it.
> > >
> > > libibverbs API is as follows:
> > >
> > > ==============================================================
> > > ========================
> > > /**
> > >  * ibv_xrc_rcv_qp_alloc - creates an XRC QP for serving as a 
> > > receive-side only QP,
> > >  *      and moves the created qp through the RESET->INIT and
> > > INIT->RTR transitions.
> > >  *      (The RTR->RTS transition is not needed, since this QP
> > > does no sending).
> > >  *      The sending XRC QP uses this QP as destination, while
> > > specifying an XRC SRQ
> > >  *      for actually receiving the transmissions and
> > > generating all completions on the
> > >  *      receiving side.
> > >  *
> > >  *      This QP is created in kernel space, and persists
> > > until the last process registered
> > >  *      for the QP calls ibv_xrc_rcv_qp_unregister() (at
> > > which time the QP is destroyed).
> > >  *
> > >  * @pd: protection domain to use.  At lower layer, this provides 
> > > access to userspace obj
> > >  * @xrc_domain: xrc domain to use for the QP.
> > >  * @attr: modify-qp attributes needed to bring the QP to RTR.
> > >  * @attr_mask:  bitmap indicating which attributes are
> > provided in the
> > > attr struct.
> > >  *      used for validity checking.
> > >  * @xrc_rcv_qpn: qp_num of created QP (if success). To be 
> passed to 
> > > the remote node (sender).
> > >  *               The remote node will use xrc_rcv_qpn in
> > > ibv_post_send when sending to
> > >  *               XRC SRQ's on this host in the same xrc domain.
> > >  *
> > >  * RETURNS: success (0), or a (negative) error value.
> > >  *
> > >  * NOTE: this verb also registers the calling user-process
> > with the QP
> > > at its creation time
> > >  *       (implicit call to ibv_xrc_rcv_qp_register), to avoid
> > > race conditions.
> > >  *       The creating process will need to call
> > > ibv_xrc_qp_unregister() for the QP to release it from
> > >  *       this process.
> > >  */
> > >
> > > int ibv_xrc_rcv_qp_alloc(struct ibv_pd *pd,
> > >                          struct ibv_xrc_domain *xrc_domain,
> > >                          struct ibv_qp_attr *attr,
> > >                          enum ibv_qp_attr_mask attr_mask,
> > >                          uint32_t *xrc_rcv_qpn);
> > >
> > >
> > 
> =====================================================================
> > >
> > > /**
> > >  * ibv_xrc_rcv_qp_register: registers a user process with 
> an XRC QP 
> > > which serves as
> > >  *         a receive-side only QP.
> > >  *
> > >  * @xrc_domain: xrc domain the QP belongs to (for verification).
> > >  * @xrc_qp_num: The (24 bit) number of the XRC QP.
> > >  *
> > >  * RETURNS: success (0),
> > >  *          or error (-EINVAL), if:
> > >  *            1. There is no such QP_num allocated.
> > >  *            2. The QP is allocated, but is not an receive XRC QP
> > >  *            3. The XRC QP does not belong to the given domain.
> > >  */
> > > int ibv_xrc_rcv_qp_register(struct ibv_xrc_domain *xrc_domain, 
> > > uint32_t xrc_qp_num);
> > >
> > >
> > 
> =====================================================================
> > > /**
> > >  * ibv_xrc_rcv_qp_unregister: detaches a user process from
> > an XRC QP
> > > serving as
> > >  *         a receive-side only QP. If as a result, there are
> > > no remaining userspace processes
> > >  *         registered for this XRC QP, it is destroyed.
> > >  *
> > >  * @xrc_domain: xrc domain the QP belongs to (for verification).
> > >  * @xrc_qp_num: The (24 bit) number of the XRC QP.
> > >  *
> > >  * RETURNS: success (0),
> > >  *          or error (-EINVAL), if:
> > >  *            1. There is no such QP_num allocated.
> > >  *            2. The QP is allocated, but is not an XRC QP
> > >  *            3. The XRC QP does not belong to the given domain.
> > >  * NOTE: I don't see any reason to return a special code if
> > the QP is
> > > destroyed -- the unregister simply
> > >  *       succeeds.
> > >  */
> > > int ibv_xrc_rcv_qp_unregister(struct ibv_xrc_domain *xrc_domain, 
> > > uint32_t xrc_qp_num); 
> > > ==============================================================
> > > ===============================
> > >
> > > Usage:
> > >
> > > 1. Sender creates an XRC QP (sending QP) 2. Sender sends some 
> > > receiving process on a remote node (say R1) a request to 
> provide an 
> > > XRC QP and XRC SRQ for
> > >    receiving messages (the request includes the sending 
> QP number).
> > > 3. R1 calls ibv_xrc_rcv_qp_alloc() to create a receiving 
> XRC QP in 
> > > kernel space, and move
> > >    that QP up to RTR state. This function also registers 
> process R1 
> > > with the XRC QP.
> > > 4. R1 calls ibv_create_xrc_srq() to create an SRQ for
> > receive messages
> > > via the just created XRC QP.
> > > 5. R1 responds to request, providing the XRC qp number, 
> and XRC SRQ 
> > > number to be used in communication.
> > > 6. Sender then may wish to communicate with another
> > receiving process
> > > on the remote host (say R2).
> > >    it sends a request to R2 containing the remote XRC QP number 
> > > (obtained from R1)
> > >    which it will use to send messages.
> > > 7. R2 creates an XRC SRQ (if one does not already exist for the 
> > > domain), and also
> > >    calls ibv_xrc_rcv_qp_register() to register the process
> > R2 with the
> > > XRC QP created by R1.
> > > 8. If R1 no longer needs to communicate with the sender, it calls
> > > ibv_xrc_rcv_qp_unregister() for the QP.
> > >    The QP will not yet be destroyed, since R2 is still
> > registered with
> > > it.
> > > 9. If R2 no longer needs to communicate with the sender, it calls
> > > ibv_xrc_rcv_qp_unregister() for the QP.
> > >    At this point, the QP is destroyed, since no processes remain 
> > > registered with it.
> > >
> > > NOTES:
> > > 1. The problem of the QP being destroyed and quickly
> > re-allocated does
> > > not exist -- the upper bits of the
> > >    QP number are incremented at each allocation (except 
> for the MSB 
> > > which is always 1 for XRC QPs).  Thus,
> > >    even if the same QP is re-allocated, its QP number
> > (stored in the
> > > QP object) will be different than
> > >    expected (unless it is re-destroyed/re-allocated 
> several hundred 
> > > times).
> > >
> > > 2. With this model, we do not need a heartbeat: if a
> > receiving process
> > > dies, all XRC QPs it has registered for will
> > >    be unregistered as part of process cleanup in kernel space.
> > >
> > > - Jack
> > >
> > >
> >
> 



More information about the general mailing list