[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Gleb Natapov glebn at voltaire.com
Mon Dec 24 22:43:06 PST 2007


On Mon, Dec 24, 2007 at 11:49:37PM +0000, Tang, Changqing wrote:
> 
> 
> > -----Original Message-----
> > From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
> > Sent: Monday, December 24, 2007 8:03 AM
> > To: Tang, Changqing
> > Cc: Jack Morgenstein; Roland Dreier;
> > general at lists.openfabrics.org; Open MPI Developers;
> > mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > independent of any one user process
> >
> > Hi CQ,
> > Tang, Changqing wrote:
> > >         If I have a MPI server processes on a node, many other MPI
> > > client processes will dynamically connect/disconnect with
> > the server. The server use same XRC domain.
> > >
> > >         Will this cause accumulating the "kernel" QP for such
> > > application ? we want the server to run 365 days a year.
> > >
> > I have some question about the scenario above. Did you call
> > for the mpi disconnect on the both ends (server/client)
> > before the client exit (did we must to do it?)
> 
> Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call
> is not a collective call, it is just a local operation.
Bust spec says that MPI_Comm_disconnect() is a collective call:
http://www.mpi-forum.org/docs/mpi-20-html/node114.htm#Node114

> 
> --CQ
> 
> 
> >
> > Regards,
> > Pasha.
> > >
> > > Thanks.
> > > --CQ
> > >
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
> > >> Sent: Thursday, December 20, 2007 9:15 AM
> > >> To: Jack Morgenstein
> > >> Cc: Tang, Changqing; Roland Dreier;
> > >> general at lists.openfabrics.org; Open MPI Developers;
> > >> mvapich-discuss at cse.ohio-state.edu
> > >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> > >> independent of any one user process
> > >>
> > >> Adding Open MPI and MVAPICH community to the thread.
> > >>
> > >> Pasha (Pavel Shamis)
> > >>
> > >> Jack Morgenstein wrote:
> > >>
> > >>> background:  see "XRC Cleanup order issue thread" at
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > http://lists.openfabrics.org/pipermail/general/2007-December/043935.h
> > >> t
> > >>
> > >>> ml
> > >>>
> > >>> (userspace process which created the receiving XRC qp on a
> > >>>
> > >> given host
> > >>
> > >>> dies before other processes which still need to receive XRC
> > >>>
> > >> messages
> > >>
> > >>> on their SRQs which are "paired" with the now-destroyed
> > >>>
> > >> receiving XRC
> > >>
> > >>> QP.)
> > >>>
> > >>> Solution: Add a userspace verb (as part of the XRC suite) which
> > >>> enables the user process to create an XRC QP owned by the
> > >>>
> > >> kernel -- which belongs to the required XRC domain.
> > >>
> > >>> This QP will be destroyed when the XRC domain is closed
> > >>>
> > >> (i.e., as part
> > >>
> > >>> of a ibv_close_xrc_domain call, but only when the domain's
> > >>>
> > >> reference count goes to zero).
> > >>
> > >>> Below, I give the new userspace API for this function.  Any
> > >>>
> > >> feedback will be appreciated.
> > >>
> > >>> This API will be implemented in the upcoming OFED 1.3
> > >>>
> > >> release, so we need feedback ASAP.
> > >>
> > >>> Notes:
> > >>> 1. There is no query or destroy verb for this QP. There is
> > >>>
> > >> also no userspace object for the
> > >>
> > >>>    QP. Userspace has ONLY the raw qp number to use when
> > >>>
> > >> creating the (X)RC connection.
> > >>
> > >>> 2. Since the QP is "owned" by kernel space, async events
> > >>>
> > >> for this QP are also handled in kernel
> > >>
> > >>>    space (i.e., reported in /var/log/messages). There are
> > >>>
> > >> no completion events for the QP, since
> > >>
> > >>>    it does not send, and all receives completions are
> > >>>
> > >> reported in the XRC SRQ's cq.
> > >>
> > >>>    If this QP enters the error state, the remote QP which
> > >>>
> > >> sends will start receiving RETRY_EXCEEDED
> > >>
> > >>>    errors, so the application will be aware of the failure.
> > >>>
> > >>> - Jack
> > >>>
> > >>>
> > >>
> > =====================================================================
> > >> =
> > >>
> > >>> ================
> > >>> /**
> > >>>  * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
> > >>>
> > >> a receive-side only QP,
> > >>
> > >>>  *    and moves the created qp through the RESET->INIT and
> > >>>
> > >> INIT->RTR transitions.
> > >>
> > >>>  *      (The RTR->RTS transition is not needed, since this
> > >>>
> > >> QP does no sending).
> > >>
> > >>>  *    The sending XRC QP uses this QP as destination, while
> > >>>
> > >> specifying an XRC SRQ
> > >>
> > >>>  *    for actually receiving the transmissions and
> > >>>
> > >> generating all completions on the
> > >>
> > >>>  *    receiving side.
> > >>>  *
> > >>>  *    This QP is created in kernel space, and persists
> > >>>
> > >> until the XRC domain is closed.
> > >>
> > >>>  *    (i.e., its reference count goes to zero).
> > >>>  *
> > >>>  * @pd: protection domain to use.  At lower layer, this provides
> > >>> access to userspace obj
> > >>>  * @xrc_domain: xrc domain to use for the QP.
> > >>>  * @attr: modify-qp attributes needed to bring the QP to RTR.
> > >>>  * @attr_mask:  bitmap indicating which attributes are
> > >>>
> > >> provided in the attr struct.
> > >>
> > >>>  *    used for validity checking.
> > >>>  * @xrc_rcv_qpn: qp_num of created QP (if success). To be
> > >>>
> > >> passed to the remote node. The
> > >>
> > >>>  *               remote node will use xrc_rcv_qpn in
> > >>>
> > >> ibv_post_send when sending to
> > >>
> > >>>  *             XRC SRQ's on this host in the same xrc domain.
> > >>>  *
> > >>>  * RETURNS: success (0), or a (negative) error value.
> > >>>  */
> > >>>
> > >>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
> > >>>                        struct ibv_xrc_domain *xrc_domain,
> > >>>                        struct ibv_qp_attr *attr,
> > >>>                        enum ibv_qp_attr_mask attr_mask,
> > >>>                        uint32_t *xrc_rcv_qpn);
> > >>>
> > >>> Notes:
> > >>>
> > >>> 1. Although the kernel creates the qp in the kernel's own
> > >>>
> > >> PD, we still need the PD
> > >>
> > >>>    parameter to determine the device.
> > >>>
> > >>> 2. I chose to use struct ibv_qp_attr, which is used in
> > >>>
> > >> modify QP, rather than create
> > >>
> > >>>    a new structure for this purpose.  This also guards
> > >>>
> > >> against API changes in the event
> > >>
> > >>>    that during development I notice that more modify-qp
> > >>>
> > >> parameters must be specified
> > >>
> > >>>    for this operation to work.
> > >>>
> > >>> 3. Table of the ibv_qp_attr parameters showing what values to set:
> > >>>
> > >>> struct ibv_qp_attr {
> > >>>       enum ibv_qp_state       qp_state;               Not needed
> > >>>       enum ibv_qp_state       cur_qp_state;           Not needed
> > >>>               -- Driver starts from RESET and takes qp to RTR.
> > >>>       enum ibv_mtu            path_mtu;               Yes
> > >>>       enum ibv_mig_state      path_mig_state;         Yes
> > >>>       uint32_t                qkey;                   Yes
> > >>>       uint32_t                rq_psn;                 Yes
> > >>>       uint32_t                sq_psn;                 Not needed
> > >>>       uint32_t                dest_qp_num;            Yes
> > >>>
> > >> -- this is the remote side QP for the RC conn.
> > >>
> > >>>       int                     qp_access_flags;        Yes
> > >>>       struct ibv_qp_cap       cap;                    Need
> > >>>
> > >> only XRC domain.
> > >>
> > >>>                                                       Other
> > >>>
> > >> caps will use hard-coded values:
> > >>
> > >>   max_send_wr = 1;
> > >>
> > >>   max_recv_wr = 0;
> > >>
> > >>   max_send_sge = 1;
> > >>
> > >>   max_recv_sge = 0;
> > >>
> > >>   max_inline_data = 0;
> > >>
> > >>>       struct ibv_ah_attr      ah_attr;                Yes
> > >>>       struct ibv_ah_attr      alt_ah_attr;            Optional
> > >>>       uint16_t                pkey_index;             Yes
> > >>>       uint16_t                alt_pkey_index;         Optional
> > >>>       uint8_t                 en_sqd_async_notify;    Not
> > >>>
> > >> needed (No sq)
> > >>
> > >>>       uint8_t                 sq_draining;            Not
> > >>>
> > >> needed (No sq)
> > >>
> > >>>       uint8_t                 max_rd_atomic;          Not
> > >>>
> > >> needed (No sq)
> > >>
> > >>>       uint8_t                 max_dest_rd_atomic;     Yes
> > >>>
> > >> -- Total max outstanding RDMAs expected
> > >>
> > >>>                                                       for
> > >>>
> > >> ALL srq destinations using this receive QP.
> > >>
> > >>>                                                       (if
> > >>>
> > >> you are only using SENDs, this value can be 0).
> > >>
> > >>>       uint8_t                 min_rnr_timer;          default - 0
> > >>>       uint8_t                 port_num;               Yes
> > >>>       uint8_t                 timeout;                Yes
> > >>>       uint8_t                 retry_cnt;              Yes
> > >>>       uint8_t                 rnr_retry;              Yes
> > >>>       uint8_t                 alt_port_num;           Optional
> > >>>       uint8_t                 alt_timeout;            Optional
> > >>> };
> > >>>
> > >>> 4. Attribute mask bits to set:
> > >>>       For RESET_to_INIT transition:
> > >>>               IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
> > >>>
> > >>>       For INIT_to_RTR transition:
> > >>>               IB_QP_AV | IB_QP_PATH_MTU |
> > >>>               IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
> > >>>          If you are using RDMA or atomics, also set:
> > >>>               IB_QP_MAX_DEST_RD_ATOMIC
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> general mailing list
> > >>> general at lists.openfabrics.org
> > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >>>
> > >>> To unsubscribe, please visit
> > >>> http://openib.org/mailman/listinfo/openib-general
> > >>>
> > >>>
> > >>>
> > >> --
> > >> Pavel Shamis (Pasha)
> > >> Mellanox Technologies
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> > --
> > Pavel Shamis (Pasha)
> > Mellanox Technologies
> >
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

--
			Gleb.



More information about the general mailing list