[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
Tang, Changqing
changquing.tang at hp.com
Mon Dec 24 15:49:37 PST 2007
> -----Original Message-----
> From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
> Sent: Monday, December 24, 2007 8:03 AM
> To: Tang, Changqing
> Cc: Jack Morgenstein; Roland Dreier;
> general at lists.openfabrics.org; Open MPI Developers;
> mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> Hi CQ,
> Tang, Changqing wrote:
> > If I have a MPI server processes on a node, many other MPI
> > client processes will dynamically connect/disconnect with
> the server. The server use same XRC domain.
> >
> > Will this cause accumulating the "kernel" QP for such
> > application ? we want the server to run 365 days a year.
> >
> I have some question about the scenario above. Did you call
> for the mpi disconnect on the both ends (server/client)
> before the client exit (did we must to do it?)
Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call
is not a collective call, it is just a local operation.
--CQ
>
> Regards,
> Pasha.
> >
> > Thanks.
> > --CQ
> >
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
> >> Sent: Thursday, December 20, 2007 9:15 AM
> >> To: Jack Morgenstein
> >> Cc: Tang, Changqing; Roland Dreier;
> >> general at lists.openfabrics.org; Open MPI Developers;
> >> mvapich-discuss at cse.ohio-state.edu
> >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> >> independent of any one user process
> >>
> >> Adding Open MPI and MVAPICH community to the thread.
> >>
> >> Pasha (Pavel Shamis)
> >>
> >> Jack Morgenstein wrote:
> >>
> >>> background: see "XRC Cleanup order issue thread" at
> >>>
> >>>
> >>>
> >>>
> >>
> http://lists.openfabrics.org/pipermail/general/2007-December/043935.h
> >> t
> >>
> >>> ml
> >>>
> >>> (userspace process which created the receiving XRC qp on a
> >>>
> >> given host
> >>
> >>> dies before other processes which still need to receive XRC
> >>>
> >> messages
> >>
> >>> on their SRQs which are "paired" with the now-destroyed
> >>>
> >> receiving XRC
> >>
> >>> QP.)
> >>>
> >>> Solution: Add a userspace verb (as part of the XRC suite) which
> >>> enables the user process to create an XRC QP owned by the
> >>>
> >> kernel -- which belongs to the required XRC domain.
> >>
> >>> This QP will be destroyed when the XRC domain is closed
> >>>
> >> (i.e., as part
> >>
> >>> of a ibv_close_xrc_domain call, but only when the domain's
> >>>
> >> reference count goes to zero).
> >>
> >>> Below, I give the new userspace API for this function. Any
> >>>
> >> feedback will be appreciated.
> >>
> >>> This API will be implemented in the upcoming OFED 1.3
> >>>
> >> release, so we need feedback ASAP.
> >>
> >>> Notes:
> >>> 1. There is no query or destroy verb for this QP. There is
> >>>
> >> also no userspace object for the
> >>
> >>> QP. Userspace has ONLY the raw qp number to use when
> >>>
> >> creating the (X)RC connection.
> >>
> >>> 2. Since the QP is "owned" by kernel space, async events
> >>>
> >> for this QP are also handled in kernel
> >>
> >>> space (i.e., reported in /var/log/messages). There are
> >>>
> >> no completion events for the QP, since
> >>
> >>> it does not send, and all receives completions are
> >>>
> >> reported in the XRC SRQ's cq.
> >>
> >>> If this QP enters the error state, the remote QP which
> >>>
> >> sends will start receiving RETRY_EXCEEDED
> >>
> >>> errors, so the application will be aware of the failure.
> >>>
> >>> - Jack
> >>>
> >>>
> >>
> =====================================================================
> >> =
> >>
> >>> ================
> >>> /**
> >>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
> >>>
> >> a receive-side only QP,
> >>
> >>> * and moves the created qp through the RESET->INIT and
> >>>
> >> INIT->RTR transitions.
> >>
> >>> * (The RTR->RTS transition is not needed, since this
> >>>
> >> QP does no sending).
> >>
> >>> * The sending XRC QP uses this QP as destination, while
> >>>
> >> specifying an XRC SRQ
> >>
> >>> * for actually receiving the transmissions and
> >>>
> >> generating all completions on the
> >>
> >>> * receiving side.
> >>> *
> >>> * This QP is created in kernel space, and persists
> >>>
> >> until the XRC domain is closed.
> >>
> >>> * (i.e., its reference count goes to zero).
> >>> *
> >>> * @pd: protection domain to use. At lower layer, this provides
> >>> access to userspace obj
> >>> * @xrc_domain: xrc domain to use for the QP.
> >>> * @attr: modify-qp attributes needed to bring the QP to RTR.
> >>> * @attr_mask: bitmap indicating which attributes are
> >>>
> >> provided in the attr struct.
> >>
> >>> * used for validity checking.
> >>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be
> >>>
> >> passed to the remote node. The
> >>
> >>> * remote node will use xrc_rcv_qpn in
> >>>
> >> ibv_post_send when sending to
> >>
> >>> * XRC SRQ's on this host in the same xrc domain.
> >>> *
> >>> * RETURNS: success (0), or a (negative) error value.
> >>> */
> >>>
> >>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
> >>> struct ibv_xrc_domain *xrc_domain,
> >>> struct ibv_qp_attr *attr,
> >>> enum ibv_qp_attr_mask attr_mask,
> >>> uint32_t *xrc_rcv_qpn);
> >>>
> >>> Notes:
> >>>
> >>> 1. Although the kernel creates the qp in the kernel's own
> >>>
> >> PD, we still need the PD
> >>
> >>> parameter to determine the device.
> >>>
> >>> 2. I chose to use struct ibv_qp_attr, which is used in
> >>>
> >> modify QP, rather than create
> >>
> >>> a new structure for this purpose. This also guards
> >>>
> >> against API changes in the event
> >>
> >>> that during development I notice that more modify-qp
> >>>
> >> parameters must be specified
> >>
> >>> for this operation to work.
> >>>
> >>> 3. Table of the ibv_qp_attr parameters showing what values to set:
> >>>
> >>> struct ibv_qp_attr {
> >>> enum ibv_qp_state qp_state; Not needed
> >>> enum ibv_qp_state cur_qp_state; Not needed
> >>> -- Driver starts from RESET and takes qp to RTR.
> >>> enum ibv_mtu path_mtu; Yes
> >>> enum ibv_mig_state path_mig_state; Yes
> >>> uint32_t qkey; Yes
> >>> uint32_t rq_psn; Yes
> >>> uint32_t sq_psn; Not needed
> >>> uint32_t dest_qp_num; Yes
> >>>
> >> -- this is the remote side QP for the RC conn.
> >>
> >>> int qp_access_flags; Yes
> >>> struct ibv_qp_cap cap; Need
> >>>
> >> only XRC domain.
> >>
> >>> Other
> >>>
> >> caps will use hard-coded values:
> >>
> >> max_send_wr = 1;
> >>
> >> max_recv_wr = 0;
> >>
> >> max_send_sge = 1;
> >>
> >> max_recv_sge = 0;
> >>
> >> max_inline_data = 0;
> >>
> >>> struct ibv_ah_attr ah_attr; Yes
> >>> struct ibv_ah_attr alt_ah_attr; Optional
> >>> uint16_t pkey_index; Yes
> >>> uint16_t alt_pkey_index; Optional
> >>> uint8_t en_sqd_async_notify; Not
> >>>
> >> needed (No sq)
> >>
> >>> uint8_t sq_draining; Not
> >>>
> >> needed (No sq)
> >>
> >>> uint8_t max_rd_atomic; Not
> >>>
> >> needed (No sq)
> >>
> >>> uint8_t max_dest_rd_atomic; Yes
> >>>
> >> -- Total max outstanding RDMAs expected
> >>
> >>> for
> >>>
> >> ALL srq destinations using this receive QP.
> >>
> >>> (if
> >>>
> >> you are only using SENDs, this value can be 0).
> >>
> >>> uint8_t min_rnr_timer; default - 0
> >>> uint8_t port_num; Yes
> >>> uint8_t timeout; Yes
> >>> uint8_t retry_cnt; Yes
> >>> uint8_t rnr_retry; Yes
> >>> uint8_t alt_port_num; Optional
> >>> uint8_t alt_timeout; Optional
> >>> };
> >>>
> >>> 4. Attribute mask bits to set:
> >>> For RESET_to_INIT transition:
> >>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
> >>>
> >>> For INIT_to_RTR transition:
> >>> IB_QP_AV | IB_QP_PATH_MTU |
> >>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
> >>> If you are using RDMA or atomics, also set:
> >>> IB_QP_MAX_DEST_RD_ATOMIC
> >>>
> >>>
> >>> _______________________________________________
> >>> general mailing list
> >>> general at lists.openfabrics.org
> >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>
> >>> To unsubscribe, please visit
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>>
> >>>
> >> --
> >> Pavel Shamis (Pasha)
> >> Mellanox Technologies
> >>
> >>
> >>
> >
> >
>
>
> --
> Pavel Shamis (Pasha)
> Mellanox Technologies
>
>
More information about the general
mailing list