[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process
Pavel Shamis (Pasha)
pasha at dev.mellanox.co.il
Mon Dec 24 06:03:09 PST 2007
Hi CQ,
Tang, Changqing wrote:
> If I have a MPI server processes on a node, many other MPI client processes will dynamically
> connect/disconnect with the server. The server use same XRC domain.
>
> Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days
> a year.
>
I have some question about the scenario above. Did you call for the mpi
disconnect on the both ends (server/client) before the client exit (did
we must to do it?)
Regards,
Pasha.
>
> Thanks.
> --CQ
>
>
>
>
>
>> -----Original Message-----
>> From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
>> Sent: Thursday, December 20, 2007 9:15 AM
>> To: Jack Morgenstein
>> Cc: Tang, Changqing; Roland Dreier;
>> general at lists.openfabrics.org; Open MPI Developers;
>> mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
>> independent of any one user process
>>
>> Adding Open MPI and MVAPICH community to the thread.
>>
>> Pasha (Pavel Shamis)
>>
>> Jack Morgenstein wrote:
>>
>>> background: see "XRC Cleanup order issue thread" at
>>>
>>>
>>>
>>>
>> http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht
>>
>>> ml
>>>
>>> (userspace process which created the receiving XRC qp on a
>>>
>> given host
>>
>>> dies before other processes which still need to receive XRC
>>>
>> messages
>>
>>> on their SRQs which are "paired" with the now-destroyed
>>>
>> receiving XRC
>>
>>> QP.)
>>>
>>> Solution: Add a userspace verb (as part of the XRC suite) which
>>> enables the user process to create an XRC QP owned by the
>>>
>> kernel -- which belongs to the required XRC domain.
>>
>>> This QP will be destroyed when the XRC domain is closed
>>>
>> (i.e., as part
>>
>>> of a ibv_close_xrc_domain call, but only when the domain's
>>>
>> reference count goes to zero).
>>
>>> Below, I give the new userspace API for this function. Any
>>>
>> feedback will be appreciated.
>>
>>> This API will be implemented in the upcoming OFED 1.3
>>>
>> release, so we need feedback ASAP.
>>
>>> Notes:
>>> 1. There is no query or destroy verb for this QP. There is
>>>
>> also no userspace object for the
>>
>>> QP. Userspace has ONLY the raw qp number to use when
>>>
>> creating the (X)RC connection.
>>
>>> 2. Since the QP is "owned" by kernel space, async events
>>>
>> for this QP are also handled in kernel
>>
>>> space (i.e., reported in /var/log/messages). There are
>>>
>> no completion events for the QP, since
>>
>>> it does not send, and all receives completions are
>>>
>> reported in the XRC SRQ's cq.
>>
>>> If this QP enters the error state, the remote QP which
>>>
>> sends will start receiving RETRY_EXCEEDED
>>
>>> errors, so the application will be aware of the failure.
>>>
>>> - Jack
>>>
>>>
>> ======================================================================
>>
>>> ================
>>> /**
>>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
>>>
>> a receive-side only QP,
>>
>>> * and moves the created qp through the RESET->INIT and
>>>
>> INIT->RTR transitions.
>>
>>> * (The RTR->RTS transition is not needed, since this
>>>
>> QP does no sending).
>>
>>> * The sending XRC QP uses this QP as destination, while
>>>
>> specifying an XRC SRQ
>>
>>> * for actually receiving the transmissions and
>>>
>> generating all completions on the
>>
>>> * receiving side.
>>> *
>>> * This QP is created in kernel space, and persists
>>>
>> until the XRC domain is closed.
>>
>>> * (i.e., its reference count goes to zero).
>>> *
>>> * @pd: protection domain to use. At lower layer, this provides
>>> access to userspace obj
>>> * @xrc_domain: xrc domain to use for the QP.
>>> * @attr: modify-qp attributes needed to bring the QP to RTR.
>>> * @attr_mask: bitmap indicating which attributes are
>>>
>> provided in the attr struct.
>>
>>> * used for validity checking.
>>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be
>>>
>> passed to the remote node. The
>>
>>> * remote node will use xrc_rcv_qpn in
>>>
>> ibv_post_send when sending to
>>
>>> * XRC SRQ's on this host in the same xrc domain.
>>> *
>>> * RETURNS: success (0), or a (negative) error value.
>>> */
>>>
>>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
>>> struct ibv_xrc_domain *xrc_domain,
>>> struct ibv_qp_attr *attr,
>>> enum ibv_qp_attr_mask attr_mask,
>>> uint32_t *xrc_rcv_qpn);
>>>
>>> Notes:
>>>
>>> 1. Although the kernel creates the qp in the kernel's own
>>>
>> PD, we still need the PD
>>
>>> parameter to determine the device.
>>>
>>> 2. I chose to use struct ibv_qp_attr, which is used in
>>>
>> modify QP, rather than create
>>
>>> a new structure for this purpose. This also guards
>>>
>> against API changes in the event
>>
>>> that during development I notice that more modify-qp
>>>
>> parameters must be specified
>>
>>> for this operation to work.
>>>
>>> 3. Table of the ibv_qp_attr parameters showing what values to set:
>>>
>>> struct ibv_qp_attr {
>>> enum ibv_qp_state qp_state; Not needed
>>> enum ibv_qp_state cur_qp_state; Not needed
>>> -- Driver starts from RESET and takes qp to RTR.
>>> enum ibv_mtu path_mtu; Yes
>>> enum ibv_mig_state path_mig_state; Yes
>>> uint32_t qkey; Yes
>>> uint32_t rq_psn; Yes
>>> uint32_t sq_psn; Not needed
>>> uint32_t dest_qp_num; Yes
>>>
>> -- this is the remote side QP for the RC conn.
>>
>>> int qp_access_flags; Yes
>>> struct ibv_qp_cap cap; Need
>>>
>> only XRC domain.
>>
>>> Other
>>>
>> caps will use hard-coded values:
>>
>> max_send_wr = 1;
>>
>> max_recv_wr = 0;
>>
>> max_send_sge = 1;
>>
>> max_recv_sge = 0;
>>
>> max_inline_data = 0;
>>
>>> struct ibv_ah_attr ah_attr; Yes
>>> struct ibv_ah_attr alt_ah_attr; Optional
>>> uint16_t pkey_index; Yes
>>> uint16_t alt_pkey_index; Optional
>>> uint8_t en_sqd_async_notify; Not
>>>
>> needed (No sq)
>>
>>> uint8_t sq_draining; Not
>>>
>> needed (No sq)
>>
>>> uint8_t max_rd_atomic; Not
>>>
>> needed (No sq)
>>
>>> uint8_t max_dest_rd_atomic; Yes
>>>
>> -- Total max outstanding RDMAs expected
>>
>>> for
>>>
>> ALL srq destinations using this receive QP.
>>
>>> (if
>>>
>> you are only using SENDs, this value can be 0).
>>
>>> uint8_t min_rnr_timer; default - 0
>>> uint8_t port_num; Yes
>>> uint8_t timeout; Yes
>>> uint8_t retry_cnt; Yes
>>> uint8_t rnr_retry; Yes
>>> uint8_t alt_port_num; Optional
>>> uint8_t alt_timeout; Optional
>>> };
>>>
>>> 4. Attribute mask bits to set:
>>> For RESET_to_INIT transition:
>>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
>>>
>>> For INIT_to_RTR transition:
>>> IB_QP_AV | IB_QP_PATH_MTU |
>>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
>>> If you are using RDMA or atomics, also set:
>>> IB_QP_MAX_DEST_RD_ATOMIC
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>>
>> --
>> Pavel Shamis (Pasha)
>> Mellanox Technologies
>>
>>
>>
>
>
--
Pavel Shamis (Pasha)
Mellanox Technologies
More information about the general
mailing list