[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

Pavel Shamis (Pasha) pasha at dev.mellanox.co.il
Mon Dec 24 06:03:09 PST 2007


Hi CQ,
Tang, Changqing wrote:
>         If I have a MPI server processes on a node, many other MPI client processes will dynamically
> connect/disconnect with the server. The server use same XRC domain.
>
>         Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days
> a year.
>   
I have some question about the scenario above. Did you call for the mpi 
disconnect on the both ends (server/client) before the client exit (did 
we must to do it?)

Regards,
Pasha.
>
> Thanks.
> --CQ
>
>
>
>
>   
>> -----Original Message-----
>> From: Pavel Shamis (Pasha) [mailto:pasha at dev.mellanox.co.il]
>> Sent: Thursday, December 20, 2007 9:15 AM
>> To: Jack Morgenstein
>> Cc: Tang, Changqing; Roland Dreier;
>> general at lists.openfabrics.org; Open MPI Developers;
>> mvapich-discuss at cse.ohio-state.edu
>> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
>> independent of any one user process
>>
>> Adding Open MPI and MVAPICH community to the thread.
>>
>> Pasha (Pavel Shamis)
>>
>> Jack Morgenstein wrote:
>>     
>>> background:  see "XRC Cleanup order issue thread" at
>>>
>>>
>>>
>>>       
>> http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht
>>     
>>> ml
>>>
>>> (userspace process which created the receiving XRC qp on a
>>>       
>> given host
>>     
>>> dies before other processes which still need to receive XRC
>>>       
>> messages
>>     
>>> on their SRQs which are "paired" with the now-destroyed
>>>       
>> receiving XRC
>>     
>>> QP.)
>>>
>>> Solution: Add a userspace verb (as part of the XRC suite) which
>>> enables the user process to create an XRC QP owned by the
>>>       
>> kernel -- which belongs to the required XRC domain.
>>     
>>> This QP will be destroyed when the XRC domain is closed
>>>       
>> (i.e., as part
>>     
>>> of a ibv_close_xrc_domain call, but only when the domain's
>>>       
>> reference count goes to zero).
>>     
>>> Below, I give the new userspace API for this function.  Any
>>>       
>> feedback will be appreciated.
>>     
>>> This API will be implemented in the upcoming OFED 1.3
>>>       
>> release, so we need feedback ASAP.
>>     
>>> Notes:
>>> 1. There is no query or destroy verb for this QP. There is
>>>       
>> also no userspace object for the
>>     
>>>    QP. Userspace has ONLY the raw qp number to use when
>>>       
>> creating the (X)RC connection.
>>     
>>> 2. Since the QP is "owned" by kernel space, async events
>>>       
>> for this QP are also handled in kernel
>>     
>>>    space (i.e., reported in /var/log/messages). There are
>>>       
>> no completion events for the QP, since
>>     
>>>    it does not send, and all receives completions are
>>>       
>> reported in the XRC SRQ's cq.
>>     
>>>    If this QP enters the error state, the remote QP which
>>>       
>> sends will start receiving RETRY_EXCEEDED
>>     
>>>    errors, so the application will be aware of the failure.
>>>
>>> - Jack
>>>
>>>       
>> ======================================================================
>>     
>>> ================
>>> /**
>>>  * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
>>>       
>> a receive-side only QP,
>>     
>>>  *    and moves the created qp through the RESET->INIT and
>>>       
>> INIT->RTR transitions.
>>     
>>>  *      (The RTR->RTS transition is not needed, since this
>>>       
>> QP does no sending).
>>     
>>>  *    The sending XRC QP uses this QP as destination, while
>>>       
>> specifying an XRC SRQ
>>     
>>>  *    for actually receiving the transmissions and
>>>       
>> generating all completions on the
>>     
>>>  *    receiving side.
>>>  *
>>>  *    This QP is created in kernel space, and persists
>>>       
>> until the XRC domain is closed.
>>     
>>>  *    (i.e., its reference count goes to zero).
>>>  *
>>>  * @pd: protection domain to use.  At lower layer, this provides
>>> access to userspace obj
>>>  * @xrc_domain: xrc domain to use for the QP.
>>>  * @attr: modify-qp attributes needed to bring the QP to RTR.
>>>  * @attr_mask:  bitmap indicating which attributes are
>>>       
>> provided in the attr struct.
>>     
>>>  *    used for validity checking.
>>>  * @xrc_rcv_qpn: qp_num of created QP (if success). To be
>>>       
>> passed to the remote node. The
>>     
>>>  *               remote node will use xrc_rcv_qpn in
>>>       
>> ibv_post_send when sending to
>>     
>>>  *             XRC SRQ's on this host in the same xrc domain.
>>>  *
>>>  * RETURNS: success (0), or a (negative) error value.
>>>  */
>>>
>>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
>>>                        struct ibv_xrc_domain *xrc_domain,
>>>                        struct ibv_qp_attr *attr,
>>>                        enum ibv_qp_attr_mask attr_mask,
>>>                        uint32_t *xrc_rcv_qpn);
>>>
>>> Notes:
>>>
>>> 1. Although the kernel creates the qp in the kernel's own
>>>       
>> PD, we still need the PD
>>     
>>>    parameter to determine the device.
>>>
>>> 2. I chose to use struct ibv_qp_attr, which is used in
>>>       
>> modify QP, rather than create
>>     
>>>    a new structure for this purpose.  This also guards
>>>       
>> against API changes in the event
>>     
>>>    that during development I notice that more modify-qp
>>>       
>> parameters must be specified
>>     
>>>    for this operation to work.
>>>
>>> 3. Table of the ibv_qp_attr parameters showing what values to set:
>>>
>>> struct ibv_qp_attr {
>>>       enum ibv_qp_state       qp_state;               Not needed
>>>       enum ibv_qp_state       cur_qp_state;           Not needed
>>>               -- Driver starts from RESET and takes qp to RTR.
>>>       enum ibv_mtu            path_mtu;               Yes
>>>       enum ibv_mig_state      path_mig_state;         Yes
>>>       uint32_t                qkey;                   Yes
>>>       uint32_t                rq_psn;                 Yes
>>>       uint32_t                sq_psn;                 Not needed
>>>       uint32_t                dest_qp_num;            Yes
>>>       
>> -- this is the remote side QP for the RC conn.
>>     
>>>       int                     qp_access_flags;        Yes
>>>       struct ibv_qp_cap       cap;                    Need
>>>       
>> only XRC domain.
>>     
>>>                                                       Other
>>>       
>> caps will use hard-coded values:
>>     
>>   max_send_wr = 1;
>>     
>>   max_recv_wr = 0;
>>     
>>   max_send_sge = 1;
>>     
>>   max_recv_sge = 0;
>>     
>>   max_inline_data = 0;
>>     
>>>       struct ibv_ah_attr      ah_attr;                Yes
>>>       struct ibv_ah_attr      alt_ah_attr;            Optional
>>>       uint16_t                pkey_index;             Yes
>>>       uint16_t                alt_pkey_index;         Optional
>>>       uint8_t                 en_sqd_async_notify;    Not
>>>       
>> needed (No sq)
>>     
>>>       uint8_t                 sq_draining;            Not
>>>       
>> needed (No sq)
>>     
>>>       uint8_t                 max_rd_atomic;          Not
>>>       
>> needed (No sq)
>>     
>>>       uint8_t                 max_dest_rd_atomic;     Yes
>>>       
>> -- Total max outstanding RDMAs expected
>>     
>>>                                                       for
>>>       
>> ALL srq destinations using this receive QP.
>>     
>>>                                                       (if
>>>       
>> you are only using SENDs, this value can be 0).
>>     
>>>       uint8_t                 min_rnr_timer;          default - 0
>>>       uint8_t                 port_num;               Yes
>>>       uint8_t                 timeout;                Yes
>>>       uint8_t                 retry_cnt;              Yes
>>>       uint8_t                 rnr_retry;              Yes
>>>       uint8_t                 alt_port_num;           Optional
>>>       uint8_t                 alt_timeout;            Optional
>>> };
>>>
>>> 4. Attribute mask bits to set:
>>>       For RESET_to_INIT transition:
>>>               IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT
>>>
>>>       For INIT_to_RTR transition:
>>>               IB_QP_AV | IB_QP_PATH_MTU |
>>>               IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
>>>          If you are using RDMA or atomics, also set:
>>>               IB_QP_MAX_DEST_RD_ATOMIC
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>>       
>> --
>> Pavel Shamis (Pasha)
>> Mellanox Technologies
>>
>>
>>     
>
>   


-- 
Pavel Shamis (Pasha)
Mellanox Technologies




More information about the general mailing list