[ofa-general] error with ibv_poll_cq() call
Tang, Changqing
changquing.tang at hp.com
Wed Mar 26 00:09:25 PDT 2008
Well, I said "PEER" process may destroyed the remote QP, The process
calling ibv_poll_cq() still has the QP in RTS state.
And though I use OFED 1.3, the HCA is not connectX.
idea ?
--CQ
> -----Original Message-----
> From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> Sent: Wednesday, March 26, 2008 2:01 AM
> To: general at lists.openfabrics.org
> Cc: Tang, Changqing; Roland Dreier
> Subject: Re: [ofa-general] error with ibv_poll_cq() call
>
> On Wednesday 26 March 2008 05:56, Tang, Changqing wrote:
> >
> > Hi,
> > We are debuging our dynamic process code, when we call
> >
> > ret = ibv_poll_cq(cq_hndl, 1, &compl);
> >
> > The peer process may have destroyed the QP.
> >
> > However, ibv_poll_cq() return -2 in 'ret', 'errno' is still 0
> >
> > What could be the reason for this error ?
> >
> > There is a posted send pending for completion, so error should be
> > reported via the completion status, not the polling function itself.
> >
> > Thanks for any help. This is OFED 1.3
>
> Roland,
> It looks like we have a race condition in mlx4_destroy_qp.
> We clean the cq BEFORE modifying the QP to reset (done in
> kernel as part of the ibv_cmd_destroy_qp() flow).
>
> CQ's problem has exposed this bug. mlx4_cq_clean needs to be invoked
> **after** the destroy:
>
> Index: libmlx4/src/verbs.c
> ===================================================================
> --- libmlx4.orig/src/verbs.c 2008-03-26 09:00:08.000000000 +0200
> +++ libmlx4/src/verbs.c 2008-03-26 09:00:52.449586000 +0200
> @@ -558,11 +558,6 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
> struct mlx4_qp *qp = to_mqp(ibqp);
> int ret;
>
> - mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
> - ibqp->srq ? to_msrq(ibqp->srq) : NULL);
> - if (ibqp->send_cq != ibqp->recv_cq)
> - mlx4_cq_clean(to_mcq(ibqp->send_cq),
> ibqp->qp_num, NULL);
> -
> mlx4_lock_cqs(ibqp);
> mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
> mlx4_unlock_cqs(ibqp);
> @@ -576,6 +571,11 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
> return ret;
> }
>
> + mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
> + ibqp->srq ? to_msrq(ibqp->srq) : NULL);
> + if (ibqp->send_cq != ibqp->recv_cq)
> + mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num,
> + NULL);
> +
> if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
> mlx4_free_db(to_mctx(ibqp->context),
> MLX4_DB_TYPE_RQ, qp->db);
> free(qp->sq.wrid);
>
>
>
>
More information about the general
mailing list