[ofa-general] error with ibv_poll_cq() call

Tang, Changqing changquing.tang at hp.com
Wed Mar 26 00:09:25 PDT 2008


Well, I said "PEER" process may destroyed the remote QP, The process
calling ibv_poll_cq() still has the QP in RTS state.

And though I use OFED 1.3, the HCA is not connectX.

idea ?


--CQ



> -----Original Message-----
> From: Jack Morgenstein [mailto:jackm at dev.mellanox.co.il]
> Sent: Wednesday, March 26, 2008 2:01 AM
> To: general at lists.openfabrics.org
> Cc: Tang, Changqing; Roland Dreier
> Subject: Re: [ofa-general] error with ibv_poll_cq() call
>
> On Wednesday 26 March 2008 05:56, Tang, Changqing wrote:
> >
> > Hi,
> >         We are debuging our dynamic process code, when we call
> >
> > ret = ibv_poll_cq(cq_hndl, 1, &compl);
> >
> > The peer process may have destroyed the QP.
> >
> > However, ibv_poll_cq() return -2 in 'ret', 'errno' is still 0
> >
> > What could be the reason for this error ?
> >
> > There is a posted send pending for completion, so error should be
> > reported via the completion status, not the polling function itself.
> >
> > Thanks for any help. This is OFED 1.3
>
> Roland,
> It looks like we have a race condition in mlx4_destroy_qp.
> We clean the cq BEFORE modifying the QP to reset (done in
> kernel as part of the ibv_cmd_destroy_qp() flow).
>
> CQ's problem has exposed this bug.  mlx4_cq_clean needs to be invoked
> **after** the destroy:
>
> Index: libmlx4/src/verbs.c
> ===================================================================
> --- libmlx4.orig/src/verbs.c    2008-03-26 09:00:08.000000000 +0200
> +++ libmlx4/src/verbs.c 2008-03-26 09:00:52.449586000 +0200
> @@ -558,11 +558,6 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
>         struct mlx4_qp *qp = to_mqp(ibqp);
>         int ret;
>
> -       mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
> -                      ibqp->srq ? to_msrq(ibqp->srq) : NULL);
> -       if (ibqp->send_cq != ibqp->recv_cq)
> -               mlx4_cq_clean(to_mcq(ibqp->send_cq),
> ibqp->qp_num, NULL);
> -
>         mlx4_lock_cqs(ibqp);
>         mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
>         mlx4_unlock_cqs(ibqp);
> @@ -576,6 +571,11 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
>                 return ret;
>         }
>
> +       mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
> +                      ibqp->srq ? to_msrq(ibqp->srq) : NULL);
> +       if (ibqp->send_cq != ibqp->recv_cq)
> +               mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num,
> + NULL);
> +
>         if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
>                 mlx4_free_db(to_mctx(ibqp->context),
> MLX4_DB_TYPE_RQ, qp->db);
>         free(qp->sq.wrid);
>
>
>
>



More information about the general mailing list