[ofa-general] error with ibv_poll_cq() call
Jack Morgenstein
jackm at dev.mellanox.co.il
Wed Mar 26 00:01:25 PDT 2008
On Wednesday 26 March 2008 05:56, Tang, Changqing wrote:
>
> Hi,
> We are debuging our dynamic process code, when we call
>
> ret = ibv_poll_cq(cq_hndl, 1, &compl);
>
> The peer process may have destroyed the QP.
>
> However, ibv_poll_cq() return -2 in 'ret', 'errno' is still 0
>
> What could be the reason for this error ?
>
> There is a posted send pending for completion, so error should be
> reported via the completion status, not the polling function
> itself.
>
> Thanks for any help. This is OFED 1.3
Roland,
It looks like we have a race condition in mlx4_destroy_qp. We clean the
cq BEFORE modifying the QP to reset (done in kernel as part of
the ibv_cmd_destroy_qp() flow).
CQ's problem has exposed this bug. mlx4_cq_clean needs to be invoked
**after** the destroy:
Index: libmlx4/src/verbs.c
===================================================================
--- libmlx4.orig/src/verbs.c 2008-03-26 09:00:08.000000000 +0200
+++ libmlx4/src/verbs.c 2008-03-26 09:00:52.449586000 +0200
@@ -558,11 +558,6 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
struct mlx4_qp *qp = to_mqp(ibqp);
int ret;
- mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
- ibqp->srq ? to_msrq(ibqp->srq) : NULL);
- if (ibqp->send_cq != ibqp->recv_cq)
- mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL);
-
mlx4_lock_cqs(ibqp);
mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
mlx4_unlock_cqs(ibqp);
@@ -576,6 +571,11 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
return ret;
}
+ mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
+ ibqp->srq ? to_msrq(ibqp->srq) : NULL);
+ if (ibqp->send_cq != ibqp->recv_cq)
+ mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL);
+
if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);
free(qp->sq.wrid);
More information about the general
mailing list