[ofa-general] error with ibv_poll_cq() call

Jack Morgenstein jackm at dev.mellanox.co.il
Wed Mar 26 00:01:25 PDT 2008


On Wednesday 26 March 2008 05:56, Tang, Changqing wrote:
> 
> Hi,
>         We are debuging our dynamic process code, when we call
> 
> ret = ibv_poll_cq(cq_hndl, 1, &compl);
> 
> The peer process may have destroyed the QP.
> 
> However, ibv_poll_cq() return -2 in 'ret', 'errno' is still 0
> 
> What could be the reason for this error ?
> 
> There is a posted send pending for completion, so error should be
> reported via the completion status, not the polling function
> itself.
> 
> Thanks for any help. This is OFED 1.3

Roland,
It looks like we have a race condition in mlx4_destroy_qp.  We clean the
cq BEFORE modifying the QP to reset (done in kernel as part of
the ibv_cmd_destroy_qp() flow).

CQ's problem has exposed this bug.  mlx4_cq_clean needs to be invoked
**after** the destroy:

Index: libmlx4/src/verbs.c
===================================================================
--- libmlx4.orig/src/verbs.c	2008-03-26 09:00:08.000000000 +0200
+++ libmlx4/src/verbs.c	2008-03-26 09:00:52.449586000 +0200
@@ -558,11 +558,6 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 	struct mlx4_qp *qp = to_mqp(ibqp);
 	int ret;
 
-	mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
-		       ibqp->srq ? to_msrq(ibqp->srq) : NULL);
-	if (ibqp->send_cq != ibqp->recv_cq)
-		mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL);
-
 	mlx4_lock_cqs(ibqp);
 	mlx4_clear_qp(to_mctx(ibqp->context), ibqp->qp_num);
 	mlx4_unlock_cqs(ibqp);
@@ -576,6 +571,11 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp)
 		return ret;
 	}
 
+	mlx4_cq_clean(to_mcq(ibqp->recv_cq), ibqp->qp_num,
+		       ibqp->srq ? to_msrq(ibqp->srq) : NULL);
+	if (ibqp->send_cq != ibqp->recv_cq)
+		mlx4_cq_clean(to_mcq(ibqp->send_cq), ibqp->qp_num, NULL);
+
 	if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC)
 		mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db);
 	free(qp->sq.wrid);






More information about the general mailing list