[ofa-general] Application blocked in mthca_poll_cq

Bharath Ramesh bramesh at vt.edu
Mon Nov 5 09:37:04 PST 2007


I am not sure about the version of OFED being used, but its most likely
OFED-1.2. Is there any way to find the version of OFED used. libmthca.so
points to libmthca-rdmav2.so. I am not sure if this helps. My application is
multithreaded, every time this happens when I try to attach the process to
gdb I find that mthca_poll_cq is the one blocking and sometimes the call is
blocking on pthread_spin_unlock. Which is surprising as I wouldnt expect
pthread_spin_unlock to be blocking. I am sure that I am not doing any
use-after-free. I dont destroy the CQ till the application is terminating.
This situation occurs well before the application terminates.

Thanks,

Bharath


-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Monday, November 05, 2007 11:49 AM
To: Bharath Ramesh
Cc: OFA-General
Subject: Re: [ofa-general] Application blocked in mthca_poll_cq

 > Every now and then I notice that my application is blocks inside  >
mthca_poll_cq. When I attach gdb to the process I find its blocking on a  >
call to pthread_spin_lock/pthread_spin_unlock. I am not sure if this is  > a
bug or something wrong with what I am doing. I calling ibv_poll_cq  > with
the number of entries as 1. Any help on this would be much  > appreciated. I
am not able to replicate it on separate test program.
 > There is not other call to ibv_poll_cq.

What version of libmthca are you using?  libmthca 1.0.2 and earlier had a
bug that could cause this in rare circumstances (if you destroy two QPs
simultaneously from different threads and the two QPs are such that the
receive CQ of one QP is the send CQ of the other and vice versa).  To be
honest I doubt you're hitting this.

The only operations in libmthca that hit the CQ spinlock are:
 - polling the CQ
 - resizing a CQ
 - modifying a QP to RESET
 - destroying a QP
all of that code seems to take and release the CQ spinlock properly.

I assume your application is multithreaded?  When it gets stuck it would be
useful to know which other thread is holding the CQ lock that poll_cq is
blocked on; I don't know of a really good way to figure that out though.

Is it possible that you have a use-after-free where you destroy a CQ and
then call poll with a pointer to the freed CQ?

 - R.




More information about the general mailing list