[ofa-general] perfquery causes kernel to be stuck in ib_unregister_mad_agent() function

Hal Rosenstock hrosenstock at xsigo.com
Tue Apr 29 03:49:20 PDT 2008


Hi Jean-Francois,

On Tue, 2008-04-29 at 10:17 +0200, Jean-Francois.Neyroud wrote:
> If I attemp to query at the same time the performance counters on all 
> nodes on a cluster ( 40 nodes) .
> perfquery causes kernel to be stuck in ib_unregister_mad_agent() function.
> 
> Impossible to send CTRL-C or CTRL-Z to perfquery, it is stuck in the kernel.
> # pgrep perfquery
> 27578
> # cat /proc/27578/wchan
> ib_unregister_mad_agent
> 
> I have this problem with OFED-1.2.5 or 1.3 and with mthca or ConnectX, 
> not tested with others HCA and OFED.
> 
> Reproduceur with 2 nodes and without switch:
> 
> # for i in `seq 1 100`; do perfquery >/dev/null 2>&1 & done
> 
> # pgrep perfquery | while read pid; do echo "$pid: `cat /proc/$pid/wchan`"; echo; done | dshbak -c
> ----------------
> [14936,14938-15029]
> ----------------
>  0
> ----------------
> 
> ----------------
> ----------------
> 14937
> ----------------
>  flush_cpu_workqueue
> 
> 
> Does anyone know this problem ?

This could be related to the lock dependency issue discussed in the
following thread:

http://lists.openfabrics.org/pipermail/general/2008-January/044723.html

You might want to look to the following for the actual fix:

commit 2fe7e6f7c9f55eac24c5b3cdf56af29ab9b0ca81
Author: Roland Dreier <rolandd at cisco.com>
Date:   Fri Jan 25 14:15:42 2008 -0800

    IB/umad: Simplify and fix locking
    
    In addition to being overly complex, the locking in user_mad.c is
    broken: there were multiple reports of deadlocks and lockdep warnings.
    In particular it seems that a single thread may end up trying to take
    the same rwsem for reading more than once, which is explicitly
    forbidden in the comments in <linux/rwsem.h>.
    
    To solve this, we change the locking to use plain mutexes instead of
    rwsems.  There is one mutex per open file, which protects the contents
    of the struct ib_umad_file, including the array of agents and list of
    queued packets; and there is one mutex per struct ib_umad_port, which
    protects the contents, including the list of open files.  We never
    hold the file mutex across calls to functions like ib_unregister_mad_agent()
,
    which can call back into other ib_umad code to queue a packet, and we
    always hold the port mutex as long as we need to make sure that a
    device is not hot-unplugged from under us.
    
    This even makes things nicer for users of the -rt patch, since we
    remove calls to downgrade_write() (which is not implemented in -rt).
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

I don't think this change was incorporated into either OFED 1.2.5 or 1.3.

-- Hal

> 
> Jean-Francois.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list