[ofa-general] perfquery causes kernel to be stuck in ib_unregister_mad_agent() function
Hal Rosenstock
hrosenstock at xsigo.com
Tue Apr 29 03:49:20 PDT 2008
Hi Jean-Francois,
On Tue, 2008-04-29 at 10:17 +0200, Jean-Francois.Neyroud wrote:
> If I attemp to query at the same time the performance counters on all
> nodes on a cluster ( 40 nodes) .
> perfquery causes kernel to be stuck in ib_unregister_mad_agent() function.
>
> Impossible to send CTRL-C or CTRL-Z to perfquery, it is stuck in the kernel.
> # pgrep perfquery
> 27578
> # cat /proc/27578/wchan
> ib_unregister_mad_agent
>
> I have this problem with OFED-1.2.5 or 1.3 and with mthca or ConnectX,
> not tested with others HCA and OFED.
>
> Reproduceur with 2 nodes and without switch:
>
> # for i in `seq 1 100`; do perfquery >/dev/null 2>&1 & done
>
> # pgrep perfquery | while read pid; do echo "$pid: `cat /proc/$pid/wchan`"; echo; done | dshbak -c
> ----------------
> [14936,14938-15029]
> ----------------
> 0
> ----------------
>
> ----------------
> ----------------
> 14937
> ----------------
> flush_cpu_workqueue
>
>
> Does anyone know this problem ?
This could be related to the lock dependency issue discussed in the
following thread:
http://lists.openfabrics.org/pipermail/general/2008-January/044723.html
You might want to look to the following for the actual fix:
commit 2fe7e6f7c9f55eac24c5b3cdf56af29ab9b0ca81
Author: Roland Dreier <rolandd at cisco.com>
Date: Fri Jan 25 14:15:42 2008 -0800
IB/umad: Simplify and fix locking
In addition to being overly complex, the locking in user_mad.c is
broken: there were multiple reports of deadlocks and lockdep warnings.
In particular it seems that a single thread may end up trying to take
the same rwsem for reading more than once, which is explicitly
forbidden in the comments in <linux/rwsem.h>.
To solve this, we change the locking to use plain mutexes instead of
rwsems. There is one mutex per open file, which protects the contents
of the struct ib_umad_file, including the array of agents and list of
queued packets; and there is one mutex per struct ib_umad_port, which
protects the contents, including the list of open files. We never
hold the file mutex across calls to functions like ib_unregister_mad_agent()
,
which can call back into other ib_umad code to queue a packet, and we
always hold the port mutex as long as we need to make sure that a
device is not hot-unplugged from under us.
This even makes things nicer for users of the -rt patch, since we
remove calls to downgrade_write() (which is not implemented in -rt).
Signed-off-by: Roland Dreier <rolandd at cisco.com>
I don't think this change was incorporated into either OFED 1.2.5 or 1.3.
-- Hal
>
> Jean-Francois.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list