[ewg] Re: [ofa-general] soft lockup in the kernel mad layer
Or Gerlitz
ogerlitz at voltaire.com
Tue Jul 1 02:44:04 PDT 2008
Or Gerlitz wrote:
> doing some tests against some nodes with new HCA firmware (connectx FW 2.5) which seems to be very slow responding on node info queries, I think that I have stepped on a bug/s in the kernel mad code The IB bits used on this node are not the mainline kernel ones but rather
> git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
> commit 564e9e9383272f4311fd87ff4e5447cfcebad73a
>
Jack, Vlad
Looking now on the ofed_1_3/linux-2.6.git tree, I don't see the below
commit there, am I correct?
Is it because the fix was pushed to the kernel after the "feature
freeze" of ofed 1.3 but not into ofed
since you don't pick all the fixes that get into the kernel during an
ofed cycle?
Or.
> commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
> Author: Sean Hefty <sean.hefty at intel.com>
> Date: Fri Nov 30 17:30:18 2007 -0800
>
> IB/mad: Fix incorrect access to items on local_list
>
> In cancel_mads(), MADs are moved from the wait_list and local_list
> to a cancel_list for processing. However, the structures on these two
> lists are not the same. The wait_list references struct
> ib_mad_send_wr_private, but local_list references struct
> ib_mad_local_private. Cancel_mads() treats all items moved to the
> cancel_list as struct ib_mad_send_wr_private. This leads to a system
> crash when requests are moved from the local_list to the cancel_list.
>
> Fix this by leaving local_list alone. All requests on the local_list
> have completed are just awaiting processing by a queued worker thread.
>
> Bug (crash) reported by Dotan Barak <dotanb at dev.mellanox.co.il>.
> Problem with local_list access reported by Robert Reynolds
> <rreynolds at opengridcomputing.com>.
>
> Signed-off-by: Sean Hefty <sean.hefty at intel.com>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
>
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 5eace99..fbe16d5 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -2275,8 +2275,6 @@ static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv)
>
> /* Empty wait list to prevent receives from finding a request */
> list_splice_init(&mad_agent_priv->wait_list, &cancel_list);
> - /* Empty local completion list as well */
> - list_splice_init(&mad_agent_priv->local_list, &cancel_list);
> spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
>
> /* Report all cancelled requests */
More information about the ewg
mailing list