[openib-general] [PATCH] ib_cancel_mad API

Wed Sep 29 13:12:36 PDT 2004

    Sean> I'm not sure why this is an issue.  The user receives
    Sean> exactly one callback for every sent MAD.  Even if the MAD is
    Sean> found, the cancel operation will not complete until after
    Sean> all posted work requests have completed.

OK, think about the following scenario.  We're in ib_mad_complete_send_wr():

	/* Remove send from MAD agent and notify client of completion. */
	list_del(&mad_send_wr->agent_send_list);
	spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags);

/* HERE ===> */

	if (mad_send_wr->status != IB_WC_SUCCESS )
		mad_send_wc->status = mad_send_wr->status;
	mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc);

and the kernel gets preempted or a long-running interrupt comes along
where I've marked.  Now, on another CPU (or after the preemption), a
consumer calls ib_cancel_mad(), which does:

	spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags);
	list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list,
			    agent_send_list) {
		if (mad_send_wr->wr_id == wr_id)
			goto found;
	}
	spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags);
	return -EINVAL;

and doesn't find the work request, since ib_mad_complete_send_wr() has
already removed it from the list, so it returns -EINVAL.

The consumer says, "oh, OK, I have no pending requests so I can free
my context."  Then the first thread continues and proceeds to call the
consumer's send_handler function.

We can say that the consumer has to have reference counting or
otherwise protect itself against this, but it makes more sense to me
to avoid this sort of bug in common code rather than debugging every
consumer...

 - R.