[Openib-windows] A problem in ib_close_al

Fabian Tillier ftillier at silverstorm.com
Mon Jul 24 15:41:19 PDT 2006


Hi Leo,

On 7/23/06, Leonid Keller <leonid at mellanox.co.il> wrote:
> Hi Fab,
> Seems like I found the reason of the stuck on shutdown.
> Find attached 2 patches for problems, which I come across on during
> investigating of this case.
> Here are short description.
> 1. (a bug responsible for the stuck)
>        If a send MAD times out, it sends once more, so one can get 2
> responds for it.
>        __process_recv_resp() stores the second respond on place of they
> first one (h_send->p_resp_mad)
>        and the latter gets never released !
> 2. there are 2 more places, that *seems* like "forget" to release
> response MADs - please, check me.
> 3. synchronization problems - we have talked them already over. I
> suggest a variant of a fix.

I just checked in the special QP synchronization and MAD tracking
fixes in revisions 427 and 426, respectively.

I had missed the synchronization problem in the user-mode MAD pool.
This has been checked in as revision 428.

> One more question:
> Is it possible for internal send in __mad_svc_send_done to have a
> response on it ?

Internal MADs don't get responses - they are any non-data RMPP MADs
(RMPP ACK, ABORT, and STOP MADs.)  Only RMPP ACK messages are sent in
IBAL, but it handles receiving any ABORT and STOP messages.

> If - yes, one needs to release it.
>
> I'll appreciate, if you can look these patches ASAP - we need them for
> the release.

In the al_mad.patch:

> Index: al_mad.c
> ===================================================================
> --- al_mad.c	(revision 425)
> +++ al_mad.c	(working copy)
> @@ -1257,6 +1257,9 @@
>  		AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC, ("canceling MAD\n") );
>  		h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item );
>  		h_send->canceled = TRUE;
> +		/* __check_send_queue() skips MADs with 'retry_time=MAX_TIME'
> +		before it checks 'canceled' field, so part of the request will be skipped */
> +		h_send->retry_time = 0;
>  	}
>  	cl_spinlock_release( &h_mad_svc->obj.lock );

You can't set retry_time to zero here - the MAD will be referenced in
the send completion callback.  If retry_time is MAX_TIME, it means the
send is posted on the QP, but hasn't yet completed.  It should just be
flagged as cancelled.  The send completion will finish processing it.

> @@ -2260,8 +2264,14 @@
>  		/* The send is currently active.  Do not report it. */
>  		AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC,
>  			("resp send active TID:0x%I64x\n", p_mad_hdr->trans_id) );
> +		p_resp_mad = h_send->p_resp_mad;
>  		h_send->p_resp_mad = p_mad_element;
>  		cl_spinlock_release( &h_mad_svc->obj.lock );
> +		if (p_resp_mad)
> +		{
> +			/* we got a second response to that send --> drop the first one */
> +			ib_put_mad( p_resp_mad );
> +		}
>  	}
>  	else
>  	{

You can put the MAD back into its pool while holding the MAD service
lock.  Did you actually see this happen, where a duplicate response
was received?

> @@ -2273,6 +2283,12 @@
>  			(cl_list_item_t*)&h_send->pool_item );
>  		cl_spinlock_release( &h_mad_svc->obj.lock );
>
> +		if (h_send->p_resp_mad)
> +		{
> +			/* we got a second response to that send --> drop the first one */
> +			ib_put_mad( h_send->p_resp_mad );
> +		}
> +
>  		/* Report the receive. */
>  		h_mad_svc->pfn_user_recv_cb( h_mad_svc, (void*)h_mad_svc->obj.context,
>  			p_mad_element );

If the send is complete, it is removed from the list and a duplicate
response will not find it in the send list (see the call to
__mad_svc_match_recv earlier in that function).

> @@ -3092,9 +3108,7 @@
>  	{
>  		h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item );
>
> -		h_mad_svc->pfn_user_send_cb( h_mad_svc, (void*)h_mad_svc->obj.context,
> -			h_send->p_send_mad );
> -		__cleanup_mad_send( h_mad_svc, h_send );
> +		__notify_send_comp( h_mad_svc, h_send, h_send->p_send_mad->status);
>  		p_list_item = cl_qlist_remove_head( &timeout_list );
>  	}
>  	AL_EXIT( AL_DBG_MAD_SVC );

I don't understand what you were trying to do here.  If a MAD times
out, it cannot have a response associated with it, so calling
__notify_send_comp doesn't accomplish anything.  Am I missing
something?  Did you see a timeout where the response mad was there?

Thanks,

- Fab




More information about the ofw mailing list