[Openib-windows] A problem in ib_close_al
Fabian Tillier
ftillier at silverstorm.com
Mon Jul 24 15:41:19 PDT 2006
Hi Leo,
On 7/23/06, Leonid Keller <leonid at mellanox.co.il> wrote:
> Hi Fab,
> Seems like I found the reason of the stuck on shutdown.
> Find attached 2 patches for problems, which I come across on during
> investigating of this case.
> Here are short description.
> 1. (a bug responsible for the stuck)
> If a send MAD times out, it sends once more, so one can get 2
> responds for it.
> __process_recv_resp() stores the second respond on place of they
> first one (h_send->p_resp_mad)
> and the latter gets never released !
> 2. there are 2 more places, that *seems* like "forget" to release
> response MADs - please, check me.
> 3. synchronization problems - we have talked them already over. I
> suggest a variant of a fix.
I just checked in the special QP synchronization and MAD tracking
fixes in revisions 427 and 426, respectively.
I had missed the synchronization problem in the user-mode MAD pool.
This has been checked in as revision 428.
> One more question:
> Is it possible for internal send in __mad_svc_send_done to have a
> response on it ?
Internal MADs don't get responses - they are any non-data RMPP MADs
(RMPP ACK, ABORT, and STOP MADs.) Only RMPP ACK messages are sent in
IBAL, but it handles receiving any ABORT and STOP messages.
> If - yes, one needs to release it.
>
> I'll appreciate, if you can look these patches ASAP - we need them for
> the release.
In the al_mad.patch:
> Index: al_mad.c
> ===================================================================
> --- al_mad.c (revision 425)
> +++ al_mad.c (working copy)
> @@ -1257,6 +1257,9 @@
> AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC, ("canceling MAD\n") );
> h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item );
> h_send->canceled = TRUE;
> + /* __check_send_queue() skips MADs with 'retry_time=MAX_TIME'
> + before it checks 'canceled' field, so part of the request will be skipped */
> + h_send->retry_time = 0;
> }
> cl_spinlock_release( &h_mad_svc->obj.lock );
You can't set retry_time to zero here - the MAD will be referenced in
the send completion callback. If retry_time is MAX_TIME, it means the
send is posted on the QP, but hasn't yet completed. It should just be
flagged as cancelled. The send completion will finish processing it.
> @@ -2260,8 +2264,14 @@
> /* The send is currently active. Do not report it. */
> AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_MAD_SVC,
> ("resp send active TID:0x%I64x\n", p_mad_hdr->trans_id) );
> + p_resp_mad = h_send->p_resp_mad;
> h_send->p_resp_mad = p_mad_element;
> cl_spinlock_release( &h_mad_svc->obj.lock );
> + if (p_resp_mad)
> + {
> + /* we got a second response to that send --> drop the first one */
> + ib_put_mad( p_resp_mad );
> + }
> }
> else
> {
You can put the MAD back into its pool while holding the MAD service
lock. Did you actually see this happen, where a duplicate response
was received?
> @@ -2273,6 +2283,12 @@
> (cl_list_item_t*)&h_send->pool_item );
> cl_spinlock_release( &h_mad_svc->obj.lock );
>
> + if (h_send->p_resp_mad)
> + {
> + /* we got a second response to that send --> drop the first one */
> + ib_put_mad( h_send->p_resp_mad );
> + }
> +
> /* Report the receive. */
> h_mad_svc->pfn_user_recv_cb( h_mad_svc, (void*)h_mad_svc->obj.context,
> p_mad_element );
If the send is complete, it is removed from the list and a duplicate
response will not find it in the send list (see the call to
__mad_svc_match_recv earlier in that function).
> @@ -3092,9 +3108,7 @@
> {
> h_send = PARENT_STRUCT( p_list_item, al_mad_send_t, pool_item );
>
> - h_mad_svc->pfn_user_send_cb( h_mad_svc, (void*)h_mad_svc->obj.context,
> - h_send->p_send_mad );
> - __cleanup_mad_send( h_mad_svc, h_send );
> + __notify_send_comp( h_mad_svc, h_send, h_send->p_send_mad->status);
> p_list_item = cl_qlist_remove_head( &timeout_list );
> }
> AL_EXIT( AL_DBG_MAD_SVC );
I don't understand what you were trying to do here. If a MAD times
out, it cannot have a response associated with it, so calling
__notify_send_comp doesn't accomplish anything. Am I missing
something? Did you see a timeout where the response mad was there?
Thanks,
- Fab
More information about the ofw
mailing list