[openib-general][patch review] srp: fmr implementation,

Fri May 5 11:53:50 PDT 2006

Roland Dreier wrote:
>  > 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd)
>  >     a. abort get timeout or
>  >     b. req->cmd_done or
>  >     c. !req->tsk_status
>  > 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't
>  > do step 1 for .eh_device_reset_handler(scmnd) since same scsi command
>  > is used for all .eh_handler()
>  > 3. scsi command is used in all .eh_handler() will be freed by scsi
>  > midlayer at the end of error handling sequences
>  > 4. If we don't do step 1, scsi command which is used in all
>  > .eh_handler() and freed is still in our pending queue and is
>  > referenced in srp_reconnect_target() / reinit request ring
> 
> So I finally got a chance to look at this in detail.  It does look
> like we should remove the request in (1) if the command finishes or
> the abort succeeds.  However if the abort times out then then command
> is still out there -- shouldn't we wait for the
> eh_device_reset_handler and then flush all matching commands there?
> 

We should remove the request in (1) if the command finishes 
or the abort succeeds for eh_abort_handler only

For abort times out case, we can do either remove the 
command in eh_abort_handler or wait for 
eh_device_reset_handler (this may be better since it gives a 
command more time with a chance to complete.

> And I don't understand (4) -- isn't srp_reconnect_target() being
> called from srp_reset_host() as part of the error handling sequence?
> Unless I'm misreading the code in scsi_error.c, commands don't get
> freed (assuming all aborts and device resets fail) before then.
> 
> What am I missing?  In your case, where the abort and device reset
> fail and then the host reset gets called, where was the command
> getting freed?

reading scsi_error.c again, I find this logic for our case 
(please correct me if I'm wrong)
1. eh_abort_handler and eh_device_reset_handler fail with 
timeout; eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi 
command and call scsi_send_eh_cmnd to send STU or TUR command
4. scsi_send_eh_cmnd calls srp_queuecommand which will get 
new req, reformat scsi_done pointer to scsi_eh_done, and add 
req to req_queue for this same scsi command with different 
opcode (ie. STU or TUR)
5. In my case I got QP event 1 - so scsi_send_eh_cmnd will 
get to timeout case and call eh_abort_handler for this scsi 
command with opcode STU or TUR
6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi 
command back with scsi_set_cmd_retry; however, srp already 
change and can not retrieve the old scsi_done and 
host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which 
moves the scsi command to done_q and scsi command is freed 
in done_q
10. However the srp req carries this scsi command still in 
our req_queue. The next eh_host_reset_handler will re-init 
the req_queue and use the scsi command pointer (this is the 
crash use-after-freed that we see)

Bottom line my previous patch still does not address the 
logic above - I'll rework the patch and send to you later 
for review

Vu