[openib-general][patch review] srp: fmr implementation,
Vu Pham
vuhuong at mellanox.com
Fri May 5 11:53:50 PDT 2006
Roland Dreier wrote:
> > 1. srp_unmap_data() and srp_remove_req() for .eh_abort_handler(scmnd)
> > a. abort get timeout or
> > b. req->cmd_done or
> > c. !req->tsk_status
> > 2. we should do step (1) for .eh_abort_handler(scmnd) only and don't
> > do step 1 for .eh_device_reset_handler(scmnd) since same scsi command
> > is used for all .eh_handler()
> > 3. scsi command is used in all .eh_handler() will be freed by scsi
> > midlayer at the end of error handling sequences
> > 4. If we don't do step 1, scsi command which is used in all
> > .eh_handler() and freed is still in our pending queue and is
> > referenced in srp_reconnect_target() / reinit request ring
>
> So I finally got a chance to look at this in detail. It does look
> like we should remove the request in (1) if the command finishes or
> the abort succeeds. However if the abort times out then then command
> is still out there -- shouldn't we wait for the
> eh_device_reset_handler and then flush all matching commands there?
>
We should remove the request in (1) if the command finishes
or the abort succeeds for eh_abort_handler only
For abort times out case, we can do either remove the
command in eh_abort_handler or wait for
eh_device_reset_handler (this may be better since it gives a
command more time with a chance to complete.
> And I don't understand (4) -- isn't srp_reconnect_target() being
> called from srp_reset_host() as part of the error handling sequence?
> Unless I'm misreading the code in scsi_error.c, commands don't get
> freed (assuming all aborts and device resets fail) before then.
>
> What am I missing? In your case, where the abort and device reset
> fail and then the host reset gets called, where was the command
> getting freed?
reading scsi_error.c again, I find this logic for our case
(please correct me if I'm wrong)
1. eh_abort_handler and eh_device_reset_handler fail with
timeout; eh_host_reset_handler successes
2. scsi_eh_host_reset goes on with scsi_eh_try_stu & scsi_eh_tur
3. either scsi_eh_try_stu or scsi_eh_tur will reuse the scsi
command and call scsi_send_eh_cmnd to send STU or TUR command
4. scsi_send_eh_cmnd calls srp_queuecommand which will get
new req, reformat scsi_done pointer to scsi_eh_done, and add
req to req_queue for this same scsi command with different
opcode (ie. STU or TUR)
5. In my case I got QP event 1 - so scsi_send_eh_cmnd will
get to timeout case and call eh_abort_handler for this scsi
command with opcode STU or TUR
6. scsi_eh_try_stu & scsi_eh_tur will retrieve the old scsi
command back with scsi_set_cmd_retry; however, srp already
change and can not retrieve the old scsi_done and
host_scribble pointer
8. scsi_eh_host_reset fail and scsi_eh_offline_sdevs is called
9. scsi_eh_offline_sdevs calls scsi_eh_finish_cmd which
moves the scsi command to done_q and scsi command is freed
in done_q
10. However the srp req carries this scsi command still in
our req_queue. The next eh_host_reset_handler will re-init
the req_queue and use the scsi command pointer (this is the
crash use-after-freed that we see)
Bottom line my previous patch still does not address the
logic above - I'll rework the patch and send to you later
for review
Vu
More information about the general
mailing list