[openib-general][patch review] srp: fmr implementation,

Fri May 5 18:33:09 PDT 2006

    Vu> It still does not address the issue pointed out from my
    Vu> previous email - the first eh_host_reset_handler() success,
    Vu> right away scsi_eh_host_reset() send start-stop-unit or
    Vu> test-unit-ready command using the same scsi command. This stu
    Vu> or tur command stuck in our queue, get timeout and get
    Vu> aborted.  The abortion of stu or tur command once again get
    Vu> timeout. The original scsi command get freed. We delay the
    Vu> clean-up of the associated request in
    Vu> eh_device_reset_handler() instead of in eh_abort_handler() so
    Vu> it's still in our queue. The lun is marked offline. The next
    Vu> eh_device_reset_handler() for the same lun won't be
    Vu> called. The next eh_reset_host_handler() will hit
    Vu> used-after-free bug.  You can see the log below

I'm still confused.  Even the original eh_reset_host_handler
implementation will throw away all commands in the SRP queue, because
it does:

	for (i = 0; i < SRP_SQ_SIZE - 1; ++i)
		target->req_ring[i].next = i + 1;
	target->req_ring[SRP_SQ_SIZE - 1].next = -1;
	INIT_LIST_HEAD(&target->req_queue);

and the new patched version does

	list_for_each_entry(req, &target->req_queue, list) {
		req->scmnd->result = DID_RESET << 16;
		req->scmnd->scsi_done(req->scmnd);
		srp_unmap_data(req->scmnd, target, req);
	}

on top of that.

So after srp_reconnect_target() returns, SRP has no requests in its
queue.  The only way that a command could be put in the queue is if
the SCSI midlayer passes it back into the queuecommand functions.

I know I'm being dense but could you explain it one more time?

Also, this really worries me:

    Vu> May  5 16:36:24 lab105 kernel: ib_mthca 0000:05:00.0: CQ overrun on CQN 040082

Do you know what's causing this?

 - R.