[openib-general] [PATCH] IB/srp: don't schedule reconnect from srp, scsi does it for us (was Re: FW: OFED 1.1 rc3 srp driver panic)

Thu Sep 7 05:00:01 PDT 2006

Quoting r. ishai at dev.mellanox.co.il <ishai at dev.mellanox.co.il>:
> Subject: Re: FW: OFED 1.1 rc3 srp driver panic
> 
> I think I found the race that causes this NULL Dereference.
> 
> 1) There is a connection error.
> 
> 2) srp_completion gets bad status and schedules a call to srp_reconnect_work.
> 
> 3) srp_reconnect_work is scheduled to run and calls srp_reconnect_target.
> 
> 4) srp_reconnect_target starts to run, changes the target state to
> SRP_TARGET_CONNECTING but there is a context switch before it gets to
> execute srp_reset_req.
> 
> 5) The scsi error handling calls to srp_reset_host.
> 
> 6) srp_reset_host calls srp_reconnect_target that returns -EAGAIN
> (because the target state is not SRP_TARGET_LIVE).
> 
> 7) srp_reset_host returns FAILED and therefore the device goes offline.
> 
> 8) Because the device goes offline the commands are being freed (In the
> scsi mid-layer).
> 
> 9) The first execution of srp_reconnect_target resumes and calls to
> srp_reset_req that tries to access the commands that were freed.
> 
> 10) NULL deref.
> 
> Ishai

It seems that we don't really need to schedule srp_reconnect_work on
error since it will be called later anyway.
So it seems we can address these crashes and simplify srp
in the following way:

---

IB/srp: don't schedule reconnet from srp, scsi does it for us

If there is a problem in the connection, the scsi mid-layer will
eventually call reset_host that will call srp_reconnect, so we
do not need to schedule srp_reconnect_work from srp_completion.

Removing this prevents srp_reset_host from failing if srp_completion
was in progress, which in turn was causing crashes as both scsi
midlayer and srp_reconnect were cancelling commands.

Signed-off-by: Ishai Rabinovitz <ishai at mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: last_stable/drivers/infiniband/ulp/srp/ib_srp.c
===================================================================

--- last_stable.orig/drivers/infiniband/ulp/srp/ib_srp.c	2006-09-06 15:37:50.000000000 +0300
+++ last_stable/drivers/infiniband/ulp/srp/ib_srp.c	2006-09-07 11:16:28.000000000 +0300
@@ -799,13 +799,6 @@ static void srp_process_rsp(struct srp_t
 	spin_unlock_irqrestore(target->scsi_host->host_lock, flags);
 }
 
-static void srp_reconnect_work(void *target_ptr)
-{
-	struct srp_target_port *target = target_ptr;
-
-	srp_reconnect_target(target);
-}
-
 static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc)
 {
 	struct srp_iu *iu;
@@ -858,7 +851,6 @@ static void srp_completion(struct ib_cq 
 {
 	struct srp_target_port *target = target_ptr;
 	struct ib_wc wc;
-	unsigned long flags;
 
 	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 	while (ib_poll_cq(cq, 1, &wc) > 0) {
@@ -866,10 +858,6 @@ static void srp_completion(struct ib_cq 
 			printk(KERN_ERR PFX "failed %s status %d\n",
 			       wc.wr_id & SRP_OP_RECV ? "receive" : "send",
 			       wc.status);
-			spin_lock_irqsave(target->scsi_host->host_lock, flags);
-			if (target->state == SRP_TARGET_LIVE)
-				schedule_work(&target->work);
-			spin_unlock_irqrestore(target->scsi_host->host_lock, flags);
 			break;
 		}
 
@@ -1724,8 +1712,6 @@ static ssize_t srp_create_target(struct 
 	target->scsi_host  = target_host;
 	target->srp_host   = host;
 
-	INIT_WORK(&target->work, srp_reconnect_work, target);
-
 	INIT_LIST_HEAD(&target->free_reqs);
 	INIT_LIST_HEAD(&target->req_queue);
 	for (i = 0; i < SRP_SQ_SIZE; ++i) {

-- 
MST