[openib-general] [SRP] [RFC] Needed changes to support fail-over drivers

Ishai Rabinovitz ishai at mellanox.co.il
Mon Jul 24 09:56:02 PDT 2006


Hi,

The current SRP initiator code cannot work with several fail-over mechanisms. 

The current srp driver's behavior when a target off-line then online:
1) The target is offline.
2) the initiator tries to reconnect and fails
3) The initiator calls srp_remove_work that removes the scsi_host.
4) The target is back online.
5) the user (or the ibsrpdm daemon) is expected to execute a new add_target.
6) This creates a new scsi_host (with new names to the devices and new index in
the scsi_host directory in sysfs) for this target.

Fail-over drivers (e.g., MPP that is used by Engenio and XVM that is used by
SGI) have problems with this behavior (item 3). They need the scsi_host to keep
exist and return errors in the meanwhile until the connection to the target
resumes.

In addition remove/re-alloc scsi host is a "heavy" operation instead of
disconnect/reconnect the connection only.

In order to support these tools I propose the following changes that will allow
the user to move the srp initiator to a disconnected state (when the target
leaves the fabric) and reconnect it later (when the target returns to the
fabric).

After these changes will be in the ib_srp module, the ibsrpdm daemon will be
able to monitor the presence of targets in the fabric and to use this interface
(When targets leave or rejoin the fabric).

Here is the description of the new design: (I already implemented most of the
code)

1) Split the function srp_reconnect_target into two functions:
_srp_disconnect_target and _srp_reconnect_target 

2) Adding two new states: SRP_TARGET_DISCONNECTED (The state after
_srp_disconnect_target was executed and before _srp_reconnect_target is
executed) and SRP_TARGET_DISCONNECTING (The state while in srp_remove_target).

3) Adding new input files in sysfs:
/sys/class/scsi_host/host?/{disconnect_target,connect_target,erase_target}

4) Writing the string "remove" to /sys/class/scsi_host/host?/disconnect_target
calls srp_disconnect_target that moves the corresponding target to a
SRP_TARGET_DISCONNECTED state (After closing the cm, and reset all pending
requests).  Now when the scsi performs queuecommand to this host the result is
DID_NO_CONNECT.  This causes the scsi mid-layer to return to the user with an
IO error without initiating the scsi error auto recovery chain.

5) Writing anything to /sys/class/scsi_host/host?/reconnect_target calls
_srp_reconnect_target that move the target to SRP_TARGET_LIVE state again.

6) Writing "erase" to /sys/class/scsi_host/host?/erase_target calls
srp_remove_work that removes the scsi_host.

7) Adding output files in sysfs to present the HCA and port that the initiator
used to connect to the target. Using these files and the target GUID the
ibsrpdm can know on which scsi_host to perform the reconnect_target.

Please comment.

-- 
Ishai Rabinovitz




More information about the general mailing list