[ewg][PATCH][0/2] SRP multipath failover within 60 seconds,
Vu Pham
vuhuong at mellanox.com
Wed Feb 6 01:01:53 PST 2008
The following patches assist SRP/dm-multipath to failover within 60
seconds (bugzilla #577) without data corruption, read/write error
1. srp_disconnect_without_wait.patch - srp send disconnect request
without waiting for CM timewait exit event since srp current does not
re-use the cm_id and qp/cq of a connection (patch
srp_1_recreate_at_reconnect.patch already in kernel_patches/fixes
recreate the cmid, qp/cq for a connection at reconnect)
2. srp_qp_in_err_timer_reconnect_target.patch - when detecting a
post_send/post_receive error, srp set qp_in_error, set a timer to
reconnect to target, return SCSI_MLQUEUE_HOST_BUSY to lock the queue,
and return DID_NO_CONNECT when target state is DEAD or REMOVED
Here is my multipath.conf
defaults {
udev_dir /dev
polling_interval 5
selector "round-robin 0"
path_grouping_policy multibus
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout /bin/true
path_checker readsector0
rr_min_io 100
rr_weight priorities
failback immediate
no_path_retry 5
user_friendly_names no
}
I also set srp_daemon.sh to rescan fabric every 60 seconds (instead of
300 secs as default setting)
I ran data integrity test to /dev/mapper/<devices> and {disable path 1,
sleep 90, enable path 1, sleep 60, disable path 2, sleep 90, enable path
2, sleep 60} in the loop
RHEL5, 5.1 work very well (no data corruption, read/write failure report)
For SLES 10 sp1, it work well as long as I run *multipath* every 60
secs. I think that I mis-configured the multipathd somehow (Here is how
I set it up: using the same multipath.conf above, chkconfig
boot.multipath on and chkconf multipathd on)
-vu
More information about the ewg
mailing list