[ewg][PATCH][0/2] SRP multipath failover within 60 seconds,
Vu Pham
vuhuong at mellanox.com
Wed Feb 6 01:25:46 PST 2008
EXTRA NOTES:
1. pull cable/ plug back in (or ibportstate disable/enable)
a. Within 30 seconds I/Os resume on the same path (with same
cm_id, qp and cq)
b. Within 30-45 seconds, I/Os resume on the same path (with
new cm_id, qp and cq)
c. >45 seconds, I/Os fail-over to next path
2. After running test for a while, I stop the test, run
*multipath -F* and unload ib_srp module. With RHEL 5 & 5.1,
I can unload ib_srp cleanly; however, I got *srp is in use*
error in SLES 10 sp1
-vu
> The following patches assist SRP/dm-multipath to failover within 60
> seconds (bugzilla #577) without data corruption, read/write error
>
> 1. srp_disconnect_without_wait.patch - srp send disconnect request
> without waiting for CM timewait exit event since srp current does not
> re-use the cm_id and qp/cq of a connection (patch
> srp_1_recreate_at_reconnect.patch already in kernel_patches/fixes
> recreate the cmid, qp/cq for a connection at reconnect)
> 2. srp_qp_in_err_timer_reconnect_target.patch - when detecting a
> post_send/post_receive error, srp set qp_in_error, set a timer to
> reconnect to target, return SCSI_MLQUEUE_HOST_BUSY to lock the queue,
> and return DID_NO_CONNECT when target state is DEAD or REMOVED
>
> Here is my multipath.conf
> defaults {
> udev_dir /dev
> polling_interval 5
> selector "round-robin 0"
> path_grouping_policy multibus
> getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
> prio_callout /bin/true
> path_checker readsector0
> rr_min_io 100
> rr_weight priorities
> failback immediate
> no_path_retry 5
> user_friendly_names no
> }
> I also set srp_daemon.sh to rescan fabric every 60 seconds (instead of
> 300 secs as default setting)
>
> I ran data integrity test to /dev/mapper/<devices> and {disable path 1,
> sleep 90, enable path 1, sleep 60, disable path 2, sleep 90, enable path
> 2, sleep 60} in the loop
>
> RHEL5, 5.1 work very well (no data corruption, read/write failure report)
> For SLES 10 sp1, it work well as long as I run *multipath* every 60
> secs. I think that I mis-configured the multipathd somehow (Here is how
> I set it up: using the same multipath.conf above, chkconfig
> boot.multipath on and chkconf multipathd on)
>
> -vu
>
>
>
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
More information about the ewg
mailing list