[ewg] Re: SRP HA dm_multipath testing and questions
Ishai Rabinovitz
ishai at dev.mellanox.co.il
Tue Apr 10 07:21:07 PDT 2007
Scott Weitzenkamp (sweitzen) wrote:
> I've been testing SRP HA and dm_multipath with:
> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID
> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID
> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs
>
> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on",
> then rebooted. On SLES 10, I ran "chkconfig boot.multipath on" and
> "chkconfig multipathd on", then rebooted. Ishai, I don't seem to need
> 91-srp.rules, are you using the boot.multipath and multipathd scripts?
On RHEL4 you really do not need 91-srp.rules and it is not used (see /etc/init.d/openibd)
On SLES10 I was sure that you need it. I checked it, and you are correct. I don't see how it does it, but it seems that when using boot.multipath there is no need for 91-srp.rules. I will check it more deeply and change documentation and openibd script accordingly.
>
> On both RHEL4 networks, I get IB port load balancing and failover, on
> SLES10 I only see failover. I'm not sure if this is a function of
> RHEL4-vs-SLES10, or RAID vs JBOD.
>
Maybe this is because you removed 91-srp.rules (Did you removed it?)
How did you test the failover and failback?
> Traffic failover is very slow (a few minutes), what do others see?
>
What do you mean by slow. When do you start counting.
> I will be testing DDN IB storage, EMC DMX, and RHEL5 soon.
>
> I'm getting an Oops on RHEL4 U3 x86_64 on both test networks:
>
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:<4>NMI Watchdog detected LOCKUP, CPU=1, registers:
> CPU 1
> Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core nfs
> lockd nfs_
> acl sunrpc rdma_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) iw_cm(U)
> ib_addr(U) ib_loc
> al_sa(U) ds yenta_socket pcmcia_core dm_mirror dm_round_robin
> dm_multipath dm_mo
> d button battery ac ohci_hcd hw_random shpchp ib_mthca(U) ib_ipoib(U)
> ib_umad(U)
> ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6
> tg3 flop
> py sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod
> Pid: 3990, comm: scsi_eh_3 Not tainted 2.6.9-34.ELsmp
> RIP: 0010:[<ffffffff802409bf>] <ffffffff802409bf>{serial_in+83}
> RSP: 0018:000001007f203c10 EFLAGS: 00000002
> RAX: 00000000ffffff00 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804b59a0
> RBP: ffffffff804b59a0 R08: 000000000000003a R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002706
> R13: ffffffff8045afc5 R14: 0000000000000009 R15: 000000000000002d
> FS: 0000002a958a07a0(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000036ce02e728 CR3: 00000000cff00000 CR4: 00000000000006e0
> Process scsi_eh_3 (pid: 3990, threadinfo 000001007f202000, task
> 000001007f1957f0
> )
> Stack: ffffffff80242ab2 0000000d000402dc ffffffff803f88e0 00000000000402dc
> 0000000000040309 0000000000000030 000001017bf79830 000000000000c000
> ffffffff8013764c 0000000000040309
> Call Trace:<ffffffff80242ab2>{serial8250_console_write+113}
> <ffffffff8013764c>{_
> _call_console_drivers+68}
> <ffffffff801378b9>{release_console_sem+276}
> <ffffffff80137b44>{vprintk+49
> 8}
> <ffffffff80137bee>{printk+141} <ffffffff8013346f>{__wake_up+54}
> <ffffffff802498bc>{freed_request+105}
> <ffffffffa01e24e4>{:dm_multipath:mu
> ltipath_end_io+0}
> <ffffffffa0007350>{:scsi_mod:scsi_prep_fn+120}
> <ffffffff80247f53>{elv_nex
> t_request+68}
> <ffffffffa00076c6>{:scsi_mod:scsi_request_fn+66}
> <ffffffff8024a107>{blk_i
> nsert_request+160}
> <ffffffffa0006d15>{:scsi_mod:scsi_requeue_command+48}
> <ffffffffa000720f>{:scsi_mod:scsi_io_completion+866}
> <ffffffffa00064c7>{:scsi_mod:scsi_error_handler+2809}
> <ffffffff80110e17>{child_rip+8}
> <ffffffffa00059ce>{:scsi_mod:scsi_error_h
> andler+0}
> <ffffffff80110e0f>{child_rip+0}
>
> Code: 0f b6 c0 c3 0f b6 4f 22 0f b6 47 23 41 89 d0 d3 e6 83 f8 02
> Kernel panic - not syncing: nmi watchdog
>
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>
Please open a bugzilla about this deadlock.
Can you reproduce it?
More information about the ewg
mailing list