[ewg] Re: SRP HA dm_multipath testing and questions

Chieng Etta etta at systemfabricworks.com
Tue Apr 10 09:58:11 PDT 2007


Please see below.

Thanks,
Etta

-----Original Message-----
From: ewg-bounces at lists.openfabrics.org
[mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ishai Rabinovitz
Sent: Tuesday, April 10, 2007 9:21 AM
To: Scott Weitzenkamp (sweitzen)
Cc: Roland Dreier (rdreier); ewg at lists.openfabrics.org; openib
Subject: [ewg] Re: SRP HA dm_multipath testing and questions


Scott Weitzenkamp (sweitzen) wrote:
> I've been testing SRP HA and dm_multipath with:
> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun T4 RAID
> - RHEL4 U3 x86_64, Cisco FC Gateway, and Sun 3510 RAID
> - SLES10 x86_64, Cisco FC Gateway, and 3 JBODs
>  
> On RHEL4, I edited /etc/multipath.conf, ran "chkconfig multipathd on", 
> then rebooted.  On SLES 10, I ran "chkconfig boot.multipath on" and 
> "chkconfig multipathd on", then rebooted.  Ishai, I don't seem to need 
> 91-srp.rules, are you using the boot.multipath and multipathd scripts?

On RHEL4 you really do not need 91-srp.rules and it is not used (see
/etc/init.d/openibd)
On SLES10 I was sure that you need it. I checked it, and you are correct. I
don't see how it does it, but it seems that when using boot.multipath there
is no need for 91-srp.rules. I will check it more deeply and change
documentation and openibd script accordingly. 

[EC] I just verified it on SLES10 x86_64.  The multipath worked fine by
using boot.multipath without 91-srp.rules.

Ishai, in the SRP release notes - section 6, srp_daemon a., the first line
should be changed to '"srp_daemon -a -o" is equivalent to "ibsrpdm"'.


>  
> On both RHEL4 networks, I get IB port load balancing and failover, on 
> SLES10 I only see failover. I'm not sure if this is a function of 
> RHEL4-vs-SLES10, or RAID vs JBOD.
>  

Maybe this is because you removed 91-srp.rules (Did you removed it?)
How did you test the failover and failback?

> Traffic failover is very slow (a few minutes), what do others see?
>  

What do you mean by slow. When do you start counting.

> I will be testing DDN IB storage, EMC DMX, and RHEL5 soon.
>  
> I'm getting an Oops on RHEL4 U3 x86_64 on both test networks:
>  
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:0): rejecting I/O to offline device
> scsi3 (0:<4>NMI Watchdog detected LOCKUP, CPU=1, registers:
> CPU 1
> Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core nfs 
> lockd nfs_
> acl sunrpc rdma_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) iw_cm(U) 
> ib_addr(U) ib_loc
> al_sa(U) ds yenta_socket pcmcia_core dm_mirror dm_round_robin 
> dm_multipath dm_mo
> d button battery ac ohci_hcd hw_random shpchp ib_mthca(U) ib_ipoib(U) 
> ib_umad(U)
>  ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6 
> tg3 flop
> py sg ext3 jbd mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod
scsi_mod
> Pid: 3990, comm: scsi_eh_3 Not tainted 2.6.9-34.ELsmp
> RIP: 0010:[<ffffffff802409bf>] <ffffffff802409bf>{serial_in+83}
> RSP: 0018:000001007f203c10  EFLAGS: 00000002
> RAX: 00000000ffffff00 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804b59a0
> RBP: ffffffff804b59a0 R08: 000000000000003a R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000002706
> R13: ffffffff8045afc5 R14: 0000000000000009 R15: 000000000000002d
> FS:  0000002a958a07a0(0000) GS:ffffffff804d7b80(0000)
knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000036ce02e728 CR3: 00000000cff00000 CR4: 00000000000006e0
> Process scsi_eh_3 (pid: 3990, threadinfo 000001007f202000, task 
> 000001007f1957f0
> )
> Stack: ffffffff80242ab2 0000000d000402dc ffffffff803f88e0 00000000000402dc
>        0000000000040309 0000000000000030 000001017bf79830 000000000000c000
>        ffffffff8013764c 0000000000040309
> Call Trace:<ffffffff80242ab2>{serial8250_console_write+113} 
> <ffffffff8013764c>{_
> _call_console_drivers+68}
>        <ffffffff801378b9>{release_console_sem+276} 
> <ffffffff80137b44>{vprintk+49
> 8}
>        <ffffffff80137bee>{printk+141} <ffffffff8013346f>{__wake_up+54}
>        <ffffffff802498bc>{freed_request+105} 
> <ffffffffa01e24e4>{:dm_multipath:mu
> ltipath_end_io+0}
>        <ffffffffa0007350>{:scsi_mod:scsi_prep_fn+120} 
> <ffffffff80247f53>{elv_nex
> t_request+68}
>        <ffffffffa00076c6>{:scsi_mod:scsi_request_fn+66} 
> <ffffffff8024a107>{blk_i
> nsert_request+160}
>        <ffffffffa0006d15>{:scsi_mod:scsi_requeue_command+48}
>        <ffffffffa000720f>{:scsi_mod:scsi_io_completion+866}
>        <ffffffffa00064c7>{:scsi_mod:scsi_error_handler+2809}
>        <ffffffff80110e17>{child_rip+8} 
> <ffffffffa00059ce>{:scsi_mod:scsi_error_h
> andler+0}
>        <ffffffff80110e0f>{child_rip+0}
>  
> Code: 0f b6 c0 c3 0f b6 4f 22 0f b6 47 23 41 89 d0 d3 e6 83 f8 02
> Kernel panic - not syncing: nmi watchdog
>  
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  

Please open a bugzilla about this deadlock.
Can you reproduce it?
_______________________________________________
ewg mailing list
ewg at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg




More information about the ewg mailing list