[openib-general][patch review] srp: fmr implementation,

Vu Pham vuhuong at mellanox.com
Tue Apr 11 17:26:28 PDT 2006


Hi Roland,

Sorry to take this long to response. Thanks for all the enhancements.
I cced some Engenio's engineer who can help to send latest FW to you.

>
> This mostly works for me, but I still see one weird problem.  If I
> make an FMR to cover IO of size more than 58 * 4096 bytes, the IO
> never completes.  The SCSI midlayer times it out and aborts it, and
> the target responds to the task management command.  I'm having a hard
> time imagining that this is an SRP initiator or even low-level HCA
> driver bug -- it seems more likely to be a target bug (I am using an
> Engenio target to test, and I may have down-rev firmware).
>   

If you have Santricity, you can check what current controller firmware 
version is and update it to latest


> I would be very happy to hear test reports with other targets,
>   

Here is my status of testing this patch.
On x86-64 system I got data corruption problem reported after ~4 hrs of 
running Engenio's Smash test tool when I tested with Engenio storage
On ia64 system I got multiple async event 3 (IB_EVENT_QP_ACCESS_ERR) and 
even 1 (IB_EVENT_QP_FATAL), finally the error handling path kicked in 
and the system paniced. Please see log below (I tested with Mellanox's 
srp target reference implementation - I don't see this error without the 
patch)

Apr  7 18:15:10 lab105 kernel: ib_srp: QP event 3
Apr  7 18:15:10 lab105 kernel: ib_srp: failed receive status 5
Apr  7 18:15:13 lab105 kernel: ib_srp: connection closed
Apr  7 18:15:43 lab105 kernel: SRP abort called
Apr  7 18:15:43 lab105 kernel: Abort for req_index 0
Apr  7 18:15:43 lab105 kernel: SRP abort called
Apr  7 18:15:43 lab105 kernel: Abort for req_index 1
Apr  7 18:15:43 lab105 kernel: SRP abort called
Apr  7 18:15:43 lab105 kernel: Abort for req_index 2
Apr  7 18:15:43 lab105 kernel: SRP reset_device called
Apr  7 18:15:43 lab105 kernel: Abort for req_index 1
Apr  7 18:15:43 lab105 kernel: SRP reset_device called
Apr  7 18:15:43 lab105 kernel: Abort for req_index 2
Apr  7 18:15:48 lab105 kernel: SRP reset_device called
Apr  7 18:15:48 lab105 kernel: Abort for req_index 0
Apr  7 18:15:48 lab105 kernel: ib_srp: failed receive status 5
Apr  7 18:15:50 lab105 kernel: ib_srp: connection closed
Apr  7 18:15:53 lab105 kernel: ib_srp: SRP reset_host called
Apr  7 18:15:55 lab105 kernel: ib_srp: connection closed
Apr  7 18:16:05 lab105 kernel: ib_mthca 0000:05:00.0: CQ overrun on CQN 
000082
Apr  7 18:16:05 lab105 kernel: ib_srp: QP event 1
Apr  7 18:16:05 lab105 last message repeated 3 times
Apr  7 18:16:15 lab105 kernel: SRP abort called
Apr  7 18:16:15 lab105 kernel: Abort for req_index 0
Apr  7 18:16:20 lab105 kernel: ib_srp: QP event 1
Apr  7 18:16:20 lab105 kernel: ib_srp: QP event 1
Apr  7 18:16:30 lab105 kernel: SRP abort called
Apr  7 18:16:30 lab105 kernel: Abort for req_index 1
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:7: scsi: Device offlined - not 
ready after error recovery
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:6: scsi: Device offlined - not 
ready after error recovery
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical 
block 0
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical 
block 1
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical 
block 0
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical 
block 1
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical 
block 0
Apr  7 18:16:35 lab105 kernel: ib_srp: QP event 1
Apr  7 18:16:35 lab105 kernel: ib_srp: QP event 1
Apr  7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr  7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical 
block 0
Apr  7 18:17:05 lab105 kernel: SRP abort called
Apr  7 18:17:05 lab105 kernel: Abort for req_index 2
Apr  7 18:17:10 lab105 kernel: SRP reset_device called
Apr  7 18:17:10 lab105 kernel: Abort for req_index 2
Apr  7 18:17:15 lab105 kernel: ib_srp: SRP reset_host called
Apr  7 18:17:17 lab105 kernel: ib_srp: connection closed
Apr  7 18:17:17 lab105 kernel: Unable to handle kernel paging request at 
virtual address 6b6b6b6b6b6b6b6b
Apr  7 18:17:17 lab105 kernel: scsi_eh_2[14050]: Oops 11012296146944 [1]
Apr  7 18:17:17 lab105 kernel: Modules linked in: ib_srp ib_cm ib_sa 
ib_umad evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp 
parport thermal processor fan button ipv6 binfmt_misc ib_mthca ib_mad 
ib_core usbhid ehci_hcd uhci_hcd usbcore i2c_i801 i2c_core e1000 
nls_iso8859_1 nls_cp437 dm_mod reiserfs mptspi mptscsih mptbase sd_mod 
scsi_mod
Apr  7 18:17:17 lab105 kernel:
Apr  7 18:17:17 lab105 kernel: Pid: 14050, CPU 0, comm:            scsi_eh_2
Apr  7 18:17:17 lab105 kernel: psr : 0000121008026018 ifs : 
800000000000050d ip  : [<a0000002022e5571>]    Not tainted
Apr  7 18:17:17 lab105 kernel: ip is at srp_reconnect_target+0x2b1/0x5c0 
[ib_srp]
Apr  7 18:17:17 lab105 kernel: unat: 0000000000000000 pfs : 
000000000000050d rsc : 0000000000000003
Apr  7 18:17:17 lab105 kernel: rnat: 0000000000000000 bsps: 
0000000000000000 pr  : 0000000000009941
Apr  7 18:17:17 lab105 kernel: ldrs: 0000000000000000 ccv : 
0000000000000000 fpsr: 0009804c8a70433f
Apr  7 18:17:17 lab105 kernel: csd : 0000000000000000 ssd : 0000000000000000
Apr  7 18:17:17 lab105 kernel: b0  : a0000002022e54e0 b6  : 
a000000100003320 b7  : a0000002020a36a0
Apr  7 18:17:17 lab105 kernel: f6  : 1003e6b6b6b6b6b6b6b6b f7  : 
0ffe7e694bf1a00000000
Apr  7 18:17:17 lab105 kernel: f8  : 1003e0000000000002418 f9  : 
1003e0000000000000021
Apr  7 18:17:17 lab105 kernel: f10 : 1000483fffffff96976e2 f11 : 
1003e0000000000000021
Apr  7 18:17:17 lab105 kernel: r1  : a0000002022e8278 r2  : 
e0000001c6e75a18 r3  : e0000001d0a64a10
Apr  7 18:17:17 lab105 kernel: r8  : e0000001c6e75a68 r9  : 
e0000001c6e758b8 r10 : a0000001008e9f00
Apr  7 18:17:17 lab105 kernel: r11 : 0000000000000001 r12 : 
e0000001c7f5fd30 r13 : e0000001c7f58000
Apr  7 18:17:17 lab105 kernel: r14 : a0000001008e9f08 r15 : 
e0000001c7f58000 r16 : 0000000000000000
Apr  7 18:17:17 lab105 kernel: r17 : 0000000000000000 r18 : 
e0000001c7f58d84 r19 : a0000001008e9f10
Apr  7 18:17:17 lab105 kernel: r20 : ffffffffffffffff r21 : 
0000000000000008 r22 : e000000004790000
Apr  7 18:17:17 lab105 kernel: r23 : e0000001e05f7cd0 r24 : 
0000000000000080 r25 : e00000000479001f
Apr  7 18:17:17 lab105 kernel: r26 : a0000002020a36a0 r27 : 
e0000001efcca1e0 r28 : e0000001efcca000
Apr  7 18:17:17 lab105 kernel: r29 : e0000001e05f7c30 r30 : 
e0000001d0a64a88 r31 : e0000001d0a649f0
Apr  7 18:17:17 lab105 kernel:
Apr  7 18:17:17 lab105 kernel: Call Trace:
Apr  7 18:17:17 lab105 kernel:  [<a0000001000136a0>] show_stack+0x80/0xa0
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5f8b0 bsp=e0000001c7f59110
Apr  7 18:17:17 lab105 kernel:  [<a000000100013f00>] show_regs+0x840/0x880
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fa80 bsp=e0000001c7f590b0
Apr  7 18:17:17 lab105 kernel:  [<a000000100036fd0>] die+0x1b0/0x240
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fa90 bsp=e0000001c7f59068
Apr  7 18:17:17 lab105 kernel:  [<a00000010005a770>] 
ia64_do_page_fault+0x970/0xae0
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fab0 bsp=e0000001c7f59000
Apr  7 18:17:17 lab105 kernel:  [<a00000010000be60>] 
ia64_leave_kernel+0x0/0x280
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fb60 bsp=e0000001c7f59000
Apr  7 18:17:17 lab105 kernel:  [<a0000002022e5570>] 
srp_reconnect_target+0x2b0/0x5c0 [ib_srp]
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fd30 bsp=e0000001c7f58f90
Apr  7 18:17:17 lab105 kernel:  [<a0000002022e58e0>] 
srp_reset_host+0x60/0xa0 [ib_srp]
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fdf0 bsp=e0000001c7f58f68
Apr  7 18:17:17 lab105 kernel:  [<a000000201b280d0>] 
scsi_try_host_reset+0xd0/0x240 [scsi_mod]
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fdf0 bsp=e0000001c7f58f38
Apr  7 18:17:17 lab105 kernel:  [<a000000201b2a8a0>] 
scsi_error_handler+0x1880/0x22c0 [scsi_mod]
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fdf0 bsp=e0000001c7f58e50
Apr  7 18:17:17 lab105 kernel:  [<a0000001000cc540>] kthread+0x220/0x280
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fe10 bsp=e0000001c7f58e10
Apr  7 18:17:17 lab105 kernel:  [<a000000100011a60>] 
kernel_thread_helper+0xe0/0x100
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fe30 bsp=e0000001c7f58de0
Apr  7 18:17:17 lab105 kernel:  [<a000000100009120>] 
start_kernel_thread+0x20/0x40
Apr  7 18:17:17 lab105 kernel:                                 
sp=e0000001c7f5fe30 bsp=e0000001c7f5



More information about the general mailing list