[openib-general][patch review] srp: fmr implementation,
Vu Pham
vuhuong at mellanox.com
Tue Apr 11 17:26:28 PDT 2006
Hi Roland,
Sorry to take this long to response. Thanks for all the enhancements.
I cced some Engenio's engineer who can help to send latest FW to you.
>
> This mostly works for me, but I still see one weird problem. If I
> make an FMR to cover IO of size more than 58 * 4096 bytes, the IO
> never completes. The SCSI midlayer times it out and aborts it, and
> the target responds to the task management command. I'm having a hard
> time imagining that this is an SRP initiator or even low-level HCA
> driver bug -- it seems more likely to be a target bug (I am using an
> Engenio target to test, and I may have down-rev firmware).
>
If you have Santricity, you can check what current controller firmware
version is and update it to latest
> I would be very happy to hear test reports with other targets,
>
Here is my status of testing this patch.
On x86-64 system I got data corruption problem reported after ~4 hrs of
running Engenio's Smash test tool when I tested with Engenio storage
On ia64 system I got multiple async event 3 (IB_EVENT_QP_ACCESS_ERR) and
even 1 (IB_EVENT_QP_FATAL), finally the error handling path kicked in
and the system paniced. Please see log below (I tested with Mellanox's
srp target reference implementation - I don't see this error without the
patch)
Apr 7 18:15:10 lab105 kernel: ib_srp: QP event 3
Apr 7 18:15:10 lab105 kernel: ib_srp: failed receive status 5
Apr 7 18:15:13 lab105 kernel: ib_srp: connection closed
Apr 7 18:15:43 lab105 kernel: SRP abort called
Apr 7 18:15:43 lab105 kernel: Abort for req_index 0
Apr 7 18:15:43 lab105 kernel: SRP abort called
Apr 7 18:15:43 lab105 kernel: Abort for req_index 1
Apr 7 18:15:43 lab105 kernel: SRP abort called
Apr 7 18:15:43 lab105 kernel: Abort for req_index 2
Apr 7 18:15:43 lab105 kernel: SRP reset_device called
Apr 7 18:15:43 lab105 kernel: Abort for req_index 1
Apr 7 18:15:43 lab105 kernel: SRP reset_device called
Apr 7 18:15:43 lab105 kernel: Abort for req_index 2
Apr 7 18:15:48 lab105 kernel: SRP reset_device called
Apr 7 18:15:48 lab105 kernel: Abort for req_index 0
Apr 7 18:15:48 lab105 kernel: ib_srp: failed receive status 5
Apr 7 18:15:50 lab105 kernel: ib_srp: connection closed
Apr 7 18:15:53 lab105 kernel: ib_srp: SRP reset_host called
Apr 7 18:15:55 lab105 kernel: ib_srp: connection closed
Apr 7 18:16:05 lab105 kernel: ib_mthca 0000:05:00.0: CQ overrun on CQN
000082
Apr 7 18:16:05 lab105 kernel: ib_srp: QP event 1
Apr 7 18:16:05 lab105 last message repeated 3 times
Apr 7 18:16:15 lab105 kernel: SRP abort called
Apr 7 18:16:15 lab105 kernel: Abort for req_index 0
Apr 7 18:16:20 lab105 kernel: ib_srp: QP event 1
Apr 7 18:16:20 lab105 kernel: ib_srp: QP event 1
Apr 7 18:16:30 lab105 kernel: SRP abort called
Apr 7 18:16:30 lab105 kernel: Abort for req_index 1
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: scsi: Device offlined - not
ready after error recovery
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: scsi: Device offlined - not
ready after error recovery
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical
block 0
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical
block 1
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical
block 0
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical
block 1
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:7: rejecting I/O to offline device
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdj, logical
block 0
Apr 7 18:16:35 lab105 kernel: ib_srp: QP event 1
Apr 7 18:16:35 lab105 kernel: ib_srp: QP event 1
Apr 7 18:16:35 lab105 kernel: sd 2:0:0:6: rejecting I/O to offline device
Apr 7 18:16:35 lab105 kernel: Buffer I/O error on device sdi, logical
block 0
Apr 7 18:17:05 lab105 kernel: SRP abort called
Apr 7 18:17:05 lab105 kernel: Abort for req_index 2
Apr 7 18:17:10 lab105 kernel: SRP reset_device called
Apr 7 18:17:10 lab105 kernel: Abort for req_index 2
Apr 7 18:17:15 lab105 kernel: ib_srp: SRP reset_host called
Apr 7 18:17:17 lab105 kernel: ib_srp: connection closed
Apr 7 18:17:17 lab105 kernel: Unable to handle kernel paging request at
virtual address 6b6b6b6b6b6b6b6b
Apr 7 18:17:17 lab105 kernel: scsi_eh_2[14050]: Oops 11012296146944 [1]
Apr 7 18:17:17 lab105 kernel: Modules linked in: ib_srp ib_cm ib_sa
ib_umad evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp
parport thermal processor fan button ipv6 binfmt_misc ib_mthca ib_mad
ib_core usbhid ehci_hcd uhci_hcd usbcore i2c_i801 i2c_core e1000
nls_iso8859_1 nls_cp437 dm_mod reiserfs mptspi mptscsih mptbase sd_mod
scsi_mod
Apr 7 18:17:17 lab105 kernel:
Apr 7 18:17:17 lab105 kernel: Pid: 14050, CPU 0, comm: scsi_eh_2
Apr 7 18:17:17 lab105 kernel: psr : 0000121008026018 ifs :
800000000000050d ip : [<a0000002022e5571>] Not tainted
Apr 7 18:17:17 lab105 kernel: ip is at srp_reconnect_target+0x2b1/0x5c0
[ib_srp]
Apr 7 18:17:17 lab105 kernel: unat: 0000000000000000 pfs :
000000000000050d rsc : 0000000000000003
Apr 7 18:17:17 lab105 kernel: rnat: 0000000000000000 bsps:
0000000000000000 pr : 0000000000009941
Apr 7 18:17:17 lab105 kernel: ldrs: 0000000000000000 ccv :
0000000000000000 fpsr: 0009804c8a70433f
Apr 7 18:17:17 lab105 kernel: csd : 0000000000000000 ssd : 0000000000000000
Apr 7 18:17:17 lab105 kernel: b0 : a0000002022e54e0 b6 :
a000000100003320 b7 : a0000002020a36a0
Apr 7 18:17:17 lab105 kernel: f6 : 1003e6b6b6b6b6b6b6b6b f7 :
0ffe7e694bf1a00000000
Apr 7 18:17:17 lab105 kernel: f8 : 1003e0000000000002418 f9 :
1003e0000000000000021
Apr 7 18:17:17 lab105 kernel: f10 : 1000483fffffff96976e2 f11 :
1003e0000000000000021
Apr 7 18:17:17 lab105 kernel: r1 : a0000002022e8278 r2 :
e0000001c6e75a18 r3 : e0000001d0a64a10
Apr 7 18:17:17 lab105 kernel: r8 : e0000001c6e75a68 r9 :
e0000001c6e758b8 r10 : a0000001008e9f00
Apr 7 18:17:17 lab105 kernel: r11 : 0000000000000001 r12 :
e0000001c7f5fd30 r13 : e0000001c7f58000
Apr 7 18:17:17 lab105 kernel: r14 : a0000001008e9f08 r15 :
e0000001c7f58000 r16 : 0000000000000000
Apr 7 18:17:17 lab105 kernel: r17 : 0000000000000000 r18 :
e0000001c7f58d84 r19 : a0000001008e9f10
Apr 7 18:17:17 lab105 kernel: r20 : ffffffffffffffff r21 :
0000000000000008 r22 : e000000004790000
Apr 7 18:17:17 lab105 kernel: r23 : e0000001e05f7cd0 r24 :
0000000000000080 r25 : e00000000479001f
Apr 7 18:17:17 lab105 kernel: r26 : a0000002020a36a0 r27 :
e0000001efcca1e0 r28 : e0000001efcca000
Apr 7 18:17:17 lab105 kernel: r29 : e0000001e05f7c30 r30 :
e0000001d0a64a88 r31 : e0000001d0a649f0
Apr 7 18:17:17 lab105 kernel:
Apr 7 18:17:17 lab105 kernel: Call Trace:
Apr 7 18:17:17 lab105 kernel: [<a0000001000136a0>] show_stack+0x80/0xa0
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5f8b0 bsp=e0000001c7f59110
Apr 7 18:17:17 lab105 kernel: [<a000000100013f00>] show_regs+0x840/0x880
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fa80 bsp=e0000001c7f590b0
Apr 7 18:17:17 lab105 kernel: [<a000000100036fd0>] die+0x1b0/0x240
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fa90 bsp=e0000001c7f59068
Apr 7 18:17:17 lab105 kernel: [<a00000010005a770>]
ia64_do_page_fault+0x970/0xae0
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fab0 bsp=e0000001c7f59000
Apr 7 18:17:17 lab105 kernel: [<a00000010000be60>]
ia64_leave_kernel+0x0/0x280
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fb60 bsp=e0000001c7f59000
Apr 7 18:17:17 lab105 kernel: [<a0000002022e5570>]
srp_reconnect_target+0x2b0/0x5c0 [ib_srp]
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fd30 bsp=e0000001c7f58f90
Apr 7 18:17:17 lab105 kernel: [<a0000002022e58e0>]
srp_reset_host+0x60/0xa0 [ib_srp]
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fdf0 bsp=e0000001c7f58f68
Apr 7 18:17:17 lab105 kernel: [<a000000201b280d0>]
scsi_try_host_reset+0xd0/0x240 [scsi_mod]
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fdf0 bsp=e0000001c7f58f38
Apr 7 18:17:17 lab105 kernel: [<a000000201b2a8a0>]
scsi_error_handler+0x1880/0x22c0 [scsi_mod]
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fdf0 bsp=e0000001c7f58e50
Apr 7 18:17:17 lab105 kernel: [<a0000001000cc540>] kthread+0x220/0x280
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fe10 bsp=e0000001c7f58e10
Apr 7 18:17:17 lab105 kernel: [<a000000100011a60>]
kernel_thread_helper+0xe0/0x100
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fe30 bsp=e0000001c7f58de0
Apr 7 18:17:17 lab105 kernel: [<a000000100009120>]
start_kernel_thread+0x20/0x40
Apr 7 18:17:17 lab105 kernel:
sp=e0000001c7f5fe30 bsp=e0000001c7f5
More information about the general
mailing list