[openib-general][patch review] srp: fmr implementation,
Vu Pham
vuhuong at mellanox.com
Wed Apr 12 15:03:24 PDT 2006
>
> Vu> Here is my status of testing this patch. On x86-64 system I
> Vu> got data corruption problem reported after ~4 hrs of running
> Vu> Engenio's Smash test tool when I tested with Engenio storage
> Vu> On ia64 system I got multiple async event 3
> Vu> (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL),
> Vu> finally the error handling path kicked in and the system
> Vu> paniced. Please see log below (I tested with Mellanox's srp
> Vu> target reference implementation - I don't see this error
> Vu> without the patch)
>
> Hmm, that's interesting. Did you see this type of problem with the
> original FMR patch you wrote (and did you do this level of stress
> testing)? I'm wondering whether the issue is in the SRP driver, or
> whether there is a bug in the FMR stuff at a lower level.
>
I stressed on x86_64 and did not see data corruption problem. I
restarted the test with your patch without any problem till now ~15 hrs
When I tested with my original patch on ia64 I hit different problem
per[0]: Oops 8813272891392 [1]
Modules linked in: ib_srp ib_sa ib_cm ib_umad evdev joydev sg st sr_mod
ide_cd cdrom usbserial parport_pc lp parport thermal processor ipv6 fan
button ib_mthca ib_mad ib_core bd
Pid: 0, CPU 0, comm: swapper
psr : 0000101008022038 ifs : 8000000000000003 ip : [<a0000001002f68f0>]
Not tainted
ip is at __copy_user+0x890/0x960
unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003
rnat: e0000001fd1cbb64 bsps: a0000001008e9ef8 pr : 80000000a96627a7
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001003019f0 b6 : a000000100003320 b7 : a000000100302120
f6 : 000000000000000000000 f7 : 1003eff23971ce39d6000
f8 : 1003ef840500400886000 f9 : 100068000000000000000
f10 : 10005fffffffff0000000 f11 : 1003e0000000000000080
r1 : a000000100ae8b50 r2 : 0d30315052534249 r3 : 0d3031505253424a
r8 : a000000100902570 r9 : 2d3031504db9c249 r10 : 0000000000544f53
r11 : e000000004998000 r12 : a0000001007bfb20 r13 : a0000001007b8000
r14 : a0007ffffdc00000 r15 : a000000100902540 r16 : a000000100902570
r17 : 0000000000000000 r18 : ffffffffffffffff r19 : e5c738e7c46c654d
r20 : e5c738e758000000 r21 : ff23971ce39d6000 r22 : c202802004430000
r23 : e0000001e2fafd78 r24 : 6203002002030000 r25 : e0000001e6fec18b
r26 : ffffffffffffff80 r27 : 0000000000000000 r28 : 0d30315052534000
r29 : 0000000000000001 r30 : ffffffffffffffff r31 : a0000001007480c8
Call Trace:
[<a0000001000136a0>] show_stack+0x80/0xa0
sp=a0000001007bf6a0 bsp=a0000001007b94c0
[<a000000100013f00>] show_regs+0x840/0x880
sp=a0000001007bf870 bsp=a0000001007b9460
[<a000000100036fd0>] die+0x1b0/0x240
sp=a0000001007bf880 bsp=a0000001007b9418
[<a00000010005a770>] ia64_do_page_fault+0x970/0xae0
sp=a0000001007bf8a0 bsp=a0000001007b93a8
[<a00000010000be60>] ia64_leave_kernel+0x0/0x280
sp=a0000001007bf950 bsp=a0000001007b93a8
[<a0000001002f68f0>] __copy_user+0x890/0x960
sp=a0000001007bfb20 bsp=a0000001007b9390
[<a0000001003019f0>] unmap_single+0x90/0x2a0
sp=a0000001007bfb20 bsp=a0000001007b9388
[<a0000001007bf960>] init_task+0x7960/0x8000
sp=a0000001007bfb20 bsp=a0000001007b90e0
[<a0000001003019f0>] unmap_single+0x90/0x2a0
sp=a0000001007bfb20 bsp=a0000001007b8e38
> What kind of HCAs were you using? I assume on ia64 you're using
> PCI-X, what about on x86-64? PCIe or not? Memfree or not?
>
PCI-X on ia64 and PCIe without mem on x86_64
> Another thing that might be useful if it's convenient for you would be
> to use an IB analyzer and trigger on a NAK to see what happens on the
> wire around the IB_EVENT_QP_ACCESS_ERR.
I'll capture some log with analyzer when it's available
Vu
More information about the general
mailing list