[openib-general][patch review] srp: fmr implementation,

Wed Apr 12 15:03:24 PDT 2006

> 
>     Vu> Here is my status of testing this patch.  On x86-64 system I
>     Vu> got data corruption problem reported after ~4 hrs of running
>     Vu> Engenio's Smash test tool when I tested with Engenio storage
>     Vu> On ia64 system I got multiple async event 3
>     Vu> (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL),
>     Vu> finally the error handling path kicked in and the system
>     Vu> paniced. Please see log below (I tested with Mellanox's srp
>     Vu> target reference implementation - I don't see this error
>     Vu> without the patch)
> 
> Hmm, that's interesting.  Did you see this type of problem with the
> original FMR patch you wrote (and did you do this level of stress
> testing)?  I'm wondering whether the issue is in the SRP driver, or
> whether there is a bug in the FMR stuff at a lower level.
> 

I stressed on x86_64 and did not see data corruption problem. I 
restarted the test with your patch without any problem till now ~15 hrs

When I tested with my original patch on ia64 I hit different problem

per[0]: Oops 8813272891392 [1]
Modules linked in: ib_srp ib_sa ib_cm ib_umad evdev joydev sg st sr_mod 
ide_cd cdrom usbserial parport_pc lp parport thermal processor ipv6 fan 
button ib_mthca ib_mad ib_core bd

Pid: 0, CPU 0, comm:              swapper
psr : 0000101008022038 ifs : 8000000000000003 ip  : [<a0000001002f68f0>] 
    Not tainted
ip is at __copy_user+0x890/0x960
unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003
rnat: e0000001fd1cbb64 bsps: a0000001008e9ef8 pr  : 80000000a96627a7
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001003019f0 b6  : a000000100003320 b7  : a000000100302120
f6  : 000000000000000000000 f7  : 1003eff23971ce39d6000
f8  : 1003ef840500400886000 f9  : 100068000000000000000
f10 : 10005fffffffff0000000 f11 : 1003e0000000000000080
r1  : a000000100ae8b50 r2  : 0d30315052534249 r3  : 0d3031505253424a
r8  : a000000100902570 r9  : 2d3031504db9c249 r10 : 0000000000544f53
r11 : e000000004998000 r12 : a0000001007bfb20 r13 : a0000001007b8000
r14 : a0007ffffdc00000 r15 : a000000100902540 r16 : a000000100902570
r17 : 0000000000000000 r18 : ffffffffffffffff r19 : e5c738e7c46c654d
r20 : e5c738e758000000 r21 : ff23971ce39d6000 r22 : c202802004430000
r23 : e0000001e2fafd78 r24 : 6203002002030000 r25 : e0000001e6fec18b
r26 : ffffffffffffff80 r27 : 0000000000000000 r28 : 0d30315052534000
r29 : 0000000000000001 r30 : ffffffffffffffff r31 : a0000001007480c8

Call Trace:
  [<a0000001000136a0>] show_stack+0x80/0xa0
                                 sp=a0000001007bf6a0 bsp=a0000001007b94c0
  [<a000000100013f00>] show_regs+0x840/0x880
                                 sp=a0000001007bf870 bsp=a0000001007b9460
  [<a000000100036fd0>] die+0x1b0/0x240
                                 sp=a0000001007bf880 bsp=a0000001007b9418
  [<a00000010005a770>] ia64_do_page_fault+0x970/0xae0
                                 sp=a0000001007bf8a0 bsp=a0000001007b93a8
  [<a00000010000be60>] ia64_leave_kernel+0x0/0x280
                                 sp=a0000001007bf950 bsp=a0000001007b93a8
  [<a0000001002f68f0>] __copy_user+0x890/0x960
                                 sp=a0000001007bfb20 bsp=a0000001007b9390
  [<a0000001003019f0>] unmap_single+0x90/0x2a0
                                 sp=a0000001007bfb20 bsp=a0000001007b9388
  [<a0000001007bf960>] init_task+0x7960/0x8000
                                 sp=a0000001007bfb20 bsp=a0000001007b90e0
  [<a0000001003019f0>] unmap_single+0x90/0x2a0
                                 sp=a0000001007bfb20 bsp=a0000001007b8e38

> What kind of HCAs were you using?  I assume on ia64 you're using
> PCI-X, what about on x86-64?  PCIe or not?  Memfree or not?
> 

PCI-X on ia64 and PCIe without mem on x86_64

> Another thing that might be useful if it's convenient for you would be
> to use an IB analyzer and trigger on a NAK to see what happens on the
> wire around the IB_EVENT_QP_ACCESS_ERR.

I'll capture some log with analyzer when it's available

Vu