[ofa-general] iSER data corruption issues

Pete Wyckoff pw at osc.edu
Thu Oct 4 09:14:07 PDT 2007


rdreier at cisco.com wrote on Wed, 03 Oct 2007 15:01 -0700:
>  > Machines are opteron, fedora 7 up-to-date with its openfab libs,
>  > kernel 2.6.23-rc6 on target.  Either 2.6.23-rc6 or 2.6.22 or
>  > 2.6.18-rhel5 on initiator.  For some reason, it is much easier to
>  > produce with the rhel5 kernel.
> 
> There was a bug in mthca that caused data corruption with FMRs on
> Sinai (1-port PCIe) HCAs.  It was fixed in commit 608d8268 ("IB/mthca:
> Fix data corruption after FMR unmap on Sinai") which went in shortly
> before 2.6.21 was released.  I don't know if the RHEL5 2.6.18 kernel
> has this fix or not -- but if you still see the problem on 2.6.22 and
> later kernels then this isn't the fix anyway.

This is definitely it.  Same test setup runs for an hour with this
patch, but fails in tens of seconds without it.  Thanks for pointing
it out.

This rhel5 kernel is 2.6.18-8.1.6.  Perhaps there are newer ones
about that have this critical patch included.  I'm going to add a
Big Fat Warning on the iser distribution about pre-2.6.21 kernels.
It also crashes if the iSER connection drops in a certain
easy-to-reproduce way, another reason to avoid it.

Regarding the "larger" test I talked about that fails even on modern
kernels, I'm still not able to reproduce that on my setup.  I ran it
literally all night with a hacked target that calculated the return
buffer rather than accessing the disk.  For now I'm calling that a
separate bug and will investigate it further.

Thanks to Tom and Tom for helping debug this.

		-- Pete



More information about the general mailing list