[Openib-windows] wsd over mt23108 data corruption issue (x64)

Fabian Tillier ftillier at silverstorm.com
Tue May 9 09:46:30 PDT 2006


Hi Guy,

On 5/9/06, Guy Corem <guyc at voltaire.com> wrote:
>
> Hi Fab and Leonid,
>
> While testing an MPI utility that uses large buffers (>= 64MB), I've
> encountered a data corruption bug.
>
> I was able to reproduce it directly over WSD.
>
> Simplest reproducing scenario:
>
> Use pcattcp from
> http://www.pcausa.com/Utilities/ttcpdown1.htm (compiled to
> 32 bit or 64 bit)
>
> Use large file (>= 64MB)
>
> I've used a 90MB text file
>
> Receiver command line: pcattcp.exe –r –s > file2
>
> Sender command line: pcattcp.exe –s –l 150000000 –n 1 –t receiver_ip < file
>
> When comparing both files, I see that file2 has 512 bytes misplace after
> about 45MB (512 bytes were sent twice)
>
> The problem doesn't occur on 32 bit machines.
>
> The problem doesn't occur with new mthca low level driver.
>
> I suspect a memory registration problem, but wasn't able to tack it down,
> yet.

Memory registration works on full-page (4KB) granularity, so a
sub-page mixup is unlikely to be registration related.  If you saw 4KB
repeated or put in the wrong place, then I'd agree with you.

This sounds like a timing issue - the timing on 32-bit is different
than on 64-bit.  Likewise, the timings of the two HCA drivers are
different.  I suspect that 32-bit and MTHCA runs are just masking the
problem, and that it's not actually an issue with the HCA driver.

> Did you encounter similar problems in the past with huge buffers?

I have not personally seen this issue, but it may be due to my systems
have different timings such that the problem doesn't occur.

Erez Haba is working on a similar issue, but in his case it looks like
a problem in the WSD switch code, and with transfers smaller than
64MB.

- Fab



More information about the ofw mailing list