[ofa-general] iSER data corruption issues
Tom Tucker
tom at opengridcomputing.com
Wed Oct 3 11:02:22 PDT 2007
On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote:
> How does the requester (in IB speak) know that an RDMA Write
> operation has completed on the responder?
>
> We have a software iSER target, available at git.osc.edu/tgt or
> browse at http://git.osc.edu/?p=tgt.git . Using the existing
> in-kernel iSER initiator code, very rarely data corruption occurs,
> in that the received data from SCSI read operations does not match
> what was expected. Sometimes it appears as if random kernel memory
> has been scribbled on by an errant RDMA write from the target. My
> current working theory that the RDMA write has not completed by the
> time the initiator looks at its incoming data buffer.
>
> Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write
> work requests are used. After everything is connected up, a SCSI
> read sequence looks like:
>
> initiator: register pages with FMR, write test pattern
> initiator: Send request to target
> target: Recv request
> target: RDMA Write response to initiator
> target: Wait for CQ entry for local RDMA Write completion
Pete:
I don't think this should be necessary...
> target: Send response to initiator
...as long as the send is posted on the same SQ as the write.
> initiator: Recv response, access buffer
>
> On very rare occasions, this buffer will have the test pattern, not
> the data that the target just sent.
>
> Machines are opteron, fedora 7 up-to-date with its openfab libs,
> kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or
> 2.6.18-rhel5 on initiator. For some reason, it is much easier to
> produce with the rhel5 kernel. One site with fast disks can see
> similar corruption with 2.6.23-rc6, however. Target is pure
> userspace. Initiator is in kernel and is poked by "lmdd" (like
> normal dd) through an iSCSI block device (/dev/sdb).
>
> The IB spec seems to indicate that the contents of the RDMA Write
> buffer should be stable after completion of a subsequent send
> message (o9-20). In fact, the "Wait for CQ entry" step on the
> target should be unnecessary, no?
I think so too.
>
> Could there be some caching issues that the initiator is missing?
> I've added print[fk]s to the initiator and target to verify that the
> sequence of events is truly as above, and that the virtual addresses
> are as expected on both sides.
>
> Any suggestions or advice would help. Thanks,
>
If your theory is correct, the data should eventually show up. Does it?
Does your code check for errors on dma_map_single/page?
> -- Pete
>
>
> P.S. Here are some debugging printfs not in the git.
>
> Userspace code does 200 read()s of length 8000, but complains about
> the result somewhere in the 14th read, from 112000 to 120000, and
> exits early. Expected pattern is a series of 400000 4-byte words,
> incrementing by 4, starting from 0. So 0x00000000, 0x00000004, ...,
> 0x001869fc:
>
> % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10
> off=112000 want=1c000 got=3b3b3b3b
>
> Initiator generates a series of SCSI operations, as driven by
> readahead and the block queue scheduler. You can see that it starts
> reading 4 pages, then 1 page, then 23 pages, then 1 page and so on,
> in order. These sizes and offsets vary from run to run. Each line
> here is printed after the SCSI read response has been received. It
> prints the first word in the buffer, and you can see the test
> pattern where data should be:
>
> tag 02 va 36061000 len 4000 word0 00000000 ref 1
> tag 03 va 36065000 len 1000 word0 00004000 ref 1
> tag 04 va 36066000 len 17000 word0 00005000 ref 1
> tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1
Is it interesting that the bad word occurs on the first page of the new
map?
> tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1
> tag 07 va 7bdc2000 len 20000 word0 0003c000 ref 1
>
> The userspace target code prints a line when it starts the RDMA
> write, then a line when the RDMA write completes locally, then a
> line when it sends the repsponse. The tags are what the initiator
> assigned to each request. The target thinks it is sending a
> 4096-byte buffer that has 0x1c000 as its first word, but the
> initiator did not see it:
>
> tag 02 va 36061000 len 4000 word0 00000000 rdmaw
> tag 02 rdmaw completion
> tag 02 resp
> tag 03 va 36065000 len 1000 word0 00004000 rdmaw
> tag 03 rdmaw completion
> tag 03 resp
> tag 04 va 36066000 len 17000 word0 00005000 rdmaw
> tag 04 rdmaw completion
> tag 04 resp
> tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw
> tag 05 rdmaw completion
> tag 05 resp
> tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw
> tag 06 rdmaw completion
> tag 07 va 7bdc2000 len 20000 word0 0003c000 rdmaw
> tag 07 rdmaw completion
> tag 06 resp
> tag 07 resp
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list