[ofa-general] iSER data corruption issues

Pete Wyckoff pw at osc.edu
Wed Oct 3 10:42:52 PDT 2007


How does the requester (in IB speak) know that an RDMA Write
operation has completed on the responder?

We have a software iSER target, available at git.osc.edu/tgt or
browse at http://git.osc.edu/?p=tgt.git .  Using the existing
in-kernel iSER initiator code, very rarely data corruption occurs,
in that the received data from SCSI read operations does not match
what was expected.  Sometimes it appears as if random kernel memory
has been scribbled on by an errant RDMA write from the target.  My
current working theory that the RDMA write has not completed by the
time the initiator looks at its incoming data buffer.

Single RC QP, single CQ, no SRQ.  Only Send, Receive, and RDMA Write
work requests are used.  After everything is connected up, a SCSI
read sequence looks like:

    initiator: register pages with FMR, write test pattern
    initiator: Send request to target
    target:    Recv request
    target:    RDMA Write response to initiator
    target:    Wait for CQ entry for local RDMA Write completion
    target:    Send response to initiator
    initiator: Recv response, access buffer

On very rare occasions, this buffer will have the test pattern, not
the data that the target just sent.

Machines are opteron, fedora 7 up-to-date with its openfab libs,
kernel 2.6.23-rc6 on target.  Either 2.6.23-rc6 or 2.6.22 or
2.6.18-rhel5 on initiator.  For some reason, it is much easier to
produce with the rhel5 kernel.  One site with fast disks can see
similar corruption with 2.6.23-rc6, however.  Target is pure
userspace.  Initiator is in kernel and is poked by "lmdd" (like
normal dd) through an iSCSI block device (/dev/sdb).

The IB spec seems to indicate that the contents of the RDMA Write
buffer should be stable after completion of a subsequent send
message (o9-20).  In fact, the "Wait for CQ entry" step on the
target should be unnecessary, no?

Could there be some caching issues that the initiator is missing?
I've added print[fk]s to the initiator and target to verify that the
sequence of events is truly as above, and that the virtual addresses
are as expected on both sides.

Any suggestions or advice would help.  Thanks,

		-- Pete


P.S.  Here are some debugging printfs not in the git.

Userspace code does 200 read()s of length 8000, but complains about
the result somewhere in the 14th read, from 112000 to 120000, and
exits early.  Expected pattern is a series of 400000 4-byte words,
incrementing by 4, starting from 0.  So 0x00000000, 0x00000004, ...,
0x001869fc:

% lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10
off=112000 want=1c000 got=3b3b3b3b

Initiator generates a series of SCSI operations, as driven by
readahead and the block queue scheduler.  You can see that it starts
reading 4 pages, then 1 page, then 23 pages, then 1 page and so on,
in order.  These sizes and offsets vary from run to run.  Each line
here is printed after the SCSI read response has been received.  It
prints the first word in the buffer, and you can see the test
pattern where data should be:

tag 02 va 36061000 len  4000 word0 00000000 ref 1
tag 03 va 36065000 len  1000 word0 00004000 ref 1
tag 04 va 36066000 len 17000 word0 00005000 ref 1
tag 05 va 7b6bc000 len  1000 word0 3b3b3b3b ref 1
tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1
tag 07 va 7bdc2000 len 20000 word0 0003c000 ref 1

The userspace target code prints a line when it starts the RDMA
write, then a line when the RDMA write completes locally, then a
line when it sends the repsponse.  The tags are what the initiator
assigned to each request.  The target thinks it is sending a
4096-byte buffer that has 0x1c000 as its first word, but the
initiator did not see it:

tag 02 va 36061000 len  4000 word0 00000000 rdmaw
tag 02 rdmaw completion
tag 02 resp
tag 03 va 36065000 len  1000 word0 00004000 rdmaw
tag 03 rdmaw completion
tag 03 resp
tag 04 va 36066000 len 17000 word0 00005000 rdmaw
tag 04 rdmaw completion
tag 04 resp
tag 05 va 7b6bc000 len  1000 word0 0001c000 rdmaw
tag 05 rdmaw completion
tag 05 resp
tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw
tag 06 rdmaw completion
tag 07 va 7bdc2000 len 20000 word0 0003c000 rdmaw
tag 07 rdmaw completion
tag 06 resp
tag 07 resp




More information about the general mailing list