[ofa-general] iSER data corruption issues

Pete Wyckoff pw at osc.edu
Wed Oct 3 13:15:47 PDT 2007


tom at opengridcomputing.com wrote on Wed, 03 Oct 2007 13:02 -0500:
> On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: 
> > My
> > current working theory that the RDMA write has not completed by the
> > time the initiator looks at its incoming data buffer.
[..]
> If your theory is correct, the data should eventually show up. Does it?

Good point.  It does not eventually show up.  I added 5 1-second
busy loop delays, checking to see if the values ever change.  They
don't.

> Does your code check for errors on dma_map_single/page? 

This is drivers/infiniband/ulp/iser/iser_verbs.c, in
iser_reg_page_vec, as called from iser_reg_rdma_mem.  It uses
ib_fmr_pool_map_phys, and would complain if it saw an error.  These
are page cache pages, and the FMR calls seem to take physical pages,
but never map them into DMA addresses.  Should be no mapping
required for opteron and arbel, though.  I could be misunderstanding
something here.

I don't see any major differences between this old 2.6.18-rhel5 and
2.6.23-rc6, except for a call to dma_sync_single() in
mthca_arbel_map_phys_fmr(), which I'm guessing is a noop on this
platform (swiotlb).  Unfortunately 2.3.23-rc6 does not break at my
site.  At the other site with fast disks, adding any sort of kernel
debugging apparently causes the problem to go away.  Frustrating.

> > tag 02 va 36061000 len  4000 word0 00000000 ref 1
> > tag 03 va 36065000 len  1000 word0 00004000 ref 1
> > tag 04 va 36066000 len 17000 word0 00005000 ref 1
> > tag 05 va 7b6bc000 len  1000 word0 3b3b3b3b ref 1
> 
> Is it interesting that the bad word occurs on the first page of the new
> map?

One would think so, but it is not always the first page.  Sometimes,
less often, it is the first word of a page in the middle of a map.

I'll keep digging.  Thanks,

		-- Pete



More information about the general mailing list