[ofa-general] iSER data corruption issues

Wed Oct 3 20:09:54 PDT 2007

 > Fair enough, but the FMR *pools* still worry me, because they manage
 > internal registrations and defer their manipulation. Depending on lots
 > of things beyond the consumer's control, they sometimes don't even
 > close the handles advertised to the RDMA peer.

The FMR pool stuff (especially with caching turned off, as the iSER
initiator uses the API) isn't really doing anything particularly
fancy.  It just keeps a list of FMRs that are available to remap, and
batches up the unregistration.  It is true that an R_Key may remain
valid after an FMR is unmapped, but that's the whole point of FMRs: if
you don't batch up the real flushing to amortize the cost, they're no
better than regular MRs really.

 > So, what else sends an RDMA write into the weeds? Short of writing
 > to the wrong address, it sure sounds like a dma consistency thing to
 > me. The connection wasn't lost, so it's not an error.

I don't have that feeling.  x86 systems are really pretty strongly
consistent with respect to DMA when you're not using any of the
GART/IOMMU stuff, so I think it's more likely that either the wrong
address is being given to the HCA somehow, or the mthca FMR
implementation is making the HCA write to the wrong address.

Especially since the correct data never shows up even after a long
time, it seems that the data must just be going to the wrong place.

Given that there was an FMR bug with 1-port Mellanox HCAs that caused
iSER corruption, I would like to make sure that the same thing isn't
hitting here as well.  Reproducing on 2.6.22 or 2.6.23-rcX (which have
the bug fixed) would rule that out, as would seeing the bug on
anything but a 1-port Mellanox HCA.

 - R.