[ofa-general] [PATCH 2/3] RDMA/cxgb3: fix page shift calculation in build_phys_page_list()
Bryan S Rosenburg
rosnbrg at us.ibm.com
Sat Jan 26 10:05:15 PST 2008
Roland Dreier <rdreier at cisco.com> wrote on 01/26/2008 12:04:21 AM:
> > Incidentally, I've been tracking down exactly the bug that Steve
fixed,
> > but in mthca_reg_phys_mr() rather than in the cxgb3
> > build_phys_page_list(). I'll submit a patch for mthca, unless
someone
> > else applies Steve's fix there soon.
>
> Strange... is this causing problems in practice? I do agree that the
> mthca code, when passed a list like
>
> > 0: addr 0x10001000, size 0x1000
> > 1: addr 0x10002000, size 0x1000
>
> will end up with a page list
>
> > 0: 0x10000000
> > 1: 0x10002000
>
> and a page size of 0x2000, but it seems to me that should work fine --
> the memory region will end up starting at an offset of 0x1000 and
> having size 0x2000, which gives the region precisely as intended (just
> not in the most obvious way).
Roland, you're quite right that the non-obvious page list is not
necessarily a problem. It causes a failure only if the virtual address
that is eventually mapped to this region has an alignment with respect to
large-page boundaries that is different from the alignment of the physical
address. To be concrete, the page list works as you expect if and only if
((*iova_start) & ((1ULL << shift) - 1)) ==
(buffer_list[0].addr & ((1ULL << shift) - 1)).
With the example above, the existing algorithm works fine if, for example,
the virtual address is 0x80001000, but it does not work if the va is
0x80000000. In the latter case, incoming rdma bytes destined for va
0x80000000 wind up at physical address 0x10000000, not at 0x10001000 where
they belong. To answer your first question, yes, I've seen exactly this
behavior in practice. I don't have a hardware spec for mthca, but from
observation it seems to me that the offset from the base of the physical
region is being derived from the low-order bits of the virtual address.
So Steve's fix is overly conservative, but it's straightforward and it has
the benefit that it removes the need for the complex special case for
single-element buffer lists. Before I saw the patch, I was toying with
ways to incorporate the va alignment into the shift calculation, but I'm
not sure the complexity is worth it. Also, I see that in cxgb3 the code
is factored in such a way that the va isn't even available during the
shift calculation. That's not a problem for mthca, but there are
advantages to keeping things consistent.
For your amusement, the first time I hit this problem, I happened to have
a buffer list in which the pages were adjacent but out of order:
0: addr 0x10001000, size 0x1000
1: addr 0x10000000, size 0x1000
resulting in the following list of 0x2000-byte pages:
0: 0x10000000
1: 0x10000000 !!!
To my surprise, even this bizarre page list works properly if the virtual
address is aligned properly.
- Bryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080126/ea6df85e/attachment.html>
More information about the general
mailing list