[ofa-general] [PATCH 2/3] RDMA/cxgb3: fix page shift calculation in build_phys_page_list()

Bryan S Rosenburg rosnbrg at us.ibm.com
Sat Jan 26 10:05:15 PST 2008


Roland Dreier <rdreier at cisco.com> wrote on 01/26/2008 12:04:21 AM:
>  > Incidentally, I've been tracking down exactly the bug that Steve 
fixed, 
>  > but in mthca_reg_phys_mr() rather than in the cxgb3 
>  > build_phys_page_list().  I'll submit a patch for mthca, unless 
someone 
>  > else applies Steve's fix there soon.
> 
> Strange... is this causing problems in practice?  I do agree that the
> mthca code, when passed a list like
> 
> > 0: addr 0x10001000, size 0x1000
> > 1: addr 0x10002000, size 0x1000
> 
> will end up with a page list
> 
> > 0: 0x10000000
> > 1: 0x10002000
> 
> and a page size of 0x2000, but it seems to me that should work fine --
> the memory region will end up starting at an offset of 0x1000 and
> having size 0x2000, which gives the region precisely as intended (just
> not in the most obvious way).

Roland, you're quite right that the non-obvious page list is not 
necessarily a problem.  It causes a failure only if the virtual address 
that is eventually mapped to this region has an alignment with respect to 
large-page boundaries that is different from the alignment of the physical 
address.  To be concrete, the page list works as you expect if and only if

    ((*iova_start) & ((1ULL << shift) - 1)) ==
        (buffer_list[0].addr & ((1ULL << shift) - 1)).

With the example above, the existing algorithm works fine if, for example, 
the virtual address is 0x80001000, but it does not work if the va is 
0x80000000.  In the latter case, incoming rdma bytes destined for va 
0x80000000 wind up at physical address 0x10000000, not at 0x10001000 where 
they belong.  To answer your first question, yes, I've seen exactly this 
behavior in practice.  I don't have a hardware spec for mthca, but from 
observation it seems to me that the offset from the base of the physical 
region is being derived from the low-order bits of the virtual address.

So Steve's fix is overly conservative, but it's straightforward and it has 
the benefit that it removes the need for the complex special case for 
single-element buffer lists.  Before I saw the patch, I was toying with 
ways to incorporate the va alignment into the shift calculation, but I'm 
not sure the complexity is worth it.  Also, I see that in cxgb3 the code 
is factored in such a way that the va isn't even available during the 
shift calculation.  That's not a problem for mthca, but there are 
advantages to keeping things consistent.

For your amusement, the first time I hit this problem, I happened to have 
a buffer list in which the pages were adjacent but out of order:

0: addr 0x10001000, size 0x1000
1: addr 0x10000000, size 0x1000

resulting in the following list of 0x2000-byte pages:

0: 0x10000000
1: 0x10000000  !!!

To my surprise, even this bizarre page list works properly if the virtual 
address is aligned properly.

- Bryan


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080126/ea6df85e/attachment.html>


More information about the general mailing list