[ofa-general] Re: movnt
Roland Dreier
rdreier at cisco.com
Wed May 16 13:56:46 PDT 2007
> So we can map the device memory with WB or WT semantics, and movnt will enable
> WC. And the nice thing about this trick, is that both WB and WT *are already
> programmed into PAT after reset*, which means that we can use them for pages we
> map for userspace, without stepping on anyone's toes or waiting for
> the generic in-kernel support for WC to materialize.
I'm not sure whether this is much of an advantage. There's no generic
way to map memory with WB that I know of. I don't think that setting
a PAT entry for WC is the hold-up -- the problem is more in the right
infrastructure for pgprot_xxx(). I don't think it's very nice to have
#ifdef __x86_64__ in a driver.
> I attach a header file that implements WC memcpy with these
> instructions for lengths from 16 to 128 bytes (and one can,
> naturally, just call xmm_copy64 in a loop), that I wrote for fun
> at some point. Feel free to read/flame/reuse in any way you like.
Using movntdq means we have to save off xmm's, and it's a hassle to
get a properly aligned block to be able to use movdqa to save them
(you can't rely on the stack being 16-byte aligned). I'd be curious
to see whether it's even worth it for a 64-byte copy (which is
probably the most common case for BF), since you need 8 extra movdqa
to save/restore the xmms on top of 4 movdqa to load the WQE and 4
movntdq to write it. Just plain movnti might be the simplest thing to
do, since 16 movnti is all you would need, and I think that comes out
to be smaller code than 12 movdqa + 4 movntdq.
(Optimizing the WQE copy in assembly might be worth it independent of
how we map the BF page for WC, since obviously posting BF sends is a
super-hot path. And it's fun to write SSE code anyway)
- R.
More information about the general
mailing list