[ofa-general] Re: movnt

Roland Dreier rdreier at cisco.com
Wed May 16 13:56:46 PDT 2007


 > So we can map the device memory with WB or WT semantics, and movnt will enable
 > WC. And the nice thing about this trick, is that both WB and WT *are already
 > programmed into PAT after reset*, which means that we can use them for pages we
 > map for userspace, without stepping on anyone's toes or waiting for
 > the generic in-kernel support for WC to materialize.

I'm not sure whether this is much of an advantage.  There's no generic
way to map memory with WB that I know of.  I don't think that setting
a PAT entry for WC is the hold-up -- the problem is more in the right
infrastructure for pgprot_xxx().  I don't think it's very nice to have
#ifdef __x86_64__ in a driver.

 > I attach a header file that implements WC memcpy with these
 > instructions for lengths from 16 to 128 bytes (and one can,
 > naturally, just call xmm_copy64 in a loop), that I wrote for fun
 > at some point. Feel free to read/flame/reuse in any way you like.

Using movntdq means we have to save off xmm's, and it's a hassle to
get a properly aligned block to be able to use movdqa to save them
(you can't rely on the stack being 16-byte aligned).  I'd be curious
to see whether it's even worth it for a 64-byte copy (which is
probably the most common case for BF), since you need 8 extra movdqa
to save/restore the xmms on top of 4 movdqa to load the WQE and 4
movntdq to write it.  Just plain movnti might be the simplest thing to
do, since 16 movnti is all you would need, and I think that comes out
to be smaller code than 12 movdqa + 4 movntdq.

(Optimizing the WQE copy in assembly might be worth it independent of
how we map the BF page for WC, since obviously posting BF sends is a
super-hot path.  And it's fun to write SSE code anyway)

 - R.



More information about the general mailing list