[ofa-general] Re: When is the next planned release of libmlx4?

Fri Jun 12 23:00:42 PDT 2009

On Fri, Jun 12, 2009 at 09:24:42PM -0700, Roland Dreier wrote:
> 
>  > I'm not sure what the chip's expectation is for the actual bus
>  > transfers in this area, but I think you are right to be concerned
>  > about atomicity, even when transfering based on longs.
> 
> The chip docs seem to suggest that we're OK as long as we do 4-byte
> writes aligned to 4 bytes.

Ok, that would certainly explain why there were problems with non long
writes when using valgrind, the write combining will hide it, until
you take too long.

>  > It is worth looking at using SSE instructions to burst transfer the
>  > entire message in one atomic go.
> 
> I'm not aware of any SSE instructions that work on chunks bigger than 16
> bytes at a time.

Right, I didn't notice it was larger :) Using 4 16 byte stores and
a SFENCE would narrow the window considerably, though alignment of the
source wqe becomes important to get good speed on the load into the
xmm registers.

> In fact the latest mlx4 kernel driver maps the blueflame page to
> userspace with write-combining enabled, and this improves performance

Yes, I bet, WC should get you 64 byte write transactions at the PCI-E
level which surely makes everything better.

> quite a bit.  The HCA doesn't care what order that the CPU drains the WC
> buffer in (according to docs at least)

I would see the risk not so much as order for a single process, but
what happens when there are alot of processes/cores doing the
write. The chip must have a limit on the number of parallel writes it
can re-assemble, I guess the only question is if the mlx4 limit is
less than the number of pages it provides address space for? If so
then it is worth minimizing the CPU instructions to do this transfer.

Jason