[ofa-general] Re: When is the next planned release of libmlx4?

Fri Jun 12 12:56:32 PDT 2009

On Fri, Jun 12, 2009 at 09:52:15AM -0700, John Gyllenhaal wrote:

> Valgrind replaces the libc memcpy call with a simple version that
> copies a byte at a time (in order).  If libmlx4 is not built with
> --with-valgrind, valgrind considers each write an invalid write and
> spends a very long time after each write updating its error
> database.  We experimented with replacing the Valgrind error
> database update with a configurable spin loop and found that if we
> put a delay of around 100,000 cycles between writes in the 'byte
> memcpy' when writing to the blueflame page, that a sent message gets
> lost/misplaced in a simple testcase with two MPI_barriers back to
> back (resulting in a hang because not all processes exit the first
> barrier).  Our theory is the card sees 'byte' writes to the
> blueflame page and due to the long delay, uses the information
> before it is all written out (and thus getting wrong info).

There are lots of ways adding a timing delay here can cause problems.
x86 CPUs have write combining buffers that can be enabled and will
aggregate byte writes into larger transfers, they do flush based on a
timer in some cases. Some devices that do this also have internal
aggregation buffers that will flush in certain cases, often
non-sequential writes or again timers..

I'm not sure what the chip's expectation is for the actual bus
transfers in this area, but I think you are right to be concerned
about atomicity, even when transfering based on longs.

For instance, you do not want to rely upon write combining to create a
single PCI-E transaction out of the message to ensure atomicity in a
multi-process environment. This will not work reliably 100% of the
time.

It is worth looking at using SSE instructions to burst transfer the
entire message in one atomic go.

Jason