[ofa-general] Re: When is the next planned release of libmlx4?

Fri Jun 12 09:52:15 PDT 2009

Howdy Roland;

Here is more details on why not using memcpy appears to be a good idea under valgrind.

Valgrind replaces the libc memcpy call with a simple version that copies a byte at a time (in order).  If libmlx4 is not built with --with-valgrind, valgrind considers each write an invalid write and spends a very long time after each write updating its error database.  We experimented with replacing the Valgrind error database update with a configurable spin loop and found that if we put a delay of around 100,000 cycles between writes in the 'byte memcpy' when writing to the blueflame page, that a sent message gets lost/misplaced in a simple testcase with two MPI_barriers back to back (resulting in a hang because not all processes exit the first barrier).   Our theory is the card sees 'byte' writes to the blueflame page and due to the long delay, uses the information before it is all written out (and thus getting wrong info).

With the patched version, longs are written to the blueflame page and it now happens to work under valgrind.   Of course, it may be luck.   I could be that writing longs are 4-8 times more efficient, so the delay is not longer big enough to matter.   It could be that it simply fixes our testcase in that the card is still reading early but happening to get the correct data in this case.    Or it could be that writing longs actually fixes things and that writing bytes is a bad idea (since you could get a context switch at any time since this is user code and that could give the same effect).   It any case, it fixes our testcases and seems like having control how data is written to the blueflame page is a good idea in any case.

When users use our valgrind wrapper scripts (they don't always) , we LD_PRELOAD a patched version of this library compiled with --with-valgrind, which prevents the delay to begin with (and runs much faster under Valgrind as a result).   

I hope this clarifies things a little.
-John G.

P.S. If a context switch happens during a write to the blueflame page or some other memory mapped NIC addresses, could bad things happen?   This is why I continued the detailed hunt after figuring out compiling with --with-valgrind resolved our problems, since similar delays could happen during context switches.

At 03:51 PM 6/11/2009, Roland Dreier wrote:

> > Our MPI folks detected a hang while using Valgrind with our ConnectX cards.
> > After trying the current master branch in git we solved the problem by applying
> > this patch from the git tree to v1.0.
> > 
> >     Don't use memcpy() to write blueflame sends
>
>Didn't realize this had that implication (I thought it just made
>blueflame not give latency benefit).  Anyway yes it has been a while
>since a libmlx4 release.  I'll make one soon, probably next week.

---------------------------------------------------------------------
John C. Gyllenhaal                             Bldg:  453   Rm: 4151
Computation Department                 Email: gyllen at llnl.gov
Lawrence Livermore National Lab  Voice: (925) 424-5485
7000 East Ave, L-557                         Fax:   (925) 423-6961
Livermore, CA. 94551-0808              URL: http://www.llnl.gov/icc/lc/DEG
---------------------------------------------------------------------