[openib-general] MPI error when using a "system" call in mpi job.

Rimmer, Todd trimmer at silverstorm.com
Wed Jun 14 06:24:10 PDT 2006



> -----Original Message-----
> From: Ira Weiny
> Sent: Tuesday, June 13, 2006 8:12 PM
> A co-worker here was seeing the following MPI error from his job:
> 
> [1] Abort: [ldev2:1] Got completion with error, code=1
>  at line 2148 in file viacheck.c
> 
> After some tracking down he found that apparently if he used a
"system"
> call
> [int system(const char *string)] the next MPI command will fail.
> 
> I have been able to reproduce this with the attached simple "hello"
> program.

I have seen this type of problem a couple years ago with our proprietary
stack and it took a bit of work to correct it.  Here is what it could
be:

This sounds like a conflict between with fork() and the Vma handling in
Open IB for registered memory.  system() is a fork(), exec(), wait()
sequence.  fork generally shares the VMAs and marks the pages as copy on
write.

In your case it sounds like one of the pages written by the child
process includes memory previously registered by the main process, and
the child ended up with the original page.  The result is that the
virtual address in the main process is now pointing to the wrong
physical page.

It sounds like you happened on a "magic sequence" which demonstrates the
problem.  Do you have information on the OS version, CPU type, and
server config?

Todd Rimmer




More information about the general mailing list