[openib-general] segfault on openib mvapich

Sacerdoti, Federico Federico.Sacerdoti at deshaw.com
Tue Sep 27 15:53:50 PDT 2005


I had such high hopes for using openib gen2 when I got ibv_uc_pingpong
to pass packets on our infiniband cluster. However, I cannot get mvapich
to work, even with Pete Wyckoff's patches. A simple program run on two
hosts always segfaults.

I might have done something wrong, but tried to build using a plain
source from the openib gen2 svn tree and Pete's patches (those that were
not rejected).
 
Adding the -debug flag to mpirun_rsh does not help (the xterms flash on
then dissapear). The ssh connections are started fine, but the segfault
happens early on. I was hoping Pete's patch:

+++ mvapich-0.9.5-112/mpid/vapi/process/mpirun_rsh.c	2005-05-26
17:35:58.000000000 -0400
@@ -744,7 +744,8 @@
     int id = getpid();
     int str_len;
 
-    str_len = strlen(command_name) + strlen(env) + strlen(wd) + 512;
+    str_len = strlen(command_name) + strlen(env) + strlen(wd) + 
+        strlen(mpirun_processes) + 512

would solve the segfault, but it still persists.

-Federico


Output from my mpirun_rsh -show command:


command: /usr/bin/ssh drda1054 cd /u/fds/run/gen2/simple; /usr/bin/env
MPIRUN_MPD=0 MPIRUN_HOST=drda1054.nyc.deshaw.com MPIRUN_PORT=32884
MPIRUN_PROCESSES='drda1054:drda1055:' MPIRUN_RANK=0 MPIRUN_NPROCS=2
MPIRUN_ID=2425 DISPLAY=desrad2.nyc.deshaw.com:8.0
/u/fds/run/gen2/simple/mp
command: /usr/bin/ssh drda1055 cd /u/fds/run/gen2/simple; /usr/bin/env
MPIRUN_MPD=0 MPIRUN_HOST=drda1054.nyc.deshaw.com MPIRUN_PORT=32884
MPIRUN_PROCESSES='drda1054:drda1055:' MPIRUN_RANK=1 MPIRUN_NPROCS=2
MPIRUN_ID=2425 DISPLAY=desrad2.nyc.deshaw.com:8.0
/u/fds/run/gen2/simple/mp

bash: line 1: 31553 Segmentation fault      /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=drda1054.nyc.deshaw.com MPIRUN_PORT=32885
MPIRUN_PROCESSES='drda1054:drda1055:' MPIRUN_RANK=1 MPIRUN_NPROCS=2
MPIRUN_ID=2428 DISPLAY=desrad2.nyc.deshaw.com:8.0
/u/fds/run/gen2/simple/mp
bash: line 1:  2565 Segmentation fault      /usr/bin/env MPIRUN_MPD=0
MPIRUN_HOST=drda1054.nyc.deshaw.com MPIRUN_PORT=32885
MPIRUN_PROCESSES='drda1054:drda1055:' MPIRUN_RANK=0 MPIRUN_NPROCS=2
MPIRUN_ID=2428 DISPLAY=desrad2.nyc.deshaw.com:8.0
/u/fds/run/gen2/simple/mp

>From dmesg (tried two programs, mpi-ring and a hello-world mp):
mpi-ring[30116]: segfault at 0000000000000000 rip 00000036a2b711c0 rsp
00007fffffdad598 error 6
mpi-ring[30386]: segfault at 0000000000000000 rip 00000036a2b711c0 rsp
00007fffffcb2868 error 6
mp[31283]: segfault at 0000000000000000 rip 00000036a2b711c0 rsp
00007fffffc4a838 error 6
mp[31553]: segfault at 0000000000000000 rip 00000036a2b711c0 rsp
00007fffff869738 error 6



More information about the general mailing list