[openib-general] segfault on openib mvapich

Sacerdoti, Federico Federico.Sacerdoti at deshaw.com
Wed Sep 28 06:40:10 PDT 2005


Thank you for your replies. It is helpful to know that you see no
problems. I will continue playing with my config. 

For what its worth, the error happens in
process/pmgr_client_mpirun_rsh.c. Here is a traceback from gdb:

# Command:
# mpirun_rsh -ssh -debug -np 2 -hostfile ../../machines.txt
#   /u/fds/run/gen2/simple/mp

This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
libthread_db library "/lib64/tls/libthread_db.so.1".

(gdb) run
Starting program: /u/fds/run/gen2/simple/mp 

Program received signal SIGSEGV, Segmentation fault.
0x0000003347d711c0 in bzero () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003347d711c0 in bzero () from /lib64/tls/libc.so.6
#1  0x0000000000419d6c in pmgr_client_init ()
#2  0x000000000041ff36 in MPID_VIA_Init ()
#3  0x0000000000415962 in MPID_Init ()
#4  0x0000000000402059 in MPIR_Init ()
#5  0x0000000000401ea4 in main (argc=1, argv=0x7fffff819e38) at mp.c:8
(gdb) 

I will try to turn on -g on mpirun_rsh to get better debugging info.
-Federico

-----Original Message-----
From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] 
Sent: Tuesday, September 27, 2005 7:19 PM
To: Roland Dreier
Cc: Sacerdoti, Federico; openib-general at openib.org
Subject: Re: [openib-general] segfault on openib mvapich


Federico, 

>     Federico> I might have done something wrong, but tried to build
>     Federico> using a plain source from the openib gen2 svn tree and
>     Federico> Pete's patches (those that were not rejected).
>  
> For whatever it's worth, basic MVAPICH tests like osu_bw work fine for
> me with two and even four processes on two x86_64 machines.

FYI, we are also running the latest version successfully on multiple
platforms (IA32, Opetron and EM64T) of different sizes.  We are also
able to run applications successfully.

To the best of our knowledge, many other organizations are also
running mvapich-gen2 successfully on their platforms.

>     Federico> Adding the -debug flag to mpirun_rsh does not help (the
>     Federico> xterms flash on then dissapear). The ssh connections are
>     Federico> started fine, but the segfault happens early on.
> 
> Without more data like a traceback from a core file or something like
> that, it's going to be very difficult for anyone to debug this.

As Roland indicates, could you please provide more details on the
platform, OpenIB version (kernel, userlib), and the errors you are
getting. This will help to debug the problem further and faster.

> Also, it might be worth contacting the MVAPICH developers by emailing
> mvapich_request -- they are much more likely to be able to help than
> the openib-general community.

We at OSU are monitoring the OpenIB list for mvapich-gen2 related
questions and are answering them. In addition, if you can send a copy
to mvapich-help at cse.ohio-state.edu (not mvapich_request), we will be
able to respond even faster.

Thanks, 

DK

> - R.  >
_______________________________________________ > openib-general
mailing list > openib-general at openib.org >
http://openib.org/mailman/listinfo/openib-general > > To unsubscribe,
please visit http://openib.org/mailman/listinfo/openib-general >




More information about the general mailing list