Open Fabrics Enterprise Distribution (OFED) MPI in OFED 1.2 README May 2007 =============================================================================== Table of Contents =============================================================================== 1. General 2. OSU MVAPICH MPI 3. MVAPICH2 4. Open MPI =============================================================================== 1. General =============================================================================== Two MPI stacks are included in this release of OFED: - Ohio State University (OSU) MVAPICH 0.9.7 (Modified by Mellanox Technologies) - MVAPICH2 0.9.8p3 - Open MPI 1.2.1-1 Setup, compilation and run information of OSU MVAPICH and Open MPI is provided below in sections 2 and 3 respectively. 1.1 Installation Note --------------------- In Step 2 of the main menu of install.sh, options 2, 3 and 4 can install one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt to learn about the different options. The installation script allows each MPI to be compiled using one or more compilers. Users need to set, per MPI stack installed, the PATH and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks. 1.2 MPI Tests ------------- OFED includes four basic tests that can be run against each MPI stack: bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests are located under: /mpi///tests/, where is /usr/local/ofed by default. =============================================================================== 2. OSU MVAPICH MPI =============================================================================== This package is a modified version of the Ohio State University (OSU) MVAPICH Rev 0.9.7 MPI software package, and is the officially supported MPI stack for this release of OFED. Modifications to the original version include: additional features, bug fixes, and RPM packaging. See http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ for more details. 2.1 Setting up for OSU MVAPICH MPI ---------------------------------- To launch OSU MPI jobs, its installation directory needs to be included in PATH and LD_LIBRARY_PATH. To set them, execute one of the following commands: source /mpi///etc/mvapich.sh -- when using sh for launching MPI jobs or source /mpi///etc/mvapich.csh -- when using csh for launching MPI jobs 2.2 Compiling OSU MVAPICH MPI Applications: ------------------------------------------- ***Important note***: A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests. The default gcc-g77 Fortran compiler is provided with all RedHat Linux releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide this compiler as part of the default installation. The following compilers are supported by OFED's OSU MPI package: gcc, intel and pathscale. The install script prompts the user to choose the compiler with which to build the OSU MVAPICH MPI RPM. Note that more than one compiler can be selected simultaneously, if desired. For details see: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich_user_guide.html To review the default configuration of the installation, check the default configuration file: /mpi///etc/mvapich.conf 2.3 Running OSU MVAPICH MPI Applications: ----------------------------------------- Requirements: o At least two nodes. Example: mtlm01, mtlm02 o Machine file: Includes the list of machines. Example: /root/cluster o Bidirectional rsh or ssh without a password Note for OSU: ssh will be used unless -rsh is specified. In order to use rsh, add to the mpirun_rsh command the parameter: -rsh *** Running OSU tests *** /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bw /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_latency /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bibw /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/osu_benchmarks-2.2/osu_bcast *** Running Intel MPI Benchmark test (Full test) *** /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/IMB-2.3/IMB-MPI1 *** Running Presta test *** /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/com -o 100 /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/glob -o 100 /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/local/ofed/mpi/gcc/mvapich-0.9.7-mlx2.2.0/tests/presta-1.4.0/globalop =============================================================================== 3. MVAPICH2 MPI =============================================================================== MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features. It is based on MPICH2 and MVICH. MVAPICH2 0.9.8 provides many features including fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support, optimized collectives, on-demand connection management, multi-core optimized and scalable shared memory support, and memory hook with ptmalloc2 library support. The ADI-3-level design of MVAPICH2 0.9.8 supports many features including: MPI-2 functionalities (one-sided, collectives and datatype), multi-threading and all MPI-1 functionalities. It also supports a wide range of platforms (architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More information can be found on the MVAPICH2 project site: http://mvapich.cse.ohio-state.edu/overview/mvapich2/ A valid Fortran compiler must be present in order to build the MVAPICH2 MPI stack and tests. The following compilers are supported by OFED's MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script prompts the user to choose the compiler with which to build the MVAPICH2 MPI RPM. Note that more than one compiler can be selected simultaneously, if desired. The install script prompts for various MVAPICH2 build options as detailed below: - Implementation (OFA or uDAPL) [default "OFA"] - OFA (IB and iWARP) Options: - ROMIO Support [default Y] - Shared Library Support [default Y] - Multithread Support [default N] - Checkpoint-Restart Support [default N] Note: * only an option of multithread support is "N" * requires an installation of BLCR and prompts for the BLCR installation directory location - uDAPL Options: - ROMIO Support [default Y] - Shared Library Support [default Y] - Multithread Support [default N] - Cluster Size [default "Small"] - I/O Bus [default "PCI-Express"] - Link Speed [default "SDR"] - Default DAPL Provider [default "ib0"] For non-interactive builds where no MVAPICH2 build options are stored in OFED configuration file, the default settings are: Implementation: OFA ROMIO Support: Y Shared Library Support: Y Multithread Support: N Checkpoint-Restart Support: N 3.1 Setting up for MVAPICH2 --------------------------- Selecting to use MVAPICH2 via the MPI selector tools will perform most of the setup necessary to build and run MPI applications with MVAPICH2. If one does not wish to use the MPI Selector tools, using the following settings should be enough: - add /bin to PATH The above is the directory where the desired MVAPICH2 instance was installed ("instance" refers to the path based on the RPM package name, including the compiler chosen during the install). It is also possible to source the following files in order to setup the proper environment: source /bin/mpivars.sh [for Bourne based shells] source /bin/mpivars.csh [for C based shells] In addition to the user environment settings handled by the MPI selector tools, some other system settings might need to be modified. MVAPICH2 requires the memlock resource limit to be modified from the default in /etc/security/limits.conf: * soft memlock unlimited MVAPICH2 requires bidirectional rsh or ssh without a password to work. The default is ssh, and in this case it will be required to add the following line to the /etc/init.d/sshd script before sshd is started: ulimit -l unlimited It is also possible to specify a specific size in kilobytes instead of unlimited if desired. The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality or the IP address of an iWARP adapter for iWARP functionality if either of those are desired. This is not required by default, unless either of the following runtime environment variables are set when using the OFA MVAPICH2 build: RDMA-CM ------- MV2_USE_RDMA_CM=1 iWARP ----- MV2_USE_IWARP_MODE=1 Otherwise, the OFA build will work without an /etc/mv2.conf file using only the Infiniband HCA directly. The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the DAPL provider information. The default DAPL provider is chosen at build time, with a default value of "ib0", however it can also be specified at runtime by setting the following environment variable: MV2_DEFAULT_DAPL_PROVIDER= More information about MVAPICH2 can be found in the MVAPICH2 User Guide: http://mvapich.cse.ohio-state.edu/support/ 3.2 Compiling MVAPICH2 Applications ----------------------------------- The MVAPICH2 compiler command for each language are: Language Compiler Command -------- ---------------- C mpicc C++ mpicxx Fortran 77 mpif77 Fortran 90 mpif90 The system compiler commands should not be used directly. The Fortran 90 compiler command only exists if a Fortran 90 compiler was used during the build process. 3.3 Running MVAPICH2 Applications --------------------------------- Launching processes in MVAPICH2 is a two step process. First, mpdboot must be used to launch MPD daemons on the desired hosts. Second, the mpiexec command is used to launch the processes. MVAPICH2 requires bidirectional ssh or rsh without a password. This is specified when the MPD daemons are launched with the mpdboot command through the --rsh command line option. The default is ssh. Once the processes are finished, stopping the MPD daemons with the mpdallexit command should be done. The following example shows the basic procedure: 4 Processes on 4 Hosts Example: $ cat >hostsfile node1.example.com node2.example.com node3.example.com node4.example.com $ mpdboot -n 4 -f ./hostsfile $ mpiexec -n 4 ./my_mpi_application $ mpdallexit It is also possible to use the mpirun command in place of mpiexec. They are actually the same command in MVAPICH2, however using mpiexec is preferred. It is possible to run more processes than hosts. In that case, multiple processes will run on some or all of the hosts used. The following examples demonstrate how to run the MPI tests. The default installation prefix and gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts file has been created in the specific directory with two hosts. OSU Tests Example: $ cd /usr/mpi/gcc/mvapich2-0.9.8-11/tests/osu_benchmarks-2.2 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./osu_bcast $ mpiexec -n 2 ./osu_bibw $ mpiexec -n 2 ./osu_bw $ mpiexec -n 2 ./osu_latency $ mpdallexit Intel MPI Benchmark Example: $ cd /usr/mpi/gcc/mvapich2-0.9.8-11/tests/IMB-2.3 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./IMB-MPI1 $ mpdallexit Presta Benchmarks Example: $ cd /usr/mpi/gcc/mvapich2-0.9.8-11/tests/presta-1.4.0 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./com -o 100 $ mpiexec -n 2 ./glob -o 100 $ mpiexec -n 2 ./globalop $ mpdallexit =============================================================================== 4. Open MPI =============================================================================== Open MPI is a next-generation MPI implementation from the Open MPI Project (http://www.open-mpi.org/). Version 1.1.1-1 of Open MPI is included in this release, which is also available directly from the main Open MPI web site. This MPI stack is being offered in OFED as a "technology preview," meaning that it is not officially supported yet. It is expected that future releases of OFED will have fully supported versions of Open MPI. A working Fortran compiler is not required to build Open MPI, but some of the included MPI tests are written in Fortran. These tests will not compile/run if Open MPI is built without Fortran support. The following compilers are supported by OFED's Open MPI package: GNU, Pathscale, Intel, or Portland. The install script prompts the user for the compiler with which to build the Open MPI RPM. Note that more than one compiler can be selected simultaneously, if desired. Users should check the main Open MPI web site for additional documentation and support. (Note: The FAQ file considers InfiniBand tuning among other issues.) 4.1 Setting up for Open MPI: ---------------------------- The Open MPI team strongly advises users to put the Open MPI installation directory in their PATH and LD_LIBRARY_PATH. This can be done at the system level if all users are going to use Open MPI. Specifically: - add /bin to PATH - add /lib to LD_LIBRARY_PATH is the directory where the desired Open MPI instance was installed. ("instance" refers to the compiler used for Open MPI compilation at install time.) If using rsh or ssh to launch MPI jobs, you *must* set the variables described above in your shell startup files (e.g., .bashrc, .cshrc, etc.). If you are using a job scheduler to launch MPI jobs (e.g., SLURM, Torque), setting the PATH and LD_LIBRARY_PATH is still required, but it does not need to be set in your shell startup files. Procedures describing how to add these values to PATH and LD_LIBRARY_PATH are described in detail at: http://www.open-mpi.org/faq/?category=running 4.2 Compiling Open MPI Applications: ------------------------------------ (copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see this web page for more details) The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications. That is, instead of using (for example) gcc to compile your program, use mpicc. Open MPI provides a wrapper compiler for four languages: Language Wrapper compiler name ------------- -------------------------------- C mpicc C++ mpiCC, mpicxx, or mpic++ (note that mpiCC will not exist on case-insensitive file-systems) Fortran 77 mpif77 Fortran 90 mpif90 ------------- -------------------------------- Note that if no Fortran 77 or Fortran 90 compilers were found when Open MPI was built, Fortran 77 and 90 support will automatically be disabled (respectively). If you expect to compile your program as: > gcc my_mpi_application.c -lmpi -o my_mpi_application Simply use the following instead: > mpicc my_mpi_application.c -o my_mpi_application Specifically: simply adding "-lmpi" to your normal compile/link command line *will not work*. See http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the Open MPI wrapper compilers. Note that Open MPI's wrapper compilers do not do any actual compiling or linking; all they do is manipulate the command line and add in all the relevant compiler / linker flags and then invoke the underlying compiler / linker (hence, the name "wrapper" compiler). More specifically, if you run into a compiler or linker error, check your source code and/or back-end compiler -- it is usually not the fault of the Open MPI wrapper compiler. 4.3 Running Open MPI Applications: ---------------------------------- Open MPI uses either the "mpirun" or "mpiexec" commands to launch applications. If your cluster uses a resource manager (such as SLURM or Torque), providing a hostfile is not necessary: > mpirun -np 4 my_mpi_application If you use rsh/ssh to launch applications, they must be set up to NOT prompt for a password (see http://www.open-mpi.org/faq/?category=rsh for more details on this topic). Moreover, you need to provide a hostfile containing a list of hosts to run on. Example: > cat hostfile node1.example.com node2.example.com node3.example.com node4.example.com > mpirun -np 4 -hostfile hostfile my_mpi_application (application runs on all 4 nodes) In the following examples, replace with the number of nodes to run on, and with the filename of a valid hostfile listing the nodes to run on. Example1: Running the OSU bandwidth: > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/osu_benchmarks-2.2 > mpirun -np -hostfile osu_bw Example2: Running the Intel MPI Benchmark benchmarks: > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3 > mpirun -np -hostfile IMB-MPI1 Example3: Running the Presta benchmarks: > cd /usr/local/ofed/mpi/gcc/openmpi-1.1.1-1/tests/presta-1.4.0 > mpirun -np -hostfile com -o 100