[ewg] Fwd: OFED drivers or linux stock drivers ? [SOLVED]

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Jun 15 07:26:56 PDT 2012


Glad to know this helped.  If you have any further questions about using
MVAPICH2 please feel free to mail mvapich-discuss at cse.ohio-state.edu.

On Fri, Jun 15, 2012 at 02:57:50PM +0200, Dark Charlot wrote:
>   Dear Jonathan Perkins,
> 
>  you put me on the right track ! It was just a problem of memory locked,
> DAMN IT !
> 
>  My /etc/security/limits.conf was set correctly with these lines :
> *               hard    memlock         unlimited
> *               soft    memlock         unlimited
> 
> BUT when I was running "ulimit -l" as a user, I was getting "64" instead of
> "unlimited".
> 
> In order to have "unlimited" for all my shells, I had to put in the file
> /etc/ssh/sshd_config the line:
> 
> UsePAM yes
> 
> (and restart my sshd daemon :
> systemctl restart sshd.service)
> 
>  And now my MPI stack over infiniband is working as expected :D:D
> 
>   Many many thanks again !
> 
>   Jean-Charles
> 
> ---------- Forwarded message ----------
> From: Dark Charlot <jcldc13 at gmail.com>
> Date: 2012/6/15
> Subject: Re: [ewg] OFED drivers or linux stock drivers ?
> To: Jonathan Perkins <perkinjo at cse.ohio-state.edu>
> 
> 
>  HI,
> 
>  after recompiling MVAPICH2 with your configure options, I got this:
> 
>  mpirun_rsh -np 2 amos kerkira ./osu_bw
> 
> 
> 
> [cli_0]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(408).......:
> MPID_Init(296)..............: channel initialization failed
> MPIDI_CH3_Init(283).........:
> MPIDI_CH3I_RDMA_init(172)...:
> rdma_setup_startup_ring(431): cannot create cq
> 
> [amos:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
> MPI process died?
> [amos:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [amos:mpispawn_0][child_handler] MPI process (rank: 0, pid: 11879) exited
> with status 1
> 
> [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(408).......:
> MPID_Init(296)..............: channel initialization failed
> MPIDI_CH3_Init(283).........:
> MPIDI_CH3I_RDMA_init(172)...:
> rdma_setup_startup_ring(431): cannot create cq
> 
> [kerkira:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
> MPI process died?
> [kerkira:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [kerkira:mpispawn_1][child_handler] MPI process (rank: 1, pid: 565) exited
> with status 1
> [kerkira:mpispawn_1][report_error] connect() failed: Connection refused
> (111)
> [kerkira:mpispawn_1][report_error] connect() failed: Connection refused
> (111)
> 
>  Thanks,  JC
> 
> 
> 
> 2012/6/15 Jonathan Perkins <perkinjo at cse.ohio-state.edu>
> 
> > This could be something as simple as a locked limit issue.  Can you
> > rebuild mvapich2 by passing `--disable-fast --enable-g=dbg' to
> > configure?  You should get more useful output with these options.
> >
> > I'm cc'ing mvapich-discuss as well as this may be specific to MVAPICH2.
> >
> > On Thu, Jun 14, 2012 at 4:14 PM, Dark Charlot <jcldc13 at gmail.com> wrote:
> > >   Dear experts,
> > >
> > > I am running mageia2 linux distribution which comes with kernel 3.3.6.
> > >
> > > I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a
> > lot
> > > of pains and spec files modifications **) some of the RPM :
> > >
> > > infiniband-diags-1.5.13-1.x86_64.rpm
> > > infiniband-diags-debug-1.5.13-1.x86_64.rpm
> > > libibmad-1.3.8-1.x86_64.rpm
> > > libibmad-debug-1.3.8-1.x86_64.rpm
> > > libibmad-devel-1.3.8-1.x86_64.rpm
> > > libibmad-static-1.3.8-1.x86_64.rpm
> > > libibumad-1.3.7-1.x86_64.rpm
> > > libibumad-debug-1.3.7-1.x86_64.rpm
> > > libibumad-devel-1.3.7-1.x86_64.rpm
> > > libibumad-static-1.3.7-1.x86_64.rpm
> > > libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > > libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > > libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > > libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > > libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > > libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm
> > > libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm
> > > libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm
> > > mstflint-1.4-1.18.g1adcfbf.x86_64.rpm
> > > mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm
> > > opensm-3.3.13-1.x86_64.rpm
> > > opensm-debug-3.3.13-1.x86_64.rpm
> > > opensm-devel-3.3.13-1.x86_64.rpm
> > > opensm-libs-3.3.13-1.x86_64.rpm
> > > opensm-static-3.3.13-1.x86_64.rpm
> > >
> > >  But I was **not** able to compile ofa kernel itself.
> > >
> > >  Then I tried to use, instead, all the corresponding modules which come
> > with
> > > my stock linux kernel distribution (3.3.6)
> > >
> > >  After initializing correctly (I guess) all the necessary mellanox stuffs
> > > (openibd, opensm etc...) I can see my Mellanox cards with the command
> > > ibv_devinfo.
> > >
> > > I get the following output for all the computers which have a mellanox
> > card
> > >
> > > 1)  ibv_devinfo
> > >
> > > kerkira:% ibv_devinfo
> > >
> > > hca_id: mlx4_0
> > >         transport:                      InfiniBand (0)
> > >         fw_ver:                         2.7.000
> > >         node_guid:                      0002:c903:0009:d1b2
> > >         sys_image_guid:                 0002:c903:0009:d1b5
> > >         vendor_id:                      0x02c9
> > >         vendor_part_id:                 26428
> > >         hw_ver:                         0xA0
> > >         board_id:                       MT_0C40110009
> > >         phys_port_cnt:                  1
> > >                 port:   1
> > >                         state:                  PORT_ACTIVE (4)
> > >                         max_mtu:                2048 (4)
> > >                         active_mtu:             2048 (4)
> > >                         sm_lid:                 8
> > >                         port_lid:               8
> > >                         port_lmc:               0x00
> > >                         link_layer:             IB
> > >
> > >
> > > 2) ibstatus
> > >
> > > kerkira:% /usr/sbin/ibstatus
> > >
> > > Infiniband device 'mlx4_0' port 1 status:
> > >         default gid:     fe80:0000:0000:0000:0002:c903:0009:d1b3
> > >         base lid:        0x8
> > >         sm lid:          0x8
> > >         state:           4: ACTIVE
> > >         phys state:      5: LinkUp
> > >         rate:            40 Gb/sec (4X QDR)
> > >         link_layer:      InfiniBand
> > >
> > >
> > > QUESTION:
> > >
> > > ==> According to these outputs, could we say that my computers use
> > correctly
> > > the mlx4 drivers which comes with my kernel 3.3.6 ?
> > >
> > >
> > > Probably not because I cannot communicate between two machines using
> > > mpi.....
> > >
> > > Here is the detail:
> > > I compiled and install MVAPICH2 but I couldn't run "osu_bw" program
> > between
> > > two machines, I get :
> > >
> > > kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw
> > >
> > > [cli_0]: aborting job:
> > > Fatal error in MPI_Init:
> > > Other MPI error
> > >
> > > [kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> > 6.
> > > MPI process died?
> > > [kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> > MPI
> > > process died?
> > > [kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396)
> > exited
> > > with status 1
> > > [cli_1]: aborting job:
> > > Fatal error in MPI_Init:
> > > Other MPI error
> > >
> > > [amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
> > MPI
> > > process died?
> > > [amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> > > process died?
> > > [amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited
> > > with status 1
> > > [amos:mpispawn_1][report_error] connect() failed: Connection refused
> > (111)
> > >
> > >
> > > Now f I run on the **same** machine, I get the expected results:
> > >
> > > kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw
> > > # OSU MPI Bandwidth Test v3.6
> > > # Size      Bandwidth (MB/s)
> > > 1                       5.47
> > > 2                      11.34
> > > 4                      22.84
> > > 8                      45.89
> > > 16                     91.52
> > > 32                    180.27
> > > 64                    350.68
> > > 128                   661.78
> > > 256                  1274.94
> > > 512                  2283.42
> > > 1024                 3936.39
> > > 2048                 6362.91
> > > 4096                 9159.54
> > > 8192                10737.42
> > > 16384                9246.39
> > > 32768                8869.26
> > > 65536                8707.28
> > > 131072               8942.07
> > > 262144               9009.39
> > > 524288               9060.31
> > > 1048576              9080.17
> > > 2097152              5702.06
> > >
> > > (note: ssh between the machines kerkira and amos works correctly without
> > > password)
> > >
> > > QUESTION:
> > >
> > > ==> Why MPI programs does not work between two machines ?
> > > ==> Is it because I use the mlx4/umad/etc modules from my distribution
> > > kernel and not OFED kernel-ib ?
> > >
> > >  Thanks in advance for your help .
> > >
> > >   Jean-Charles Lambert.
> > >
> > >
> > >
> > > _______________________________________________
> > > ewg mailing list
> > > ewg at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> >
> >
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
> >

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the ewg mailing list