[ewg] Fwd: OFED drivers or linux stock drivers ? [SOLVED]
Dark Charlot
jcldc13 at gmail.com
Fri Jun 15 05:57:50 PDT 2012
Dear Jonathan Perkins,
you put me on the right track ! It was just a problem of memory locked,
DAMN IT !
My /etc/security/limits.conf was set correctly with these lines :
* hard memlock unlimited
* soft memlock unlimited
BUT when I was running "ulimit -l" as a user, I was getting "64" instead of
"unlimited".
In order to have "unlimited" for all my shells, I had to put in the file
/etc/ssh/sshd_config the line:
UsePAM yes
(and restart my sshd daemon :
systemctl restart sshd.service)
And now my MPI stack over infiniband is working as expected :D:D
Many many thanks again !
Jean-Charles
---------- Forwarded message ----------
From: Dark Charlot <jcldc13 at gmail.com>
Date: 2012/6/15
Subject: Re: [ewg] OFED drivers or linux stock drivers ?
To: Jonathan Perkins <perkinjo at cse.ohio-state.edu>
HI,
after recompiling MVAPICH2 with your configure options, I got this:
mpirun_rsh -np 2 amos kerkira ./osu_bw
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
[amos:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
MPI process died?
[amos:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[amos:mpispawn_0][child_handler] MPI process (rank: 0, pid: 11879) exited
with status 1
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(408).......:
MPID_Init(296)..............: channel initialization failed
MPIDI_CH3_Init(283).........:
MPIDI_CH3I_RDMA_init(172)...:
rdma_setup_startup_ring(431): cannot create cq
[kerkira:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
MPI process died?
[kerkira:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[kerkira:mpispawn_1][child_handler] MPI process (rank: 1, pid: 565) exited
with status 1
[kerkira:mpispawn_1][report_error] connect() failed: Connection refused
(111)
[kerkira:mpispawn_1][report_error] connect() failed: Connection refused
(111)
Thanks, JC
2012/6/15 Jonathan Perkins <perkinjo at cse.ohio-state.edu>
> This could be something as simple as a locked limit issue. Can you
> rebuild mvapich2 by passing `--disable-fast --enable-g=dbg' to
> configure? You should get more useful output with these options.
>
> I'm cc'ing mvapich-discuss as well as this may be specific to MVAPICH2.
>
> On Thu, Jun 14, 2012 at 4:14 PM, Dark Charlot <jcldc13 at gmail.com> wrote:
> > Dear experts,
> >
> > I am running mageia2 linux distribution which comes with kernel 3.3.6.
> >
> > I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a
> lot
> > of pains and spec files modifications **) some of the RPM :
> >
> > infiniband-diags-1.5.13-1.x86_64.rpm
> > infiniband-diags-debug-1.5.13-1.x86_64.rpm
> > libibmad-1.3.8-1.x86_64.rpm
> > libibmad-debug-1.3.8-1.x86_64.rpm
> > libibmad-devel-1.3.8-1.x86_64.rpm
> > libibmad-static-1.3.8-1.x86_64.rpm
> > libibumad-1.3.7-1.x86_64.rpm
> > libibumad-debug-1.3.7-1.x86_64.rpm
> > libibumad-devel-1.3.7-1.x86_64.rpm
> > libibumad-static-1.3.7-1.x86_64.rpm
> > libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm
> > libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm
> > libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm
> > libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm
> > mstflint-1.4-1.18.g1adcfbf.x86_64.rpm
> > mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm
> > opensm-3.3.13-1.x86_64.rpm
> > opensm-debug-3.3.13-1.x86_64.rpm
> > opensm-devel-3.3.13-1.x86_64.rpm
> > opensm-libs-3.3.13-1.x86_64.rpm
> > opensm-static-3.3.13-1.x86_64.rpm
> >
> > But I was **not** able to compile ofa kernel itself.
> >
> > Then I tried to use, instead, all the corresponding modules which come
> with
> > my stock linux kernel distribution (3.3.6)
> >
> > After initializing correctly (I guess) all the necessary mellanox stuffs
> > (openibd, opensm etc...) I can see my Mellanox cards with the command
> > ibv_devinfo.
> >
> > I get the following output for all the computers which have a mellanox
> card
> >
> > 1) ibv_devinfo
> >
> > kerkira:% ibv_devinfo
> >
> > hca_id: mlx4_0
> > transport: InfiniBand (0)
> > fw_ver: 2.7.000
> > node_guid: 0002:c903:0009:d1b2
> > sys_image_guid: 0002:c903:0009:d1b5
> > vendor_id: 0x02c9
> > vendor_part_id: 26428
> > hw_ver: 0xA0
> > board_id: MT_0C40110009
> > phys_port_cnt: 1
> > port: 1
> > state: PORT_ACTIVE (4)
> > max_mtu: 2048 (4)
> > active_mtu: 2048 (4)
> > sm_lid: 8
> > port_lid: 8
> > port_lmc: 0x00
> > link_layer: IB
> >
> >
> > 2) ibstatus
> >
> > kerkira:% /usr/sbin/ibstatus
> >
> > Infiniband device 'mlx4_0' port 1 status:
> > default gid: fe80:0000:0000:0000:0002:c903:0009:d1b3
> > base lid: 0x8
> > sm lid: 0x8
> > state: 4: ACTIVE
> > phys state: 5: LinkUp
> > rate: 40 Gb/sec (4X QDR)
> > link_layer: InfiniBand
> >
> >
> > QUESTION:
> >
> > ==> According to these outputs, could we say that my computers use
> correctly
> > the mlx4 drivers which comes with my kernel 3.3.6 ?
> >
> >
> > Probably not because I cannot communicate between two machines using
> > mpi.....
> >
> > Here is the detail:
> > I compiled and install MVAPICH2 but I couldn't run "osu_bw" program
> between
> > two machines, I get :
> >
> > kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw
> >
> > [cli_0]: aborting job:
> > Fatal error in MPI_Init:
> > Other MPI error
> >
> > [kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 6.
> > MPI process died?
> > [kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI
> > process died?
> > [kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396)
> exited
> > with status 1
> > [cli_1]: aborting job:
> > Fatal error in MPI_Init:
> > Other MPI error
> >
> > [amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
> MPI
> > process died?
> > [amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> > process died?
> > [amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited
> > with status 1
> > [amos:mpispawn_1][report_error] connect() failed: Connection refused
> (111)
> >
> >
> > Now f I run on the **same** machine, I get the expected results:
> >
> > kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw
> > # OSU MPI Bandwidth Test v3.6
> > # Size Bandwidth (MB/s)
> > 1 5.47
> > 2 11.34
> > 4 22.84
> > 8 45.89
> > 16 91.52
> > 32 180.27
> > 64 350.68
> > 128 661.78
> > 256 1274.94
> > 512 2283.42
> > 1024 3936.39
> > 2048 6362.91
> > 4096 9159.54
> > 8192 10737.42
> > 16384 9246.39
> > 32768 8869.26
> > 65536 8707.28
> > 131072 8942.07
> > 262144 9009.39
> > 524288 9060.31
> > 1048576 9080.17
> > 2097152 5702.06
> >
> > (note: ssh between the machines kerkira and amos works correctly without
> > password)
> >
> > QUESTION:
> >
> > ==> Why MPI programs does not work between two machines ?
> > ==> Is it because I use the mlx4/umad/etc modules from my distribution
> > kernel and not OFED kernel-ib ?
> >
> > Thanks in advance for your help .
> >
> > Jean-Charles Lambert.
> >
> >
> >
> > _______________________________________________
> > ewg mailing list
> > ewg at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20120615/56f9a41d/attachment.html>
More information about the ewg
mailing list