[ewg] OFED drivers or linux stock drivers ?

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Jun 14 19:31:41 PDT 2012


This could be something as simple as a locked limit issue.  Can you
rebuild mvapich2 by passing `--disable-fast --enable-g=dbg' to
configure?  You should get more useful output with these options.

I'm cc'ing mvapich-discuss as well as this may be specific to MVAPICH2.

On Thu, Jun 14, 2012 at 4:14 PM, Dark Charlot <jcldc13 at gmail.com> wrote:
>   Dear experts,
>
> I am running mageia2 linux distribution which comes with kernel 3.3.6.
>
> I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a lot
> of pains and spec files modifications **) some of the RPM :
>
> infiniband-diags-1.5.13-1.x86_64.rpm
> infiniband-diags-debug-1.5.13-1.x86_64.rpm
> libibmad-1.3.8-1.x86_64.rpm
> libibmad-debug-1.3.8-1.x86_64.rpm
> libibmad-devel-1.3.8-1.x86_64.rpm
> libibmad-static-1.3.8-1.x86_64.rpm
> libibumad-1.3.7-1.x86_64.rpm
> libibumad-debug-1.3.7-1.x86_64.rpm
> libibumad-devel-1.3.7-1.x86_64.rpm
> libibumad-static-1.3.7-1.x86_64.rpm
> libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm
> libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm
> libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm
> libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm
> libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm
> libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm
> libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm
> libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm
> mstflint-1.4-1.18.g1adcfbf.x86_64.rpm
> mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm
> opensm-3.3.13-1.x86_64.rpm
> opensm-debug-3.3.13-1.x86_64.rpm
> opensm-devel-3.3.13-1.x86_64.rpm
> opensm-libs-3.3.13-1.x86_64.rpm
> opensm-static-3.3.13-1.x86_64.rpm
>
>  But I was **not** able to compile ofa kernel itself.
>
>  Then I tried to use, instead, all the corresponding modules which come with
> my stock linux kernel distribution (3.3.6)
>
>  After initializing correctly (I guess) all the necessary mellanox stuffs
> (openibd, opensm etc...) I can see my Mellanox cards with the command
> ibv_devinfo.
>
> I get the following output for all the computers which have a mellanox card
>
> 1)  ibv_devinfo
>
> kerkira:% ibv_devinfo
>
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.7.000
>         node_guid:                      0002:c903:0009:d1b2
>         sys_image_guid:                 0002:c903:0009:d1b5
>         vendor_id:                      0x02c9
>         vendor_part_id:                 26428
>         hw_ver:                         0xA0
>         board_id:                       MT_0C40110009
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 8
>                         port_lid:               8
>                         port_lmc:               0x00
>                         link_layer:             IB
>
>
> 2) ibstatus
>
> kerkira:% /usr/sbin/ibstatus
>
> Infiniband device 'mlx4_0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c903:0009:d1b3
>         base lid:        0x8
>         sm lid:          0x8
>         state:           4: ACTIVE
>         phys state:      5: LinkUp
>         rate:            40 Gb/sec (4X QDR)
>         link_layer:      InfiniBand
>
>
> QUESTION:
>
> ==> According to these outputs, could we say that my computers use correctly
> the mlx4 drivers which comes with my kernel 3.3.6 ?
>
>
> Probably not because I cannot communicate between two machines using
> mpi.....
>
> Here is the detail:
> I compiled and install MVAPICH2 but I couldn't run "osu_bw" program between
> two machines, I get :
>
> kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw
>
> [cli_0]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
> [kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
> MPI process died?
> [kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) exited
> with status 1
> [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
> [amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI
> process died?
> [amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited
> with status 1
> [amos:mpispawn_1][report_error] connect() failed: Connection refused (111)
>
>
> Now f I run on the **same** machine, I get the expected results:
>
> kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw
> # OSU MPI Bandwidth Test v3.6
> # Size      Bandwidth (MB/s)
> 1                       5.47
> 2                      11.34
> 4                      22.84
> 8                      45.89
> 16                     91.52
> 32                    180.27
> 64                    350.68
> 128                   661.78
> 256                  1274.94
> 512                  2283.42
> 1024                 3936.39
> 2048                 6362.91
> 4096                 9159.54
> 8192                10737.42
> 16384                9246.39
> 32768                8869.26
> 65536                8707.28
> 131072               8942.07
> 262144               9009.39
> 524288               9060.31
> 1048576              9080.17
> 2097152              5702.06
>
> (note: ssh between the machines kerkira and amos works correctly without
> password)
>
> QUESTION:
>
> ==> Why MPI programs does not work between two machines ?
> ==> Is it because I use the mlx4/umad/etc modules from my distribution
> kernel and not OFED kernel-ib ?
>
>  Thanks in advance for your help .
>
>   Jean-Charles Lambert.
>
>
>
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the ewg mailing list