[ewg] OFED drivers or linux stock drivers ?

Dark Charlot jcldc13 at gmail.com
Thu Jun 14 13:14:02 PDT 2012


  Dear experts,

I am running mageia2 linux distribution which comes with kernel 3.3.6.

I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a lot
of pains and spec files modifications **) some of the RPM :

infiniband-diags-1.5.13-1.x86_64.rpm
infiniband-diags-debug-1.5.13-1.x86_64.rpm
libibmad-1.3.8-1.x86_64.rpm
libibmad-debug-1.3.8-1.x86_64.rpm
libibmad-devel-1.3.8-1.x86_64.rpm
libibmad-static-1.3.8-1.x86_64.rpm
libibumad-1.3.7-1.x86_64.rpm
libibumad-debug-1.3.7-1.x86_64.rpm
libibumad-devel-1.3.7-1.x86_64.rpm
libibumad-static-1.3.7-1.x86_64.rpm
libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm
libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm
libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm
libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm
mstflint-1.4-1.18.g1adcfbf.x86_64.rpm
mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm
opensm-3.3.13-1.x86_64.rpm
opensm-debug-3.3.13-1.x86_64.rpm
opensm-devel-3.3.13-1.x86_64.rpm
opensm-libs-3.3.13-1.x86_64.rpm
opensm-static-3.3.13-1.x86_64.rpm

 But I was **not** able to compile ofa kernel itself.

 Then I tried to use, instead, all the corresponding modules which come
with my stock linux kernel distribution (3.3.6)

 After initializing correctly (I guess) all the necessary mellanox stuffs
(openibd, opensm etc...) I can see my Mellanox cards with the command
ibv_devinfo.

I get the following output for all the computers which have a mellanox card

1)  ibv_devinfo

kerkira:% ibv_devinfo

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.000
        node_guid:                      0002:c903:0009:d1b2
        sys_image_guid:                 0002:c903:0009:d1b5
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xA0
        board_id:                       MT_0C40110009
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 8
                        port_lid:               8
                        port_lmc:               0x00
                        link_layer:             IB


2) ibstatus

kerkira:% /usr/sbin/ibstatus

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0009:d1b3
        base lid:        0x8
        sm lid:          0x8
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand


QUESTION:

==> According to these outputs, could we say that my computers use
correctly the mlx4 drivers which comes with my kernel 3.3.6 ?


Probably not because I cannot communicate between two machines using
mpi.....

Here is the detail:
I compiled and install MVAPICH2 but I couldn't run "osu_bw" program between
two machines, I get :

kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw

[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
MPI process died?
[kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) exited
with status 1
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
MPI process died?
[amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited
with status 1
[amos:mpispawn_1][report_error] connect() failed: Connection refused (111)


Now f I run on the **same** machine, I get the expected results:

kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw
# OSU MPI Bandwidth Test v3.6
# Size      Bandwidth (MB/s)
1                       5.47
2                      11.34
4                      22.84
8                      45.89
16                     91.52
32                    180.27
64                    350.68
128                   661.78
256                  1274.94
512                  2283.42
1024                 3936.39
2048                 6362.91
4096                 9159.54
8192                10737.42
16384                9246.39
32768                8869.26
65536                8707.28
131072               8942.07
262144               9009.39
524288               9060.31
1048576              9080.17
2097152              5702.06

(note: ssh between the machines kerkira and amos works correctly without
password)

QUESTION:

==> Why MPI programs does not work between two machines ?
==> Is it because I use the mlx4/umad/etc modules from my distribution
kernel and not OFED kernel-ib ?

 Thanks in advance for your help .

  Jean-Charles Lambert.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20120614/15aea855/attachment.html>


More information about the ewg mailing list