[ewg] OFED drivers or linux stock drivers ?
Dark Charlot
jcldc13 at gmail.com
Thu Jun 14 13:14:02 PDT 2012
Dear experts,
I am running mageia2 linux distribution which comes with kernel 3.3.6.
I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a lot
of pains and spec files modifications **) some of the RPM :
infiniband-diags-1.5.13-1.x86_64.rpm
infiniband-diags-debug-1.5.13-1.x86_64.rpm
libibmad-1.3.8-1.x86_64.rpm
libibmad-debug-1.3.8-1.x86_64.rpm
libibmad-devel-1.3.8-1.x86_64.rpm
libibmad-static-1.3.8-1.x86_64.rpm
libibumad-1.3.7-1.x86_64.rpm
libibumad-debug-1.3.7-1.x86_64.rpm
libibumad-devel-1.3.7-1.x86_64.rpm
libibumad-static-1.3.7-1.x86_64.rpm
libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm
libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm
libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm
libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm
libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm
mstflint-1.4-1.18.g1adcfbf.x86_64.rpm
mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm
opensm-3.3.13-1.x86_64.rpm
opensm-debug-3.3.13-1.x86_64.rpm
opensm-devel-3.3.13-1.x86_64.rpm
opensm-libs-3.3.13-1.x86_64.rpm
opensm-static-3.3.13-1.x86_64.rpm
But I was **not** able to compile ofa kernel itself.
Then I tried to use, instead, all the corresponding modules which come
with my stock linux kernel distribution (3.3.6)
After initializing correctly (I guess) all the necessary mellanox stuffs
(openibd, opensm etc...) I can see my Mellanox cards with the command
ibv_devinfo.
I get the following output for all the computers which have a mellanox card
1) ibv_devinfo
kerkira:% ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.000
node_guid: 0002:c903:0009:d1b2
sys_image_guid: 0002:c903:0009:d1b5
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xA0
board_id: MT_0C40110009
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 8
port_lid: 8
port_lmc: 0x00
link_layer: IB
2) ibstatus
kerkira:% /usr/sbin/ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:0009:d1b3
base lid: 0x8
sm lid: 0x8
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
QUESTION:
==> According to these outputs, could we say that my computers use
correctly the mlx4 drivers which comes with my kernel 3.3.6 ?
Probably not because I cannot communicate between two machines using
mpi.....
Here is the detail:
I compiled and install MVAPICH2 but I couldn't run "osu_bw" program between
two machines, I get :
kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error
[kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
MPI process died?
[kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) exited
with status 1
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error
[amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
MPI process died?
[amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited
with status 1
[amos:mpispawn_1][report_error] connect() failed: Connection refused (111)
Now f I run on the **same** machine, I get the expected results:
kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw
# OSU MPI Bandwidth Test v3.6
# Size Bandwidth (MB/s)
1 5.47
2 11.34
4 22.84
8 45.89
16 91.52
32 180.27
64 350.68
128 661.78
256 1274.94
512 2283.42
1024 3936.39
2048 6362.91
4096 9159.54
8192 10737.42
16384 9246.39
32768 8869.26
65536 8707.28
131072 8942.07
262144 9009.39
524288 9060.31
1048576 9080.17
2097152 5702.06
(note: ssh between the machines kerkira and amos works correctly without
password)
QUESTION:
==> Why MPI programs does not work between two machines ?
==> Is it because I use the mlx4/umad/etc modules from my distribution
kernel and not OFED kernel-ib ?
Thanks in advance for your help .
Jean-Charles Lambert.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20120614/15aea855/attachment.html>
More information about the ewg
mailing list