<div> Dear experts,</div><div><br></div><div>I am running mageia2 linux distribution which comes with kernel 3.3.6.</div><div><br></div><div>I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a lot of pains and spec files modifications **) some of the RPM :</div>
<div><br></div><div>infiniband-diags-1.5.13-1.x86_64.rpm</div><div>infiniband-diags-debug-1.5.13-1.x86_64.rpm</div><div>libibmad-1.3.8-1.x86_64.rpm</div><div>libibmad-debug-1.3.8-1.x86_64.rpm</div><div>libibmad-devel-1.3.8-1.x86_64.rpm</div>
<div>libibmad-static-1.3.8-1.x86_64.rpm</div><div>libibumad-1.3.7-1.x86_64.rpm</div><div>libibumad-debug-1.3.7-1.x86_64.rpm</div><div>libibumad-devel-1.3.7-1.x86_64.rpm</div><div>libibumad-static-1.3.7-1.x86_64.rpm</div><div>
libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm</div><div>libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm</div><div>libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm</div><div>libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm</div>
<div>libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm</div><div>libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm</div><div>libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm</div><div>libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm</div><div>
mstflint-1.4-1.18.g1adcfbf.x86_64.rpm</div><div>mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm</div><div>opensm-3.3.13-1.x86_64.rpm</div><div>opensm-debug-3.3.13-1.x86_64.rpm</div><div>opensm-devel-3.3.13-1.x86_64.rpm</div><div>
opensm-libs-3.3.13-1.x86_64.rpm</div><div>opensm-static-3.3.13-1.x86_64.rpm</div><div><br></div><div> But I was **not** able to compile ofa kernel itself.</div><div><br></div><div> Then I tried to use, instead, all the corresponding modules which come with my stock linux kernel distribution (3.3.6)</div>
<div><br></div><div> After initializing correctly (I guess) all the necessary mellanox stuffs (openibd, opensm etc...) I can see my Mellanox cards with the command ibv_devinfo.</div><div><br></div><div>I get the following output for all the computers which have a mellanox card</div>
<div><br></div><div>1) ibv_devinfo</div><div><br></div><div>kerkira:% ibv_devinfo</div><div><br></div><div>hca_id: mlx4_0</div><div> transport: InfiniBand (0)</div><div> fw_ver: 2.7.000</div>
<div> node_guid: 0002:c903:0009:d1b2</div><div> sys_image_guid: 0002:c903:0009:d1b5</div><div> vendor_id: 0x02c9</div><div> vendor_part_id: 26428</div>
<div> hw_ver: 0xA0</div><div> board_id: MT_0C40110009</div><div> phys_port_cnt: 1</div><div> port: 1</div><div> state: PORT_ACTIVE (4)</div>
<div> max_mtu: 2048 (4)</div><div> active_mtu: 2048 (4)</div><div> sm_lid: 8</div><div> port_lid: 8</div>
<div> port_lmc: 0x00</div><div> link_layer: IB</div><div><br></div><div><br></div><div>2) ibstatus</div><div><br></div><div>kerkira:% /usr/sbin/ibstatus</div>
<div><br></div><div>Infiniband device 'mlx4_0' port 1 status:</div><div> default gid: fe80:0000:0000:0000:0002:c903:0009:d1b3</div><div> base lid: 0x8</div><div> sm lid: 0x8</div>
<div> state: 4: ACTIVE</div><div> phys state: 5: LinkUp</div><div> rate: 40 Gb/sec (4X QDR)</div><div> link_layer: InfiniBand</div><div><br></div><div><br></div><div>
QUESTION:</div><div><br></div><div>==> According to these outputs, could we say that my computers use correctly the mlx4 drivers which comes with my kernel 3.3.6 ?</div><div><br></div><div><br></div><div>Probably not because I cannot communicate between two machines using mpi.....</div>
<div><br></div><div>Here is the detail:</div><div>I compiled and install MVAPICH2 but I couldn't run "osu_bw" program between two machines, I get :</div><div><br></div><div>kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw</div>
<div><br></div><div>[cli_0]: aborting job:</div><div>Fatal error in MPI_Init:</div><div>Other MPI error</div><div><br></div><div>[kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?</div>
<div>[kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?</div><div>[kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) exited with status 1</div><div>[cli_1]: aborting job:</div>
<div>Fatal error in MPI_Init:</div><div>Other MPI error</div><div><br></div><div>[amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?</div><div>[amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?</div>
<div>[amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited with status 1</div><div>[amos:mpispawn_1][report_error] connect() failed: Connection refused (111)</div><div><br></div><div><br></div><div>Now f I run on the **same** machine, I get the expected results:</div>
<div><br></div><div>kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw</div><div># OSU MPI Bandwidth Test v3.6</div><div># Size Bandwidth (MB/s)</div><div>1 5.47</div><div>2 11.34</div>
<div>4 22.84</div><div>8 45.89</div><div>16 91.52</div><div>32 180.27</div><div>64 350.68</div><div>128 661.78</div>
<div>256 1274.94</div><div>512 2283.42</div><div>1024 3936.39</div><div>2048 6362.91</div><div>4096 9159.54</div><div>8192 10737.42</div>
<div>16384 9246.39</div><div>32768 8869.26</div><div>65536 8707.28</div><div>131072 8942.07</div><div>262144 9009.39</div><div>524288 9060.31</div>
<div>1048576 9080.17</div><div>2097152 5702.06</div><div><br></div><div>(note: ssh between the machines kerkira and amos works correctly without password)</div><div><br></div><div>QUESTION:</div>
<div><br></div><div>==> Why MPI programs does not work between two machines ? </div><div>==> Is it because I use the mlx4/umad/etc modules from my distribution kernel and not OFED kernel-ib ?</div><div> </div><div> Thanks in advance for your help .</div>
<div><br></div><div> Jean-Charles Lambert.</div><div><br></div><div><br></div>