***SPAM*** Re: [ofa-general] ***SPAM*** troubleshooting with infinband

Dotan Barak dotanba at gmail.com
Fri Feb 13 23:23:40 PST 2009


Vittorio wrote:
> Hello!
> This is my first message on the list so i hope that i'm not going to 
> ask silly or already answered question
>
> i'm a student and i'm porting an electromagnetic field simulator to a 
> parallel and distributed linux cluster for final thesis; i'm using 
> both OpenMP and MPI over Infiniband to achieve speed improvements
>
> the openmp part is done and now i'm facing problem with setting up MPI 
> over Infinband
> i have correctly set up the kernel modules
> installed the right drivers for the board (mellanox hca) and userspace 
> programs
> installed mpavich2 mpi implementation
>
> however i fail to run all of this together:
> for example ibhost correctly find the two nodes connected
>
> Ca    : 0x0002c90300018b8e ports 2 " HCA-1"
> Ca    : 0x0002c90300018b12 ports 2 "localhost HCA-1"
>
> but ibping doens't receive responses
>
> ibwarn: [32052] ibping: Ping..
> ibwarn: [32052] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 2)
> ibwarn: [32052] main: ibping to Lid 2 failed
>
> subsequently any other operation with MPI fails
> strangely enough however IPoIB works very well and i can ping and 
> connect with no problems
>
> the two machines are identical and they use a crossover cable (point 
> to point)
> lspci identifies the boards as
> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, 
> PCIe 2.0 2.5GT/s] (rev a0)
>
> what can be the cause of all of this? am i forgetting something?
> any help is greatly appreciated
> Thank you
> Vittorio
I suggest that you will execute the ibv_rc_pingpong  and see that the IB 
connectivity is o.k..
Then try to execute rping to check that the ib_cma is o.k..

Those will be a good start point to find the problem
(do it for all of the active ports that you have).


Dotan



More information about the general mailing list