[ewg] RDMACM Differences

Jeff Squyres jsquyres at cisco.com
Mon Feb 28 08:02:58 PST 2011


This looks like an Open MPI-specific question (I barely monitor this email list; I only saw this post by pure chance). 

Can you ping us over on the Open MPI mailing list with this question?  There's more people that can help you there.

    http://www.open-mpi.org/community/lists/ompi.php

Thanks!



On Feb 26, 2011, at 4:05 PM, Jagga Soorma wrote:

> Hello,
> 
> I am running into the following issue while trying to run osu_latency:
> 
> --
> -bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_
> prefix 0 -np 2 --hostfile mpihosts /home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency
> # OSU MPI Latency Test v3.3
> # Size            Latency (us)
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] error modifing QP to RTR errno says Invalid argument
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb] error in endpoint reply start connect
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 6781 on
> node amber04 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> --
> 
> I can get around this by adding the "--mca btl_openib_cpc_include rdmacm" option.  However, I have another host with a different HCA with all the same drivers and software versions that I can run this same command successfully with using the rdmacm option.  What could be causing one of my environments to fail but the other to work fine (without the rdmacm option)?  
> 
> --
> [root at amber03 ~]# ofed_info | grep OFED
> MLNX_OFED_LINUX-1.5.2-1.0.0 (OFED-1.5.2-20101020-1520):
> MLNX_OFED_LINUX-1.5.2-1.0.0 (/mswg/release/ofed-1.5.2-rpms/rnfs-utils/rnfs-utils-1.1.5-10.OFED.src.rpm):
> 
> [root at amber03 ~]# ibv_devinfo 
> hca_id:    mlx4_0
>     transport:            InfiniBand (0)
>     fw_ver:                2.7.9294
>     node_guid:            78e7:d103:0021:8884
>     sys_image_guid:            78e7:d103:0021:8887
>     vendor_id:            0x02c9
>     vendor_part_id:            26438
>     hw_ver:                0xB0
>     board_id:            HP_0200000003
>     phys_port_cnt:            2
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        2048 (4)
>             sm_lid:            1
>             port_lid:        20
>             port_lmc:        0x00
>             link_layer:        IB
> 
>         port:    2
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        1024 (3)
>             sm_lid:            0
>             port_lid:        0
>             port_lmc:        0x00
>             link_layer:        Ethernet
> --
> 
> Any help would be greatly appreciated.
> 
> Thanks,
> -J
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




More information about the ewg mailing list