[ewg] bug 1918 - openmpi broken due to rdma-cm changes

Jeff Squyres jsquyres at cisco.com
Fri Feb 5 14:20:03 PST 2010


On Feb 5, 2010, at 4:53 PM, Steve Wise wrote:

> There is still some inconsistency here.   Sean, you claimed binds to
> 127.0.0.1 succeed in ofed-1.4 for IB devices.  If so, then folks running
> IB/openmpi/rdmacm should be seeing issues.  We need to dig a little more...

FWIW, I can run Open MPI v1.4.2beta on my OFED 1.4.1 cluster over IB devices using RDMA CM with no problems.  

I added some debug statements in OMPI showing which rdma_cm_bind's it attempts, just to be sure.  Here's a run across 2 nodes, each with a single 2-port mthca (each port connected to a different IB subnet, not that that matters):

$ mpirun -np 2 --bynode --mca btl_openib_cpc_include rdmacm ring
[svbu-mpi025:05592] FAILED to bind to 127.0.0.1
[svbu-mpi025:05592] FAILED to bind to 172.29.218.165
[svbu-mpi025:05592] SUCCEEDED to bind to 10.10.30.165
[svbu-mpi025:05592] SUCCEEDED to bind to 10.10.20.165
[svbu-mpi026:05529] FAILED to bind to 127.0.0.1
[svbu-mpi026:05529] FAILED to bind to 172.29.218.166
[svbu-mpi026:05529] SUCCEEDED to bind to 10.10.30.166
[svbu-mpi026:05529] SUCCEEDED to bind to 10.10.20.166
...

The 172.x address is my gigE device (eth0).

-- 
Jeff Squyres
jsquyres at cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




More information about the ewg mailing list