[ewg] bug 1918 - openmpi broken due to rdma-cm changes
Jeff Squyres
jsquyres at cisco.com
Fri Feb 5 14:20:03 PST 2010
On Feb 5, 2010, at 4:53 PM, Steve Wise wrote:
> There is still some inconsistency here. Sean, you claimed binds to
> 127.0.0.1 succeed in ofed-1.4 for IB devices. If so, then folks running
> IB/openmpi/rdmacm should be seeing issues. We need to dig a little more...
FWIW, I can run Open MPI v1.4.2beta on my OFED 1.4.1 cluster over IB devices using RDMA CM with no problems.
I added some debug statements in OMPI showing which rdma_cm_bind's it attempts, just to be sure. Here's a run across 2 nodes, each with a single 2-port mthca (each port connected to a different IB subnet, not that that matters):
$ mpirun -np 2 --bynode --mca btl_openib_cpc_include rdmacm ring
[svbu-mpi025:05592] FAILED to bind to 127.0.0.1
[svbu-mpi025:05592] FAILED to bind to 172.29.218.165
[svbu-mpi025:05592] SUCCEEDED to bind to 10.10.30.165
[svbu-mpi025:05592] SUCCEEDED to bind to 10.10.20.165
[svbu-mpi026:05529] FAILED to bind to 127.0.0.1
[svbu-mpi026:05529] FAILED to bind to 172.29.218.166
[svbu-mpi026:05529] SUCCEEDED to bind to 10.10.30.166
[svbu-mpi026:05529] SUCCEEDED to bind to 10.10.20.166
...
The 172.x address is my gigE device (eth0).
--
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
More information about the ewg
mailing list