[ewg] bug 1918 - openmpi broken due to rdma-cm changes
Steve Wise
swise at opengridcomputing.com
Fri Feb 5 08:16:32 PST 2010
Jeff Squyres (jsquyres) wrote:
>
> Note that it is highly unlikely that we will release open mpi 1.4.2 in
> time for ofed 1.5.1.
>
Jeff, there is no way to handle high priority bug fixes in the current
released stream?
> Also note that trying to bind rdma cm to all interface ip addresses
> was the way that we were advised by openfabrics to figure out which
> devices are rdma-capable.
>
> As such, it is highly desirable to get the fix transparently in rdmacm
> and preserve the old semantic. More specifically, it seems undesirable
> to change this semantic in a minor ofed point release.
>
I agree that we should probably not allow 127.0.0.1 binds in ofed-1.5.1
at all because it regresses OpenMPI. Even with IB systems, if the bind
to 127.0.0.1 succeeds, then OpenMPI assumes 127.0.0.1 is bound to that
rdma interface and advertises this address to its peer as an address
to-which that peer can rdma connect! This will break IB clusters too,
not just T3/iWARP cluster. While I think OpenMPI needs to skip
127.0.0.1 in its logic, I think we should probably defer allowing
127.0.0.1 binds until ofed-1.6.
But Jeff, note that if someone uses the upstream kernel and OpenMPI, its
busted...
So I recommend:
1) Don't allow 127.0.0.1 binds in ofed-1.5.1
2) Fix OpenMPI ASAP to never advertise 127.0.0.1 as a valid rdma-cm
connect address (get it in ofed-1.5.2 or ofed-1.6).
Steve.
More information about the ewg
mailing list