[ewg] bug 1918 - openmpi broken due to rdma-cm changes

Steve Wise swise at opengridcomputing.com
Fri Feb 5 08:16:32 PST 2010


Jeff Squyres (jsquyres) wrote:
>
> Note that it is highly unlikely that we will release open mpi 1.4.2 in 
> time for ofed 1.5.1.
>

Jeff, there is no way to handle high priority bug fixes in the current 
released stream?

> Also note that trying to bind rdma cm to all interface ip addresses 
> was the way that we were advised by openfabrics to figure out which 
> devices are rdma-capable.
>
> As such, it is highly desirable to get the fix transparently in rdmacm 
> and preserve the old semantic. More specifically, it seems undesirable 
> to change this semantic in a minor ofed point release.
>

I agree that we should probably not allow 127.0.0.1 binds in ofed-1.5.1 
at all because it regresses OpenMPI.  Even with IB systems, if the bind 
to 127.0.0.1 succeeds, then OpenMPI assumes 127.0.0.1 is bound to that 
rdma interface and advertises this address to its peer as an address 
to-which that peer can rdma connect!  This will break IB clusters too, 
not just T3/iWARP cluster.   While I think OpenMPI needs to skip 
127.0.0.1 in its logic, I think we should probably defer allowing 
127.0.0.1 binds until ofed-1.6.

But Jeff, note that if someone uses the upstream kernel and OpenMPI, its 
busted...

So I recommend:

1) Don't allow 127.0.0.1 binds in ofed-1.5.1

2) Fix OpenMPI ASAP to never advertise 127.0.0.1 as a valid rdma-cm 
connect address (get it in ofed-1.5.2 or ofed-1.6).



Steve.



More information about the ewg mailing list