[ewg] bug 1918 - openmpi broken due to rdma-cm changes

Steve Wise swise at opengridcomputing.com
Fri Feb 5 11:01:55 PST 2010


Sean Hefty wrote:
>> Is the issue 6f8372b6 ("RDMA/cm: fix loopback address support")?  This
>> just went in for 2.6.33, which is still at -rc6, so if we can quickly
>> reach a consensus, there is still time to get a fix in for 2.6.33.
>>     
>
> That should be the patch in question.  I'm not sure about reaching consensus. :)
> If the other changes to the rdma_cm aren't closely tied to that change, we may
> be able to back that one patch out until we can get whatever other fix may be
> needed.
>   

I'd like to do this approach.  Then re-submit once we come to consensus...

> In my view, openmpi has a bug in that it can pass a loopback address to a remote
> peer and expect it to be used to establish a connection.  Steve seems to agree
> with this.
>
> My original intent was to allow the use of the loopback address with the
> rdma_cm.  I.e. 127.0.0.1 meant 'this host', and not 'software loopback'.  I just
> had Arlin run a quick test with OFED 1.4 over IB, and it allows binding to
> 127.0.0.1, but never forms connections.  I.e. ucmatose -b 127.0.0.1 succeeds in
> listening, but ucmatose -s 127.0.0.1 fails to connect because of a route error.
> (Hmm... I'm still confused about what openmpi is doing then.)
>   

But it must fail in OFED-1.4 if binding to an iwarp interface.   Maybe 
there was IB-only logic allowing 127.0.0.1 binds in OFED-1.4?   

The reason openmpi might still work on IB is that its not typical to use 
the rdma-cm for IB setups.  Its required for iwarp though.

 Jeff, what's the default CPC for IB devices?

> Even if an application were to use non-loopback IP addresses, there's no
> guarantee of forming a connection if those addresses map to an iwarp device.
> So, even if the rdma_cm fails binding to 127.0.0.1 unless there's some RDMA
> device (software or hardware - not sure why we care) capable of supporting it,
> an application would need to also deal with failures from rdma_resolve_addr.
>
> Indicating loopback through a device capability flag seems like the right
> approach, and the rdma_cm can use this to fail rdma_bind_addr/rdma_resolve_addr
> calls.  That's probably not a trivial patch however.
>
> - Sean
>   




More information about the ewg mailing list