[ewg] bug 1918 - openmpi broken due to rdma-cm changes
Steve Wise
swise at opengridcomputing.com
Fri Feb 5 11:01:55 PST 2010
Sean Hefty wrote:
>> Is the issue 6f8372b6 ("RDMA/cm: fix loopback address support")? This
>> just went in for 2.6.33, which is still at -rc6, so if we can quickly
>> reach a consensus, there is still time to get a fix in for 2.6.33.
>>
>
> That should be the patch in question. I'm not sure about reaching consensus. :)
> If the other changes to the rdma_cm aren't closely tied to that change, we may
> be able to back that one patch out until we can get whatever other fix may be
> needed.
>
I'd like to do this approach. Then re-submit once we come to consensus...
> In my view, openmpi has a bug in that it can pass a loopback address to a remote
> peer and expect it to be used to establish a connection. Steve seems to agree
> with this.
>
> My original intent was to allow the use of the loopback address with the
> rdma_cm. I.e. 127.0.0.1 meant 'this host', and not 'software loopback'. I just
> had Arlin run a quick test with OFED 1.4 over IB, and it allows binding to
> 127.0.0.1, but never forms connections. I.e. ucmatose -b 127.0.0.1 succeeds in
> listening, but ucmatose -s 127.0.0.1 fails to connect because of a route error.
> (Hmm... I'm still confused about what openmpi is doing then.)
>
But it must fail in OFED-1.4 if binding to an iwarp interface. Maybe
there was IB-only logic allowing 127.0.0.1 binds in OFED-1.4?
The reason openmpi might still work on IB is that its not typical to use
the rdma-cm for IB setups. Its required for iwarp though.
Jeff, what's the default CPC for IB devices?
> Even if an application were to use non-loopback IP addresses, there's no
> guarantee of forming a connection if those addresses map to an iwarp device.
> So, even if the rdma_cm fails binding to 127.0.0.1 unless there's some RDMA
> device (software or hardware - not sure why we care) capable of supporting it,
> an application would need to also deal with failures from rdma_resolve_addr.
>
> Indicating loopback through a device capability flag seems like the right
> approach, and the rdma_cm can use this to fail rdma_bind_addr/rdma_resolve_addr
> calls. That's probably not a trivial patch however.
>
> - Sean
>
More information about the ewg
mailing list