[ewg] bug 1918 - openmpi broken due to rdma-cm changes
Steve Wise
swise at opengridcomputing.com
Thu Feb 4 15:04:23 PST 2010
Sean Hefty wrote:
>> Well then the rdma-cm needs to know which devices support hw loopback.
>> Cuz on a T3-only system, no hwloop...
>>
>
> The problem sounds like it's more than just whether 127.0.0.1 is usable. That
> check may fix openmpi, but it sounds more like the app needs to know whether the
> device can actually support loopback, regardless of what addresses are used. Is
> this correct?
>
> What would openmpi do if there were two addresses assigned to the T3 device?
>
It would use them and might even create two connections.
> Does openmpi simply bypass RDMA for all connections on the local machine?
>
>
OpenMPI can be run to use hw loopback if its available. For T3
clusters, OMPI is run in a mode to use shared memory for intra-node
communications.
> Basically, I'm not sure that this is *just* an rdma_cm issue. Although it
> definitely appears that some sort of change needs to be made to the rdma_cm.
>
>
I think the OpenMPI rdmacm code needs to skip 127.0.0.1, in this
particular case. Prior to ofed-1.5.1, however, the bind would fail and
thus OpenMPI would not advertise 127.0.0.1 to its peer. I will work to
get that change done.
But lets also add a device attribute so the rdmacm can know if a device
supports loopback. Clearly, if the rdma-cm allows binds to T3,
loopback connections will fail at connect time.
Hey Roland, are you ok with a device attribute to indicate hw-loopback
support?
Steve.
More information about the ewg
mailing list