[ewg] bug 1918 - openmpi broken due to rdma-cm changes

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Fri Feb 5 13:14:55 PST 2010


On Fri, Feb 05, 2010 at 03:08:10PM -0500, Jeff Squyres wrote:
> On Feb 5, 2010, at 1:56 PM, Jason Gunthorpe wrote:
> 
> > > I think we should remove the feature of allowing binds to 127.0.0.1 
> > > altogether based on Jeff's arguments and my assertion that 127.0.0.1 is 
> > > a sw-loopback mechanism anyway...
> > 
> > I don't agree, the kernel should be free to provide a loop back
> > service any way it likes, and if that means using one of the HW
> 
> Ok, fine.  Should we push back OFED 1.5.1 until Open MPI can get 1.4.2 out?  I don't know when that will be.
 
> In short: you're breaking backward compatibility with zero warning.
> There is real software out there that will break if people upgrade
> their kernel/OFED/RDMA CM/whatever (e.g., Open MPI).  Isn't this
> supposed to be the Enterprise distribution (meaning: stability)?
> (trying to keep the frustration out of my voice...)

Well, I think you are right. This kind of change seems appropriate to
me for mainline, but OFED/RHEL should carry a responsibility to manage
an identified incompatibility, either patch their kernel, patch their
OMPI, or publish an errata. That is the role of a distribution.

> How about this: back out the change for now.  Give everyone time to
> upgrade.  If nothing else, ***give those of us who are involved in
> this community*** time to upgrade.  Then put the feature back in
> after adequate time has passed.

I've seen this approach go badly too :( If it isn't actually in a
mainline kernel userspace devs tend to ignore it ..

Sounds like this is taken care for now anyhow, Sean's patch to remove
it for iwarp since it doesn't work today with any iwarp drivers does
obscure the problem.. But it does seem like rdma_cm mode for IB
networks will still be broken in OMPI with the new kernels.

Jason



More information about the ewg mailing list