[ewg] Re: [PATCH] link-local address fix for rdma_resolve_addr
Jason Gunthorpe
jgunthorpe at obsidianresearch.com
Wed Oct 21 18:19:39 PDT 2009
On Wed, Oct 21, 2009 at 05:40:30PM -0700, Sean Hefty wrote:
> >Even so, it still seems OK to me:
> >
> >Path:
> > addr4_resolve_remote
> > $ ip route get 10.0.0.11 from 192.168.122.1
> > local 10.0.0.11 from 192.168.122.1 dev lo
> > srcIP = 192.168.122.1
> > rdma_translate_ip(dst_ip = 10.0.0.11)
> > rdma_copy_addr("eth0");
> > src_dev_addr = eth0.dev_addr (ie GID of 10.0.0.11)
> > memcpy(dst_dev_addr = src_dev_addr) (ie GID of 10.0.0.11)
> >
> >So everthing is bound to the GID of 10.0.0.11 which matches the listen
> >of 10.0.0.11, which seems OK.
>
> The source could have called rdma_bind_addr(192.168.122.1) prior to calling
> rdma_resolve_addr(). (DAPL does this.) This would have returned a different
> RDMA device than binding to 10.0.0.11. The client app could have allocated
> resources on that device, but the CM REQ will carry the gid/lid of the other
> device. The endpoints won't be able to communicate.
That is very difficult to fit into the semantics the IP routing
model uses :( And it looks like an API problem in DAPL :(
So, I see now, you are proposing that in this case the connection
attempt to be routed through the network and not looped back.. I
actually have a big problem with that, ignoring a 'lo' entry in a
routing table is very much not IP like and not a good idea. That
should be respected..
I guess I'd much rather see that one situation return EHOSTUNREACH or
something.
But, I suppose you are going to tell me that Intel MPI uses DAPL to
loopback connect to other processes on the same node, and relies on
this? :( :( :(
Sigh. Anyhow, lets not get side tracked. It seems to me, the easy way
out for David's approach is to simply check if the device is already
bound via rdma_bind() and if so force it to that device no matter what
the routing table lookup returns. Can you suggest a reliable way to
make that check?
[What happens now if I do this:
rdma_bind(10.0.0.11)
rdma_resolve_addr(src = 192.168.122.1 dst = 10.0.0.11)
Does the cma_bind path check that it is already bound and give out an
error? too late for me to check]
Once the cma_bind for rdma_resolve_addr is moved into the
addr_resolve_remote function then people using the API without calling
bind on the client path will get sane IP-like behavior.
> Yes, it's weird, and may not be optimal, but if a source address is
> explicitly given, then its mapping to a specific RDMA device should
> be honored.
Remember, on Linux the IP is *not* attached to a device, it is part of
the host itself. So the idea that a source address somehow specifies a
RDMA device does not fit into the Linux IP networking model.
Unfortunately the definition of rdma_bind kinda bakes this mismatched
model into the API :(
Truth be told, to fit the Linux IP model, the RDMA CM should have
provided exactly only two ways to bind a cm_id to a specific device -
rdma_accept and rdma_resolve_addr.
Jason
More information about the ewg
mailing list