[ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
Steve Wise
swise at opengridcomputing.com
Wed Aug 1 06:58:15 PDT 2007
Sean Hefty wrote:
>> The correct solution in my mind is to use the host stack's TCP port
>> space for _all_ RDMA_PS_TCP port allocations. The patch below is a
>> minimal delta to unify the port spaces bay using the kernel stack to
>> bind ports. This is done by allocating a kernel socket and binding to
>> the appropriate local addr/port. It also allows the kernel stack to
>> pick ephemeral ports by virtue of just passing in port 0 on the kernel
>> bind operation.
>
> I'm not thrilled with the idea of overlapping port spaces, and I can't come up
> with a solution that works for all situations. I understand the overlapping
> port space problem, but I consider the ability to use the same port number for
> both RDMA and sockets a feature.
>
> What if MPI used a similar mechanism as SDP? That is, if it gets a port number
> from sockets, it reserves that same RDMA port number, or vice-versa. The
> rdma_cm advertises separate port spaces from TCP/UDP, so IMO any assumption
> otherwise, at this point, is a bug in the user's code.
This would require EVERY rdma application to do this or potentially
steal listening endpoints from sockets applications already running.
This isn't just an MPI issue. Consider the iperf daemon running on port
5001. Then you run mvapich2/rdma and it happens to bind/listen to port
5001. Then an incoming innocent iperf connection gets passed to the
rdma stack in error...
>
> Before merging the port spaces, I'd like a way for an application to use a
> single well-known port number that works over both RDMA and sockets.
>
So you want the ability for an application to support a standard
TCP/Sockets service concurrently with an RDMA service? To do this
effectively over iWARP, you need the ability to take a streaming native
host connection and migrate it to rdma mode and vice-versa. And that
model isn't very popular.
Consider NFS and NFS-RDMA. The NFS gurus struggled with this very issue
and concluded that the RDMA service needs to be on a separate port.
Thus they are proposing a new netid/port number for doing RDMA mounts vs
TCP/UDP mounts. IMO that is the correct way to go: RDMA services are
different that tcp services. They use a different protocol on top of
TCP and thus shouldn't be handled on the same TCP port. So,
applications that want to service Sockets and RDMA services concurrently
would do so by listening on different ports...
>> RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
>
> Is there any reason to limit this behavior to TCP only, or would we also include
> UDP?
>
The iWARP protocols don't include a UDP based service, so it is not
needed. But if you're calling it a UDP port space, maybe it should be
the host's port space?
>> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>> index 9e0ab04..e4d2d7f 100644
>> --- a/drivers/infiniband/core/cma.c
>> +++ b/drivers/infiniband/core/cma.c
>> @@ -111,6 +111,7 @@ struct rdma_id_private {
>> struct rdma_cm_id id;
>>
>> struct rdma_bind_list *bind_list;
>> + struct socket *sock;
>
> This points off to a rather largish structure...
>
Yes. The only exports interfaces into the host port allocation stuff
requires a socket struct. I didn't want to try and tackle exporting the
port allocation services at a lower level. Even at the bottom level, I
think it still assumes a socket struct...
>> struct hlist_node node;
>> struct list_head list;
>> struct list_head listen_list;
>> @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
>> kfree(bind_list);
>> }
>> mutex_unlock(&lock);
>> + if (id_priv->sock)
>> + sock_release(id_priv->sock);
>> }
>>
>> void rdma_destroy_id(struct rdma_cm_id *id)
>> @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
>> return 0;
>> }
>>
>> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
>> +{
>> + int ret;
>> + struct socket *sock;
>> +
>> + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
>> + if (ret)
>> + return ret;
>> + ret = sock->ops->bind(sock,
>> + (struct socketaddr *)&id_priv->id.route.addr.src_addr,
>> + ip_addr_size(&id_priv->id.route.addr.src_addr));
>> + if (ret) {
>> + sock_release(sock);
>> + return ret;
>> + }
>> + id_priv->sock = sock;
>> + return 0;
>> +}
>> +
>> static int cma_get_port(struct rdma_id_private *id_priv)
>> {
>> struct idr *ps;
>> @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
>> break;
>> case RDMA_PS_TCP:
>> ps = &tcp_ps;
>> + ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
>> + if (ret)
>> + goto out;
>
> Would we need tcp_ps (and udp_ps) anymore?
I left it all in to show the minimal changes needed to implement the
functionality. To keep the patch simple for initial consumption. But
yes, the rdma-cm really doesn't need to track the port stuff for TCP
since the host stack does.
> Also, I think SDP maps into the TCP
> port space already, so changes to SDP will be needed as well, which may
> eliminate its port space.
I haven't looked in detail at the SDP code, but I would think it should
want the TCP port space and not its own anwyay, but I'm not sure. What
is the point of the SDP port space anyway?
BTW: I believe for SDP to work over iWARP, a "shared by all SDP peers"
port mapper service is required to map TCP port A to a different SDP
port so that the iwarp SDP connections don't collide with real TCP
connections for the same application service. Thus 2 ports are really
used. Regardless, both ports would need to be reserved from the host
stack port space or the issue will happen: some sockets app might also
bind to the SDP port and be denied service. I think its all a non-issue
for IB since IPoIB and RDMA/IB use different QPs/connections/contexts/etc...
One more issue for discussion:
This patch is reserving ports exclusively even if they are just
ephemeral ports used on the active side. This isn't really required,
and limits the amount of connections you can have. The current rdma-cm,
however, does the same thing. We could consider using lower level
services like inet_hash_connect().
Steve.
More information about the general
mailing list