[ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

Wed Aug 1 06:58:15 PDT 2007

Sean Hefty wrote:
>> The correct solution in my mind is to use the host stack's TCP port
>> space for _all_ RDMA_PS_TCP port allocations.   The patch below is a
>> minimal delta to unify the port spaces bay using the kernel stack to
>> bind ports.  This is done by allocating a kernel socket and binding to
>> the appropriate local addr/port.  It also allows the kernel stack to
>> pick ephemeral ports by virtue of just passing in port 0 on the kernel
>> bind operation.
> 
> I'm not thrilled with the idea of overlapping port spaces, and I can't come up
> with a solution that works for all situations.  I understand the overlapping
> port space problem, but I consider the ability to use the same port number for
> both RDMA and sockets a feature.
> 
> What if MPI used a similar mechanism as SDP?  That is, if it gets a port number
> from sockets, it reserves that same RDMA port number, or vice-versa.  The
> rdma_cm advertises separate port spaces from TCP/UDP, so IMO any assumption
> otherwise, at this point, is a bug in the user's code.

This would require EVERY rdma application to do this or potentially 
steal listening endpoints from sockets applications already running. 
This isn't just an MPI issue.  Consider the iperf daemon running on port 
5001.  Then you run mvapich2/rdma and it happens to bind/listen to port 
5001.  Then an incoming innocent iperf connection gets passed to the 
rdma stack in error...

> 
> Before merging the port spaces, I'd like a way for an application to use a
> single well-known port number that works over both RDMA and sockets.
> 

So you want the ability for an application to support a standard 
TCP/Sockets service concurrently with an RDMA service?  To do this 
effectively over iWARP, you need the ability to take a streaming native 
host connection and migrate it to rdma mode and vice-versa.  And that 
model isn't very popular.

Consider NFS and NFS-RDMA.  The NFS gurus struggled with this very issue 
and concluded that the RDMA service needs to be on a separate port. 
Thus they are proposing a new netid/port number for doing RDMA mounts vs 
TCP/UDP mounts.  IMO that is the correct way to go:  RDMA services are 
different that tcp services.  They use a different protocol on top of 
TCP and thus shouldn't be handled on the same TCP port.  So, 
applications that want to service Sockets and RDMA services concurrently 
would do so by listening on different ports...

>> RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
> 
> Is there any reason to limit this behavior to TCP only, or would we also include
> UDP?
> 

The iWARP protocols don't include a UDP based service, so it is not 
needed.  But if you're calling it a UDP port space, maybe it should be 
the host's port space?

>> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>> index 9e0ab04..e4d2d7f 100644
>> --- a/drivers/infiniband/core/cma.c
>> +++ b/drivers/infiniband/core/cma.c
>> @@ -111,6 +111,7 @@ struct rdma_id_private {
>>  	struct rdma_cm_id	id;
>>
>>  	struct rdma_bind_list	*bind_list;
>> +	struct socket		*sock;
> 
> This points off to a rather largish structure...
> 

Yes.  The only exports interfaces into the host port allocation stuff 
requires a socket struct.  I didn't want to try and tackle exporting the 
port allocation services at a lower level.  Even at the bottom level, I 
think it still assumes a socket struct...

>>  	struct hlist_node	node;
>>  	struct list_head	list;
>>  	struct list_head	listen_list;
>> @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
>>  		kfree(bind_list);
>>  	}
>>  	mutex_unlock(&lock);
>> +	if (id_priv->sock)
>> +		sock_release(id_priv->sock);
>>  }
>>
>>  void rdma_destroy_id(struct rdma_cm_id *id)
>> @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
>>  	return 0;
>>  }
>>
>> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
>> +{
>> +	int ret;
>> +	struct socket *sock;
>> +
>> +	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
>> +	if (ret)
>> +		return ret;
>> +	ret = sock->ops->bind(sock,
>> +			  (struct socketaddr *)&id_priv->id.route.addr.src_addr,
>> +			  ip_addr_size(&id_priv->id.route.addr.src_addr));
>> +	if (ret) {
>> +		sock_release(sock);
>> +		return ret;
>> +	}
>> +	id_priv->sock = sock;
>> +	return 0;
>> +}
>> +
>>  static int cma_get_port(struct rdma_id_private *id_priv)
>>  {
>>  	struct idr *ps;
>> @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
>>  		break;
>>  	case RDMA_PS_TCP:
>>  		ps = &tcp_ps;
>> +		ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
>> +		if (ret)
>> +			goto out;
> 
> Would we need tcp_ps (and udp_ps) anymore?  

I left it all in to show the minimal changes needed to implement the 
functionality.  To keep the patch simple for initial consumption.  But 
yes, the rdma-cm really doesn't need to track the port stuff for TCP 
since the host stack does.

> Also, I think SDP maps into the TCP
> port space already, so changes to SDP will be needed as well, which may
> eliminate its port space.

I haven't looked in detail at the SDP code, but I would think it should 
want the TCP port space and not its own anwyay, but I'm not sure.  What 
is the point of the SDP port space anyway?

BTW:  I believe for SDP to work over iWARP, a "shared by all SDP peers" 
port mapper service is required to map TCP port A to a different SDP 
port so that the iwarp SDP connections don't collide with real TCP 
connections for the same application service.  Thus 2 ports are really 
used.  Regardless, both ports would need to be reserved from the host 
stack port space or the issue will happen: some sockets app might also 
bind to the SDP port and be denied service.  I think its all a non-issue 
for IB since IPoIB and RDMA/IB use different QPs/connections/contexts/etc...

One more issue for discussion:

This patch is reserving ports exclusively even if they are just 
ephemeral ports used on the active side.  This isn't really required, 
and limits the amount of connections you can have.  The current rdma-cm, 
however, does the same thing.  We could consider using lower level 
services like inet_hash_connect().

Steve.