[openib-general] RE: [RFC] support transports whose native endpoint is not a socket

Mon Mar 6 09:20:49 PST 2006

> ---------- Forwarded message ----------
> From: Or Gerlitz <ogerlitz at voltaire.com>
> Date: Mar 6, 2006 5:40 AM
> Subject: [RFC] support transports whose native endpoint is 
> not a socket
> To: open-iscsi at googlegroups.com
> 
> 
> The patch below is an initial drop (which compiles and work with the 
> open-iscsi-1.0-485 release and TCP being the transport) which should
> depict my understanding of the direction suggested in the 
> related thread
> last week.
> 
> It only implements the TCP case, now, before implementing the 
> user/kernel 
> part i'd like to see that i understand correct the following:
> 
> +1 the discovery code eventually also calls iscsi_tcp_connect and
>    iscsi_io_disconnect, this is since discovery is always carried out
>    over TCP/IP. So i left these calls from discovery.c untouched -
>    am i right in here? also can i assume that only calls from 
> discovery.c
>    must always go over TCP/IP and any other call should go via
>    ipc->transport_connect/poll/disconnect
> 
> +2 some places in the code call close(conn->socket_fd) 
> directly, so they
>    need to be changed to either iscsi_io_disconnect() or
>    ipc->transport_disconnect() ?
> 
> +3 some places in the code do read/write directly from the 
> socket, this is 
>    done when (conn->kernel_io  == 0), again is as in discovery we
>    read/write directly from user space and later always via 
> the kernel ?
> 
>    i was somehow confused here, since i see that iscsi_login sets 
>    conn->kernel_io = 0 but on the other hand i know that login
>    request/resposnse are --not-- sent directly from user space.
> 
> 

Doing discovery session connections through the host stack is
fine, because nobody really needs hardware accelaration of
discovery sessions.

But properly enabling the non-discovery sessions is trickier.

Attempting to make non-socket devices look like socket devices
is just going down the wrong path. There are too many variations
in the underlying mechanisms and wire protocols. Making phony
sockets is especially problematic when it requires creating a
pseudo-socket with a pseudo address type even though the actual
wire connection is plain TCP using an IPv4 or IPv6 address.

The strategy I would suggest instead is one that recognizes
that there are numerous types of iscsi devices (software over
TCP, hardware using TCP, software doing iSER over an iWARP
device, software doing iSER over an IB device and probably
some more that I don't know about yet. They all vary in how
connections are established, and how packets are generated.

The strategy I suggest is as follows:

1) As a starting point, work from CMA interface that is already
   part of openib/gen2. This is an existing interface that already
   abstract connection establishment over either IP or IB.

   The only real extension needed to it is to allow a given iSCSI
   device to use the "qp" field as a socket. That way iscsi_tcp
   can use a host stack socket as it currently does, but the various
   offload devices can uses QPs.

2) Once the iSCSI normal connection is established using the CMA
   (or a CMA-like) interface, it can be handed off to the kernel
   iSCSI code just as is done now with the simple socket handle.
   That would be unchanged for iscsi_tcp.

3) The socket/qp handle would then be used to exchange startup-phase
PDUs.

	a) for iscsi_tcp these are write()s and read()s through
	   the socket and the host TCP/IP stack.

	b) for iSCSI offload and iSER/iWARP these are work requests
	   for the qp to post/receive startup phase PDUs over the
         established TCP/IP connection in streaming mode.

	c) for iSER/IB these are RDMA Send/Recv work requests
         for the QP that has an established IB RC connection.

4) As a result of these exchanges the socket/handle is either dicsonnect
   and closed, or transitioned to its operational mode (iSCSI or iSER):

	a) for iscsi_tcp this is a nop, or it enables handling of
ffp-PDUs.
	b) for iSCSI/offload it enables ffp-task-work requests.
      c) for iSER/iWARP it enables RDMA mode on the QP's connection.
	c) for iSER/IB it would enable remote RDMA access. If going to
	   "streaming mode" is supported it would require shifting to a
         pseudo-TCP connection over IPoIB. The cost to benefit ration
         on such a project is truly staggering.

Basically, the two options are to hide the underlying connection (the
CMA approach,and what is proposed above), or to be able to create a
proxy connection when it cannot be done honestly. The latter is mildly
ugly for iSER/IB -- but it is extremely ugly for iSCSI offload or
iSCI/iWARP. A "pseudo-socket" for iSCSI/iWARP would be expecting
user-mode code to specify a remote IP address for a target that
*is* reachable by a plain TCP connection with an *alternate*
socket family.

Plus the pseudo-socket would have to proxy many socket features,
not just sending and receiving the MPA requests. These include
socket inheritance, dealing with aborted clients, setting MTU
sizes and TCP keepalive.