[ofw] port RDS to OFW - request for input...

Wed Jul 22 14:17:55 PDT 2009

Sean Hefty wrote:
> Can you explain how Oracle will interface to RDS?  Does it expect to use normal
> socket calls? 
RDS clients access access RDS via socket interfaces including using 
CMSGs to async initiate rdma ops, atomics,
and retrieve completion notifications.

>  Does it access it through an existing kernel driver? 
No.
>  If the
> interface were through a user space library, would that work?
>
>   
We've been talking about this... we have a few requirements.. wrt to the 
port..
ideally we would continue with our kernel mode RDS implementation..

1) The same RDS code base runs on multiple platforms today (3)
with two remaining ports in progress (Windows + one other). We
would like to retain as much of the common core RDS code as possible -
to gain interoperability (wire protocol) with RDS running on other 
platforms.. (a requirement)
while leveraging the extensive testing completed for common RDS code base.

2) We would like to retain the same RDS interface semantics / behavior 
across all
RDS ports. We need to minimize the differences between ports to reduce /
remove any coding changes for the RDS clients that inter-operate across 
platforms and
as well facilitate long term maintenance..

For example, the same RDS client (Oracle) runs on all platforms today - 
we can not
change the client.

Our current thinking is that we would create a thin shim driver on Windows
which wraps the core RDS driver in conjunction with a thin winsock 
provider dll
to redirect socket ops to our windows shim driver / RDS. The shim driver
will interact directly with the RDS core driver providing interfacing to 
local
windows services for memory management, threads, etc - along with
verbs + cma support..etc.. all from kernel mode.

Another possibility would leverage the RDS modular
framework allowing transports to be added.. we could add
a Windows specific RDS / IB transport to interact with IB
on Windows.. for example..

3) For RDMA operations - RDS clients tend to register a buffer,
hand out the buffer key to another node and ask it to initiate an rdma,
and when the rdma completes - unregister the buffer.

On the rdma initiator - the RDS core driver leverages kernel mode
to pin local buffers and use a device level dma key to initiate the 
operations
- very straight forward..

If we move to user mode for rdma initiator side - then we'd need to 
register
/ pin local buffers on the fly - which would require a kernel mode call to
pin and register memory and again to unregister ?

We've looked at caching registrations etc - with the according issues of
dealing with memory dealloc / realloc underneath cached registrations
(same as MPI folks describe) and it's not practical to track - given the
dynamic nature of RDS clients... plus caches have their own performance 
costs.

4) RDS clients expect a reliable datagram model - a single end point can be
used to reliably send messages to any number of destinations. It's 
common for
a large set of RDS clients to dynamically (during a query) form and 
establish
communication where the total end points are in 200k plus range. From a
performance perspective - these dynamic associations between endpoints
must be very lite weight.

When running RDS over IB or iWARP or TCP - for example,
RDS forms one or more node to node connections and multiplexes all user
mode datagrams over these connections. The effect is that there
is virtually no run time cost to establish communication between
RDS end points at the scale we are creating..

> - Sean
>
>