[ofiwg] OFIWG usage requirements - Oracle RDBMS

Wed Jun 18 17:48:35 PDT 2014

> Avneesh, thanks for going over this.  After your presentation and
> looking back over the slides, I did come up with a few questions.
> 
> slide 6
> You mentioned using a memory mapped interface with ibacm.  Although
> there is work progressing to support this, I thought that the shared
> memory interface requires allocating an fd per process.  Is the problem
> allocating an fd per process, or is the problem on the ibacm daemon
> side trying to support 30k connections (e.g. an fd set size
> limitation)?
> 
[Avneesh Pant] It's primarily the fan-in issue on the ACM daemon end. Single fd per process to mmap the cache should be ok. Shared memory implementation should also have a performance benefit over sockets in terms of latency.

Another issue I forgot to mention wrt to RDMA CM is that with a large number of connections the mechanism of delivering RDMA_CM_ADDRESS_CHANGE_EVENT is very inefficient. We have seen a significant amount of time in delivering/processing of these events (we can have 50K + connections on a node). Having a more optimized(bulk?) notification mechanism would be nice. Usually the client already maintains a mapping of RDMA CM ID -> IP (at least we do) so if an IP has migrated then there is no need to indicate the address change for each CMID. Maybe a way to just get notification of the IP moving is sufficient for these clients?

> slide 11
> This mentions a shared PD concept and wanting this in mainline code.
> Is this some vendor specific feature available through some other
> interface?
> 
[Avneesh Pant] I assume you mean slide 10. We would like to see this available as part of the standard API if possible. If not then at least as an extension. Shared PDs may have more generic use cases (PGAS SHMEM implementations may find it useful?). We are also looking at it as a way to implement IO server architectures to mitigate some of the QP scaling issues. With shared PDs the IO server is the only one required to allocate QPs/CQs etc. Memory registrations of process private memory can utilize the shared PD allowing IO server to provide RDMA capability to any number of process private memory.  

> slide 13
> Do you have a specific idea for how to expose NUMA through an API?
> 
[Avneesh Pant] No particularly. My only request is that this not be tied to a specific OS concept/implementation. For example we want the API to be portable across Linux and Solaris which represent NUMA concepts somewhat differently. Maybe looking at a library like hwloc which provides an abstracted topology aware mechanism may be helpful. I am of course not proposing a full blown hwloc mechanism but much simpler primitives along the lines of:
 - get_current/all_locality() -> Returns a handle to current "locality" L the process is running on.
 - Enumerate devices(L) -> Return an ordered list of devices with some notion of increasing distance from L. Could associate a hop count/distance metric to it as well
If the locality concept is generic enough then even allocation of memory for Event Collectors and interrupt affinity for Event Group can be tied to locality. Default could be "local" but some clients might want to specify distinct locality for processing events etc. (threaded servers?).

Avneesh

> - Sean