[ofw] [RFC] 0/5: assistant to the IB communication manager
Sean Hefty
sean.hefty at intel.com
Wed Sep 16 23:09:16 PDT 2009
The following collection of pseudo-patches implement a new user space package
(IB ACM) designed to assist with connection establishment. A description is
given below, copied from the acm_notes.txt file included with the package. The
complete package is available on git.openfabrics.org/~shefty/ibacm.git and also
in svn under branches/winverbs/ulp/ibacm.
This is a request for both general and detailed feedback. The IB ACM has had
very limited testing. Testing has been restricted to using the provided test
utility, and invoking it from the windows version of the librdmacm on a single,
small cluster. Calling it from the linux librdmacm is more involved and still
under development.
Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
Assistant for InfiniBand Communication Management (IB ACM)
Note: The IB ACM should be considered experimental.
Overview
--------
The IB ACM package implements and provides a framework for experimental name,
address, and route resolution services over InfiniBand. It is intended to
address connection setup scalability issues running MPI applications on large
clusters. The IB ACM provides information needed to establish a connection, but
does not implement the CM protocol. Long term, the IB ACM may support multiple
resolution mechanisms.
The IB ACM is focused on being scalable and efficient. The current
implementation limits network traffic, SA interactions, and centralized
services. As a trade-off, it is not expected to support all cluster routing
configurations. However, it is anticipated that additional functionality, such
as path record caching, can be incorporated into the IB ACM to support a wider
range of configurations.
The IB ACM package is comprised of three components: the ib_acm service, a
libibacm library, and a test/configuration utility - ib_acme. All are userspace
components and are available for Linux and Windows. Additional details are
given below.
Quick Start Guide
-----------------
1. Prerequisites: libibverbs and libibumad must be installed.
The IB stack should be running with IPoIB configured
2. Install the IB ACM package
This installs libibacm, ib_acm, and ib_acme.
3. Run ib_acme -A -O
This will generate IB ACM address and options configuration files.
(acm_addr.cfg and acm_opts.cfg)
4. Run ib_acm and leave running
5. Optionally, run ib_acme -s <source_ip> -d <dest_ip> -v
This will verify that the ib_acm service is running.
It also verifies the path is usable on the given cluster.
5. Install librdmacm.
6. Define the following environment variable: RDMA_CM_USE_IB_ACM=1
The librdmacm will automatically use the ib_acm service.
On failures, the librdmacm will fall back to normal resolution.
Details
-------
libibacm:
The libibacm is an end-user library with simple interfaces for communicating
with the ib_acm service. The libibacm implements the ib_acm client protocol.
Although the interfaces to the libibacm are considered experimental, it's
expected that existing calls will be supported going forward.
For simplicity, all calls operate synchronously and are serialized. Possible
future changes to the libibacm would be to process calls in parallel and add
asynchronous interfaces.
ib_acme:
The ib_acme program serves a dual role. It acts as a utility to test ib_acm
operation and help verify if the ib_acm is usable for a given cluster
configuration. Additionally, it automatically generates ib_acm configuration
files to assist with or eliminate manual setup.
acm configuration files:
The ib_acm service relies on two configuration files. The acm_addr.cfg file
contains name and address mappings for each IB <device, port, pkey> endpoint.
Although the names in the acm_addr.cfg file can be anything, ib_acme maps the
host name and IP addresses to the IB endpoints.
The acm_opts.cfg file provides a set of configurable options for the ib_acm
service, such as timeout, number of retries, logging level, etc. ib_acme
generates the acm_opts.cfg file using static information. A future enhancement
would adjust options based on the current system and cluster size.
ib_acm:
The ib_acm service is responsible for resolving names and addresses to
InfiniBand path information and caching such data. It is currently implemented
as an executable application, but is a conceptual service or daemon that should
execute with administrative privileges.
The ib_acm implements a client interface over TCP sockets, which is abstracted
by the libibacm library. One or more back-end protocols are used by the ib_acm
service to satisfy user requests. Although the ib_acm supports standard SA path
record queries on the back-end, it provides an experimental resolution protocol
in hope of achieving greater scalability.
Conceptually, the ib_acm service implements an ARP like protocol and uses IB
multicast records to construct path record data. It makes the assumption that a
unicast path between two endpoints is realizable if those endpoints can
communicate over a multicast group with similar properties (rate, mtu, etc.)
Specifically, all IB endpoints join a number of multicast groups. Multicast
groups differ based on rates, mtu, sl, etc., and are prioritized. All
participating endpoints must be able to communicate on the lowest priority
multicast group. The ib_acm assigns one or more names/addresses to each IB
endpoint using the acm_addr.cfg file. Clients provide source and destination
names or addresses as input to the service, and receive as output path record
data.
The service maps a client's source name/address to a local IB endpoint. If the
destination name/address is not cached locally, it sends a multicast request out
on the lowest priority multicast group on the local endpoint. The request
carries a list of multicast groups that the sender can use. The recipient of
the request selects the highest priority multicast group that it can use as well
and returns that information directly to the sender. The request data is cached
by all endpoints that receive the multicast request message. The source
endpoint also caches the response and uses the multicast group that was selected
to construct path record data, which is returned to the client.
The current implementation of the IB ACM has several additional restrictions.
The ib_acm is limited in its handling of dynamic changes; the ib_acm must be
stopped and restarted if a cluster is reconfigured. Cached data does not timed
out and is only updated if a new resolution request is received from a different
QPN than a cached request. Support for IPv6 has not been verified. The number
of addresses that can be assigned to a single endpoint is limited to 4, and the
number of multicast groups that an endpoint can support is limited to 2.
More information about the ofw
mailing list