[ofiwg] Multicast issues

Mon Jun 9 10:03:01 PDT 2014

In the financial industry we depend on rapid information dissemination. The replication of the information is done within the fabric and not by the endpoints through multicast. In the case of Ethernet long established semantics and protocols exist since the 80s. This is mostly defined by IGMP (RFC 4606) and various system calls for the socket layer in the operating system (see the IP manpage in section 7).

OFED has so far supported joining and leaving multicast groups. There are a number of issues around the fringes were we have repeatedly encountered problems and incompatibilities in the implementations of various vendors.

1. Multicast loop back suppression
------------------------------------------

The default behavior is to send multicast packets to all endpoints that have subscribed to a multicast group. Infiniband semantics require a subscription to a multicast group before a message can be sent to a multicast group. Therefore infiniband will reflect back any multicast packet to a QP that send a message causing additional logic to be necessary to ignore the message.

The socket layer has a setting for each socket that specifies if a multicast packet will be looped back locally (IP_MULTICAST_LOOP).

For the OFED stack some vendors have implemented options to the driver that specify multicast loopback suppression. However, the semantics are not the same. IP_MULTICAST_LOOP prevents any other socket on the local host to receive the multicast message. The option supported by OFED 1.X and MLNX_OFED for mlx4 f.e. however only suppresses loopback to the sending QP but will send the packets back to other QPs listening for the same multicast group on the same host.

The situation gets a bit worse because in the most recent version of MLNX_OFED 2.2 the semantics change to what IP_MULTICAST_LOOP does in case flow steering is activated in the driver. Likely a bug that will be fixed soon.

There was a recent discussion on linux-rdma about how to specify loopback suppression and we agreed that this would be done on a per QP basis with an extended QP creation flag.

I would prefer the implementation where the loopback suppression is limited only to the sending QP. Otherwise running other apps on the same system may not receive multicast packets that they expect. The fine points of IP_MULTICAST_LOOP causes some gotchas that I would like to see avoided.

2. Requirement to subscribe to multicast groups before being able to send a multicast packet on a group.
-------------------------------------------------------------------------------------------------------------------------------

This differs between the Ethernet and Infiniband protocols supported by current OFED. On Ethernet no subscription is necessary to send a multicast message (and therefore the loopback suppression problem does not exist). Infiniband requires a subscription to send multicast.

This problem would be avoided if the fabric could subscribe to a multicast group in a send only mode. OFED does not allow that.

3. Transition of multicast traffic between fabrics.
----------------------------------------------------------

The conversion of multicast traffic to and from Ethernet has been problematic in the past since a mapping needs to occur between different fabrics. IPoIB requires a subscription on the Ethernet level and on the Infiniband level which caused race conditions that took a long time to get under control. Semantics for transitioning between fabrics could be useful. Preferably one transaction only should cause a multicast group subscription.

4. Control of multicast routing and backpressure
----------------------------------------------------------

Most Infiniband fabrics can only apply static routing of multicast packets. This can causes bottlenecks for traffic in the fabric that cause backpressure which then slows down the sender. Which will cause a delay in multicast reception by all receivers.

On Ethernet there are numerous solutions that allow dynamic routing of multicast packets if a certain node within the fabric gets overloaded.

Maybe this is more of an issue for the particular implementation of a fabric but I think it would be useful if the sender and/or receiver could establish that congestion exists and take appropriate measures.

One way to avoid this backpressure is to disable the congestion mechanism that slows down the sender. It would be best if this could be configured on a QP level so that multicast streams can be configured so that packets will be dropped rather than causing a slowdown of all receivers.

___________________________________
christoph at lameter.com (202)596-5598