[ofiwg] Multicast issues

Christoph Lameter christoph at graphe.net
Mon Jun 9 11:19:42 PDT 2014

Forgot three more issues that go a bit beyond that in scope:

A. The socket API also supports IP_MULTICAST_ALL which simply means receive all multicast traffic. Similar functionality is available via the ibverbs flow steering API.

B. Raw Ethernet QPs can listen to all traffic coming into a link. This is useful for diagnostics and for special applications that need to observer all traffic on a link. Similar to the promiscuous mode. There is a sniffer mode in the flow steering Apis that allows similar things with ibverbs.

C. And then there is the need to be able to direct flows of data to specific processors where the data is supposed to be processed. The flow steering API does that now but it's a bit awkward.

christoph at lameter.com (202)596-5598

> On Jun 9, 2014, at 12:49 PM, "Hefty, Sean" <sean.hefty at intel.com> wrote:
> Thanks, Christoph.  I will capture this in the requirements document, so we can track the best way to ensure that the interfaces enable the capability that you need.
> - Sean
>> In the financial industry we depend on rapid information dissemination. The
>> replication of the information is done within the fabric and not by the
>> endpoints through multicast. In the case of Ethernet long established
>> semantics and protocols exist since the 80s. This is mostly defined by IGMP
>> (RFC 4606) and various system calls for the socket layer in the operating
>> system (see the IP manpage in section 7).
>> OFED has so far supported joining and leaving multicast groups. There are a
>> number of issues around the fringes were we have repeatedly encountered
>> problems and incompatibilities in the implementations of various vendors.
>> 1. Multicast loop back suppression
>> ------------------------------------------
>> The default behavior is to send multicast packets to all endpoints that
>> have subscribed to a multicast group. Infiniband semantics require a
>> subscription to a multicast group before a message can be sent to a
>> multicast group. Therefore infiniband will reflect back any multicast
>> packet to a QP that send a message causing additional logic to be necessary
>> to ignore the message.
>> The socket layer has a setting for each socket that specifies if a
>> multicast packet will be looped back locally (IP_MULTICAST_LOOP).
>> For the OFED stack some vendors have implemented options to the driver that
>> specify multicast loopback suppression. However, the semantics are not the
>> same. IP_MULTICAST_LOOP prevents any other socket on the local host to
>> receive the multicast message. The option supported by OFED 1.X and
>> MLNX_OFED for mlx4 f.e. however only suppresses loopback to the sending QP
>> but will send the packets back to other QPs listening for the same
>> multicast group on the same host.
>> The situation gets a bit worse because in the most recent version of
>> MLNX_OFED 2.2 the semantics change to what IP_MULTICAST_LOOP does in case
>> flow steering is activated in the driver. Likely a bug that will be fixed
>> soon.
>> There was a recent discussion on linux-rdma about how to specify loopback
>> suppression and we agreed that this would be done on a per QP basis with an
>> extended QP creation flag.
>> I would prefer the implementation where the loopback suppression is limited
>> only to the sending QP. Otherwise running other apps on the same system may
>> not receive multicast packets that they expect. The fine points of
>> IP_MULTICAST_LOOP causes some gotchas that I would like to see avoided.
>> 2. Requirement to subscribe to multicast groups before being able to send a
>> multicast packet on a group.
>> ---------------------------------------------------------------------------
>> ----------------------------------------------------
>> This differs between the Ethernet and Infiniband protocols supported by
>> current OFED. On Ethernet no subscription is necessary to send a multicast
>> message (and therefore the loopback suppression problem does not exist).
>> Infiniband requires a subscription to send multicast.
>> This problem would be avoided if the fabric could subscribe to a multicast
>> group in a send only mode. OFED does not allow that.
>> 3. Transition of multicast traffic between fabrics.
>> ----------------------------------------------------------
>> The conversion of multicast traffic to and from Ethernet has been
>> problematic in the past since a mapping needs to occur between different
>> fabrics. IPoIB requires a subscription on the Ethernet level and on the
>> Infiniband level which caused race conditions that took a long time to get
>> under control. Semantics for transitioning between fabrics could be useful.
>> Preferably one transaction only should cause a multicast group
>> subscription.
>> 4. Control of multicast routing and backpressure
>> ----------------------------------------------------------
>> Most Infiniband fabrics can only apply static routing of multicast packets.
>> This can causes bottlenecks for traffic in the fabric that cause
>> backpressure which then slows down the sender. Which will cause a delay in
>> multicast reception by all receivers.
>> On Ethernet there are numerous solutions that allow dynamic routing of
>> multicast packets if a certain node within the fabric gets overloaded.
>> Maybe this is more of an issue for the particular implementation of a
>> fabric but I think it would be useful if the sender and/or receiver could
>> establish that congestion exists and take appropriate measures.
>> One way to avoid this backpressure is to disable the congestion mechanism
>> that slows down the sender. It would be best if this could be configured on
>> a QP level so that multicast streams can be configured so that packets will
>> be dropped rather than causing a slowdown of all receivers.
>> ___________________________________
>> christoph at lameter.com (202)596-5598
>> _______________________________________________
>> ofiwg mailing list
>> ofiwg at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/ofiwg

More information about the ofiwg mailing list