[ofiwg] Multicast issues
Hefty, Sean
sean.hefty at intel.com
Mon Jun 9 10:49:35 PDT 2014
Thanks, Christoph. I will capture this in the requirements document, so we can track the best way to ensure that the interfaces enable the capability that you need.
- Sean
> In the financial industry we depend on rapid information dissemination. The
> replication of the information is done within the fabric and not by the
> endpoints through multicast. In the case of Ethernet long established
> semantics and protocols exist since the 80s. This is mostly defined by IGMP
> (RFC 4606) and various system calls for the socket layer in the operating
> system (see the IP manpage in section 7).
>
> OFED has so far supported joining and leaving multicast groups. There are a
> number of issues around the fringes were we have repeatedly encountered
> problems and incompatibilities in the implementations of various vendors.
>
>
> 1. Multicast loop back suppression
> ------------------------------------------
>
> The default behavior is to send multicast packets to all endpoints that
> have subscribed to a multicast group. Infiniband semantics require a
> subscription to a multicast group before a message can be sent to a
> multicast group. Therefore infiniband will reflect back any multicast
> packet to a QP that send a message causing additional logic to be necessary
> to ignore the message.
>
> The socket layer has a setting for each socket that specifies if a
> multicast packet will be looped back locally (IP_MULTICAST_LOOP).
>
> For the OFED stack some vendors have implemented options to the driver that
> specify multicast loopback suppression. However, the semantics are not the
> same. IP_MULTICAST_LOOP prevents any other socket on the local host to
> receive the multicast message. The option supported by OFED 1.X and
> MLNX_OFED for mlx4 f.e. however only suppresses loopback to the sending QP
> but will send the packets back to other QPs listening for the same
> multicast group on the same host.
>
> The situation gets a bit worse because in the most recent version of
> MLNX_OFED 2.2 the semantics change to what IP_MULTICAST_LOOP does in case
> flow steering is activated in the driver. Likely a bug that will be fixed
> soon.
>
> There was a recent discussion on linux-rdma about how to specify loopback
> suppression and we agreed that this would be done on a per QP basis with an
> extended QP creation flag.
>
> I would prefer the implementation where the loopback suppression is limited
> only to the sending QP. Otherwise running other apps on the same system may
> not receive multicast packets that they expect. The fine points of
> IP_MULTICAST_LOOP causes some gotchas that I would like to see avoided.
>
>
> 2. Requirement to subscribe to multicast groups before being able to send a
> multicast packet on a group.
> ---------------------------------------------------------------------------
> ----------------------------------------------------
>
> This differs between the Ethernet and Infiniband protocols supported by
> current OFED. On Ethernet no subscription is necessary to send a multicast
> message (and therefore the loopback suppression problem does not exist).
> Infiniband requires a subscription to send multicast.
>
> This problem would be avoided if the fabric could subscribe to a multicast
> group in a send only mode. OFED does not allow that.
>
>
> 3. Transition of multicast traffic between fabrics.
> ----------------------------------------------------------
>
> The conversion of multicast traffic to and from Ethernet has been
> problematic in the past since a mapping needs to occur between different
> fabrics. IPoIB requires a subscription on the Ethernet level and on the
> Infiniband level which caused race conditions that took a long time to get
> under control. Semantics for transitioning between fabrics could be useful.
> Preferably one transaction only should cause a multicast group
> subscription.
>
>
> 4. Control of multicast routing and backpressure
> ----------------------------------------------------------
>
> Most Infiniband fabrics can only apply static routing of multicast packets.
> This can causes bottlenecks for traffic in the fabric that cause
> backpressure which then slows down the sender. Which will cause a delay in
> multicast reception by all receivers.
>
> On Ethernet there are numerous solutions that allow dynamic routing of
> multicast packets if a certain node within the fabric gets overloaded.
>
> Maybe this is more of an issue for the particular implementation of a
> fabric but I think it would be useful if the sender and/or receiver could
> establish that congestion exists and take appropriate measures.
>
> One way to avoid this backpressure is to disable the congestion mechanism
> that slows down the sender. It would be best if this could be configured on
> a QP level so that multicast streams can be configured so that packets will
> be dropped rather than causing a slowdown of all receivers.
>
> ___________________________________
> christoph at lameter.com (202)596-5598
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg
More information about the ofiwg
mailing list