[ofa-general] Re: multicast join failed for...

Fri Apr 13 04:36:53 PDT 2007

On Fri, 2007-04-13 at 00:41, Michael S. Tsirkin wrote:
> > Quoting Sean Hefty <sean.hefty at intel.com>:
> > Subject: RE: [ofa-general] Re: multicast join failed for...
> > 
> > >> > The job will continue running though, and when you diagnose the problem
> > >> > and disconnect the bad node, rate will be back to high.
> > >> > So what's the problem?
> > 
> > What would bring the rate back up?
> 
> When the node is diagnosed and disconnected, SM will bring the rate back up.

I would say that the SM could (rather than will) bring the rate back up.
This increases the implementation complexity but would be warranted
if/when a dynamic rate option is supported.

> > Halting all multicast traffic across the subnet to handle a flaky node
> 
> Not halting, that would be broken. We are slowing the traffic down to avoid
> congestion at this link.
> 
> And you don't know it's "flaky" - it's just a heteroenious network. Policy can
> be forced by SM option but I don't think we should assume homogenious networks
> by default.

Homogeneous subnets are not assumed. What is assumed is the most common
use case (4x SDR or greater equipment). The issue occurs when there is a
slower node attempting to join.

-- Hal

> > wanting
> > to join some multicast group doesn't seem like a good solution.
> 
> As I said, there are tens of ways a bad node can hurt performance,
> and we don't/can't handle them. Why focus on ipoib? It's
> the only way to connect to node on some fabrics, it
> really must be up at all times.
> 
> > Plus it looks
> > like we'd have to repeat this later to bring the rate back up.
> 
> So? It should all be automatic.
> You see a problem in the network, diagnose it, replace the bad node,
> performance comes back up. That's the way to do it.