[ofa-general] Re: multicast join failed for...
Michael S. Tsirkin
mst at dev.mellanox.co.il
Thu Apr 12 21:41:29 PDT 2007
> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: [ofa-general] Re: multicast join failed for...
>
> >> > The job will continue running though, and when you diagnose the problem
> >> > and disconnect the bad node, rate will be back to high.
> >> > So what's the problem?
>
> What would bring the rate back up?
When the node is diagnosed and disconnected, SM will bring the rate back up.
> Halting all multicast traffic across the subnet to handle a flaky node
Not halting, that would be broken. We are slowing the traffic down to avoid
congestion at this link.
And you don't know it's "flaky" - it's just a heteroenious network. Policy can
be forced by SM option but I don't think we should assume homogenious networks
by default.
> wanting
> to join some multicast group doesn't seem like a good solution.
As I said, there are tens of ways a bad node can hurt performance,
and we don't/can't handle them. Why focus on ipoib? It's
the only way to connect to node on some fabrics, it
really must be up at all times.
> Plus it looks
> like we'd have to repeat this later to bring the rate back up.
So? It should all be automatic.
You see a problem in the network, diagnose it, replace the bad node,
performance comes back up. That's the way to do it.
--
MST
More information about the general
mailing list