[ofa-general] Re: multicast join failed for...

Hal Rosenstock halr at voltaire.com
Fri Apr 13 04:36:43 PDT 2007


On Fri, 2007-04-13 at 02:17, Michael S. Tsirkin wrote:
> > Quoting Sean Hefty <sean.hefty at intel.com>:
> > Subject: RE: [ofa-general] Re: multicast join failed for...
> > 
> > >When the node is diagnosed and disconnected, SM will bring the rate
> > >back up.
> > 
> > But how?  Doesn't it require re-registration of all multicast groups and
> > clients registered for SA events?
> 
> SA detects that rate can be increased and sends another reregister MAD.

Nit: it would be the SM rather than SA which would detect this and
reregister (which is an SM PortInfo change). That causes the SA clients
to do a lot of SA things.

> > >As I said, there are tens of ways a bad node can hurt performance,
> > >and we don't/can't handle them. Why focus on ipoib? It's
> > >the only way to connect to node on some fabrics, it
> > >really must be up at all times.
> > 
> > But the solution is affecting all multicast traffic, not just that
> > related to ipoib.  If you want all nodes to be able to join the ipoib
> > multicast group, why not just create the group at the lower rate?
> 
> If the group is created at a lower rate, there would be no problem.
> But the default configuration should be "plug an play".

So you are arguing for 1x SDR as the default. We've discussed and
disagreed on this before as I think it masks performance issues and
those are harder to find. I could be wrong about this.

> > ipoib multicast performance doesn't seem that critical.
> 
> This is a policy than can be made optional, but should not
> be forced on users by default.
> 
> > Whereas disrupting
> > other multicast groups, which could actively be in use by MPI, may be. 
> 
> The disruption would be very minor - this would happen at most once when rate changes
> from DDR to SDR and once when it changes back.

In frequency it may be minor. It affects other things that should not be
affected. Perhaps that is just a shortcoming of the mechanism underneath
and that can/should be improved.

-- Hal





More information about the general mailing list