[openib-general] Scalable Monitoring - RFC
Michael Krause
krause at cup.hp.com
Tue Nov 21 09:08:34 PST 2006
At 01:14 PM 11/20/2006, Bernard King-Smith wrote:
> > ----- Message from "Eitan Zahavi" <eitan at mellanox.co.il> on Mon, 20
> > Nov 2006 14:24:36 +0200 -----
> >
> > To:
> >
> > openib-general at openib.org
> >
> > Subject:
> >
> > [openib-general] Scalable Monitoring - RFC
> >
> > Hi All,
> > Following the path forward requirements review by Matt L. in the
> > last OFA Dev Summit I have
> > started thinking what would make a monitoring system scale to tens
> > of thousand of node.
> > This RFC provides both what I propose as the requirements list as
> > well as a draft implementation proposal - just to start the discussion.
> > I apologize for the long mail but I think this issue deserves a
> > careful design (been there done that
)
> > Scalable fabric monitoring requirements:
> > * scale up to 48k nodes
> > * 16 ports which gets to about 1,000,000 ports.
> > (16 ports per device is average for 32 ports for switch and 1 for HCA)
>
>What is the problem you are trying to address? 48K nodes or a single
>fabric of 1,000,000 endpoints? With the number of cores per node going up,
>you are looking at a multiple petaflop machine with this many nodes. When
>do you expect this to happen, 5-10 years from now? Most very large systems
>generally limit fabric port limits to the number of nodes not the number
>of endpoints. A single fabric runs into the problem of too many stages in
>the Fat Tree and single point of failure. Even if each nodes has multiple
>cores, each node can have multiple IB ports each connecting to different
>planes of IB fabric. This means that you only need to address a
>configuration of 48K ports, and for greater bandwidth use multiple
>parallel IB fabrics.
I listened to a government labs guy talk a year or so back about a 256K
processor configuration need by 2015 or so. However, I doubt he was aware
of all of the advancements in multi-core technology (which in turn is going
to be gated by the lack or slow advancements in memory technology). One
can likely safely assume that 8 and 16 core processors will be volume in
the coming years and that aggressive multi-core as Intel demonstrated at
the last IDF (if I recall was 80 cores) is certainly a strong possibility
and could occur given advancements in process technology. None the less,
the question of what people are trying to solve is a very valid one to ask.
The LID space is slightly less than 48K (some reserved / special values)
and most people will want to enable at least a 2-4 LID per port in order to
support multi-path / APM within a multi-switch configuration. That would
lead to 12K-24K ports being linked per fabric instance. 12K * 8 cores gets
one half way to that government lab's need which is likely the extreme
niche. Also note that most of these platforms will still be 2-4 socket
solutions so there could be multiple HCA per endnode and given the advent
of 10 GT/s signaling and improvements in optics / copper as well as blade
technology, one can see constructing this in a fairly compact physical
environment (the mix of servers and storage on consolidated fabric isn't
really an issue as some people may want to segregate as you note into
different fabric instances so the flows do not interfere from a QoS
perspective with one another).
As for monitoring, well, IB management approach was never what I wanted
(was the sole vote against the architecture) but it is what it
is. Ideally, the switches should just take a more active role. Set the
thresholds for when to raise an alarm and let the SM react. Given at
these speeds one needs to have a sustained effect for some reasonably long
period of time in order to justify change, it does not seem like the alarm
rate would be that high or that the SM would have trouble servicing the
changes. The goal should be to minimize oscillation which means at these
signaling rates, don't muck with the parameters until the effect is
potentially N seconds long where N is potentially large (will be somewhat a
function of fabric diameter).
Mike
>You get higher reliability with multiple planes of IB fabric because the
>failure in point in the fabric doesn't take the entire network down.
>Handling each plane of the fabric as a separate network cuts down on the
>number of elements that each fabric manager has to track. You can always
>aggregate summary information across multiple planes of fabric after
>collecting from the individual fabric managers.
>
> > * provide alerts for ports crossing some rate of change
> > * support profiling of data flow through the fabric
> > * be able to handle changes in topology due to MTBF.
> > Basic design considerations:
>
> [SNIP]
>
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
>
>
>Bernie King-Smith
>IBM Corporation
>Server Group
>Cluster System Performance
>wombat2 at us.ibm.com (845)433-8483
>Tie. 293-8483 or wombat2 on NOTES
>
>"We are not responsible for the world we are born into, only for the world
>we leave when we die.
>So we have to accept what has gone before us and work to change the only
>thing we can,
>-- The Future." William Shatner
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061121/3cd85f5c/attachment.html>
More information about the general
mailing list