[openib-general] Scalable Monitoring - RFC

Tue Nov 21 09:08:34 PST 2006

At 01:14 PM 11/20/2006, Bernard King-Smith wrote:

> > ----- Message from "Eitan Zahavi" <eitan at mellanox.co.il> on Mon, 20
> > Nov 2006 14:24:36 +0200 -----
> >
> > To:
> >
> > openib-general at openib.org
> >
> > Subject:
> >
> > [openib-general] Scalable Monitoring - RFC
> >
> > Hi All,
> > Following the path forward requirements review by Matt L. in the
> > last OFA Dev Summit I have
> > started thinking what would make a monitoring system scale to tens
> > of thousand of node.
> > This RFC provides both what I propose as the requirements list as
> > well as a draft implementation proposal - just to start the discussion.
> > I apologize for the long mail but I think this issue deserves a
> > careful design (been there done that
)
> > Scalable fabric monitoring requirements:
> > * scale up to 48k nodes
> > *  16 ports which gets to about 1,000,000 ports.
> >     (16 ports per device is average for 32 ports for switch and 1 for HCA)
>
>What is the problem you are trying to address? 48K nodes or a single 
>fabric of 1,000,000 endpoints? With the number of cores per node going up, 
>you are looking at a multiple petaflop machine with this many nodes. When 
>do you expect this to happen, 5-10 years from now? Most very large systems 
>generally limit fabric port limits to the number of nodes not the number 
>of endpoints. A single fabric runs into the problem of too many stages in 
>the Fat Tree and single point of failure. Even if each nodes has multiple 
>cores, each node can have multiple IB ports each connecting to different 
>planes of IB fabric. This means that you only need to address a 
>configuration of 48K ports, and for greater bandwidth use multiple 
>parallel IB fabrics.

I listened to a government labs guy talk a year or so back about a 256K 
processor configuration need by 2015 or so.  However, I doubt he was aware 
of all of the advancements in multi-core technology (which in turn is going 
to be gated by the lack or slow advancements in memory technology).   One 
can likely safely assume that 8 and 16 core processors will be volume in 
the coming years and that aggressive multi-core as Intel demonstrated at 
the last IDF (if I recall was 80 cores) is certainly a strong possibility 
and could occur given advancements in process technology.  None the less, 
the question of what people are trying to solve is a very valid one to ask.

The LID space is slightly less than 48K (some reserved / special values) 
and most people will want to enable at least a 2-4 LID per port in order to 
support multi-path / APM within a multi-switch configuration.   That would 
lead to 12K-24K ports being linked per fabric instance.  12K * 8 cores gets 
one half way to that  government lab's need which is likely the extreme 
niche.  Also note that most of these platforms will still be 2-4 socket 
solutions so there could be multiple HCA per endnode and given the advent 
of 10 GT/s signaling and improvements in optics / copper as well as blade 
technology, one can see constructing this in a fairly compact physical 
environment (the mix of servers and storage on consolidated fabric isn't 
really an issue as some people may want to segregate as you note into 
different fabric instances so the flows do not interfere from a QoS 
perspective with one another).

As for monitoring, well, IB management approach was never what I wanted 
(was the sole vote against the architecture) but it is what it 
is.   Ideally, the switches should just take a more active role.   Set the 
thresholds for when to raise an alarm and let the SM react.   Given at 
these speeds one needs to have a sustained effect for some reasonably long 
period of time in order to justify change, it does not seem like the alarm 
rate would be that high or that the SM would have trouble servicing the 
changes.   The goal should be to minimize oscillation which means at these 
signaling rates, don't muck with the parameters until the effect is 
potentially N seconds long where N is potentially large (will be somewhat a 
function of fabric diameter).

Mike

>You get higher reliability with multiple planes of IB fabric because the 
>failure in point in the fabric doesn't take the entire network down. 
>Handling each plane of the fabric as a separate network cuts down on the 
>number of elements that each fabric manager has to track. You can always 
>aggregate summary information across multiple planes of fabric after 
>collecting from the individual fabric managers.
>
> > * provide alerts for ports crossing some rate of change
> > * support profiling of data flow through the fabric
> > * be able to handle changes in topology due to MTBF.
> > Basic design considerations:
>
>  [SNIP]
>
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
>
>
>Bernie King-Smith
>IBM Corporation
>Server Group
>Cluster System Performance
>wombat2 at us.ibm.com    (845)433-8483
>Tie. 293-8483 or wombat2 on NOTES
>
>"We are not responsible for the world we are born into, only for the world 
>we leave when we die.
>So we have to accept what has gone before us and work to change the only 
>thing we can,
>-- The Future." William Shatner
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061121/3cd85f5c/attachment.html>