[openib-general] Scalable Monitoring - RFC

Eitan Zahavi eitan at mellanox.co.il
Mon Nov 20 04:24:36 PST 2006


Hi All,

Following the path forward requirements review by Matt L. in the last
OFA Dev Summit I have 
started thinking what would make a monitoring system scale to tens of
thousand of node.

This RFC provides both what I propose as the requirements list as well
as a draft implementation proposal - just to start the discussion.
I apologize for the long mail but I think this issue deserves a careful
design (been there done that...)

Scalable fabric monitoring requirements:
* scale up to 48k nodes 
*  16 ports which gets to about 1,000,000 ports.
    (16 ports per device is average for 32 ports for switch and 1 for
HCA)
* provide alerts for ports crossing some rate of change
* support profiling of data flow through the fabric
* be able to handle changes in topology due to MTBF.

Basic design considerations:
* a distributed data collection scheme is required. No single manager
can go through all the ports in reasonable time.

* the system should push as much work to the agents. Examples for that
are rate calculation and counter resets. 

* features like data compression or aggregation are important for
reducing the data reported up-the-tree to the console or data storage.
   To support that the agents should be able to:
   1. Report error counters increase only if crosses a programmable
threshold
   2. Aggregate data and packet counters. Provide upstream data only
when change is larger then a given threshold 
   3. Create and provides histogram of rate of change for each counter
(merging data for all ports it collects counters for).

* To support data profiling we will need to store xmit recv data for
each port which boils down to huge amount of data. 
   Instead of rolling that data up-the-tree each agent should be able to
write its own file and merge offline.

* handling topology changes is a challenge for a distributed set of
agents. Two problem arise from topology dynamics:
   1. An agent responsible for query of some device might loose
connection to it. So dynamic responsibility assignment is 
       required to support dynamic topology. 
   2. Agents are arranged in reporting hierarchy which by itself can be
disconnected.

Implementation proposal:
This section describes a specific agent behavior. 
The actual communication implementation can be based on IB verbs or over
plain sockets.

Definitions:
Responsibility Subnet: the set of nodes the agent is monitoring
Responsibility Subnet Perimeter: the nodes connected to the
responsibility subnet which are in the responsibility of other agents
Peer Agents: the agents that are responsible for the nodes on the
perimeter of the current agent
Master Agent: the agent the current agent should report to. This is
configured on the command line.

Agents data structure:
1.	List of node GUIDs it is handling
2.	Last read values for all counters for each port it is handling
3.	Change rate histogram for each port it is handling
4.	List of other agents that are responsible for nodes on the
perimeter of its responsibility network
		
Agents Communication:
A broadcast group is used to broadcast messages regarding agents node
ownership.
Point to point communication is also used as much as possible as
described below.

The messages involved are (I list data of the message in <>):
1.	Query: Who monitors <NodeGUID>?
Response: <Agent IP>, <port num>, <hops> (the distance the agent is from
the monitored node), <num nodes owned>
2.	Trap: <time>, <NodeGUID>, <port num>, <counter>, <value>,
<threshold crossed>
3.	Data: <time>, set of port counters to report. The message is
variable length.
4.	Get Histograms: an agent is requested to provide back histograms
for all its managed ports
5.	Clear all: instruct an agent to clear all counters and start
fresh

Agent behavior is:
1.	Register to the monitor agents broadcast group - I.e. have a UDP
socket listening on it
2.	Discover the fabric connected to it (in BFS manner while
tracking the number of hops):
		For each node discovered :
				If already knows its responsible master
- skip
				Query for responsible master (broadcast
"Who Owns?")
				    If gets response 
				        If same or lower hops then the
current hop count add to other nodes responsibility list
				    Else 
				        it is this agent responsibility
- notify every other agent by broadcasting this info (response for "who
owns?" )
				        add to BFS next step list
3.	Foreach node in its responsibility list
		read all counters every "epoch" (means a constant scan
time)
		calculate rate of change for every counter
		report traps if the counter is error counter or data set
at the end of the epoch changed data which crosses the change threshold
		clear counter overflows
4.	Once in some timeout 
		     For each peer in the peers list
		          If peer does not answer run step 2 again
5.	Respond to incoming messages:
a.	"Who owns?"  - use point to point
b.	"Trap" - propagate up the tree
c.	"Data" - propagate up the tree
d.	"Histogram" - merge with other child histograms.
e.	"Get Histogram" - send downwards and after merging sent
"Histogram" up.
f.	"Clear all" - clear self and send downwards


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061120/5fa1393f/attachment.html>


More information about the general mailing list