[ofa-general] Running OpenSM on large clusters

Wed Oct 17 04:30:49 PDT 2007

On 16:35 Tue 16 Oct     , Edward Mascarenhas wrote:
> 
> Has anyone seen issues with running OpenSM on large (1500+ nodes) 
> clusters?
> 
> We are seeing 1000s of the following message in the system log
> 
> __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is 
> already overloaded with 6736 messages and queue time of:10006[msec]

I guess you see this during fabric bringup when SA processor is not
available yet. Which version of OpenSM you are using - we did some
improvements in this area in recent versions (partially in OFED-1.2)?

> It seems like a huge number of datagrams are being generated resulting 
> in increased time to bring up the fabric. 
> 
> Is there a threshold of cluster size beyond which we are likely to see 
> these messages.
> 
> How many MADs are generated during bring up?

A lot :). Exact number will depend on exact topology and requested
configuration. Could you send us output of ibnetdiscover?

> What is the largest cluster size for which OpenSM has been tried by 
> others?

I hope others will answer. Largest cluster known for me was Thunderbird
(4480 nodes), there are some details:
http://openfabrics.org/archives/nov2006sc/ofa_devel_111606.pdf

Sasha