[ofa-general] Running OpenSM on large clusters

Wed Oct 17 11:38:53 PDT 2007

On Tue, 16 Oct 2007 16:35:38 -0700
Edward Mascarenhas <eddiem at sgi.com> wrote:

> 
> Has anyone seen issues with running OpenSM on large (1500+ nodes) 
> clusters?
> 
> We are seeing 1000s of the following message in the system log
> 
> __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is 
> already overloaded with 6736 messages and queue time of:10006[msec]
> 
> It seems like a huge number of datagrams are being generated resulting 
> in increased time to bring up the fabric. 
> 
> Is there a threshold of cluster size beyond which we are likely to see 
> these messages.
> 
> How many MADs are generated during bring up?
> 
> What is the largest cluster size for which OpenSM has been tried by 
> others?
> 

We have atlas running with 1152 nodes.  OpenSM is able to route it with up/down
routing in ~2min.

We don't see messages like you state above.  But we have been using the OpenSM
from OFED 1.2.

Hope this helps,
Ira