[ofa-general] Running OpenSM on large clusters
    Ira Weiny 
    weiny2 at llnl.gov
       
    Wed Oct 17 11:38:53 PDT 2007
    
    
  
On Tue, 16 Oct 2007 16:35:38 -0700
Edward Mascarenhas <eddiem at sgi.com> wrote:
> 
> Has anyone seen issues with running OpenSM on large (1500+ nodes) 
> clusters?
> 
> We are seeing 1000s of the following message in the system log
> 
> __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is 
> already overloaded with 6736 messages and queue time of:10006[msec]
> 
> It seems like a huge number of datagrams are being generated resulting 
> in increased time to bring up the fabric. 
> 
> Is there a threshold of cluster size beyond which we are likely to see 
> these messages.
> 
> How many MADs are generated during bring up?
> 
> What is the largest cluster size for which OpenSM has been tried by 
> others?
> 
We have atlas running with 1152 nodes.  OpenSM is able to route it with up/down
routing in ~2min.
We don't see messages like you state above.  But we have been using the OpenSM
from OFED 1.2.
Hope this helps,
Ira
    
    
More information about the general
mailing list