[ofa-general] Running OpenSM on large clusters

Wed Oct 17 12:03:51 PDT 2007

On 11:38 Wed 17 Oct     , Ira Weiny wrote:
> On Tue, 16 Oct 2007 16:35:38 -0700
> Edward Mascarenhas <eddiem at sgi.com> wrote:
> 
> > 
> > Has anyone seen issues with running OpenSM on large (1500+ nodes) 
> > clusters?
> > 
> > We are seeing 1000s of the following message in the system log
> > 
> > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is 
> > already overloaded with 6736 messages and queue time of:10006[msec]
> > 
> > It seems like a huge number of datagrams are being generated resulting 
> > in increased time to bring up the fabric. 
> > 
> > Is there a threshold of cluster size beyond which we are likely to see 
> > these messages.
> > 
> > How many MADs are generated during bring up?
> > 
> > What is the largest cluster size for which OpenSM has been tried by 
> > others?
> > 
> 
> We have atlas running with 1152 nodes.  OpenSM is able to route it with up/down
> routing in ~2min.

2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from
bring-up power-on?

Sasha

> We don't see messages like you state above.  But we have been using the OpenSM
> from OFED 1.2.
> 
> Hope this helps,
> Ira
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general