[ofa-general] Running OpenSM on large clusters
Sasha Khapyorsky
sashak at voltaire.com
Wed Oct 17 12:03:51 PDT 2007
On 11:38 Wed 17 Oct , Ira Weiny wrote:
> On Tue, 16 Oct 2007 16:35:38 -0700
> Edward Mascarenhas <eddiem at sgi.com> wrote:
>
> >
> > Has anyone seen issues with running OpenSM on large (1500+ nodes)
> > clusters?
> >
> > We are seeing 1000s of the following message in the system log
> >
> > __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is
> > already overloaded with 6736 messages and queue time of:10006[msec]
> >
> > It seems like a huge number of datagrams are being generated resulting
> > in increased time to bring up the fabric.
> >
> > Is there a threshold of cluster size beyond which we are likely to see
> > these messages.
> >
> > How many MADs are generated during bring up?
> >
> > What is the largest cluster size for which OpenSM has been tried by
> > others?
> >
>
> We have atlas running with 1152 nodes. OpenSM is able to route it with up/down
> routing in ~2min.
2min is a lot for OpenSM with up/down. Is it pure OpenSM time or from
bring-up power-on?
Sasha
> We don't see messages like you state above. But we have been using the OpenSM
> from OFED 1.2.
>
> Hope this helps,
> Ira
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list