[ofa-general] Running OpenSM on large clusters
Ira Weiny
weiny2 at llnl.gov
Wed Oct 17 11:38:53 PDT 2007
On Tue, 16 Oct 2007 16:35:38 -0700
Edward Mascarenhas <eddiem at sgi.com> wrote:
>
> Has anyone seen issues with running OpenSM on large (1500+ nodes)
> clusters?
>
> We are seeing 1000s of the following message in the system log
>
> __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is
> already overloaded with 6736 messages and queue time of:10006[msec]
>
> It seems like a huge number of datagrams are being generated resulting
> in increased time to bring up the fabric.
>
> Is there a threshold of cluster size beyond which we are likely to see
> these messages.
>
> How many MADs are generated during bring up?
>
> What is the largest cluster size for which OpenSM has been tried by
> others?
>
We have atlas running with 1152 nodes. OpenSM is able to route it with up/down
routing in ~2min.
We don't see messages like you state above. But we have been using the OpenSM
from OFED 1.2.
Hope this helps,
Ira
More information about the general
mailing list