[ofa-general] Running OpenSM on large clusters
Sasha Khapyorsky
sashak at voltaire.com
Wed Oct 17 04:30:49 PDT 2007
On 16:35 Tue 16 Oct , Edward Mascarenhas wrote:
>
> Has anyone seen issues with running OpenSM on large (1500+ nodes)
> clusters?
>
> We are seeing 1000s of the following message in the system log
>
> __osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is
> already overloaded with 6736 messages and queue time of:10006[msec]
I guess you see this during fabric bringup when SA processor is not
available yet. Which version of OpenSM you are using - we did some
improvements in this area in recent versions (partially in OFED-1.2)?
> It seems like a huge number of datagrams are being generated resulting
> in increased time to bring up the fabric.
>
> Is there a threshold of cluster size beyond which we are likely to see
> these messages.
>
> How many MADs are generated during bring up?
A lot :). Exact number will depend on exact topology and requested
configuration. Could you send us output of ibnetdiscover?
> What is the largest cluster size for which OpenSM has been tried by
> others?
I hope others will answer. Largest cluster known for me was Thunderbird
(4480 nodes), there are some details:
http://openfabrics.org/archives/nov2006sc/ofa_devel_111606.pdf
Sasha
More information about the general
mailing list