[ofa-general] Running OpenSM on large clusters

Tue Oct 16 16:35:38 PDT 2007

Has anyone seen issues with running OpenSM on large (1500+ nodes) 
clusters?

We are seeing 1000s of the following message in the system log

__osm_sa_mad_ctrl_process: Dropping MAD since the dispatcher is 
already overloaded with 6736 messages and queue time of:10006[msec]

It seems like a huge number of datagrams are being generated resulting 
in increased time to bring up the fabric. 

Is there a threshold of cluster size beyond which we are likely to see 
these messages.

How many MADs are generated during bring up?

What is the largest cluster size for which OpenSM has been tried by 
others?

Thanks,
Edward