[ofw] OpenSM with HPC

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Thu Oct 2 12:54:23 PDT 2008


Hi Anatoly,

I need more details:

Anatoly Greenblatt wrote:
> Hi,
> 
> Our client reported problems running over 192 concurrent jobs with 
> OpenSM.

What kind of cluster does your client have?
How many hosts? How many switches?

What do these jobs do?
Are these MPI jobs? Do they use/create multicast groups? Something else?
How many processes each job has?

> The jobs are executed several times. After a while the memory 
> usage of OpenSM goes to ~30MB, cpu usage to 100% and eventually the node 
> freezes and needs to be reset.

Is the problem reproducible?
Can you send me SM log?

-- Yevgeny

>  
> 
> Configuration:
> 
> Winof rev 1596 (~rc1)
> 
> ConnectX HCA
> 
> Windows 2008 x64 with HPC pack rc2
> 
> NetworkDirect is installed
> 
> OpenSM is running as a service on the head node.
> 
> About a hundred nodes are used (maybe more, I don’t have exact number yet)
> 
>  
> 
> Has anyone any thoughts about this?
> 
>  
> 
> Thanks,
> 
> Anatoly.
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw




More information about the ofw mailing list