[ofw] OpenSM with HPC

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Thu Oct 2 14:18:11 PDT 2008


Hi Anatoly,

Missed Hal's response when I wrote mine.
As Hal says, Windows SM version is old. IMHO, fixing problems like
you've reported it is not even an option, because it will involve
serious changes to existing features or implementing additional ones,
which was already done in the Linux OpenSM.
So I can only reiterate what Hal has already said - OpenSM in Windows
needs to be updated to the latest and greatest.

-- Yevgeny


Yevgeny Kliteynik wrote:
> Hi Anatoly,
> 
> I need more details:
> 
> Anatoly Greenblatt wrote:
>> Hi,
>>
>> Our client reported problems running over 192 concurrent jobs with 
>> OpenSM.
> 
> What kind of cluster does your client have?
> How many hosts? How many switches?
> 
> What do these jobs do?
> Are these MPI jobs? Do they use/create multicast groups? Something else?
> How many processes each job has?
> 
>> The jobs are executed several times. After a while the memory usage of 
>> OpenSM goes to ~30MB, cpu usage to 100% and eventually the node 
>> freezes and needs to be reset.
> 
> Is the problem reproducible?
> Can you send me SM log?
> 
> -- Yevgeny
> 
>>  
>>
>> Configuration:
>>
>> Winof rev 1596 (~rc1)
>>
>> ConnectX HCA
>>
>> Windows 2008 x64 with HPC pack rc2
>>
>> NetworkDirect is installed
>>
>> OpenSM is running as a service on the head node.
>>
>> About a hundred nodes are used (maybe more, I don’t have exact number 
>> yet)
>>
>>  
>>
>> Has anyone any thoughts about this?
>>
>>  
>>
>> Thanks,
>>
>> Anatoly.
>>
>>  
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 




More information about the ofw mailing list