[ofa-general] Running OpenSM on large clusters

Ira Weiny weiny2 at llnl.gov
Wed Oct 17 14:19:37 PDT 2007


On Wed, 17 Oct 2007 14:04:49 -0500
Chris Elmquist <chrise at sgi.com> wrote:

> On Wednesday (10/17/2007 at 11:38AM -0700), Ira Weiny wrote:
> > On Tue, 16 Oct 2007 16:35:38 -0700
> > Edward Mascarenhas <eddiem at sgi.com> wrote:
> > 
> > > 
> > > Has anyone seen issues with running OpenSM on large (1500+ nodes) 
> > > clusters?
> > > 
> [...]
> 
> > 
> > We have atlas running with 1152 nodes.  OpenSM is able to route it with up/down
> > routing in ~2min.
> > 
> > We don't see messages like you state above.  But we have been using the OpenSM
> > from OFED 1.2.
> > 
> > Hope this helps,
> > Ira
> 
> Ira,
> 
> Thank you for the information.  Can you describe the configuration of
> the machine on which you run that OpenSM?  How much horsepower and the
> type of HCA used?
> 
> I suspect that the machine on which we run OpenSM may be underpowered for
> what we are asking of it...
> 
> Chris
> 

The node is a 4 socket MB with 2.4Gig dual core opterons (8 cores total).
OpenSM is the biggest thing running on that node but I don't recall it taking
all 8 cores for any length of time...  The HCA's are Mellanox on a PCIe bus.
ibstat is included below.

Ira

14:12:48 > ibstat
CA 'mthca0'
        CA type: MT25208
        Number of ports: 2
        Firmware version: 5.2.916
        Hardware version: 20
        Node GUID: 0x0002c9020021a5ec
        System image GUID: 0x0002c9020021a5ef
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 1388
                LMC: 0
                SM lid: 1388
                Capability mask: 0x02510a6a
                Port GUID: 0x0002c9020021a5ed
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510a68
                Port GUID: 0x0002c9020021a5ee




More information about the general mailing list