[ofa-general] Running OpenSM on large clusters
Ira Weiny
weiny2 at llnl.gov
Wed Oct 17 14:19:37 PDT 2007
On Wed, 17 Oct 2007 14:04:49 -0500
Chris Elmquist <chrise at sgi.com> wrote:
> On Wednesday (10/17/2007 at 11:38AM -0700), Ira Weiny wrote:
> > On Tue, 16 Oct 2007 16:35:38 -0700
> > Edward Mascarenhas <eddiem at sgi.com> wrote:
> >
> > >
> > > Has anyone seen issues with running OpenSM on large (1500+ nodes)
> > > clusters?
> > >
> [...]
>
> >
> > We have atlas running with 1152 nodes. OpenSM is able to route it with up/down
> > routing in ~2min.
> >
> > We don't see messages like you state above. But we have been using the OpenSM
> > from OFED 1.2.
> >
> > Hope this helps,
> > Ira
>
> Ira,
>
> Thank you for the information. Can you describe the configuration of
> the machine on which you run that OpenSM? How much horsepower and the
> type of HCA used?
>
> I suspect that the machine on which we run OpenSM may be underpowered for
> what we are asking of it...
>
> Chris
>
The node is a 4 socket MB with 2.4Gig dual core opterons (8 cores total).
OpenSM is the biggest thing running on that node but I don't recall it taking
all 8 cores for any length of time... The HCA's are Mellanox on a PCIe bus.
ibstat is included below.
Ira
14:12:48 > ibstat
CA 'mthca0'
CA type: MT25208
Number of ports: 2
Firmware version: 5.2.916
Hardware version: 20
Node GUID: 0x0002c9020021a5ec
System image GUID: 0x0002c9020021a5ef
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 1388
LMC: 0
SM lid: 1388
Capability mask: 0x02510a6a
Port GUID: 0x0002c9020021a5ed
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0002c9020021a5ee
More information about the general
mailing list