[ofa-general] OpenSM?

Ira Weiny weiny2 at llnl.gov
Tue May 27 10:08:59 PDT 2008


Charles,

Here at LLNL we have been running OpenSM for some time.  Thus far we are very
happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
bring it up (not counting boot time) in less than a minute.

Here are some details.

We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
which have been submitted upstream and been accepted by Sasha but are not yet
in a release.)

Our clusters are all Fat-tree topologies.

We have a node which is more or less dedicated to running OpenSM.  We have some
other monitoring software running on it, but OpenSM can utilize the CPU/Memory
if it needs to.

   A) On our large clusters this node is a 4 socket, dual core (8 cores
   total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
   OpenSM needs this much but the nodes were built all the same so this is
   what it got.

   B) On one of our smaller clusters (128 nodes) OpenSM is running on a
   dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
   memory.  We have not seen any issues with this cluster and OpenSM.

We run with the up/down algorithm, ftree has not panned out for us yet.  I
can't say how that would compare to the Cisco algorithms.

In short OpenSM should work just fine on your cluster.

Hope this helps,
Ira


On Tue, 27 May 2008 11:15:14 -0400
Charles Taylor <taylor at hpc.ufl.edu> wrote:

> 
> We have a 400 node IB cluster.    We are running an embedded SM in  
> failover mode on our TS270/Cisco7008 core switches.    Lately we have  
> been seeing problems with LID assignment when rebooting nodes (see log  
> messages below).   It is also taking far too long for LIDS to be  
> assigned as it takes  on the order of minutes for the ports to  
> transition to "ACTIVE".
> 
> This seems like a bug to us and we are considering switching to  
> OpenSM  on a host.   I'm wondering about experience with running  
> OpenSM for medium to large clusters (Fat Tree) and what resources  
> (memory/cpu) we should plan on for the host node.
> 
> Thanks,
> 
> Charlie Taylor
> UF HPC Center
> 
> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> OUT_OF_SERVICE
> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
> existing IB
> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> DELETE_MC_GROUP
> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
> Topology
> changed
> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> discovering removed ports
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
> Topology
> changed
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> discovering new ports
> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
> Force port to
> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
> Program port
> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
> Failed to
> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
> status 0x1c
> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
> IN_SERVICE trap
> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
> IB node
> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> previous GET/SET operation failures
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:  
> Reassigning
> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr  
> LID=0
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:  
> Force port to
> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:  
> Clean up SA
> resources for port forced down due to LID conflict, node -
> GUID=00:02:c9:02:00:21:4b:58, port=1
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:  
> cleaning DB
> for guid 00:02:c9:02:00:21:4b:59, lid 194
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
> _ib_smAllocSubnet: initRate= 4
> May 27 14:18:47 topspin-270sc last message repeated 23 times
> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
> links
> detected in the network
> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:  
> Active
> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,  
> state=2,
> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
> Rediscover
> the subnet
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
> 00:06:6a:00:d9:00:04:5d port 16 is INIT state
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> some ports in INIT state
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> previous GET/SET operation failures
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
> _ib_smAllocSubnet: initRate= 4
> May 27 14:21:05 topspin-270sc last message repeated 23 times
> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
> links
> detected in the network
> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:  
> Program port
> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM  
> CREATE_MC_GROUP
> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a  
> backup session
> with Standby SM guid 00:05:ad:00:00:02:3c:60
> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
> async events
> require sweep
> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
> caused by
> multicast membership change
> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
> synchronized
> with Standby SM guid 00:05:ad:00:00:02:3c:60
> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
> synchronized
> with all designated backup SMs
> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
> **********************  NEW SWEEP ********************
> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
> topology
> change
> 
> On May 23, 2008, at 2:20 PM, Steve Wise wrote:
> 
> > Or Gerlitz wrote:
> >> Steve Wise wrote:
> >>> Are we sure we need to expose this to the user?
> >> I believe this is the way to go if we want to let smart ULPs  
> >> generate new rkey/stag per mapping. Simpler ULPs could then just  
> >> put the same value for each map associated with the same mr.
> >>
> >> Or.
> >>
> >
> > How should I add this to the API?
> >
> > Perhaps we just document the format of an rkey in the struct ib_mr.   
> > Thus the app would do this to change the key before posting the  
> > fast_reg_mr wr (coded to be explicit, not efficient):
> >
> > u8 newkey;
> > u32 newrkey;
> >
> > newkey = 0xaa;
> > newrkey = (mr->rkey & 0xffffff00) | newkey;
> > mr->rkey = newrkey
> > wr.wr.fast_reg.mr = mr;
> > ...
> >
> >
> > Note, this assumes mr->rkey is in host byte order (I think the linux  
> > rdma code assumes this in other places too).
> >
> >
> > Steve.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
> 



More information about the general mailing list