[ofa-general] OpenSM?

Tue May 27 14:31:58 PDT 2008

Charles,

Ira Weiny wrote:
> Charles,
> 
> Here at LLNL we have been running OpenSM for some time.  Thus far we are very
> happy with it's performance.  Our largest cluster is 1152 nodes and OpenSM can
> bring it up (not counting boot time) in less than a minute.

OpenSM is successfully running on some large clusters with 4-5K nodes.
It takes about 2-3 minutes to bring up such clusters.

> Here are some details.
> 
> We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
> which have been submitted upstream and been accepted by Sasha but are not yet
> in a release.)
> 
> Our clusters are all Fat-tree topologies.
> 
> We have a node which is more or less dedicated to running OpenSM.  We have some
> other monitoring software running on it, but OpenSM can utilize the CPU/Memory
> if it needs to.
> 
>    A) On our large clusters this node is a 4 socket, dual core (8 cores
>    total) Opteron running at 2.4Gig with 16Gig of memory.  I don't believe
>    OpenSM needs this much but the nodes were built all the same so this is
>    what it got.
> 
>    B) On one of our smaller clusters (128 nodes) OpenSM is running on a
>    dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
>    memory.  We have not seen any issues with this cluster and OpenSM.
> 
> We run with the up/down algorithm, ftree has not panned out for us yet.  I
> can't say how that would compare to the Cisco algorithms.

If the cluster topology is fat-tree, then there is a ftree and up/down routing.
Ftree would be a good choice if you need LMC=0 (plus if the topology complies
with certain fat-tree rules). For any other tree, or for LMC>0, up/down should
work.

-- Yevgeny

> In short OpenSM should work just fine on your cluster.
> 
> Hope this helps,
> Ira
> 
> 
> On Tue, 27 May 2008 11:15:14 -0400
> Charles Taylor <taylor at hpc.ufl.edu> wrote:
> 
>> We have a 400 node IB cluster.    We are running an embedded SM in  
>> failover mode on our TS270/Cisco7008 core switches.    Lately we have  
>> been seeing problems with LID assignment when rebooting nodes (see log  
>> messages below).   It is also taking far too long for LIDS to be  
>> assigned as it takes  on the order of minutes for the ports to  
>> transition to "ACTIVE".
>>
>> This seems like a bug to us and we are considering switching to  
>> OpenSM  on a host.   I'm wondering about experience with running  
>> OpenSM for medium to large clusters (Fat Tree) and what resources  
>> (memory/cpu) we should plan on for the host node.
>>
>> Thanks,
>>
>> Charlie Taylor
>> UF HPC Center
>>
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> OUT_OF_SERVICE
>> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An  
>> existing IB
>> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> DELETE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
>> Topology
>> changed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> discovering removed ports
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
>> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:  
>> Topology
>> changed
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> discovering new ports
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:  
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:  
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:  
>> Failed to
>> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad  
>> status 0x1c
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM  
>> IN_SERVICE trap
>> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new  
>> IB node
>> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> previous GET/SET operation failures
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:  
>> Reassigning
>> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr  
>> LID=0
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:  
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:  
>> Clean up SA
>> resources for port forced down due to LID conflict, node -
>> GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:  
>> cleaning DB
>> for guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:18:47 topspin-270sc last message repeated 23 times
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
>> links
>> detected in the network
>> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:  
>> Active
>> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,  
>> state=2,
>> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:  
>> Rediscover
>> the subnet
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
>> 00:06:6a:00:d9:00:04:5d port 16 is INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> some ports in INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> previous GET/SET operation failures
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:21:05 topspin-270sc last message repeated 23 times
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity  
>> links
>> detected in the network
>> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:  
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM  
>> CREATE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a  
>> backup session
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:  
>> async events
>> require sweep
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration  
>> caused by
>> multicast membership change
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
>> synchronized
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB  
>> synchronized
>> with all designated backup SMs
>> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> **********************  NEW SWEEP ********************
>> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No  
>> topology
>> change
>>
>> On May 23, 2008, at 2:20 PM, Steve Wise wrote:
>>
>>> Or Gerlitz wrote:
>>>> Steve Wise wrote:
>>>>> Are we sure we need to expose this to the user?
>>>> I believe this is the way to go if we want to let smart ULPs  
>>>> generate new rkey/stag per mapping. Simpler ULPs could then just  
>>>> put the same value for each map associated with the same mr.
>>>>
>>>> Or.
>>>>
>>> How should I add this to the API?
>>>
>>> Perhaps we just document the format of an rkey in the struct ib_mr.   
>>> Thus the app would do this to change the key before posting the  
>>> fast_reg_mr wr (coded to be explicit, not efficient):
>>>
>>> u8 newkey;
>>> u32 newrkey;
>>>
>>> newkey = 0xaa;
>>> newrkey = (mr->rkey & 0xffffff00) | newkey;
>>> mr->rkey = newrkey
>>> wr.wr.fast_reg.mr = mr;
>>> ...
>>>
>>>
>>> Note, this assumes mr->rkey is in host byte order (I think the linux  
>>> rdma code assumes this in other places too).
>>>
>>>
>>> Steve.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>