[ofa-general] OpenSM?
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Tue May 27 14:31:58 PDT 2008
Charles,
Ira Weiny wrote:
> Charles,
>
> Here at LLNL we have been running OpenSM for some time. Thus far we are very
> happy with it's performance. Our largest cluster is 1152 nodes and OpenSM can
> bring it up (not counting boot time) in less than a minute.
OpenSM is successfully running on some large clusters with 4-5K nodes.
It takes about 2-3 minutes to bring up such clusters.
> Here are some details.
>
> We are running v3.1.10 of OpenSM with some minor modifications (mostly patches
> which have been submitted upstream and been accepted by Sasha but are not yet
> in a release.)
>
> Our clusters are all Fat-tree topologies.
>
> We have a node which is more or less dedicated to running OpenSM. We have some
> other monitoring software running on it, but OpenSM can utilize the CPU/Memory
> if it needs to.
>
> A) On our large clusters this node is a 4 socket, dual core (8 cores
> total) Opteron running at 2.4Gig with 16Gig of memory. I don't believe
> OpenSM needs this much but the nodes were built all the same so this is
> what it got.
>
> B) On one of our smaller clusters (128 nodes) OpenSM is running on a
> dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of
> memory. We have not seen any issues with this cluster and OpenSM.
>
> We run with the up/down algorithm, ftree has not panned out for us yet. I
> can't say how that would compare to the Cisco algorithms.
If the cluster topology is fat-tree, then there is a ftree and up/down routing.
Ftree would be a good choice if you need LMC=0 (plus if the topology complies
with certain fat-tree rules). For any other tree, or for LMC>0, up/down should
work.
-- Yevgeny
> In short OpenSM should work just fine on your cluster.
>
> Hope this helps,
> Ira
>
>
> On Tue, 27 May 2008 11:15:14 -0400
> Charles Taylor <taylor at hpc.ufl.edu> wrote:
>
>> We have a 400 node IB cluster. We are running an embedded SM in
>> failover mode on our TS270/Cisco7008 core switches. Lately we have
>> been seeing problems with LID assignment when rebooting nodes (see log
>> messages below). It is also taking far too long for LIDS to be
>> assigned as it takes on the order of minutes for the ports to
>> transition to "ACTIVE".
>>
>> This seems like a bug to us and we are considering switching to
>> OpenSM on a host. I'm wondering about experience with running
>> OpenSM for medium to large clusters (Fat Tree) and what resources
>> (memory/cpu) we should plan on for the host node.
>>
>> Thanks,
>>
>> Charlie Taylor
>> UF HPC Center
>>
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:
>> Rediscover
>> the subnet
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM
>> OUT_OF_SERVICE
>> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An
>> existing IB
>> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM
>> DELETE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:
>> Topology
>> changed
>> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> discovering removed ports
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:
>> async events
>> require sweep
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:
>> Rediscover
>> the subnet
>> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no
>> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]:
>> Topology
>> changed
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> discovering new ports
>> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> multicast membership change
>> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]:
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]:
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]:
>> Failed to
>> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad
>> status 0x1c
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM
>> IN_SERVICE trap
>> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59
>> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new
>> IB node
>> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:
>> async events
>> require sweep
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:
>> Rediscover
>> the subnet
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No
>> topology
>> change
>> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> previous GET/SET operation failures
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]:
>> Reassigning
>> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr
>> LID=0
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]:
>> Force port to
>> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]:
>> Clean up SA
>> resources for port forced down due to LID conflict, node -
>> GUID=00:02:c9:02:00:21:4b:58, port=1
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]:
>> cleaning DB
>> for guid 00:02:c9:02:00:21:4b:59, lid 194
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:18:47 topspin-270sc last message repeated 23 times
>> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity
>> links
>> detected in the network
>> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]:
>> Active
>> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16,
>> state=2,
>> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:
>> async events
>> require sweep
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]:
>> Rediscover
>> the subnet
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No
>> topology
>> change
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node
>> 00:06:6a:00:d9:00:04:5d port 16 is INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> some ports in INIT state
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> previous GET/SET operation failures
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]:
>> _ib_smAllocSubnet: initRate= 4
>> May 27 14:21:05 topspin-270sc last message repeated 23 times
>> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity
>> links
>> detected in the network
>> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]:
>> Program port
>> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor
>> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2
>> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM
>> CREATE_MC_GROUP
>> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:
>> async events
>> require sweep
>> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No
>> topology
>> change
>> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> multicast membership change
>> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a
>> backup session
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]:
>> async events
>> require sweep
>> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid
>> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No
>> topology
>> change
>> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration
>> caused by
>> multicast membership change
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB
>> synchronized
>> with Standby SM guid 00:05:ad:00:00:02:3c:60
>> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB
>> synchronized
>> with all designated backup SMs
>> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]:
>> ********************** NEW SWEEP ********************
>> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No
>> topology
>> change
>>
>> On May 23, 2008, at 2:20 PM, Steve Wise wrote:
>>
>>> Or Gerlitz wrote:
>>>> Steve Wise wrote:
>>>>> Are we sure we need to expose this to the user?
>>>> I believe this is the way to go if we want to let smart ULPs
>>>> generate new rkey/stag per mapping. Simpler ULPs could then just
>>>> put the same value for each map associated with the same mr.
>>>>
>>>> Or.
>>>>
>>> How should I add this to the API?
>>>
>>> Perhaps we just document the format of an rkey in the struct ib_mr.
>>> Thus the app would do this to change the key before posting the
>>> fast_reg_mr wr (coded to be explicit, not efficient):
>>>
>>> u8 newkey;
>>> u32 newrkey;
>>>
>>> newkey = 0xaa;
>>> newrkey = (mr->rkey & 0xffffff00) | newkey;
>>> mr->rkey = newrkey
>>> wr.wr.fast_reg.mr = mr;
>>> ...
>>>
>>>
>>> Note, this assumes mr->rkey is in host byte order (I think the linux
>>> rdma code assumes this in other places too).
>>>
>>>
>>> Steve.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list