***SPAM*** Re: [ofa-general] OpenSM Problems/Questions

Ira Weiny weiny2 at llnl.gov
Tue Sep 9 12:11:40 PDT 2008


On Tue, 9 Sep 2008 14:35:48 -0400
"Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:

> Hi,
> 
> On Tue, Sep 9, 2008 at 9:01 AM, Matthew Trzyna <trzyna at us.ibm.com> wrote:
> > Hello
> >
> >
> > A "Basic Fabric Diagram" at the end.
> >
> >
> > I am working with a customer implementing a large IB fabric and is
> > encountering problems with OpenSM (OFED 1.3) when they added a new 264 node
> > cluster (with its own 288 port IB switch) to their existing cluster. Two
> > more 264 clusters are planned to be added in the near future. They recently
> > moved to SLES 10 SP1 and OFED 1.3 (before adding the new cluster) and had
> > not been experiencing these problems before.

Are there routing issues?

> >
> > Could you help provide answers to the questions listed below? Additional
> > information about the configuration including a basic fabric diagram are
> > provided after the questions.
> >
> > What parameters should be set on the non-SM nodes that affect how the Subnet
> > Administrator functions?
> > What parameters should be set on the SM node(s) that affect how the Subnet
> > Administrator functions? And, what  parameters should be removed from the SM
> > node(s)? (ie.  ib_sa paths_per_dest=0x7f)
> > How should SM failover be setup? How many failover SM's should be
> > configured? This must happen quickly and transparently or GPFS will die
> > everywhere due to timeouts if this takes too long).
> 
> What is quickly enough ?

What does GPFS do that requires the SM/SA to be constantly available?  Lustre
is pretty stable (IB wise) once connected.  Our SysAdmins can restart the SM
almost at will without issues.  As an asside, we do not run with a standby SM.
We have not had many instances where OpenSM crashes (probably about 3 times in
3 years).  So I think it is important to find out why GPFS needs the SM/SA and
then make sure that is available.

> 
> > Are there SA (Subnet Administrator) commands that should not be executed on
> > a large "live" fabric?            (ie. "saquery -p")
> > Should GPFS be configured "off" on the SM node(s)?
> > Do you know of any other OpenSM implementations that have 5 (or more) 288
> > port IB switches that might have already encountered/resolved some of these
> > issues?
> 
> There are some deployments with multiple large switches deployed.

We have 2 clusters which currently have 4x288 port switches in them.  Plus many
more 24 port "leafs" off of those cores.  OpenSM, while not perfect, does work
quite well for us.

> 
> Not sure what you mean by issues; I see questions above.

I am not sure what the questions are either.  Are you having problems with any
particular diag or with OpenSM not running (routing?) correctly?

> 
> > The following problem that is being encountered may also be SA/SM related. A
> > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G).
> > A ping from those node (NodesA-G) to NodeX returns "Destination Host
> > Unreachable". A ping from NodeX to NodesA-G works.
> 
> Sounds like perhaps those nodes were unable to join the broadcast
> group perhaps due to a rate issue.

Hal is correct, and saquery is your friend here.  If you use "genders" and
"whatsup" (https://computing.llnl.gov/linux/downloads.html) I have a series of
tools "Pragmatic InfiniBand Utilities (PIU)"
(https://computing.llnl.gov/linux/piu.html) which includes a tool called
"ibnodeinmcast" which can help debug this.  What it does is use saquery [-g|-m]
to find nodes in the multicast groups.  With the addition of other LLNL tools
this can be boiled down to which nodes "should" be in the group but are not.
You are welcome to download that package and adapt it to your environment.

Another cause could be that OpenSM is not routing something correctly.  That
will require some more debuging with dump_lfts.sh and dump_mfts.sh.

Ira

> 
> -- Hal
> 
> > --------------------------------------------------------------------------------------------------
> >
> > System Information
> >
> > Here is the current opensm.conf file: (See attached file: opensm.conf)
> >
> > It is the default configuration from the OFED 1.3 build with "priority"
> > added at the bottom. Note that the /etc/init.d/opensmd sources
> > /etc/sysconfig/opensm not etc/sysconfig/opensm.conf (opensm.conf was just
> > copied to opensm). There are a couple of "proposed" settings that are
> > commented out, that were found them on the web.
> >
> > Following are the present settings that may affect the Fabric:
> >
> > /etc/infiniband/openib.conf
> > SET_IPOIB_CM=no
> >
> > /etc/modprobe.conf.local
> > options ib_ipoib send_queue_size=512 recv_queue_size=512
> > options ib_sa paths_per_dest=0x7f
> >
> > /etc/sysctl.conf
> > net.ipv4.neigh.ib0.base_reachable_time = 1200
> > net.ipv4.neigh.default.gc_thresh3 = 3072
> > net.ipv4.neigh.default.gc_thresh2 = 2500
> > net.ipv4.neigh.default.gc_thresh1 = 2048
> >
> > /etc/sysconfig/opensm
> > All defaults as supplied with OFED 1.3 OpenSM
> >
> >
> > -------------------------------------------------------
> >
> >
> >                    Basic Fabric Diagram
> >
> >                     +----------+
> >                     |Top Level |-------------------+ 20 IO nodes
> >   +-----------------| 288 port |----------------+    16 Viual nodes
> >   |                 |  IB Sw   |------------+   |     2 Admin nodes
> >   |          +------|          |---+        |   |       (SM nodes)
> >   |          |      +----------+   |        |   |     4 Support nodes
> >   |          |          |          |        |   |
> >   |          |          |          |        |   |
> >  24         24         24         24       24  24 <--uplinks
> >   |          |          |          |        |   |
> >   |          |          |          |        |   +------+
> >   |          |          |          |        |          |
> >   |(BASE)    |(SCU1)    |(SCU2)    |(SCU3)  |(SCU4)    |(SCU5)
> > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
> > |288-port| |288-port| |288-port| |288-port| |288-port| |288-port|
> > | IB Sw  | | IB Sw  | | IB Sw  | |  IB Sw | |  IB Sw | |  IB Sw |
> > +--------+ +--------+ +--------+ +--------+ +--------+ +--------+
> > 140-nodes 264-nodes  264-nodes  264-nodes  264-nodes  264-nodes
> > WhiteBox    Dell       Dell       IBM        IBM      IBM (future)
> >
> > NOTE: SCU4 is not currently connected to the Top Level Switch.
> >      We'd like to address these issues before making that connection.
> >
> >      Subnet Managers are configured on nodes connected to the
> >      Top Leval Switch.
> >
> > Let me know if you need any more information.
> >
> > Any help you could provide would be most appreciated.
> >
> > Thanks.
> >
> > Matt Trzyna
> > IBM Linux Cluster Enablement
> > 3039 Cornwallis Rd.
> > RTP, NC 27709
> > e-mail: trzyna at us.ibm.com
> > Office: (919) 254-9917 Tie Line: 444
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http:// openib.org/mailman/listinfo/openib-general
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 



More information about the general mailing list