[ofa-general] opensm routing

Al Chu chu11 at llnl.gov
Thu Jun 12 10:49:44 PDT 2008


Hey Jeff,

> That works. The compute nodes need to talk to other compute nodes for 
> MPI over one set of links, and they need to talk to the Lustre nodes for 
> I/O, but over a different (disjoint) set of links. Thanks.

Is there a strong belief that a different/disjoint set of links would be
beneficial?  Sometime ago, Sasha and I iterated on a patch in which I
found out sometimes not all switch ports would be used.  In this
particular case, a chunk of leaf switches were sometimes using only 11
out of 12 uplinks.  After the fix, mpigraph showed about 20% improvement
in MPI bandwidth.

It obviously depends on your cluster/environment/apps/user usage
pattern/etc.  Livermore Lab's usage patterns will probably be different.

Al

On Thu, 2008-06-12 at 10:11 -0700, Jeff Becker wrote:
> Hi Al
> 
> Al Chu wrote:
> > Hey Jeff,
> >
> > On Wed, 2008-06-11 at 09:43 -0700, Jeff Becker wrote:
> >   
> >> Basically, we have an Altix ICE cluster connected by a pair of hypercube 
> >> Infiniband fabrics. External to that, we have some Lustre nodes 
> >> connected into the cluster with Infiniband. Our goal is to keep Lustre 
> >> traffic separate from compute (MPI) traffic. Ideally, we'd have 2 
> >> subnets and an IB router between the Lustre fabric and the compute 
> >> fabric to accomplish this.
> >>     
> >
> > I see.  In your environment, the lustre storage servers are on the same
> > fabric as your compute nodes?
> >   
> Right.
> >   
> >> Barring that, I thought we could use partitions as follows: compute 
> >> HCA's and switch ports are on both partitions with full membership in 
> >> compute partition, and limited membership in I/O partition.  The Lustre 
> >> nodes and switches would only be in the I/O partition  (full 
> >> membership). That way, inter compute node (MPI) traffic would be 
> >> disallowed from using routes through the I/O fabric (by partition 
> >> membership), and I/O traffic could not interfere with compute (via 
> >> separate partitions). Is this scheme feasible?
> >>
> >> If that's not possible, the next idea is to modify OpenSM to assign 
> >> large weights to the links between the compute and I/O fabrics, so that 
> >> the MinHop algorithm would never consider using these links for 
> >> inter-compute node traffic.
> >>     
> >
> > So dedicating (for example) X out of Y uplinks for MPI only and the
> > remaining uplinks for lustre only?
> >   
> That works. The compute nodes need to talk to other compute nodes for 
> MPI over one set of links, and they need to talk to the Lustre nodes for 
> I/O, but over a different (disjoint) set of links. Thanks.
> 
> -jeff
> > Al
> >
> >   
> >> Thoughts? Thanks.
> >>
> >> -jeff
> >>
> >> Al Chu wrote:
> >>     
> >>> Hey Jeff,
> >>>
> >>> Out of my curiosity, are you just trying to change the routing to
> >>> improve job performance?  i.e. lustre nodes get special routing vs.
> >>> compute nodes?
> >>>
> >>> Al
> >>>
> >>> On Tue, 2008-06-10 at 15:08 -0700, Jeff Becker wrote:
> >>>   
> >>>       
> >>>> Hi all. I was looking into doing some subnet partitioning to separate 
> >>>> compute nodes from Lustre nodes, and I saw the following in 
> >>>> ~sashak/management.git on the OFA server, in opensm/doc/OpenSM_PKey_Mgr.txt
> >>>>
> >>>> OpenSM Partition Management
> >>>> ---------------------------
> >>>>
> >>>> Roadmap:
> >>>> Phase 1 - provide partition management at the EndPort (HCA, Router and Switch
> >>>>           Port 0) level with no routing affects.
> >>>> Phase 2 - routing engine should take partitions into account.
> >>>> ...
> >>>> Phase 2 functionality:
> >>>>
> >>>> The partition policy should be considered during the routing such that
> >>>> links are associated with particular partition or a set of
> >>>> partitions. Policy should be enhanced to provide hints for how to do
> >>>> that (correlating to QoS too). The exact algorithm is TBD.
> >>>>
> >>>>
> >>>> What is the status of Pkey-aware routing? Thanks.
> >>>>
> >>>> -jeff
> >>>>
> >>>> _______________________________________________
> >>>> general mailing list
> >>>> general at lists.openfabrics.org
> >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>
> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>>     
> >>>>         
> 
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list