[ofa-general] opensm routing

Jeff Becker Jeffrey.C.Becker at nasa.gov
Thu Jun 12 11:14:43 PDT 2008


Hi Al

Al Chu wrote:
> Hey Jeff,
>
>   
>> That works. The compute nodes need to talk to other compute nodes for 
>> MPI over one set of links, and they need to talk to the Lustre nodes for 
>> I/O, but over a different (disjoint) set of links. Thanks.
>>     
>
> Is there a strong belief that a different/disjoint set of links would be
> beneficial?  Sometime ago, Sasha and I iterated on a patch in which I
> found out sometimes not all switch ports would be used.  In this
> particular case, a chunk of leaf switches were sometimes using only 11
> out of 12 uplinks.  After the fix, mpigraph showed about 20% improvement
> in MPI bandwidth.
>   
Basically, we want to avoid situations where I/O and MPI contend for the 
same links, and get in each other's way.

-jeff
> It obviously depends on your cluster/environment/apps/user usage
> pattern/etc.  Livermore Lab's usage patterns will probably be different.
>
> Al
>
> On Thu, 2008-06-12 at 10:11 -0700, Jeff Becker wrote:
>   
>> Hi Al
>>
>> Al Chu wrote:
>>     
>>> Hey Jeff,
>>>
>>> On Wed, 2008-06-11 at 09:43 -0700, Jeff Becker wrote:
>>>   
>>>       
>>>> Basically, we have an Altix ICE cluster connected by a pair of hypercube 
>>>> Infiniband fabrics. External to that, we have some Lustre nodes 
>>>> connected into the cluster with Infiniband. Our goal is to keep Lustre 
>>>> traffic separate from compute (MPI) traffic. Ideally, we'd have 2 
>>>> subnets and an IB router between the Lustre fabric and the compute 
>>>> fabric to accomplish this.
>>>>     
>>>>         
>>> I see.  In your environment, the lustre storage servers are on the same
>>> fabric as your compute nodes?
>>>   
>>>       
>> Right.
>>     
>>>   
>>>       
>>>> Barring that, I thought we could use partitions as follows: compute 
>>>> HCA's and switch ports are on both partitions with full membership in 
>>>> compute partition, and limited membership in I/O partition.  The Lustre 
>>>> nodes and switches would only be in the I/O partition  (full 
>>>> membership). That way, inter compute node (MPI) traffic would be 
>>>> disallowed from using routes through the I/O fabric (by partition 
>>>> membership), and I/O traffic could not interfere with compute (via 
>>>> separate partitions). Is this scheme feasible?
>>>>
>>>> If that's not possible, the next idea is to modify OpenSM to assign 
>>>> large weights to the links between the compute and I/O fabrics, so that 
>>>> the MinHop algorithm would never consider using these links for 
>>>> inter-compute node traffic.
>>>>     
>>>>         
>>> So dedicating (for example) X out of Y uplinks for MPI only and the
>>> remaining uplinks for lustre only?
>>>   
>>>       
>> That works. The compute nodes need to talk to other compute nodes for 
>> MPI over one set of links, and they need to talk to the Lustre nodes for 
>> I/O, but over a different (disjoint) set of links. Thanks.
>>
>> -jeff
>>     
>>> Al
>>>
>>>   
>>>       
>>>> Thoughts? Thanks.
>>>>
>>>> -jeff
>>>>
>>>> Al Chu wrote:
>>>>     
>>>>         
>>>>> Hey Jeff,
>>>>>
>>>>> Out of my curiosity, are you just trying to change the routing to
>>>>> improve job performance?  i.e. lustre nodes get special routing vs.
>>>>> compute nodes?
>>>>>
>>>>> Al
>>>>>
>>>>> On Tue, 2008-06-10 at 15:08 -0700, Jeff Becker wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Hi all. I was looking into doing some subnet partitioning to separate 
>>>>>> compute nodes from Lustre nodes, and I saw the following in 
>>>>>> ~sashak/management.git on the OFA server, in opensm/doc/OpenSM_PKey_Mgr.txt
>>>>>>
>>>>>> OpenSM Partition Management
>>>>>> ---------------------------
>>>>>>
>>>>>> Roadmap:
>>>>>> Phase 1 - provide partition management at the EndPort (HCA, Router and Switch
>>>>>>           Port 0) level with no routing affects.
>>>>>> Phase 2 - routing engine should take partitions into account.
>>>>>> ...
>>>>>> Phase 2 functionality:
>>>>>>
>>>>>> The partition policy should be considered during the routing such that
>>>>>> links are associated with particular partition or a set of
>>>>>> partitions. Policy should be enhanced to provide hints for how to do
>>>>>> that (correlating to QoS too). The exact algorithm is TBD.
>>>>>>
>>>>>>
>>>>>> What is the status of Pkey-aware routing? Thanks.
>>>>>>
>>>>>> -jeff
>>>>>>
>>>>>> _______________________________________________
>>>>>> general mailing list
>>>>>> general at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>>
>>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>>     
>>>>>>         
>>>>>>             




More information about the general mailing list