[ofiwg] DS/DA Runtime Model Discussion

Mon Feb 15 11:31:31 PST 2016

In a way, I view the Lustre LND layer as a provider layer (specific code for a specific fabric API) and the LNet layer which is above the LNDs as the network services layer.  Guess it comes down to perspective :^).

As a former user space developer, I view an example of a network services layer something like ZeroMQ which provides a complete end-to-end communications system which handles such things as the threading model when running asynchronously.  If the network stack being used requires a different approach to the runtime model, the ZeroMQ developers deal with that thereby protecting the applications from having to change.

I guess MPI is the replacement for ZeroMQ in the HPC world.  However, kernel space has nothing like ZeroMQ or MPI that file systems like Lustre or GPFS can use so we have to have layers like Lustre’s LND to do that work for us.  Using OFED/verbs from one of our LNDs was supposed to help protect us from changes in vendor hardware/firmware.  It doesn’t.  Recently, Mellanox changed their firmware from mlx4 to mlx5.  In theory, Lustre should never have cared about that as OFED should be a standard which shields us from such changes (i.e. if a change to the usage model is needed, that should be made to the OFED code base and not what lies above).  I have just spent the last two months firefighting the effects on customers who upgraded one or more IB cards in a cluster from mlx4 to mlx5.

In a perfect dream world I have, the work our LNDs do would be absorbed by kFabrics and all Lustre will have to do is change LNet to directly use kFabrics and we can toss away all the LNDs and be good to run on current and future fabrics equally well.  

Doug

> On Feb 12, 2016, at 5:08 PM, Paul Grun <grun at cray.com> wrote:
> 
> In general, I agree with your basic assertion...one of the expected values of the OFI project is 'application transportability', meaning that a given consumer of the services offered via the API should be easily ported from one provider to another (assuming that both providers offer equivalent functionality).
> 
> That being said, one of the expectations of the OFI project is that a given provider vendor may target his provider at a particular market and thus may optimize his implementation for that market resulting in a higher quality/higher performing provider, but potentially at higher cost.  None of which negates your basic point.
> 
> One point I do want to raise is the expression 'middleware'.  The convention we've adopted in OFI is to refer to everything above the API as a consumer of network services, and everything below the API as comprising the network stack.  Thus MPI, which is referred to as  communications middleware, is a consumer of network services.  
> 
> I am looking (in vain, I'm afraid) for my canonical LNET stack diagram, but if memory serves I think of the LND layer, which is written to a particular network API (e.g. o2iblnd), as a consumer and thus roughly equivalent to MPI as middleware.  But I would not think of the provider as being middleware.
> 
> All that aside, to help me better visualize your point, can you give an example of a specific way that an LNET consumer (LND?) would behave that might differ between providers in order to maximize performance?
> 
> Thanks,
> -Paul
> 
> -----Original Message-----
> From: Oucharek, Doug S [mailto:doug.s.oucharek at intel.com] 
> Sent: Friday, February 12, 2016 11:25 AM
> To: Smith, Stan <stan.smith at intel.com>
> Cc: Paul Grun <grun at cray.com>; ofiwg at lists.openfabrics.org
> Subject: Re: DS/DA Runtime Model Discussion
> 
> You can see where I am coming from.  As an application writer using this middleware, if I write my code one way and am able to get good performance from fabric A (provider A), I am expecting to get a consistent  performance profile when I start to support fabric B (provider B).  If I have to put a bunch of “if this provider, do this, if that provider, do something different” conditions in my application to get consistent performance out of the fabric, I consider that a fail of the middleware.  The middleware should minimize the changes the applications do to adopt new fabrics and that needs to include, as much as possible, what is required for best performance.
> 
> I appreciate that the application may need to provide hints, message profiles, etc. to make the job easier.  But good middleware should be a negotiator between the application and the provider so I don’t have to learn all the gritty details of how the provider works just to use it reasonably well.  
> 
> Doug
> 
>> On Feb 12, 2016, at 10:52 AM, Smith, Stan <stan.smith at intel.com> wrote:
>> 
>> [Doug writes] 
>> So, if Lustre creates only one endpoint (QP) to another node and fires a high rate of concurrent messages (high thread count) over that endpoint, will libfabrics/kFabrics intelligently use CPU cores, IRQ balancing, NUMA, etc?  Or will it be the responsibility of the application writers to find a way to manipulate the use of endpoints to get the best performance?
>> 
>> 
>> OK - I grok where you are coming from...
>> 
>> Thread & core allocation/scheduling/binding w.r.t. endpoints are all aspects outside the current scope of libfabric/kFabric today.
>> 
>> From a libfabric/kFabric provider POV what would 'intelligently use CPU cores, IRQ balancing, NUMA'  actually imply?
>> 
>> The transport layer (aka libfabric/kFabric provider) existing at a layer below the client, could have a difficult time guessing at the expected thread/core behavior a higher level client layer would expect.
>> 
>> That said, perhaps the client could provide hints as to the desired/expected behavior which the provider could choose to implement if possible.
>> 
>> Getting this design discussion on the OFIWIG things-to-think-about list would be a good 1st step.
>> 
>> Stan.
>> 
>> 
>> 
>>> On Feb 12, 2016, at 8:52 AM, Smith, Stan <stan.smith at intel.com> wrote:
>>> 
>>> Hi Doug,
>>> I may have misled you in believing that clients of libfabric and/or KFabric are responsible for transport locking issues, they are 'not'.
>>> 
>>> Libfabric/kFabric providers 'are' responsible for access serialization to hardware.
>>> 
>>> s.
>>> 
>>> -----Original Message-----
>>> From: ofiwg-bounces at lists.openfabrics.org [mailto:ofiwg-bounces at lists.openfabrics.org] On Behalf Of Oucharek, Doug S
>>> Sent: Wednesday, February 10, 2016 3:37 PM
>>> To: Paul Grun <grun at cray.com>
>>> Cc: ofiwg at lists.openfabrics.org
>>> Subject: [ofiwg] DS/DA Runtime Model Discussion
>>> 
>>> This email is a followup to my comment in a previous DS/DA call about the runtime model being an important part of the DS/DA definition.
>>> 
>>> MPI seems to be the dominate user of fabrics in HPC.  As such, they have a huge impact on the design of the runtime model being followed by fabric developers and corresponding middleware (what I consider OFED/verbs, libfabrics, and DS/DA).  Currently, they seems to be pushing for bare metal access from the providers leaving the work of serialization/locking to the middleware or the applications themselves.
>>> 
>>> If DS/DA follows libfabrics in its development, I am concerned that the bare metal mindset will dominate here as well and that will leave “application anarchy” with regards to how serialization/locking is being done.  Mitigating the strategy of fabric users is something I would expect from the providers (the one common access point regardless of middleware).  The MPI push was to get this common point to back off and leave serialization/locking to the upper layers but we now do not have a common point to coordinate competing access to the fabric.
>>> 
>>> Should it not be a part of the middleware (libfabrics and DS/DA) to at the very least, put demands upon the providers so a common strategy for serialization/locking can be enforced for a specific fabric so the apps, like Lustre, don’t have to make significant code changes to get reasonable performance out of the fabric?  If we have to make significant changes for each new fabric released, the value of the middleware (be it OFED, libfabrics, or DS/DA) is severely diminished and we might as well just access the fabric drivers directly.
>>> 
>>> Discussion?  
>>> 
>>> Doug
>>> _______________________________________________
>>> ofiwg mailing list
>>> ofiwg at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/ofiwg
>> 
>