[ofiwg] DS/DA Runtime Model Discussion

Tue Feb 23 07:42:06 PST 2016

Doug,

Nice read.

Do I understand correctly that, in a nutshell, the proposal is
that kfabrics becomes semantically richer than current
kernel verbs or any other kernel network interface, which
would allow to (efficiently, we hope) abstract from any underlying
fabrics semantics? Wouldn't that just bloat the interface
implementation far beyond the point the current kernel verbs
already is ... or is the idea that code can be pushed 'further down'
below kfabrics and become vendor specific? Your second figure
hints at the second approach.

I am not yet sold to the idea that kernel programmers would
talk to kernel sockets via kfabrics. The kernel socket interface
is very stable and efficient.

GPFS deliberately chose to only use user level RDMA interfaces for
block transfers and metadata operations. So, with that, it does not
depend on kernel RDMA interfaces, and switching from verbs
to libfabrics would just mean the pain of some coding effort w/o any
functional benefit btw.

Best regards,
Bernard.

ofiwg-bounces at lists.openfabrics.org wrote on 02/23/2016 01:55:36 AM:

> From: "Oucharek, Doug S" <doug.s.oucharek at intel.com>
> To: Paul Grun <grun at cray.com>
> Cc: "ofiwg at lists.openfabrics.org" <ofiwg at lists.openfabrics.org>
> Date: 02/23/2016 01:55 AM
> Subject: Re: [ofiwg] DS/DA Runtime Model Discussion
> Sent by: ofiwg-bounces at lists.openfabrics.org
>
> Not really sure I had a question, per se, but rather pointing out an
> opportunity for kFabrics to make life easier for the guys, like me,
> to support applications like Lustre.  And, in doing so, benefit the
vendors.
>
> Attached are two diagrams I whipped up.  The first one,
> lnetToday.png, shows the current state of affairs for LNet/LNDs.
> the second, lnetFuture.png, is the dream world where kFabrics takes
> on more of what the apps in kernel space have to do today.  I’ll explain…
>
> As you can see from the colours in the first diagram, a vendor of
> new fabrics has to provide the low-level driver and a provider (if
> wanting to use OFED as API).  If they don’t want to use OFED, they
> also have to create an LND for Lustre like Cray did for GNI (see
> diagram).  The Lustre community will NOT do this for you.  It is
> expected that a vendor hire/train staff to ensure Lustre runs well
> on their hardware (both new and updated).  Or, they can pay a Lustre
> support group like mine to do it for them.  Either way, it is an
> expensive proposition.
>
> Now, the assumption that just writing a provider for OFED will save
> them having to do any Lustre-specific work, that is not necessarily
> true.  Depending on how well the OFED interface works for their
> specific design, some adaptations in the LND may still need to be
> done.  That, again, is not something the Lustre community is going
> to do.  The vendor is responsible for that.
>
> To the point: if we just swap kFabrics for OFED in the first
> diagram, that will not change the responsibility of who does what.
> There will be a new LND that sits on top of kFabrics and the Lustre
> community will “eventually” come to support it.  But any custom
> changes to that LND for different vendor hardware is the
> responsibility of the vendor.  The best thing kFabrics can do to
> protect all the vendors from having to spend time and money on LND
> optimizations is to isolate, as much as possible, the different
> providers and their performance characteristics from the LND layer.
> If this is done well, we can evolve to the second diagram where the
> LND layer disappears (or is adopted by kFabrics) and all the Lustre
> community has to do is maintain LNet and its use of kFabrics.  That
> should also make the vendor’s lives easier if/when they choose to
> support Lustre.
>
> I suspect that GPFS and NVMe will find themselves in a similar boat.
> Solve this in kFabrics solves it for all of us rather than having to
> tackle it on a per application basis.  User space is not having the
> same issues with libfabrics because there are a bunch of other
> layers in the networking onion taking a role in smoothing the road
> over.  We don’t have those layers in kernel space.
>
> Just a point about LNDs: the o2iblnd (interface to OFED) is over 6,
> 000 lines of some very complex code.  The gnilnd is significantly
> bigger.  Writing and testing a new LND is a very significant effort
> and is not something any vendor should expect to do in under 6
> months.  The more of this kFabrics takes on, the more it will save
> vendors and app developers alike.
>
Those 6000 lines of code plus gnilnd code would have
to be moved to kfabrics to abstract from different fabrics,
which would make the kernel interface implementation rather bulky...?

> Doug
>
>
> [cid:0BBB44FB-6C9E-488E-851D-CBE0FEAC05A0 at amr.corp.intel.com][cid:
> 30601161-2654-4CB2-AD6C-56DBF473888D at amr.corp.intel.com]
> On Feb 18, 2016, at 4:47 PM, Paul Grun
<grun at cray.com<mailto:grun at cray.com
> >> wrote:
>
> Meanwhile, are we anywhere near to addressing your original
> question??  I think we may have wandered afield...
>
> -----Original Message-----
> From: Paul Grun
> Sent: Thursday, February 18, 2016 4:47 PM
> To: 'Oucharek, Doug S' <doug.s.oucharek at intel.com<
> mailto:doug.s.oucharek at intel.com>>
> Cc: Smith, Stan <stan.smith at intel.com<mailto:stan.smith at intel.com>>;
> ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>
> Subject: RE: DS/DA Runtime Model Discussion
>
> Let's keep this useful discussion alive...
>
> I can see your point that LND looks to the LNET layer like a
> provider, insomuch as it insulates the network layer (LNET) from the
> vagaries of a specific network.  My understanding of the details of
> the layers is rusty, at best, but as far as I know there are two
> main LND layers available today - one for RDMA networks, and one for
> non-RDMA networks, like TCP/IP.  The former is o2iblnd, and I can't
> remember the name for the latter.  My assumption is (please correct
> me otherwise) that o2iblnd only runs over a single network, that
> being IB (including its RoCE variant).
>
> So I have a couple of thoughts:
> 1. Does it make sense to write the existing o2iblnd layer to the
> (proposed) kfabric API?  Keep in mind that IB is one of the networks
> supported by kfabric via a verbs provider layer.  Doing this would
> truly insulate the LNET layer from substitutions of the underlying
> network.  It would also place LND squarely in the realm of being a
> consumer of network services.  Or...
> 2. Does it make sense to write a new LND which is natively coded to
> kfabric, leaving us with (at least) three possible LND layers?
> I'd love to hear a discussion about this.
>
> As far as equating LND with MPI, I would describe MPI as
> communications middleware - it provides a communication service
> which I would equate with LNET, not with LND.  Obviously the
> analogies are far from perfect.  As you point out, in today's world,
> the kernel treats LND as a network service provider.  I guess my
> suggestion is that we try to push it up the stack slightly and lump
> it together with LNET as the communications service.
>
> Your thoughts?
> -Paul
>
> -----Original Message-----
> From: Oucharek, Doug S [mailto:doug.s.oucharek at intel.com]
> Sent: Monday, February 15, 2016 11:32 AM
> To: Paul Grun <grun at cray.com<mailto:grun at cray.com>>
> Cc: Smith, Stan <stan.smith at intel.com<mailto:stan.smith at intel.com>>;
> ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>
> Subject: Re: DS/DA Runtime Model Discussion
>
> In a way, I view the Lustre LND layer as a provider layer (specific
> code for a specific fabric API) and the LNet layer which is above
> the LNDs as the network services layer.  Guess it comes down to
> perspective :^).
>
> As a former user space developer, I view an example of a network
> services layer something like ZeroMQ which provides a complete end-
> to-end communications system which handles such things as the
> threading model when running asynchronously.  If the network stack
> being used requires a different approach to the runtime model, the
> ZeroMQ developers deal with that thereby protecting the applications
> from having to change.
>
> I guess MPI is the replacement for ZeroMQ in the HPC world.
> However, kernel space has nothing like ZeroMQ or MPI that file
> systems like Lustre or GPFS can use so we have to have layers like
> Lustre’s LND to do that work for us.  Using OFED/verbs from one of
> our LNDs was supposed to help protect us from changes in vendor
> hardware/firmware.  It doesn’t.  Recently, Mellanox changed their
> firmware from mlx4 to mlx5.  In theory, Lustre should never have
> cared about that as OFED should be a standard which shields us from
> such changes (i.e. if a change to the usage model is needed, that
> should be made to the OFED code base and not what lies above).  I
> have just spent the last two months firefighting the effects on
> customers who upgraded one or more IB cards in a cluster from mlx4 to
mlx5.
>
> In a perfect dream world I have, the work our LNDs do would be
> absorbed by kFabrics and all Lustre will have to do is change LNet
> to directly use kFabrics and we can toss away all the LNDs and be
> good to run on current and future fabrics equally well.
>
> Doug
>
> On Feb 12, 2016, at 5:08 PM, Paul Grun
<grun at cray.com<mailto:grun at cray.com
> >> wrote:
>
> In general, I agree with your basic assertion...one of the expected
> values of the OFI project is 'application transportability', meaning
> that a given consumer of the services offered via the API should be
> easily ported from one provider to another (assuming that both
> providers offer equivalent functionality).
>
> That being said, one of the expectations of the OFI project is that
> a given provider vendor may target his provider at a particular
> market and thus may optimize his implementation for that market
> resulting in a higher quality/higher performing provider, but
> potentially at higher cost.  None of which negates your basic point.
>
> One point I do want to raise is the expression 'middleware'.  The
> convention we've adopted in OFI is to refer to everything above the
> API as a consumer of network services, and everything below the API
> as comprising the network stack.  Thus MPI, which is referred to as
> communications middleware, is a consumer of network services.
>
> I am looking (in vain, I'm afraid) for my canonical LNET stack
> diagram, but if memory serves I think of the LND layer, which is
> written to a particular network API (e.g. o2iblnd), as a consumer
> and thus roughly equivalent to MPI as middleware.  But I would not
> think of the provider as being middleware.
>
> All that aside, to help me better visualize your point, can you give
> an example of a specific way that an LNET consumer (LND?) would
> behave that might differ between providers in order to maximize
performance?
>
> Thanks,
> -Paul
>
> -----Original Message-----
> From: Oucharek, Doug S [mailto:doug.s.oucharek at intel.com]
> Sent: Friday, February 12, 2016 11:25 AM
> To: Smith, Stan <stan.smith at intel.com<mailto:stan.smith at intel.com>>
> Cc: Paul Grun <grun at cray.com<mailto:grun at cray.com>>;
> ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>
> Subject: Re: DS/DA Runtime Model Discussion
>
> You can see where I am coming from.  As an application writer using
> this middleware, if I write my code one way and am able to get good
> performance from fabric A (provider A), I am expecting to get a
> consistent  performance profile when I start to support fabric B
> (provider B).  If I have to put a bunch of “if this provider, do
> this, if that provider, do something different” conditions in my
> application to get consistent performance out of the fabric, I
> consider that a fail of the middleware.  The middleware should
> minimize the changes the applications do to adopt new fabrics and
> that needs to include, as much as possible, what is required for
> best performance.
>
> I appreciate that the application may need to provide hints, message
> profiles, etc. to make the job easier.  But good middleware should
> be a negotiator between the application and the provider so I don’t
> have to learn all the gritty details of how the provider works just
> to use it reasonably well.
>
> Doug
>
> On Feb 12, 2016, at 10:52 AM, Smith, Stan <stan.smith at intel.com<
> mailto:stan.smith at intel.com>> wrote:
>
> [Doug writes]
> So, if Lustre creates only one endpoint (QP) to another node and
> fires a high rate of concurrent messages (high thread count) over
> that endpoint, will libfabrics/kFabrics intelligently use CPU cores,
> IRQ balancing, NUMA, etc?  Or will it be the responsibility of the
> application writers to find a way to manipulate the use of endpoints
> to get the best performance?
>
>
> OK - I grok where you are coming from...
>
> Thread & core allocation/scheduling/binding w.r.t. endpoints are all
> aspects outside the current scope of libfabric/kFabric today.
>
> From a libfabric/kFabric provider POV what would 'intelligently use
> CPU cores, IRQ balancing, NUMA'  actually imply?
>
> The transport layer (aka libfabric/kFabric provider) existing at a
> layer below the client, could have a difficult time guessing at the
> expected thread/core behavior a higher level client layer would expect.
>
> That said, perhaps the client could provide hints as to the desired/
> expected behavior which the provider could choose to implement if
possible.
>
> Getting this design discussion on the OFIWIG things-to-think-about
> list would be a good 1st step.
>
> Stan.
>
>
>
> On Feb 12, 2016, at 8:52 AM, Smith, Stan <stan.smith at intel.com<
> mailto:stan.smith at intel.com>> wrote:
>
> Hi Doug,
> I may have misled you in believing that clients of libfabric and/or
> KFabric are responsible for transport locking issues, they are 'not'.
>
> Libfabric/kFabric providers 'are' responsible for access
> serialization to hardware.
>
> s.
>
> -----Original Message-----
> From: ofiwg-bounces at lists.openfabrics.org<mailto:ofiwg-
> bounces at lists.openfabrics.org>
[mailto:ofiwg-bounces at lists.openfabrics.org
> ] On Behalf Of Oucharek, Doug S
> Sent: Wednesday, February 10, 2016 3:37 PM
> To: Paul Grun <grun at cray.com<mailto:grun at cray.com>>
> Cc: ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>
> Subject: [ofiwg] DS/DA Runtime Model Discussion
>
> This email is a followup to my comment in a previous DS/DA call
> about the runtime model being an important part of the DS/DA definition.
>
> MPI seems to be the dominate user of fabrics in HPC.  As such, they
> have a huge impact on the design of the runtime model being followed
> by fabric developers and corresponding middleware (what I consider
> OFED/verbs, libfabrics, and DS/DA).  Currently, they seems to be
> pushing for bare metal access from the providers leaving the work of
> serialization/locking to the middleware or the applications themselves.
>
> If DS/DA follows libfabrics in its development, I am concerned that
> the bare metal mindset will dominate here as well and that will
> leave “application anarchy” with regards to how serialization/
> locking is being done.  Mitigating the strategy of fabric users is
> something I would expect from the providers (the one common access
> point regardless of middleware).  The MPI push was to get this
> common point to back off and leave serialization/locking to the
> upper layers but we now do not have a common point to coordinate
> competing access to the fabric.
>
> Should it not be a part of the middleware (libfabrics and DS/DA) to
> at the very least, put demands upon the providers so a common
> strategy for serialization/locking can be enforced for a specific
> fabric so the apps, like Lustre, don’t have to make significant code
> changes to get reasonable performance out of the fabric?  If we have
> to make significant changes for each new fabric released, the value
> of the middleware (be it OFED, libfabrics, or DS/DA) is severely
> diminished and we might as well just access the fabric drivers directly.
>
> Discussion?
>
> Doug
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>
> http://lists.openfabrics.org/mailman/listinfo/ofiwg
>
>
>
>
> [attachment "lnetToday.png" deleted by Bernard Metzler/Zurich/IBM]
> [attachment "lnetFuture.png" deleted by Bernard Metzler/Zurich/IBM]
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg