[OFIWG-MPI] notes from MPI community feedback meeting today

Jeff Squyres (jsquyres) jsquyres at cisco.com
Fri Aug 8 09:09:27 PDT 2014


Thanks to all who attended.  I'd like to re-iterate the closing thought: if you haven't looked at the libfabric APIs, please do so.  **Now is the time to make comments / provide feedback***

Here's the notes I took from today:

MPI community feedback on libfabric
8 Aug 2014

Attendees

Jeff Squyres
Jeff Hammond
Sean Hefty
Paul Grun
Ryan Grant
Ron Brightwell
Alan Jea
Antonio Pena
Howard Pritchard
Ken Raffenetti
Michael Blocksome
Nysal Jan
Rolf vandeVaart
Sameh Sharkawi
Su Huang

Howard: there's enough there in the prototype to do some playing
around.  Almost enough there for shmem (need some atomics) for verbs.
Providers need to be aware that the other provider code in there is
not sufficient -- e.g., PSM.  E.g., from work at LANL, when you sit
inside PSM inside libfabric, you won't necessarily have "smooth
sailing".  Calls to mind the want-everything vs. want-minimal tension
in the MPI community.

Jeff S: Can you explain?

Howard: e.g., fault tolerance.  Smooth sailing using verbs because
they're low level.  With PSM, you have no capability other than reset
the whole job.

JeffH: are you saying verbs is connection, PSM is connectionless.

Howard: no.

JeffH: are you saying PSM is just like sfi/libfabric?  Are you saying
PSM is bad?

Howard: no. Just saying a warning signal.  Just saying danger of
vendors not providing some functionality from low level (e.g., PSM
hides a lot).  Although I do understand that this is a reference
implementation, is not complete, etc.  

I don't see any gaping holes in the API, I just worry about what's
going on under the covers.

JeffH: PSM is just a reference implementation.  The goal here is not
to implement a new API on top of an old API, but rather to make a new
API to replace the old API.  So what we have today with PSM is just
temporary.

Howard: fair enough, but I guess with SFI PSM provider, all you have
is matching sending/receiving.

Sean: SFI is attempt to define these APIs, but the implementation is
slow.  The verbs provider doesn't support many of the SFI operations,
too.  There is a plan for a reference implementation that can expose
any of the SFI APIs over sockets.  It won't give the perf, but the
functionality is there.

Howard: the verbs one seems quite complete; I wish I had time to
complete it.  It all seems fine.  My concern from the PSM provider is:
where is that one going?  I think the SFI API is convering well.  I
think we'll have something soon that we can kick the tires, and verbs
seems closest yet.

Paul: is your concern a problem with PSM a fundamental problem with
PSM, or is it something that can be fixed?

Howard: Hard to answer.  If PSM provider is just showing RDM and tag
matching, great.  But we know from MPI that that isn't sufficient for
MPI -- so what's the point?

Sean: I think this is just a gap -- are these APIs good?  Do we bother
to implement them?  There are plans to add more APIs (for PSM, I think
Sean means), we're just adding implenentation as we go along.

Howard: this actually makes me happy.

Sean: we're just trying to figure out the target that we're shooting
for.

JeffS: do you mean the PSM provider / PSM API?

Sean: we're working on providing more of the SFI APIs for PSM.  And
we're still looking at whether we have the right SFI APIs.  We're not
trying to expand PSM as an open source standard for these interfaces.
We're trying to run over Intel's IB hardware that can fit under SFI
APIs.

Jeff: ok, enough PSM.  What about from others?

MPICH guys: we haven't had a chance to look at this stuff deeply yet.

RonB: don't have an opinion yet.

RyanG: so far, this looks quite reasonable.  We could make it
compatible with Portals.  We could make a portals provider.  The
benefit is that you don't write directly to portals.  There are enough
APIs in SFI that you could get good portals support.  Is the goal to
make a uniform libfabric MPI?  But we neeed to investigate more.

Jeff: would you consider making a hardware provider that would be a
peer to portals?

RyanG: we are not developing hardware that will expose SFI
interfaces.  In that context, portals is a provider.  E.g., you could
make portals-compatible hw.

Jeff: so why bother?  If you're always targeting portals.  We don't
have any goal of making One MPI that sits on libfabric.

RyanG: we just want to be compatible.

JeffH: question -- I use RMA, but I don't use shmem much, just
accumulate.  Accumulate will be an active message.  I haven't figured
out how I would implement accumualte on SFI.

Sean: probably couldn't easily answer.  Can you describe the problem
better?

JeffH: accumulate must ship over a buffer and then do an accumulate
onto an existing buffer (e.g., floating point).  Best implementation
on IBM systems/bluegene does a rendesvouz with a header that throws an
interrupt which drives an active message, when then fires an active
message read, do the accumulate, ...etc. (loop over for pipelining).  

Sean: atomics?

JeffH: no, basically an RPC, but I would like it to not require
polling.

Sean: there are no active message APIs at this point.  We've kept that
open as something to look at, but nothing there yet.

JeffH: ok, any thought on how one does one active AM API on top of
SFI?

Sean: AFI does define atomics from the MPI spec -- are these not
sufficient?

JeffH: but if I want to do an 8MB accumulate of doubles, I can do that
with the atomics API?

Sean: i.e., over 2M doubles?  Yes.

JeffH: 2M function calls, or 1 function call?

Sean: depends on what the provider reports upwards.  E.g., provider
can report that it can do groups of 100.

JeffH: is long double complex supported?

Sean: thought the API yes, but depends on provider.

Howard: ...missed.

Sean: shows atomics APIs and enums.  Types, operations, and function
pointers.

Howard: large one-sided / atomics are one use case, but there's also
another use case of lots of small atomics -- might be more efficient
to use the host processor (vs. the network/NIC).  Itmight be quite
difficult for a lot of hardware to native atomics.  It might be better
to do active message support and therefore more hardware can support
it, and also provide the other model (e.g., lots of small atomics).

JeffH: agree.  Want to understand how these things compose w/ each
other.  E.g,. if accumulate is 1 double, I use the atomics API.  But
if I accumulate N doubles, I should use the active message API.  How
do I find the crossover?  

JeffS: absolute crossover is the route to madness.

JeffH: no, I'm asking about ordering.

Sean: shows ordering enums

JeffH: what's the scope of those enums?

Howard: if you queue up a send and then an AMO and then another send
and then another AMO, is there a way to enforce ordering?

Sean: from the perspective of the API, relax the API and let the app
specify what the ordering constraints are.  So if you mix sends and
atomics, send-after-read/send-after-write (explaining enums on
screen).  E.g., a ...missed.  All this is subject to the provider
providing this functionality, of course.  Also, additional fiels in
fi_ep_attr -- ordering limitations based on the size.  E.g.,
write-after-read and read-after-write may be based on the size.

Howard: is there some what in the API to express that provider
operates in cache-coherency domain of host processor?  E.g., if
atomics are coherent w.r.t. host processor, then ordering concerns go
away.  I don't remember what fi_write_coherent is.

JeffH: good point.  ARMCI and MPI have strict ordering consistency is
critical.

Sean: let the provider signal what it can do and let the app decide
what it can do with it.  The question is: does the API provide enough
information to let the app decide what to do.

JeffH: fi_ep_attr: 8 categories represented in FI_EP_ATTR, but only
some of those in fi_ep_attr.  What does this mean for send ordering
max size?

Sean: I added these later -- I should probably reconcile these
two. Thanks.

JeffH: concrete example -- Cray XE6 -- Gemini interconnect connected
via hypertransport.  There were some queries for ordering of writes
between CPU load/store hitting HT and Gemini NIC hitting HT and what
the actual consistency model is.  This would determine whether MPI
does X or Y optimization (e.g., how many mem barriers do you need)?

Howard: you're regerring to HT ordering options in memory
registration, right?  ...aside: fi_write_coherent; this is probably
what JeffH would need.

JeffH: knowing cache coherence isn't enough.

Howard: agreed.  But there's at least 1 capability bit that's somewhat
useful.

JeffH: might be good to have similar SFI provider queries to what I
described above (e.g., for other processors?), and may also determine
where the network bolts into the memory controller / processor / etc.

JeffS: Cisco's perspective: we don't like verbs, and are looking for a 

<gives Cisco's perspective> what about other vendors here?

JeffS: Nvidia?

Rolf: haven't kept on this... :-\  was kinda hoping this would be an
overview.  :-\

JeffS: IBM?

MikeB: looks good.  Had some questions about dynamic testing.  Howard
has talked about this a bit, though.  But shared memory as a fabric --
is that expected to have a shared memory provider in libfabric?  Or is
that outside of libfabric?  Also, what about collectives?  I know the
collective API is expected -- and it's important to MPI, but what if
vendors have their own accelerators be exposed through libfabric?

Sean: yes, plan is to have a shared memory provider.  not there yet.
For vendor specific additions, we want to have some kind of query the
provider and get vendor-specific interfaces from the providers.

Alan: shared memory provider.  What if we want to use both?  Do you
have to use 2 different providers?

Sean: probably yes.

JeffS: what about shared memory provider and PSM in the same process
-- can I figure out how to do the Right Thing?

Sean: intent is yes, every provider will need to implement loopback.
Depends on how the providers implement loopback.

Sean: jeff can I see usnic direct before you publish it?  might be
useful to see what features you have defined, exposed, etc.

JeffS: We're not ready to share yet, but it might be possible to share
for the purpose of sharing ideas, etc.

JeffS: thanks everyone -- please be sure to take time to look into
this stuff if you have not already.

EOM

-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/




More information about the ofiwg-mpi mailing list