[Openframeworkwg] source code for proof of concept framework available

Mon Dec 9 15:49:05 PST 2013

> It is a tricky bit, but, eg rkeys are integers they should always be
> in host order in the API calls, immediate data is accessed as a
> uint32, not a void * so it should be in host order, etc.

Declaring an rkey as be32_t, means that it's big Endian everywhere.  The app simply passes it to the remote side, where it's used.  Host order forces the app be big/little endian aware and handle the byte swapping appropriately.

Having immediate data be host order can force byte swapping in the provider on both the sending and receiving side.  This may be a wash in terms of performance, unless the app is using immediate data to exchange one of several constant values.

> Byte swaps for that kind of stuff is just more places where people can
> make an error and never detect it until they test on a BE/LE mix.

The rkey and immediate data are intended to go across the network.  Declaring them in host order would seem to increase the likelihood of errors, since it forces byte swapping.  This seems especially true for the rkey.

For the tag matching interface, I followed the same principal as the rkey, though tags aren't usually exchanged like rkeys are.  I'd need to look into an implementation to see if an app using host order or network order would be more efficient, or if it would matter at all.

> I like the rule if a library presents something as a uintxx_t then it
> is host order.

agreed

> If it needs to present something as network order then
> it is binary data and it is a void *.

This would require an extra memory dereference.  For fixed-size items, I'd rather see beXX_t.

I should note that I extended the rkey and immediate data as seen by the API to 64 bits.

> > > - fe_errno.h ... either have sane library
> > >   specific unique error constants and xlate the kernel syscall errnos
> > >   to the correct context specific error code,
> > >   or stick with POSIX errno codes everwhere..
> >
> > My main issue with errno values are that practically everything gets
> > mapped to EINVAL, and ends up being useless for debugging.  So, I
> > went with an 'extended' errno scheme.  I use errno values where
> > reasonable, but allow for additional values beyond those defined in
> > errno.h.  I also added places for vendor error codes.
> 
> Yes, that is what I guessed you were doing. I personally would xlate.
> 
> So call the kernel's ibv_foo_bar and then if it fails translate the
> errno into something sane and context specific..
> 
> And yes, the kernel side has problems here lacking well defined
> errnos, but that can be incrementally fixed up, IMHO.
> 
> Or said another way - either the error code is only to be passed to
> fi_strerror() and may as well be opaque to application, or it is well
> defined and every single call has a well defined list of errors it is
> allowed to return and what they mean - consider POSIX does this for
> every single POSIX system call.
> 
> A reasonable path might be to have them be opaque and then define and
> specify as a need is discovered.

It is currently expected that all error codes _may_ to be passed to fi_strerror(), and that we will eventually define what errors can be returned from every call.

> > > - fi_atomic and fl_arch can probably use gcc intrinsics - introducing
> the
> > >   multi-threaded memory model in C++11 caused gcc to gain a full set
> > >   of memory barrier and atomic builtins for C code.
> >
> > fi_arch was simply copied from libibverbs -- I haven't done anything
> > with it.  It should be removed or moved internal for provider use
> > only.
> 
> Right
> 
> > fi_atomic is a placeholder for defining atomic operations
> > over the network.
> 
> K
> 
> > > The downside of having 'ops' pointers neatly organized into
> > > functionally group'd structures is now you pay a double dereference
> > > cost at every call site, verbs had only a single dereference.
> >
> > I agree.
> >
> > The trade-off is that the ops pointers can reference static
> > structures, which should always be a cache hit.  Verbs places the
> > ops in dynamically allocated memory.  So, we reduce the memory
> > footprint, and can leave ops NULL where entire sets of operations
> > are not supported.
> 
> If you keep them as static structures you are going to have even more
> overhead:
> 
> function(obj):
>   if (includes_function(obj->size,function))  // 1 deref and if
>     struct ops *ops = obhs->ops;        // 1 deref
>     if (ops->function) != null          // 1 deref and if
>         ops->function(objs,..);

We have asserts in place of if statements.  I don't see a need to add checks to every call to see if it exists.  (What's an app really going to do with ENOSYS outside of initialization code?)  The app just does:

obj->ops->function(obj...);

ops doesn't _have_ to point to a static struct, but it can.  And it's trivial to implement a function that returns ENOSYS, including an ops structure where everything returns ENOSYS, if we want to state that a provider must implement all functions. 

The problem with obj->function is that obj either ends up quite large, or we end up with branches in the underlying code.  If we have 8 calls to send/recv messages, 8 for RDMA, 10 tagged ops, say 8 atomics, 8 collectives, 8 triggered, 8 active messages, etc., then we're looking at 50-60+ function calls hanging off every endpoint object.  That's an additional 500 bytes or so of data per QP, which is significant at scale.  The alternative is to squish everything down into 2 calls, like we have today, which removes any chance of optimizing the calls. 

I guess we could define something like this:

fi_create_endpoint(... &ep);
ep->open_msg_interface(ep, &msg_ep)
ep->open_rdma_interface(ep, &rdma_ep)

such that the user would call:

msg_ep->send(), msg_ep->recv()
rdma_ep->read(), rdma_ep->write()

That requires a little more work on the user's part, but would avoid the double indirection, not result in a bunch of NULL function pointers, and could add more type checking.  The drawback is that we still have more overhead.

> If you make them dynamic and are very clever then the library itself
> can guarentee that every array entry is callable, and the above is
> just
> 
> function(obj):
>   return obj->function(objs,..);
>
> This would require telling the library what version of the API the
> application expects so it can allocate dummy 'return ENOSYS' slots to
> unsupported entries, which is doable with the right inlines.

I think this can be handled separately.  We could define a non-assert version of FI_ASSERT_OP (fi_includes_function like you state above) that an app could use to determine if a call exists.  We can add a version field to fi_info, but I don't know how useful that would actually be, since fi_includes_function would still seem to be needed.

> The standardized calling convention of 'return int' makes that
> possible.
>
> > I guess an app could always declare their own function pointer and
> > use it to avoid the double deref.
> 
> If you can make it so every function pointer is callable then this is
> fairly doable in an app:
> 
>    auto fn = obj->function;
>    while(1)
>        fn(obj,...);
> 
> But if the inline has other stuff going on then it doesn't work so
> well..

The inlines just add asserts for debug purposes.  They're not intended to do anything beyond that.

> And the above isn't going to be optimizable along the lines of what
> Christoph was talking abotu.
> 
> > > Also, the complete loss of static type safety by having 'fid_t' be the
> > > only argument type seems like a big negative to me..
> >
> > I have a note to look at changing this, including possibly removing
> > the fid_t data type completely, but that's a significant change, so
> > I've been putting it off.
> 
> Gets harder the longer you wait :)

yeah, yeah -- I'm more trying to avoid making unnecessary changes because things head in a slightly different direction.

- Sean