[ofiwg] wait sets

Tue Jan 26 16:58:11 PST 2016

This is a continuation of the discussion from today's ofiwg and github issue 1645.

An attempt to describe the desired application behavior is:

    1. Wait for one or more events to occur
    2. Get a list of queues that are ready for action
    3. Process each queue until empty

Assuming this behavior, this could roughly be broken into 3 cases:

a.)  Single libfabric call - for one queue
fi_cq_sread/fi_eq_sread/fi_cntr_wait basically encapsulates the above steps into one call.  FWIW, some providers implement these calls by allocating a wait set internally.

b.)  libfabric only calls - for multiple queues
The 'natural' match for this would be to use:

    1. fi_wait
    2. fi_poll
    3. fi_cq_read/fi_eq_read/fi_cntr_read

Step 2 is optional.  Also, EQs cannot be assigned to poll sets, so all EQs (likely 0 or 1) would need to be checked at step 3.

c.)  OS + libfabric calls - for one or multiple queues
This modifies the above sequence to:

    1. poll/select
    2. fi_poll
    3. fi_cq_read/fi_eq_read/fi_cntr_read

In case b) we can require that providers implement fi_wait such that it avoids infinite waiting and application spinning, assuming correct application behavior.  It would be up to the provider to guarantee this.  E.g. fi_wait could clear any signals and check for ready queues before sleeping. 

If this works, then there are issues only in case c).  The poll/select fd's (or wait objects) could come from a wait set or directly from the CQs/EQs/counters.  Even in the case of a wait set, the returned fd could be from epoll, and does not guaranteed a single underlying wait object is in use.  For example, verb devices cannot share an fd between an EQ and CQ.  I'm going to claim that this means the app must act on each CQ etc. to guarantee the wait object is reset.  I believe this is true regardless of what fi_poll returns.

I'm not quite sure what all this means yet.  :)  In case c) the use of a pollset does not seem to help, and could lead to lost events.  E.g. an entry is added to an empty CQ after fi_poll returns, while the fd is still readable.  If the app doesn't check the CQ, which isn't in the fi_poll output, it could miss seeing the completion.

As for the API, it's unfortunate, but I believe that fi_cq_sread/fi_rq_sread should be used to both read events from a queue and reset the wait object.  The alternative is a separate 'reset/rearm' call, which I would rather avoid, but others can chime in.  The sread calls are only needed in case c).  Even more unfortunate is that there is not fi_cntr_sread, only fi_cntr_wait.

<insert brilliant idea here>

- Sean