[ofiwg] send/recv "credits"

Thu Oct 9 08:20:45 PDT 2014

> So far support this has been mostly Jason and me talking with app-writer hats
> on claiming that this is an app-driven request, and Sean telling us to learn to
> write better apps :).  Anyone else care to weigh in from app-writer
> perspective?  "No, I always eagerly try to send until EAGAIN, then queue it,
> works fine."  or "yes, I really need to know before I attempt the send
> because XXXX" ?
> 

I cut my teeth writing MPI on InfiniBand on MVAPICH. We used to maintain credits that ensured that ibv_post_send would not fail due to lack of send wqe availability. Looking back, we probably only did that since verbs at that time didn't have a clear EAGAIN semantic (don't know if it has changed now).

I also learned that unavailability of send slots is not the only reason one needs to have a queue in MPI. There could be any number of reasons, depending on MPI feature set you're trying to implement. Two come to mind right now, but aren't the only ones.

1. Prevent overwhelming receiver - there was a protocol within MVAPICH that broke up a large send into small packets when registration failed. In this mode, it was quite easy for a sender to quickly overwhelm the receiver by executing sends in a tight loop. Another situation was where we hadn't gotten a credit update from the remote side in a while. i.e. remote_credits  < some_threshold.

2. Handling transient network faults - if the QP is down for whatever reason, queue up and return. Network fault handling logic would keep MPI alive until certain pre-defined limit, and abort the process if it really thought that it was partitioned.

So - I found that queues were required anyways.

Additionally, we always have to guard against a post failing, since a post might fail for any number of reasons (aside from not having slots available). So, there is at least one branch there to handle the return code.

My current thinking is that running out of slots should be a relatively rare situation - and only occurs when you've saturated the network (probably getting very good network efficiency :)). In this case, the CPU is faster than fabric interface. It is arguably a good point in time to slow down the sender by pushing work up. One can see this as a kind of back-pressure, although it is a strained analogy.

Therefore, I think EAGAIN is the semantic that ends up being more efficient for situations I have had experience in.

Thanks,
Sayantan.