[ofiwg] fault-tolerance

Tue Sep 8 10:03:43 PDT 2015

One of the things I can think of is having the provider attempt some link-level resilience where it attempts to fail-over to other paths if possible when a failure is detected. That’s somewhat low hanging fruit and probably not the responsibility of OFI itself.

What would be more helpful would be to have OFI provide a well-specified mechanism for reporting communication failures that it can’t automatically resolve. Some sort of error reporting from OFI calls to say that a specific send failed would be nice. From that error code, we can infer which target failed since OFI doesn’t have any collectives which would make this more difficult.

Thanks,
Wesley

On 9/8/15, 11:57 AM, "ofiwg-bounces at lists.openfabrics.org on behalf of Hefty, Sean" <ofiwg-bounces at lists.openfabrics.org on behalf of sean.hefty at intel.com> wrote:

>> What's the state of fault-tolerance in OFI?  Would it be prudent for
>> someone to write OFI code that aspired to survive process failures?  Are
>> any implementations known to support this robustly right now?
>
>This would be provider specific.  I'm not aware of anything that's coded to handle failures.
>
>Having an example of this over libfabric would be great, though I'm not sure who's going to volunteer to write this.
>
>It's not clear to me how fault tolerance relates to a networking API.  For example, what specific lower-level features does an app need to make this happen?  Are their restrictions that providers need to report to apps regarding their level of support?  Is this something that even belongs to this level of API?
>_______________________________________________
>ofiwg mailing list
>ofiwg at lists.openfabrics.org
>http://lists.openfabrics.org/mailman/listinfo/ofiwg