[ofiwg] fault-tolerance

Hefty, Sean sean.hefty at intel.com
Tue Sep 8 09:57:32 PDT 2015


> What's the state of fault-tolerance in OFI?  Would it be prudent for
> someone to write OFI code that aspired to survive process failures?  Are
> any implementations known to support this robustly right now?

This would be provider specific.  I'm not aware of anything that's coded to handle failures.

Having an example of this over libfabric would be great, though I'm not sure who's going to volunteer to write this.

It's not clear to me how fault tolerance relates to a networking API.  For example, what specific lower-level features does an app need to make this happen?  Are their restrictions that providers need to report to apps regarding their level of support?  Is this something that even belongs to this level of API?


More information about the ofiwg mailing list