[ofiwg] fault-tolerance

Dave Goodell (dgoodell) dgoodell at cisco.com
Tue Sep 8 12:16:14 PDT 2015


One fault tolerance issue that we know we need some clarity on is being tracked here: https://github.com/ofiwg/libfabric/issues/826

-Dave

On Sep 8, 2015, at 12:40 PM, Jeff Hammond <jeff.science at gmail.com> wrote:

> Some of the requirements for FT include:
> - precise error code reporting on failures.  deadlock never occurs due to remote process failure.
> - containment of side effects of endpoint failures, especially no byzantine behavior.
> - easy to deregister failed endpoints.
> - easy to register new endpoints on the fly.  (think MPI_Comm_spawn_multiple here)
> 
> Thanks,
> 
> Jeff
> 
> On Tue, Sep 8, 2015 at 10:28 AM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
> 
> >
> >What would be more helpful would be to have OFI provide a well-specified
> >mechanism for reporting communication failures that it can’t
> >automatically resolve. Some sort of error reporting from OFI calls to say
> >that a specific send failed would be nice. From that error code, we can
> >infer which target failed since OFI doesn’t have any collectives which
> >would make this more difficult.
> 
> 
> Errors should be reported to the CQ readerr. That’s what you want, right?
> 
> Thanks,
> Sayantan.
> 
> 
> >
> >Thanks,
> >Wesley
> >
> >
> >
> >On 9/8/15, 11:57 AM, "ofiwg-bounces at lists.openfabrics.org on behalf of
> >Hefty, Sean" <ofiwg-bounces at lists.openfabrics.org on behalf of
> >sean.hefty at intel.com> wrote:
> >
> >>> What's the state of fault-tolerance in OFI?  Would it be prudent for
> >>> someone to write OFI code that aspired to survive process failures?
> >>>Are
> >>> any implementations known to support this robustly right now?
> >>
> >>This would be provider specific.  I'm not aware of anything that's coded
> >>to handle failures.
> >>
> >>Having an example of this over libfabric would be great, though I'm not
> >>sure who's going to volunteer to write this.
> >>
> >>It's not clear to me how fault tolerance relates to a networking API.
> >>For example, what specific lower-level features does an app need to make
> >>this happen?  Are their restrictions that providers need to report to
> >>apps regarding their level of support?  Is this something that even
> >>belongs to this level of API?
> >>_______________________________________________
> >>ofiwg mailing list
> >>ofiwg at lists.openfabrics.org
> >>http://lists.openfabrics.org/mailman/listinfo/ofiwg
> >_______________________________________________
> >ofiwg mailing list
> >ofiwg at lists.openfabrics.org
> >http://lists.openfabrics.org/mailman/listinfo/ofiwg
> 
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg




More information about the ofiwg mailing list