[ofiwg] fault-tolerance
Sur, Sayantan
sayantan.sur at intel.com
Tue Sep 8 10:28:22 PDT 2015
>
>What would be more helpful would be to have OFI provide a well-specified
>mechanism for reporting communication failures that it can’t
>automatically resolve. Some sort of error reporting from OFI calls to say
>that a specific send failed would be nice. From that error code, we can
>infer which target failed since OFI doesn’t have any collectives which
>would make this more difficult.
Errors should be reported to the CQ readerr. That’s what you want, right?
Thanks,
Sayantan.
>
>Thanks,
>Wesley
>
>
>
>On 9/8/15, 11:57 AM, "ofiwg-bounces at lists.openfabrics.org on behalf of
>Hefty, Sean" <ofiwg-bounces at lists.openfabrics.org on behalf of
>sean.hefty at intel.com> wrote:
>
>>> What's the state of fault-tolerance in OFI? Would it be prudent for
>>> someone to write OFI code that aspired to survive process failures?
>>>Are
>>> any implementations known to support this robustly right now?
>>
>>This would be provider specific. I'm not aware of anything that's coded
>>to handle failures.
>>
>>Having an example of this over libfabric would be great, though I'm not
>>sure who's going to volunteer to write this.
>>
>>It's not clear to me how fault tolerance relates to a networking API.
>>For example, what specific lower-level features does an app need to make
>>this happen? Are their restrictions that providers need to report to
>>apps regarding their level of support? Is this something that even
>>belongs to this level of API?
>>_______________________________________________
>>ofiwg mailing list
>>ofiwg at lists.openfabrics.org
>>http://lists.openfabrics.org/mailman/listinfo/ofiwg
>_______________________________________________
>ofiwg mailing list
>ofiwg at lists.openfabrics.org
>http://lists.openfabrics.org/mailman/listinfo/ofiwg
More information about the ofiwg
mailing list