[ofiwg] RFC on error handling in fi_getinfo call

Jeff Squyres (jsquyres) jsquyres at cisco.com
Fri Jan 16 08:12:00 PST 2015


- For 10+ years, Open MPI has fallen back to sockets if no other networks are available (e.g., if there was some kind of error with some other supposed-to-be-available high speed network).  Users have found it pretty obvious when this happens, for two reasons:

1. OMPI usually complained (to stderr) because it can usually tell when a high speed network looks like it is *supposed* to be available, but is not. Per Jason's note, perhaps such libfabric stderr warnings can only be emitted if a magic env var is present (e.g., FI_DEBUG).
2. Depending on the app, the delivered performance can be quite different with TCP sockets than a high-speed network.

I know that libfabric is in a different situation than Open MPI here, but I'm raising the point that even when OMPI was the upstart/disruptive MPI that had something to prove (early 2000's), it had this fallback-to-TCP behavior.

- Sean already heard me +1 the idea of run-time selection of providers, but I'll do it again publicly.  :-)

- Part of the fear is that applications will simply use the first result from fi_getinfo and ignore all the others (because that's what at least some do in verbs with the result of ibv_get_device_list).  Perhaps part of the solution here is to encourage better behavior in libfabric from the very beginning -- our test programs and examples should iterate through all the results of fi_getinfo, not just blindly use the first one.

- As part of the "Carry error information as part of fi_info", it might be useful to allow providers to attach *strings* as part of the error info (vs. just a single integer error value).

> On Jan 16, 2015, at 10:31 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
> On 01/15/2015 05:46 PM, Hefty, Sean wrote:
>> We've also discussed adding runtime options to disable built-in providers as a work-around for buggy providers.
> I would +1 this feature. Not only for working around buggy providers, but for ease of comparing results from things like MPI benchmarks.
> Ken
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg

Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

More information about the ofiwg mailing list