[ofiwg] RFC on error handling in fi_getinfo call

Jeff Squyres (jsquyres) jsquyres at cisco.com
Fri Jan 16 16:46:31 PST 2015


It matches Open MPI behavior, yes.

That's part of the #1 I mentioned.

It's not foolproof, but like I said, you can usually tell if there's *supposed* to be a high speed network there, but isn't.  

E.g., if you can allocate some resources, but fail when allocating your Nth QP/CQ/receive buffer, etc.

You can't tell, however, if ibv_get_device_list() returns no devices, of course.



> On Jan 16, 2015, at 7:43 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
> 
> I was poking around the OpenMPI FAQ today, and found this:
> 
> "2. But wait -- I'm using a high-speed network. Do I have to disable the TCP BTL?
> No. Following the so-called "Law of Least Astonishment", Open MPI assumes that if you have both a TCP network and at least one high-speed network (such as Myrinet or InfiniBand), you will likely only want to use the high-speed network(s) for MPI message passing. Hence, the tcp BTL component will sense this and automatically deactivate itself."
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-auto-disable
> 
> Does this match the current OpenMPI implementation? Seems nifty and avoids the issue of sockets provider being used by error.
> 
> Could it be that the sockets provider can be modified to have this behavior?
> 
>> -----Original Message-----
>> From: ofiwg-bounces at lists.openfabrics.org [mailto:ofiwg-
>> bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres (jsquyres)
>> Sent: Friday, January 16, 2015 8:12 AM
>> To: Ken Raffenetti
>> Cc: ofiwg at lists.openfabrics.org
>> Subject: Re: [ofiwg] RFC on error handling in fi_getinfo call
>> 
>> FWIW:
>> 
>> - For 10+ years, Open MPI has fallen back to sockets if no other networks are
>> available (e.g., if there was some kind of error with some other supposed-to-
>> be-available high speed network).  Users have found it pretty obvious when
>> this happens, for two reasons:
>> 
>> 1. OMPI usually complained (to stderr) because it can usually tell when a high
>> speed network looks like it is *supposed* to be available, but is not. Per
>> Jason's note, perhaps such libfabric stderr warnings can only be emitted if a
>> magic env var is present (e.g., FI_DEBUG).
>> 2. Depending on the app, the delivered performance can be quite different
>> with TCP sockets than a high-speed network.
>> 
>> I know that libfabric is in a different situation than Open MPI here, but I'm
>> raising the point that even when OMPI was the upstart/disruptive MPI that
>> had something to prove (early 2000's), it had this fallback-to-TCP behavior.
>> 
>> - Sean already heard me +1 the idea of run-time selection of providers, but
>> I'll do it again publicly.  :-)
>> 
>> - Part of the fear is that applications will simply use the first result from
>> fi_getinfo and ignore all the others (because that's what at least some do in
>> verbs with the result of ibv_get_device_list).  Perhaps part of the solution
>> here is to encourage better behavior in libfabric from the very beginning --
>> our test programs and examples should iterate through all the results of
>> fi_getinfo, not just blindly use the first one.
>> 
>> - As part of the "Carry error information as part of fi_info", it might be useful
>> to allow providers to attach *strings* as part of the error info (vs. just a single
>> integer error value).
>> 
>> 
>> 
>>> On Jan 16, 2015, at 10:31 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov>
>> wrote:
>>> 
>>> On 01/15/2015 05:46 PM, Hefty, Sean wrote:
>>>> We've also discussed adding runtime options to disable built-in providers
>> as a work-around for buggy providers.
>>> 
>>> I would +1 this feature. Not only for working around buggy providers, but
>> for ease of comparing results from things like MPI benchmarks.
>>> 
>>> Ken
>>> _______________________________________________
>>> ofiwg mailing list
>>> ofiwg at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/ofiwg
>> 
>> 
>> --
>> Jeff Squyres
>> jsquyres at cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> _______________________________________________
>> ofiwg mailing list
>> ofiwg at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/ofiwg


-- 
Jeff Squyres
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/




More information about the ofiwg mailing list