[ofiwg] noob questions
Hefty, Sean
sean.hefty at intel.com
Wed Nov 13 14:42:58 PST 2019
Can you provide the output of fi_info -v for your provider?
From the output below, it looks like your provider will rely on the ofi_rxd utility provider for its functionality. I.e. your provider supports DGRAM endpoints. Can you confirm that?
- Sean
> -----Original Message-----
> From: James Swaro <jswaro at cray.com>
> Sent: Wednesday, November 13, 2019 2:39 PM
> To: Don Fry <DFry at lightfleet.com>; Barrett, Brian <bbarrett at amazon.com>; Hefty, Sean
> <sean.hefty at intel.com>; Byrne, John (Labs) <john.l.byrne at hpe.com>;
> ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] noob questions
>
> Just pulling from your debug here, it looks like you have some requirements that your
> provider cannot satisfy for OpenMPI.
>
> checking info in util_getinfo
> lf
> libfabric:20561:lf:core:ofi_check_info():998<info> Unsupported capabilities
> libfabric:20561:lf:core:ofi_check_info():999<info> Supported: FI_MSG, FI_MULTICAST,
> FI_RECV, FI_SEND
> libfabric:20561:lf:core:ofi_check_info():999<info> Requested: FI_MSG, FI_RMA, FI_READ,
> FI_RECV, FI_SEND, FI_REMOTE_READ
> checking info in util_getinfo
> lf
> libfabric:20561:lf:core:ofi_check_ep_type():629<info> Unsupported endpoint type
> libfabric:20561:lf:core:ofi_check_ep_type():630<info> Supported: FI_EP_DGRAM
> libfabric:20561:lf:core:ofi_check_ep_type():630<info> Requested: FI_EP_MSG
> libfabric:20561:core:core:fi_getinfo_():891<warn> fi_getinfo: provider lf returned -61
> (No data available)
> libfabric:20561:core:core:fi_getinfo_():891<warn> fi_getinfo: provider ofi_rxm returned
> -61 (No data available)
> libfabric:20561:core:core:ofi_layering_ok():796<info> Need core provider, skipping
> ofi_rxm
> libfabric:20561:core:core:ofi_layering_ok():796<info> Need core provider, skipping
> ofi_rxd
> libfabric:20561:core:core:ofi_layering_ok():796<info> Need core provider, skipping
> ofi_mrail
> checking info in util_getinfo
> lf
> libfabric:20561:lf:core:ofi_check_ep_type():629<info> Unsupported endpoint type
> libfabric:20561:lf:core:ofi_check_ep_type():630<info> Supported: FI_EP_MSG
> libfabric:20561:lf:core:ofi_check_ep_type():630<info> Requested: FI_EP_DGRAM
> checking info in util_getinfo
> lf
> libfabric:20561:lf:core:ofi_check_mr_mode():510<info> Invalid memory registration mode
> libfabric:20561:lf:core:ofi_check_mr_mode():511<info> Expected:
> libfabric:20561:lf:core:ofi_check_mr_mode():511<info> Given:
> libfabric:20561:core:core:fi_getinfo_():891<warn> fi_getinfo: provider lf returned -61
> (No data available)
>
> -- Jim
>
>
> On 11/13/19, 4:36 PM, "ofiwg on behalf of Don Fry" <ofiwg-bounces at lists.openfabrics.org
> on behalf of DFry at lightfleet.com> wrote:
>
> Here is another run with the output suggested by James Swaro
>
> Don
> ________________________________________
> From: Don Fry
> Sent: Wednesday, November 13, 2019 2:26 PM
> To: Barrett, Brian; Hefty, Sean; Byrne, John (Labs); ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] noob questions
>
> attached is the output of mpirun with some of my debugging printf's
>
> Don
> ________________________________________
> From: Barrett, Brian <bbarrett at amazon.com>
> Sent: Wednesday, November 13, 2019 2:05 PM
> To: Don Fry; Hefty, Sean; Byrne, John (Labs); ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] noob questions
>
> That likely means that something failed in initializing the OFI provider. Without
> seeing the debugging output John mentioned, it's really hard to say *why* it failed to
> initialize. There are many reasons, including not being able to conform to a bunch of
> provider assumptions that Open MPI has on its providers.
>
> Brian
>
> -----Original Message-----
> From: Don Fry <DFry at lightfleet.com>
> Date: Wednesday, November 13, 2019 at 2:01 PM
> To: "Barrett, Brian" <bbarrett at amazon.com>, "Hefty, Sean" <sean.hefty at intel.com>,
> "Byrne, John (Labs)" <john.l.byrne at hpe.com>, "ofiwg at lists.openfabrics.org"
> <ofiwg at lists.openfabrics.org>
> Subject: Re: [ofiwg] noob questions
>
> When I tried --mca pml cm it complains that "PML cm cannot be selected". Maybe
> I needed to enable cm when I configured openmpi? I didn't specifically enable or
> disable it. It could also be that my getinfo routine doesn't have a capability set
> properly.
>
> my latest command line was:
> mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include "lf;ofi_rxm"
> ./mpi_latency (where lf is my provider)
>
> Thanks for the pointers, I will do some more debugging on my end.
>
> Don
> ________________________________________
> From: Barrett, Brian <bbarrett at amazon.com>
> Sent: Wednesday, November 13, 2019 12:53 PM
> To: Hefty, Sean; Byrne, John (Labs); Don Fry; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] noob questions
>
> You can force Open MPI to use libfabric as its transport by adding "-mca pml cm
> -mca mtl ofi" to the mpirun command line.
>
> Brian
>
> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> on behalf of "Hefty, Sean"
> <sean.hefty at intel.com>
> Date: Wednesday, November 13, 2019 at 12:52 PM
> To: "Byrne, John (Labs)" <john.l.byrne at hpe.com>, Don Fry <DFry at lightfleet.com>,
> "ofiwg at lists.openfabrics.org" <ofiwg at lists.openfabrics.org>
> Subject: Re: [ofiwg] noob questions
>
> My guess is that OpenMPI has an internal socket transport that it is using.
> You likely need to force MPI to use libfabric, but I don't know enough about OMPI to do
> that.
>
> Jeff (copied) likely knows the answer here, but you may need to create him
> a new meme for his assistance.
>
> - Sean
>
> > -----Original Message-----
> > From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Byrne,
> John (Labs)
> > Sent: Wednesday, November 13, 2019 11:26 AM
> > To: Don Fry <DFry at lightfleet.com>; ofiwg at lists.openfabrics.org
> > Subject: Re: [ofiwg] noob questions
> >
> > You only mention the dgram and msg types and the mtl_ofi component wants
> rdm. If you
> > don’t support rdm, I would have expected your getinfo routine to return
> error -61. You
> > can try using the ofi_rxm provider with your provider to add rdm support,
> replacing
> > verbs in “--mca mtl_ofi_provider_include verbs;ofi_rxm” with your
> provider.
> >
> >
> >
> > openmpi transport selection is complex. Adding insane levels of verbosity
> can help you
> > understand what is happening. I tend to use: --mca mtl_base_verbose 100 -
> -mca
> > btl_base_verbose 100 --mca pml_base_verbose 100
> >
> >
> >
> > John Byrne
> >
> >
> >
> > From: ofiwg [mailto:ofiwg-bounces at lists.openfabrics.org] On Behalf Of Don
> Fry
> > Sent: Wednesday, November 13, 2019 10:54 AM
> > To: ofiwg at lists.openfabrics.org
> > Subject: [ofiwg] noob questions
> >
> >
> >
> > I have written a libfabric provider for our hardware and it passes all
> the fabtests I
> > expect it to (dgram and msg). I am trying to run some MPI tests using
> libfabrics under
> > openmpi (4.0.2). When I run a simple ping-pong test using mpirun it
> sends and receives
> > the messages using the tcp/ip protocol. It does call my fi_getinfo
> routine, but
> > doesn't use my provider send/receive routines. I have rebuilt the
> libfabric library
> > disabling sockets, then again --disable-tcp, then --disable-udp, and
> fi_info reports
> > fewer and fewer providers until it only lists my provider, but each time
> I run the mpi
> > test, it still uses the ip protocol to exchange messages.
> >
> >
> >
> > When I configured openmpi I specified --with-libfabric=/usr/local/ and
> the libfabric
> > library is being loaded and executed.
> >
> >
> >
> > I am probably doing something obviously wrong, but I don't know enough
> about MPI or
> > maybe libfabric, so need some help. If this is the wrong list, redirect
> me.
> >
> >
> >
> > Any suggestions?
> >
> > Don
>
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/ofiwg
>
>
>
>
>
More information about the ofiwg
mailing list