[ofiwg] inserting duplicate addresses into an AV
michael.blocksome at intel.com
Tue Mar 20 12:05:54 PDT 2018
ew .. MPI spawn.
Again, how can a rank - even if it is yet to be attached to a MPI job - get the same fabric endpoint address from its OFI provider as some other rank in the system? Is this spawn test doing something crazy like attach-detach-attach-detach-etc and a previous address is not being removed properly before the next (same) address is inserted again?
I guess I don't understand the intricacies of this MPI spawn problem, and it's difficult for me to believe the statement "It is apparently non-trivial for the apps to avoid duplicate insertions" without this understanding. But, to me, this seems like applications/middleware just shouldn't be inserting a fabric endpoint address twice ... at least for HPC/MPI anyway. But maybe this duplicate insert scenario can still happen in a data center environment?
From: Hefty, Sean
Sent: Tuesday, March 20, 2018 1:38 PM
To: Blocksome, Michael <michael.blocksome at intel.com>; ofiwg at lists.openfabrics.org
Subject: RE: inserting duplicate addresses into an AV
The failures are related to MPI spawn tests. This happens with Intel MPI, but I suspect MPICH or other MPIs may have similar problems with this test.
> -----Original Message-----
> From: Blocksome, Michael
> Sent: Tuesday, March 20, 2018 11:29 AM
> To: Hefty, Sean <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
> Subject: RE: inserting duplicate addresses into an AV
> Which application, or which MPI, is inserting duplicate addresses? I
> don't see how MPI could be doing this. At least the MPI
> implementations I'm familiar with use PMI1, PMI2, or PMIx to exchange
> addresses at job startup into a distributed key-value store, and then
> after a barrier each MPI rank initializes its av with all these unique
> addresses. For a duplicate address to happen multiple MPI ranks would
> have to get the *same* local address from the OFI provider - how would
> that happen?
> Some providers, like bgq, can stuff all the fabric address information
> within the 64 bits of fi_addr_t, which basically makes the
> fi_av_insert() call a noop in FI_AV_MAP mode. So if this duplicate
> address problem happened on bgq it would still "just work" from the
> provider's perspective. Now MPI (or whatever is using the provider)
> might get messed up because of it, but the fabric communication
> operations would still work.
> -----Original Message-----
> From: ofiwg [mailto:ofiwg-bounces at lists.openfabrics.org] On Behalf Of
> Hefty, Sean
> Sent: Tuesday, March 20, 2018 11:54 AM
> To: ofiwg at lists.openfabrics.org
> Subject: [ofiwg] inserting duplicate addresses into an AV
> MPI is hitting into an issue that is the result of inserting the same
> address into an AV more than once. There is no defined behavior for
> what a provider should do in this case. At least one provider allows
> the duplicate insertion, and at least one fails the call... and
> neither work with MPI when this occurs. :/
> There are a couple of problems trying to define this. In the case of
> the provider that fails the call, the failure is detected when
> attempting to insert the same address into a hash table. However, not
> all providers are easily able to detect duplicates. Forcing them to
> do so _may_ require the provider to perform a linear search over the
> AV looking for a duplicate for every address that is inserted. At
> scale, this is a significant overhead.
> Even if the decision is made to force detecting duplicates (maybe even
> making this an AV option), there's the question of how a provider
> should respond. Should it insert the address twice -- creating a new
> fi_addr for it, discard the duplicate -- and return the existing
> fi_addr, or generate an error. And does it matter if AV_TABLE or MAP
> is used?
> We need to know what applications need here, and how difficult it will
> be for providers to detect duplicates. It is apparently non-trivial
> for the apps to avoid duplicate insertions.
> - Sean
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
More information about the ofiwg