[ofiwg] inserting duplicate addresses into an AV
Hefty, Sean
sean.hefty at intel.com
Tue Mar 20 12:12:30 PDT 2018
> Again, how can a rank - even if it is yet to be attached to a MPI job
> - get the same fabric endpoint address from its OFI provider as some
> other rank in the system? Is this spawn test doing something crazy
> like attach-detach-attach-detach-etc and a previous address is not
> being removed properly before the next (same) address is inserted
> again?
I have no idea what MPI spawn is doing other than inserting the same address more than once. I was hoping to live quite happily with that ignorance. :)
> I guess I don't understand the intricacies of this MPI spawn problem,
> and it's difficult for me to believe the statement "It is apparently
> non-trivial for the apps to avoid duplicate insertions" without this
> understanding. But, to me, this seems like applications/middleware
> just shouldn't be inserting a fabric endpoint address twice ... at
> least for HPC/MPI anyway. But maybe this duplicate insert scenario can
> still happen in a data center environment?
I asked what it would take for MPI to avoid the duplicate insertion. The response was for it to store a list of inserted addresses mapped to an fi_addr and do a lookup of each address prior to inserting it into an AV. This spawned (ha) my non-trivial comment.
See:
https://github.com/ofiwg/libfabric/pull/3931
Dmitry (copied) may be able to provide greater details on the problem.
- Sean
More information about the ofiwg
mailing list