[ofiwg] [libfabric-users] feature requests

Jeff Squyres (jsquyres) jsquyres at cisco.com
Tue Jun 6 17:25:18 PDT 2017


On Jun 6, 2017, at 7:36 PM, Hefty, Sean <sean.hefty at intel.com> wrote:
> 
> I guess the first thing to figure out is how addressing works when multi-rail is in use.  Would we need some sort of super-address that's a union of the underlying fabric addresses?

The Open MPI answer to this kind of question is (for each network type):

- discover all local network addresses
- use one network address -- any network address -- to identify the remote machine
- use that address to connect to the peer and discover *all* of its network addresses
- then run those discovered addresses through user-supplied filters (in an "inclusionary" or "exclusionary" way) to factor in user preferences (e.g., "don't use network X", or "only use networks A and B", etc.)
- of the network addresses (i.e., endpoints) that remain, figure out which is reachable from which (preferably with the cheapest cost)

Note: this is not rocket science, but it's not easy, either.  It also depends on the underlying network's capability to determine reachability.  Some offer heuristics, some offer timeouts, ...etc.

The key here is: don't try to amalgamate and have "super" network addresses.  Just use any given network address to identify the remote *machine*, and then start going through the mechanics of dealing with multiple network addresses.

-- 
Jeff Squyres
jsquyres at cisco.com




More information about the ofiwg mailing list