[openib-general] FW: [PATCH] [RFC] librdmacm: expose device list to users

Tue Jul 25 14:54:02 PDT 2006

Just returning thread to list... I accidentally bumped it off by replying
directly.

- Sean

>Sean Hefty wrote:
>>>>Are you wanting to dynamically determine that address?
>>>
>>>Yes - for one, I don't want to concern the MPI user with that sort of
>>>detail.  Also, I imagine a single MPI job using many different multicast
>>>groups simultaneously.
>>>
>>>What I figured I would do is have a range of multicast addresses I could
>>>select from (which WOULD be configurable by users, but have good
>>>defaults), then successively select multicast addresses until a free one
>>>is found.
>>
>>
>> There is an interface in the kernel that could be used to determine if a port
>> has already joined a multicast group.  It may be possible to expose something
>> like this to userspace, but how will you ensure that all related MPI jobs
>join
>> the correct group?  Maybe it would help me to understand how MPI uses IP
>> multicast groups today.
>
>Apologies for the long email..
>
>determine if a port has already joined a multicast group.. not sure
>that's what I'm after.  What I'm thinking is that one predefined group
>of processes are communicating via a multicast group.  Another group of
>processes are started up, and want to communicate via multicast as well
>- what is to stop this second group from using the same multicast
>address, and erroneously sending messages to the first group (and vice
>versa)?  Note that these groups of processes may or may not be running
>on the same ports, though may still be on the same network.
>
>If we don't have some way to prevent different MPI jobs on a network
>from using the same multicast address, multicast is not going to be very
>useful.
>
>
>MPI has used very little multicast so far - the mvapich group has a
>couple papers on using multicast for collective operations, I don't know
>of any other MPI's that have used it.  I'm looking to do some similar
>work in Open MPI, and hopefully take it further and do more/better.  In
>other words, I don't know for sure how MPI uses multicast yet :)
>
>It's not difficult to know who should/shouldn't be part of a particular
>multicast group.  For example, all of the members of a particular
>communicator might be in a single multicast group.  We always know which
>peers are in a particular communicator, so we know which peers should be
>in a multicast group together.  Coming up with an address for a
>particular group is a little more difficult - the address would have to
>be chosen by one peer (I'd just pick rank 0 or something), then
>communicated OOB to the other peers in the communicator.
>
>An example of where this mapping of communicator to multicast group
>might not hold would be at scale.  I have no idea how far the IB
>multicast scales yet - but if it doesn't go as far as MPI does, multiple
>groups would probably be needed to span the communicator.  It may also
>be beneficial (performance) to use many smaller groups in a tree-like
>fashion instead of one large group.
>
>
>One solution (suggested by Matt Leininger) would be for the IB stack
>(CM?) to hand out multicast addresses.  I'm thinking it would be useful
>to come up with a header file for a solution before implementing
>anything.  An 'ideal' API might look something like this:
>
>ObtainMulticastAddress
>  Returns a multicast address guaranteed not to be in use on the network
>
>JoinMulticastGroup (aka connect)
>  Requires a multicast address specifying which multicast group to join
>
>LeaveMulticastGroup (aka disconnect)
>  Requires a multicast address specifying which multicast group to leave
>
>ReturnMulticastAddress (inform IB that an address may be reused)
>  Requires a multicast address
>
>I would then use ObtainMulticastAddress, then pass the returned address
>via OOB to all the peers I want to be in that multicast group.  All the
>peers would then use JoinMulticastGroup.
>
>ReturnMulticastAddress could require that all calls to
>JoinMulticastGroup on the provided address have been paired with
>matching LeaveMulticastGroup calls.  What would be better though, is if
>the group were asynchronously released when the last peer in the group
>calls LeaveMulticastGroup.  This would avoid the need for explicitly
>fencing between all the LeaveMulticastGroup calls and one peer calling
>ReturnMulticastAddress.
>
>The idea behind ObtainMulticastAddress is that I don't care what the
>particular address is - just that I have an address, and nobody else is
>using it.  The address can take any form - does not have to be IP.
>
>It's important that everything be non-blocking.  If needed, operations
>can be done asynchronously (ie completion signalled via an event queue).
>
>Is this sort of approach even feasible?
>
>Andrew