[openib-general] multicast

Thu Jul 13 09:21:34 PDT 2006

Sean Hefty wrote:
>>I'm concerned about how rdma_cm abstracts HCAs.  It looks like I can use
>>the src_addr argument to rdma_resolve_addr() to select which IP
>>address/HCA (assuming one IP per HCA), but how can I enumerate the
>>available HCAs?
> 
> 
> The HCA / RDMA device abstraction is there for device hotplug, but the verb call
> to enumerate HCAs is still usable if you want to get a list of all HCAs in the
> system.
> 
> You will likely have one IP address per port, rather than per HCA.  You probably
> want to distinguish between locally assigned IP addresses (those given to ipoib
> devices - ib0, etc.), versus multicast IP addresses, and verify that your
> multicast routing tables direct traffic out of ipoib IP addresses, rather than
> Ethernet IP addresses.  The IB multicast groups will base their local routing
> the same as the true IP multicast groups.

Yes - I'm actually talking about a separate issue here.  It looks like 
using the RDMA CM for multicast is going to require using it for all of 
my connection management, so I'm looking at what that entails. 
Currently I'm using only ibverbs and Open MPI's runtime environment layer.

> 
>>This is important for a number of reasons - one, so that I can pass on
>>the available IP addresses to MPI peers out of band.  It's also
>>important to know which HCA's are available in the system, and to be
>>able to select which HCA to use when connecting to a peer.  This allows
>>us to implement things like load balancing and failover.
> 
> 
> HCA / port selection can be controlled by selecting a specific IP address, and
> you can configure your multicast routing tables to direct traffic out any
> desired port.  You should have the same control over using a specific HCA /
> port; only the type of address used to identify the port changes.
> 
> I might be able to make things a little easier by adding some sort of call that
> identifies all RDMA IP addresses in the system.  You could test for this today
> by calling rdma_bind_addr() on all IP addresses assigned to the system.  This
> doesn't really help with multicast addresses though, since you don't bind to
> them...

That would be very nice - Open MPI already supports enumeration of IP 
interfaces (which I could do rdma_bind_addr() on as you suggested) in a 
portable fashion, but I think being able to get this via RDMA CM is a 
better general solution.

Right about the multicast addresses - should have made it clear that I 
was talking unicast IP.

I understand RDMA CM is a generic CM intended for other types of devices 
(ie iWARP), not just infiniband.  Will all of these devices be supported 
under the ibverbs interface?  I'm thinking it would be a problem if 
we're picking up interfaces that don't support ibverbs, then try to use 
ibverbs to communicate over them.

> I'm not clear on what you mean about passing available IP addresses to MPI
> peers, or why it's done out of band.  Are you talking about IP addresses of the
> local ipoib devices?  Multicast IP addresses?  By out of band, do you mean over
> a socket, as opposed to an IB connection?

Sorry - I'm talking about IP addresses of the local ipoib devices, or 
whatever sort of addressing structure a particular network uses.  Yes, 
we currently send this information out of band, over TCP.

Our network initialization works like this - we have modules written for 
each type of network (TCP, infiniband, GM, etc).  In the first 
initialization stage for each module, available interfaces are 
enumerated, initialized, and addressing information for each interface 
is made available to our runtime environment layer.  This addressing 
information is exchanged among all peers in the MPI job via TCP (I 
believe we have a framework for supporting other methods, but only TCP 
is currently implemented).  Finally, each network module takes all the 
peer addresses for its network and sets up any necessary data structures 
for communicating with each of those peers.

> 
>>Matt Leininger suggested looking at the IB CM as an alternative, as it
>>gives more low-level control.  Am I missing something, or does the IB CM
>>not handle multicast like the RDMA CM?
> 
> 
> IB multicast groups require SA interaction, and are not associated with the IB
> CM.  What control do you feel that the RDMA CM is missing?

At the moment, I'm more concerned about how the RDMA CM API fits with 
Open MPI (which I think it will, just need to re-think connection 
management).  In the future though, one thing that comes to mind is 
control of dynamic/multipath routing.

Andrew