[openib-general] RDMA CM multicast

Andrew Friedley afriedle at open-mpi.org
Fri Jan 26 19:47:00 PST 2007


> Once this routing is in place, the only thing they need is to enhance
> the MPI job starter/etc to allocate to each job (say) two unique
> multicast --IP-- addresses on the relevant subnet and provide these IP
> addresses to each rank. Now the rank can use the RDMA CM without any
> hack.

I don't this isn't as easy as you've made it sound.  I see two 
approaches to preventing address collision -- both require voluntary 
participation.  First is a centralized authority approach (this has been 
used for IP multicast-based protocols).  This means running some sort of 
daemon in a location all peers can communicate with.  I'm not really 
keen on the idea of requiring a separate daemon just to support 
multicast in Open MPI.  Second is peer-to-peer based approaches.  These 
are doable, but difficult due to numerous race conditions.  It's also 
highly desireable to minimize the time cost of joining a multicast 
group; this is especially difficult with a peer-to-peer solutions.

Also, I'd rather not assume a single MPI job requires a constant (small) 
number of multicast groups/addresses.  The obvious correllation is to 
use one multicast group per MPI communicator.  Most applications will 
use only a few, though some may use hundreds, and may even vary the 
number in use as the app executes.  I've also been considering 
approaches utilizing many groups per communicator, so again we could be 
looking at hundreds of multicast groups per MPI job.

As I've said, implementing solutions at the MPI level is doable but 
difficult.  I knew from earlier discussions that IB is able to allocate 
new, unused multicast addresses and was hoping expose that functionality 
and avoid the multicast address allocation problem.  However I hadn't 
thought of the fact that other networks supported by the RDMA CM might 
not have similar functionality.. so this might not be appropriate there. 
  But maybe it is worth considering how hard it is for those other 
networks to provide the functionality?

Andrew




More information about the general mailing list