[openib-general] Re: RFC userspace / MPI multicast support

Hal Rosenstock halr at voltaire.com
Thu Apr 20 07:20:55 PDT 2006


Hi Amith,

On Wed, 2006-04-19 at 19:39, amith rajith mamidala wrote:
> Hi Sean,
> 
> I have a few basic questions:
> 
> 1. Does the API which waits for join to complete
> ensure that the multicast forwarding tables in the switches have been
> updated.

This is not an API issue. The IB spec (architecture) allows for lazy
joining. I can cite the compliance if needed. This is based on the fact
that any multicast sending (not just IB) is unreliable and the
application needs to deal with lost transmissions if it cares ? Isn't
this just another case of that ?

>  This is one of the main problems that we had studied:
> (Please refer to the following EURO PVM/MPI paper for details)
> http://www.cse.ohio-state.edu/~mamidala/europvm.pdf

Can you summarize the issue that this causes ? I will look at the paper
but this may take a little while.

-- Hal

> > /* Wait for join to complete. */
> > rdma_get_cm_event(&event);
> > if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE)
> > 	/* join worked - we could call rdma_get_option() here */
> > 	/* The rdma_cm attached the QP to the multicast group for us. */
> >
> > rdma_ack_cm_event(event);
> 
> 2. I am not clear on how to access the QP associated with the cm_id for
> multicast. This includes posting the receive descriptors etc.
> 
> 3. If an multicast address is already used by an application running on
> the cluster and if another request is made by a different application with
> the same multicast address, does this generate an error? From the API, it
> looks like the application has to manage this aspect,
> 
> 
> Thanks,
> Amith
> 
> 
> On Wed, 19 Apr 2006, Sean Hefty wrote:
> 
> > I'd like to get some feedback regarding the following approach to supporting
> > multicast groups in userspace, and in particular for MPI.  Based on side
> > conversations, I need to know if this approach would meet the needs of MPI
> > developers.
> >
> > To join / leave a multicast group, my proposal is to add the following APIs to
> > the rdma_cm.  (Note I haven't implemented this yet, so I'm just assuming that
> > it's possible at this point.)
> >
> > /* Asynchronously join a multicast group. */
> > int rdma_set_option(struct rdma_cm_id *id, int level, int optname,
> > 			  void *optval, size_t optlen);
> >
> > /* Retrieve multicast group information - not usually called. */
> > int rdma_get_option(struct rdma_cm_id *id, int level, int optname,
> > 			  void *optval, size_t optlen);
> >
> > /*
> >  * Post a message on the QP associated with the cm_id for the
> >  * specified multicast address.
> > */
> > int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr,
> > 		    struct sockaddr *to);
> >
> > ---
> >
> > As an example of how these APIs would be used:
> >
> > /* The cm_id provides event handling and context. */
> > rdma_create_id(&id, context);
> >
> > /* Bind to a local interface to attach to a local device. */
> > rdma_bind_addr(id, local_addr);
> >
> > /* Allocate a PD, CQs, etc. */
> > pd = ibv_alloc_pd(id->verbs);
> > ..
> >
> > /*
> >  * Create a UD QP associated with the cm_id.
> >  * TBD: automatically transition the QP to RTS for UD QP types?
> >  */
> > rdma_create_qp(id, pd, init_attr);
> >
> > /* Bind to multicast group. */
> > mcast_ip = 224.0.0.74.71; /* some fine mcast addr */
> > ip_mreq.imr_multiaddr = mcast_ip.in_addr;
> > rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq,
> > 		    sizeof(ip_mreq));
> >
> > /*
> >  * Format a send wr.  The ah, remote_qpn, and remote_qkey are
> >  * filled out by the rdma_cm based on the provided destination
> >  * address.
> >  */
> > rdma_sendto(id, send_wr, &mcast_ip);
> >
> > ---
> >
> > The multicast group information is created / managed by the rdma_cm.  The
> > rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate.
> > Except for mgid, these would most likely match the values used by the ipoib
> > broadcast group.  The mgid mapping would be similar to that used by ipoib.  The
> > actual MCMember record would be available to the user by calling
> > rdma_get_option.
> >
> > I don't believe that there would be any restriction on the use of the QP that is
> > attached to the multicast group, but it would take more work to support more
> > than one multicast group per QP.  The purpose of the rdma_sendto() routine is to
> > map a given IP address to an allocated address handle and Qkey.  At this point,
> > rdma_sendto would only work for multicast addresses that have been joined by the
> > user.
> >
> > If a user wanted more control over the multicast group, we could support a call
> > such as:
> >
> > struct ib_mreq {
> > 	struct ib_sa_mcmember_rec	rec;
> > 	ib_sa_comp_mask			comp_mask;
> > }
> >
> > rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq,
> > 		    sizeof(ib_mreq));
> >
> > Thoughts?
> >
> > - Sean
> >
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list