[openib-general] RFC userspace / MPI multicast support

Sean Hefty sean.hefty at intel.com
Wed Apr 19 12:05:50 PDT 2006


I'd like to get some feedback regarding the following approach to supporting
multicast groups in userspace, and in particular for MPI.  Based on side
conversations, I need to know if this approach would meet the needs of MPI
developers.

To join / leave a multicast group, my proposal is to add the following APIs to
the rdma_cm.  (Note I haven't implemented this yet, so I'm just assuming that
it's possible at this point.)

/* Asynchronously join a multicast group. */
int rdma_set_option(struct rdma_cm_id *id, int level, int optname,
			  void *optval, size_t optlen);

/* Retrieve multicast group information - not usually called. */
int rdma_get_option(struct rdma_cm_id *id, int level, int optname,
			  void *optval, size_t optlen);

/*
 * Post a message on the QP associated with the cm_id for the
 * specified multicast address.
*/
int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr,
		    struct sockaddr *to);

---

As an example of how these APIs would be used:

/* The cm_id provides event handling and context. */
rdma_create_id(&id, context);

/* Bind to a local interface to attach to a local device. */
rdma_bind_addr(id, local_addr);

/* Allocate a PD, CQs, etc. */
pd = ibv_alloc_pd(id->verbs);
...

/*
 * Create a UD QP associated with the cm_id.
 * TBD: automatically transition the QP to RTS for UD QP types?
 */
rdma_create_qp(id, pd, init_attr);

/* Bind to multicast group. */
mcast_ip = 224.0.0.74.71; /* some fine mcast addr */
ip_mreq.imr_multiaddr = mcast_ip.in_addr;
rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq,
		    sizeof(ip_mreq));

/* Wait for join to complete. */
rdma_get_cm_event(&event);
if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE)
	/* join worked - we could call rdma_get_option() here */
	/* The rdma_cm attached the QP to the multicast group for us. */
...
rdma_ack_cm_event(event);

/*
 * Format a send wr.  The ah, remote_qpn, and remote_qkey are
 * filled out by the rdma_cm based on the provided destination
 * address.
 */
rdma_sendto(id, send_wr, &mcast_ip);

---

The multicast group information is created / managed by the rdma_cm.  The
rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate.
Except for mgid, these would most likely match the values used by the ipoib
broadcast group.  The mgid mapping would be similar to that used by ipoib.  The
actual MCMember record would be available to the user by calling
rdma_get_option.

I don't believe that there would be any restriction on the use of the QP that is
attached to the multicast group, but it would take more work to support more
than one multicast group per QP.  The purpose of the rdma_sendto() routine is to
map a given IP address to an allocated address handle and Qkey.  At this point,
rdma_sendto would only work for multicast addresses that have been joined by the
user.

If a user wanted more control over the multicast group, we could support a call
such as:

struct ib_mreq {
	struct ib_sa_mcmember_rec	rec;
	ib_sa_comp_mask			comp_mask;
}

rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq,
		    sizeof(ib_mreq));

Thoughts?

- Sean



More information about the general mailing list