[openib-general] Re: RFC userspace / MPI multicast support
amith rajith mamidala
mamidala at cse.ohio-state.edu
Wed Apr 19 16:39:27 PDT 2006
Hi Sean,
I have a few basic questions:
1. Does the API which waits for join to complete
ensure that the multicast forwarding tables in the switches have been
updated. This is one of the main problems that we had studied:
(Please refer to the following EURO PVM/MPI paper for details)
http://www.cse.ohio-state.edu/~mamidala/europvm.pdf
> /* Wait for join to complete. */
> rdma_get_cm_event(&event);
> if (event->event == RDMA_CM_EVENT_JOIN_COMPLETE)
> /* join worked - we could call rdma_get_option() here */
> /* The rdma_cm attached the QP to the multicast group for us. */
>
> rdma_ack_cm_event(event);
2. I am not clear on how to access the QP associated with the cm_id for
multicast. This includes posting the receive descriptors etc.
3. If an multicast address is already used by an application running on
the cluster and if another request is made by a different application with
the same multicast address, does this generate an error? From the API, it
looks like the application has to manage this aspect,
Thanks,
Amith
On Wed, 19 Apr 2006, Sean Hefty wrote:
> I'd like to get some feedback regarding the following approach to supporting
> multicast groups in userspace, and in particular for MPI. Based on side
> conversations, I need to know if this approach would meet the needs of MPI
> developers.
>
> To join / leave a multicast group, my proposal is to add the following APIs to
> the rdma_cm. (Note I haven't implemented this yet, so I'm just assuming that
> it's possible at this point.)
>
> /* Asynchronously join a multicast group. */
> int rdma_set_option(struct rdma_cm_id *id, int level, int optname,
> void *optval, size_t optlen);
>
> /* Retrieve multicast group information - not usually called. */
> int rdma_get_option(struct rdma_cm_id *id, int level, int optname,
> void *optval, size_t optlen);
>
> /*
> * Post a message on the QP associated with the cm_id for the
> * specified multicast address.
> */
> int rdma_sendto(struct rdma_cm_id *id, struct ibv_send_wr *send_wr,
> struct sockaddr *to);
>
> ---
>
> As an example of how these APIs would be used:
>
> /* The cm_id provides event handling and context. */
> rdma_create_id(&id, context);
>
> /* Bind to a local interface to attach to a local device. */
> rdma_bind_addr(id, local_addr);
>
> /* Allocate a PD, CQs, etc. */
> pd = ibv_alloc_pd(id->verbs);
> ..
>
> /*
> * Create a UD QP associated with the cm_id.
> * TBD: automatically transition the QP to RTS for UD QP types?
> */
> rdma_create_qp(id, pd, init_attr);
>
> /* Bind to multicast group. */
> mcast_ip = 224.0.0.74.71; /* some fine mcast addr */
> ip_mreq.imr_multiaddr = mcast_ip.in_addr;
> rdma_set_option(id, RDMA_PROTO_IP, IP_ADD_MEMBERSHIP, &ip_mreq,
> sizeof(ip_mreq));
>
> /*
> * Format a send wr. The ah, remote_qpn, and remote_qkey are
> * filled out by the rdma_cm based on the provided destination
> * address.
> */
> rdma_sendto(id, send_wr, &mcast_ip);
>
> ---
>
> The multicast group information is created / managed by the rdma_cm. The
> rdma_cm defines the mgid, q_key, p_key, sl, flowlabel, tclass, and joinstate.
> Except for mgid, these would most likely match the values used by the ipoib
> broadcast group. The mgid mapping would be similar to that used by ipoib. The
> actual MCMember record would be available to the user by calling
> rdma_get_option.
>
> I don't believe that there would be any restriction on the use of the QP that is
> attached to the multicast group, but it would take more work to support more
> than one multicast group per QP. The purpose of the rdma_sendto() routine is to
> map a given IP address to an allocated address handle and Qkey. At this point,
> rdma_sendto would only work for multicast addresses that have been joined by the
> user.
>
> If a user wanted more control over the multicast group, we could support a call
> such as:
>
> struct ib_mreq {
> struct ib_sa_mcmember_rec rec;
> ib_sa_comp_mask comp_mask;
> }
>
> rdma_set_option(id, RDMA_PROTO_IB, IB_ADD_MEMBERSHIP, &ib_mreq,
> sizeof(ib_mreq));
>
> Thoughts?
>
> - Sean
>
More information about the general
mailing list