[ofa-general] peer to peer connections support

Or Gerlitz ogerlitz at voltaire.com
Thu Dec 20 07:08:45 PST 2007


Sean Hefty wrote:
...
> I didn't follow this.
...
> Peer to peer SIDs are in a different domain than client/server SIDs, and
> the peer_to_peer field is used to indicate which domain a SID is in.

Sorry if I wasn't clear, let me see if I understand you: with this
different domain implementation, under both client/server the passive
calls cm listen and the active call cm connect, where under peer/to/peer
both sides call cm listen and later both sides may call cm connect or
only one side, correct?

> To add to my comments on the CM API, struct ib_cm_req_param, which is
> used to send the REQ, includes service_id and peer_to_peer fields.  The
> latter is a boolean used by the CM to distinguish if incoming REQs can
> be matched with the outgoing REQ.

OK, this makes things clearer.

>> Why there should be a difference between the rdma-cm to the cm? if in
>> the cm you have a model without API change, wouldn't it apply also to
>> the rdma-cm?

> The rdma_cm does not know how to set the peer_to_peer field in the
> ib_cm_req_param.  It sets this field to 0 today.

But it could set it to one as well... assuming my understanding above of
the suggested implementation is correct, we can change the RDMA-CM API
to let users specify on rdma_connect that they want peer to peer
support, so such apps can issue rdma_listen call and later call
rdma_connect with this bit set and they are done (or almost done... I
guess there some more devil in the details here, isn't it?)

>  > I think that in the MPI world each rank gets a SID from the local CM and
>  > they exchange the SIDs out-of-band, then connections are opened. If its
>  > a connection-on-demand scheme, then when ever the rank process calls
>  > mpi_send() to peer for which the local MPI library does not have a
>  > connection, it tries to connect. So if this happens "at once" between
>  > some pair of ranks, there should be a way to form one connection out of
>  > these two connecting requests. My thinking/motivation is that support of
>  > this scheme should be in the IB stack (cm and rdma-cm) level and not in
>  > the specific MPI implementation level.
> 
> Are the out of band connections used by MPI formed using client/server
> or peer to peer?  I believe that Intel MPI has each rank listen for
> connections from the ranks below it using client/server.

yes, MPIs that do all-to-all-connect on job start, typically use
client/server where all the ranks > 0 issue listen call and then all
lower ranks connect to higher ranks or etc some other symmetry breaking
scheme. I am trying to see what needs to be supported by the IB stack to 
let MPIs that do connect on demand use the RDMA-CM.

> There are a couple of problems with the peer to peer model.  First,
> unless the connections occur at exactly the same time, they miss
> connecting (rejected with invalid SID).  

This makes the all peer to peer model useless, since an app can not make
sure that connection occur at exactly the same time! my understanding of
the spec is that peer to peer model has the ability to handle also 
connections that occur at exactly the same time but not only.

> Second, if multiple peer to
> peer connections need to form between the same pair of nodes, things can
> go screwy (that's the technical term) trying to match up the peer requests.

Under MPI each rank uses a different SID, so I think we are safe from 
this problem.

Or








More information about the general mailing list