[ofa-general] peer to peer connections support

Wed Dec 19 03:33:48 PST 2007

Sean Hefty wrote:
> Peer to peer connection was never fully implemented in the ib_cm.  I 
> don't think it would be that hard to implement at that level, and it 
> shouldn't require API changes.

With you below comment of "CM needs to know the connection model 
selected by the app" I am somehow confused. With reading your other 
comments, I see two options here based on whether the implementation 
differentiate between peer-to-peer SIDs to client/server SIDs:

if there's no difference, then also in the peer-to-peer model, the 
application must first tell the CM to listen on a SID and its up to the 
CM to break the symmetry and decide who sends the REP and who ignores 
the REQ.

if there is a diff, then peer-to-peer SIDs are in a different domain 
then client/server SIDs.

> Support at the rdma_cm level may require an API change.  There's no easy 
> way for the rdma_cm to know if it should invoke the IB peer-to-peer 
> connection model.  I'm not even sure how one peer would know the other 
> peer's port number, unless well known ports are used on both sides.

Why there should be a difference between the rdma-cm to the cm? if in 
the cm you have a model without API change, wouldn't it apply also to 
the rdma-cm?

>> Such support would be useful in symmetric schemes such as MPIs that 
>> open connections on demand and more applications where each party can 
>> both accept and initiate connections. For example, I understand that 
>> some work is done now at the open mpi community to use the rdma-cm as 
>> a possible channel for connection establishment.

> I would need to better understand the expected usage model, like how the 
> peers find each other, but this is something that could be added if needed.

I think that in the MPI world each rank gets a SID from the local CM and 
they exchange the SIDs out-of-band, then connections are opened. If its 
a connection-on-demand scheme, then when ever the rank process calls 
mpi_send() to peer for which the local MPI library does not have a 
connection, it tries to connect. So if this happens "at once" between 
some pair of ranks, there should be a way to form one connection out of 
these two connecting requests. My thinking/motivation is that support of 
this scheme should be in the IB stack (cm and rdma-cm) level and not in 
the specific MPI implementation level.

Jeff, Jon, any comments?

Or.