[ofa-general] Question about RDMA CM

Wed Sep 17 10:00:23 PDT 2008

On Wed, 2008-09-17 at 08:15 -0400, Jeff Squyres wrote:
> On Sep 17, 2008, at 1:45 AM, Doug Ledford wrote:
> 
> >> No; we are using ibv_create_qp, and then assigning id->qp afterwards.
> >
> > Don't do that.  Assume if rdmacm provides an interface for doing
> > something, then there is likely a reason.  In this case, when you call
> > rdma_create_qp(), it does more than just call ibv_create_qp() and
> > ibv_modify_qp() on your behalf.  It also pipes information about the
> > state changes in the qp to the kernel rdma_cm module (by writing the
> > commands in rdma_cm format to id->channel->fd, which is the rdma_cm fd
> > not the qp fd, in places like rdma_init_qp_attr()).
> 
> In general, I agree with you (use rdmacm_create_qp instead of making  
> it manually).  But I'm looking at the head of the librdmacm git and I  
> don't see what you're talking about.  All it does is call  
> ibv_create_qp and ibv_modify_qp.

Does head of git not have this snippet in rdma_create_qp():

        if (ucma_is_ud_ps(id->ps))
                ret = ucma_init_ud_qp(id_priv, qp);
        else
                ret = ucma_init_conn_qp(id_priv, qp);

That's in the librdmacm code I have here, and that happens to be called
*before* the code sets id->qp = qp.  And in that call chain,
ucma_init_*_qp() both end up calling rdma_init_qp_attr() and it's here
that we write to id->channel->fd (aka, the kernel rdmacm module instead
of the verbs module) what we are doing.  So unless things have changed,
it's not just a simple wrapper.  And even if they have changed, if you
want to play it safe in terms of older librdmacm versions, you have to
assume it isn't a simple wrapper.

> The person who initially wrote this code chose to create the qp  
> manually for two reasons:
> 
> - rdmacm_create_qp is just a wrapper around ibv_create_qp and  
> ibv_modify_qp
> - other parts of OMPI (that don't use RDMA CM for wireup) call  
> ibv_create_qp and ibv_modify_qp
> 
> That being said, I actually spent a little time yesterday trying to  
> convert to use rdmacm_create_qp and was having problems with the  
> comparison at the top of rdmacm_create_qp against the protection  
> domain -- somehow it was failing for me.  It was not immediately  
> obvious to me where rdmacm was getting that pd from, nor why that  
> comparison would fail.

This is some comments and code I had that works just fine for getting at
the right id/pd pair when creating a new cm_id:

        // Before we can create a queue pair (QP), we have to have a protection
        // domain (PD) and it has to exist on the controller we are going
        // to create the QP on.  Since on the server we want to be able to
        // share buffers between connections, and the buffer's PD must match
        // the QP's PD, and all connections have their own QP, we
        // have to share the same PD across all QPs on a single controller.
        // However, we don't know what controller rdma_resolve_addr bound
        // us to.  But, inside the cm_id, there is a pointer to an ibv_context,
        // and our device list is actually a list of ibv_context pointers, so
        // try to match one of our known device pointers to that pointer and
        // if we hit a match, we know what our PD needs to be since we
        // already allocated it.  If we don't find a match, we are screwed
        // and we bail.
        for (devnum = 0; devnum < num_devices; devnum++)
                if (t_data->rdma->cm_id->verbs == devlist[devnum].device)
                        break;
        if (devnum == num_devices || devlist[devnum].domain == NULL) {
                printf("_rdma_connect: couldn't find matching context\n");
                goto out;
        }

In the rdma init code I do this to set up the devlist array in the first
place:
        devices = rdma_get_devices(&num_devices);
...
        for(i=0; i < num_devices; i++) {
                devlist[i].device = devices[i];
                devlist[i].domain = ibv_alloc_pd(devlist[i].device);

An important factor being that you must use rdma_get_devices() and the
ib contexts returned there from in your code when you are allocating
protection domain contexts.

> >> As far as I can tell, I am not sending to the wrong QP.  But it is
> >> complex code, so there certainly can be a bug in this area.
> >>
> >> The thing that is weird for me is that setting rnr_retry to 7 makes  
> >> it
> >> work.
> >
> > I didn't look into the kernel code so I couldn't venture a guess as to
> > whether or not the above is actually a hard requirement, and whether  
> > or
> > not it would explain the rnr_retry of 7 getting around the race
> > condition, but I would think it's plausible.
> 
> 
> If all is working properly, a rnr_retry_count of 0 should be  
> sufficient because there should be no race conditions.  This is what  
> OMPI has had for years; it's only this new RDMA CM wireup problem that  
> has forced me to set it at 7.
> 
> However, as I mentioned before, this is complex code, so it's quite  
> possible (likely?) that I have a bug in the code somewhere.  I was  
> posting here looking for any possible insights into why this could  
> happen.
> 
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080917/3cd84291/attachment.sig>