[ofa-general] Question about RDMA CM
Doug Ledford
dledford at redhat.com
Wed Sep 17 10:00:23 PDT 2008
On Wed, 2008-09-17 at 08:15 -0400, Jeff Squyres wrote:
> On Sep 17, 2008, at 1:45 AM, Doug Ledford wrote:
>
> >> No; we are using ibv_create_qp, and then assigning id->qp afterwards.
> >
> > Don't do that. Assume if rdmacm provides an interface for doing
> > something, then there is likely a reason. In this case, when you call
> > rdma_create_qp(), it does more than just call ibv_create_qp() and
> > ibv_modify_qp() on your behalf. It also pipes information about the
> > state changes in the qp to the kernel rdma_cm module (by writing the
> > commands in rdma_cm format to id->channel->fd, which is the rdma_cm fd
> > not the qp fd, in places like rdma_init_qp_attr()).
>
> In general, I agree with you (use rdmacm_create_qp instead of making
> it manually). But I'm looking at the head of the librdmacm git and I
> don't see what you're talking about. All it does is call
> ibv_create_qp and ibv_modify_qp.
Does head of git not have this snippet in rdma_create_qp():
if (ucma_is_ud_ps(id->ps))
ret = ucma_init_ud_qp(id_priv, qp);
else
ret = ucma_init_conn_qp(id_priv, qp);
That's in the librdmacm code I have here, and that happens to be called
*before* the code sets id->qp = qp. And in that call chain,
ucma_init_*_qp() both end up calling rdma_init_qp_attr() and it's here
that we write to id->channel->fd (aka, the kernel rdmacm module instead
of the verbs module) what we are doing. So unless things have changed,
it's not just a simple wrapper. And even if they have changed, if you
want to play it safe in terms of older librdmacm versions, you have to
assume it isn't a simple wrapper.
> The person who initially wrote this code chose to create the qp
> manually for two reasons:
>
> - rdmacm_create_qp is just a wrapper around ibv_create_qp and
> ibv_modify_qp
> - other parts of OMPI (that don't use RDMA CM for wireup) call
> ibv_create_qp and ibv_modify_qp
>
> That being said, I actually spent a little time yesterday trying to
> convert to use rdmacm_create_qp and was having problems with the
> comparison at the top of rdmacm_create_qp against the protection
> domain -- somehow it was failing for me. It was not immediately
> obvious to me where rdmacm was getting that pd from, nor why that
> comparison would fail.
This is some comments and code I had that works just fine for getting at
the right id/pd pair when creating a new cm_id:
// Before we can create a queue pair (QP), we have to have a protection
// domain (PD) and it has to exist on the controller we are going
// to create the QP on. Since on the server we want to be able to
// share buffers between connections, and the buffer's PD must match
// the QP's PD, and all connections have their own QP, we
// have to share the same PD across all QPs on a single controller.
// However, we don't know what controller rdma_resolve_addr bound
// us to. But, inside the cm_id, there is a pointer to an ibv_context,
// and our device list is actually a list of ibv_context pointers, so
// try to match one of our known device pointers to that pointer and
// if we hit a match, we know what our PD needs to be since we
// already allocated it. If we don't find a match, we are screwed
// and we bail.
for (devnum = 0; devnum < num_devices; devnum++)
if (t_data->rdma->cm_id->verbs == devlist[devnum].device)
break;
if (devnum == num_devices || devlist[devnum].domain == NULL) {
printf("_rdma_connect: couldn't find matching context\n");
goto out;
}
In the rdma init code I do this to set up the devlist array in the first
place:
devices = rdma_get_devices(&num_devices);
...
for(i=0; i < num_devices; i++) {
devlist[i].device = devices[i];
devlist[i].domain = ibv_alloc_pd(devlist[i].device);
An important factor being that you must use rdma_get_devices() and the
ib contexts returned there from in your code when you are allocating
protection domain contexts.
> >> As far as I can tell, I am not sending to the wrong QP. But it is
> >> complex code, so there certainly can be a bug in this area.
> >>
> >> The thing that is weird for me is that setting rnr_retry to 7 makes
> >> it
> >> work.
> >
> > I didn't look into the kernel code so I couldn't venture a guess as to
> > whether or not the above is actually a hard requirement, and whether
> > or
> > not it would explain the rnr_retry of 7 getting around the race
> > condition, but I would think it's plausible.
>
>
> If all is working properly, a rnr_retry_count of 0 should be
> sufficient because there should be no race conditions. This is what
> OMPI has had for years; it's only this new RDMA CM wireup problem that
> has forced me to set it at 7.
>
> However, as I mentioned before, this is complex code, so it's quite
> possible (likely?) that I have a bug in the code somewhere. I was
> posting here looking for any possible insights into why this could
> happen.
>
--
Doug Ledford <dledford at redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080917/3cd84291/attachment.sig>
More information about the general
mailing list