[libfabric-users] Verbs provider not permitting FI_EP_MSG
sean.hefty at intel.com
Thu Jan 16 08:11:41 PST 2020
> I am working with a user that is running on an older Infiniband cluster. Using libfaric
> with the following hints:
> hints->caps = FI_MSG | FI_SEND | FI_RECV | FI_REMOTE_READ |
> FI_REMOTE_WRITE | FI_RMA | FI_READ | FI_WRITE;
> hints->mode = FI_CONTEXT | FI_LOCAL_MR | FI_CONTEXT2 | FI_MSG_PREFIX |
> FI_ASYNC_IOV | FI_RX_CQ_DATA;
> hints->domain_attr->mr_mode = FI_MR_BASIC;
You may want to consider updating to the newer mr_mode bits. This field was changed starting in the 1.5 release.
> hints->domain_attr->control_progress = FI_PROGRESS_AUTO;
> hints->domain_attr->data_progress = FI_PROGRESS_AUTO;
> hints->ep_attr->type = FI_EP_RDM;
This is requesting RDM endpoints, not MSG. Is this the intent for your app, and the issue is that it can't find the verbs support underneath?
> No verbs providers are found. Looking through the debug output, I suspect this is the
> crucial line:
If you run fi_info, do you see the verbs provider there?
> libfabric:verbs:fabric:fi_ibv_get_matching_info():1213<info> hints->ep_attr->rx_ctx_cnt
> != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
This is checking for XRC QP support. Skipping these if the hardware doesn't support it should be fine.
> I take it that the underlying hardware is only compatible with FI_PROTO_RDMA_CM_IB_XRC
> protocol for MSG endpoints, and it looks like I need to have FI_SHARED_CONTEXT enabled
> for these endpoints to be supported. I’m having some trouble understanding the
> implications of using FI_SHARED_CONTEXT. If I only ever use one endpoint, is there any
> functional or performance impact to setting this? I’d rather not change to using shared
> contexts unconditionally, so is there a good way for me to detect this situation other
> than to do a maximally permissive fi_getinfo and iterate through the verbs results?
You don't need to use shared contexts or XRC. When you mention only using one endpoint, do you mean one MSG endpoint or one RDM endpoint?
What version of libfabric are you using? Attaching the full debug output from the startup checks might help isolate the problem.
More information about the Libfabric-users