[libfabric-users] Verbs provider not permitting FI_EP_MSG

Steve Welch swelch at systemfabricworks.com
Thu Jan 16 10:13:25 PST 2020


Since RDMA CM will be used, I would verify that IPOIB has been configured for the port being used and that the network interface has been brought up. Otherwise, as Sean suggested I would capture the Libfabric debug output and provide that.

Steve

> On Jan 16, 2020, at 11:48 AM, Philip Davis <philip.e.davis at rutgers.edu> wrote:
> 
> Hi Steve,
> 
> Here is the requested output:
> 
> []$ ibv_devinfo -v 
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.36.5120
>         node_guid:                      248a:0703:0077:9520
>         sys_image_guid:                 248a:0703:0077:9523
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4103
>         hw_ver:                         0x0
>         board_id:                       MT_1090111019
>         phys_port_cnt:                  2
>         max_mr_size:                    0xffffffffffffffff
>         page_size_cap:                  0xfffffe00
>         max_qp:                         131000
>         max_qp_wr:                      16351
>         device_cap_flags:               0x057e9c76
>                                         BAD_PKEY_CNTR
>                                         BAD_QKEY_CNTR
>                                         AUTO_PATH_MIG
>                                         CHANGE_PHY_PORT
>                                         UD_AV_PORT_ENFORCE
>                                         PORT_ACTIVE_EVENT
>                                         SYS_IMAGE_GUID
>                                         RC_RNR_NAK_GEN
>                                         MEM_WINDOW
>                                         UD_IP_CSUM
>                                         XRC
>                                         MEM_MGT_EXTENSIONS
>                                         MEM_WINDOW_TYPE_2B
>                                         RAW_IP_CSUM
>                                         Unknown flags: 0x488000
>         max_sge:                        32
>         max_sge_rd:                     30
>         max_cq:                         65408
>         max_cqe:                        4194303
>         max_mr:                         524032
>         max_pd:                         32764
>         max_qp_rd_atom:                 16
>         max_ee_rd_atom:                 0
>         max_res_rd_atom:                2096000
>         max_qp_init_rd_atom:            128
>         max_ee_init_rd_atom:            0
>         atomic_cap:                     ATOMIC_HCA (1)
>         max_ee:                         0
>         max_rdd:                        0
>         max_mw:                         0
>         max_raw_ipv6_qp:                0
>         max_raw_ethy_qp:                0
>         max_mcast_grp:                  8192
>         max_mcast_qp_attach:            248
>         max_total_mcast_qp_attach:      2031616
>         max_ah:                         2147483647
>         max_fmr:                        0
>         max_srq:                        65472
>         max_srq_wr:                     16383
>         max_srq_sge:                    31
>         max_pkeys:                      128
>         local_ca_ack_delay:             15
>         general_odp_caps:
>         rc_odp_caps:
>                                         NO SUPPORT
>         uc_odp_caps:
>                                         NO SUPPORT
>         ud_odp_caps:
>                                         NO SUPPORT
>         completion timestamp_mask:                      0x0000ffffffffffff
>         hca_core_clock:                 427000kHZ
>         device_cap_flags_ex:            0x57E9C76
>         tso_caps:
>         max_tso:                        0
>         rss_caps:
>                 max_rwq_indirection_tables:                     0
>                 max_rwq_indirection_table_size:                 0
>                 rx_hash_function:                               0x0
>                 rx_hash_fields_mask:                            0x0
>         max_wq_type_rq:                 0
>         packet_pacing_caps:
>                 qp_rate_limit_min:      0kbps
>                 qp_rate_limit_max:      0kbps
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               1
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>                         max_msg_sz:             0x40000000
>                         port_cap_flags:         0x0259486a
>                         max_vl_num:             8 (4)
>                         bad_pkey_cntr:          0x0
>                         qkey_viol_cntr:         0x0
>                         sm_sl:                  0
>                         pkey_tbl_len:           128
>                         gid_tbl_len:            128
>                         subnet_timeout:         18
>                         init_type_reply:        0
>                         active_width:           4X (2)
>                         active_speed:           14.0 Gbps (16)
>                         phys_state:             LINK_UP (5)
>                         GID[  0]:               fe80:0000:0000:0000:248a:0703:0077:9521
> 
>                 port:   2
>                         state:                  PORT_DOWN (1)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>                         max_msg_sz:             0x40000000
>                         port_cap_flags:         0x02594868
>                         max_vl_num:             8 (4)
>                         bad_pkey_cntr:          0x0
>                         qkey_viol_cntr:         0x0
>                         sm_sl:                  0
>                         pkey_tbl_len:           128
>                         gid_tbl_len:            128
>                         subnet_timeout:         0
>                         init_type_reply:        0
>                         active_width:           4X (2)
>                         active_speed:           2.5 Gbps (1)
>                         phys_state:             POLLING (2)
>                         GID[  0]:               fe80:0000:0000:0000:248a:0703:0077:9522
> 
> 
> []$ lsmod | grep ib
> ib_isert               50770  0 
> iscsi_target_mod      302966  1 ib_isert
> ib_iser                47813  0 
> libiscsi               57233  1 ib_iser
> scsi_transport_iscsi    99909  2 ib_iser,libiscsi
> ib_srpt                48170  0 
> target_core_mod       367918  3 iscsi_target_mod,ib_srpt,ib_isert
> ib_srp                 48454  0 
> scsi_transport_srp     20993  1 ib_srp
> ib_ipoib              110142  0 
> ib_ucm                 22589  0 
> ib_uverbs              64636  2 ib_ucm,rdma_ucm
> ib_umad                22080  4 
> rdma_cm                54426  4 rpcrdma,ib_iser,rdma_ucm,ib_isert
> ib_cm                  47287  5 rdma_cm,ib_srp,ib_ucm,ib_srpt,ib_ipoib
> mlx4_ib               159474  0 
> ib_core               211874  14 rdma_cm,ib_cm,iw_cm,rpcrdma,mlx4_ib,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert
> libcrc32c              12644  4 xfs,raid456,nf_nat,nf_conntrack
> mlx4_core             309354  2 mlx4_en,mlx4_ib
> libahci                31992  1 ahci
> libata                238896  2 ahci,libahci
> devlink                30193  3 mlx4_en,mlx4_ib,mlx4_core
> 
>> On Jan 16, 2020, at 12:28 PM, Steve Welch <swelch at systemfabricworks.com <mailto:swelch at systemfabricworks.com>> wrote:
>> 
>> 
>> 
>>> On Jan 16, 2020, at 11:03 AM, Philip Davis <philip.e.davis at rutgers.edu <mailto:philip.e.davis at rutgers.edu>> wrote:
>>> 
>>> Hi Steve,
>>> 
>>> Thanks for the quick response.
>>> 
>>> I am expecting to use the the rxm provider for verbs, but in fi_info I do not see an FI_EP_MSG-type verbs provider.
>> 
>> Could you provide the output for “ibv_devinfo -v” and “lsmod | grep ib”?
>> 
>> Steve
>> 
>>> 
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: tcp;ofi_rxm
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXM
>>> provider: verbs;ofi_rxd
>>>     fabric: IB-0xfe80000000000000
>>>     domain: mlx4_0-dgram
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: UDP;ofi_rxd
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_RXD
>>> provider: verbs
>>>     fabric: IB-0xfe80000000000000
>>>     domain: mlx4_0-dgram
>>>     version: 1.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_IB_UD
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: UDP
>>>     fabric: UDP-IP
>>>     domain: udp
>>>     version: 1.1
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_UDP
>>> provider: sockets
>>>     fabric: 10.1.0.0/16
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 10.1.0.0/16
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 10.1.0.0/16
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 10.157.14.0/24
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 10.157.14.0/24
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 10.157.14.0/24
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em1
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: fe80::/64
>>>     domain: em2
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 127.0.0.0/8
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 127.0.0.0/8
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: 127.0.0.0/8
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: ::1/128
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: ::1/128
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_DGRAM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: sockets
>>>     fabric: ::1/128
>>>     domain: lo
>>>     version: 2.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: tcp
>>>     fabric: TCP-IP
>>>     domain: tcp
>>>     version: 0.1
>>>     type: FI_EP_MSG
>>>     protocol: FI_PROTO_SOCK_TCP
>>> provider: shm
>>>     fabric: shm
>>>     domain: shm
>>>     version: 1.0
>>>     type: FI_EP_RDM
>>>     protocol: FI_PROTO_SHM
>>> 
>>> Thanks,
>>> Philip
>>> 
>>>> On Jan 16, 2020, at 11:03 AM, Steve Welch <swelch at systemfabricworks.com <mailto:swelch at systemfabricworks.com>> wrote:
>>>> 
>>>> Hi Phillip,
>>>> 
>>>> Since you are specifying an FI_EP_RDM in your hints I assume you want to utilize the RXM provider on top of the Verbs core provider (i.e. ofi_rxm;verbs). The Verbs provider does not offer native FI_RDM_EP support. To use either XRC (or FI_EP_RDM endpoint)  you will have to use RXM, but I am unaware of any IB provider that supported XRC that did not support RC.
>>>> 
>>>> If you issue a 'fi_info -p verbs -v’ it will list all the verbs domains supported and the underlying protocol and you could verify if RC should be supported (via RXM for FI_EP_RDM). If you issue 'fi_info -p “ofi_rxm;verbs”', you should see multiple domains for the “ofi_rxm;verbs” provider combination. XRC domains have the “-xrc” suffix.
>>>> 
>>>> If you must use XRC and the RXM/Verbs combination then you will need to set the environment variable FI_OFI_RXM_USE_SRX=1 and RXM will handle the shared RX details.
>>>> 
>>>> Steve
>>>> 
>>>> 
>>>>> On Jan 16, 2020, at 8:56 AM, Philip Davis <philip.e.davis at rutgers.edu <mailto:philip.e.davis at rutgers.edu>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am working with a user that is running on an older Infiniband cluster. Using libfaric with the following hints:
>>>>> 
>>>>> hints->caps = FI_MSG | FI_SEND | FI_RECV | FI_REMOTE_READ |
>>>>>                  FI_REMOTE_WRITE | FI_RMA | FI_READ | FI_WRITE;
>>>>>    hints->mode = FI_CONTEXT | FI_LOCAL_MR | FI_CONTEXT2 | FI_MSG_PREFIX |
>>>>>                  FI_ASYNC_IOV | FI_RX_CQ_DATA;
>>>>>    hints->domain_attr->mr_mode = FI_MR_BASIC;
>>>>>    hints->domain_attr->control_progress = FI_PROGRESS_AUTO;
>>>>>    hints->domain_attr->data_progress = FI_PROGRESS_AUTO;
>>>>>    hints->ep_attr->type = FI_EP_RDM;
>>>>> 
>>>>> 
>>>>> No verbs providers are found. Looking through the debug output, I suspect this is the crucial line:
>>>>> 
>>>>> libfabric:verbs:fabric:fi_ibv_get_matching_info():1213<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
>>>>> 
>>>>> I take it that the underlying hardware is only compatible with FI_PROTO_RDMA_CM_IB_XRC protocol for MSG endpoints, and it looks like I need to have FI_SHARED_CONTEXT enabled for these endpoints to be supported. I’m having some trouble understanding the implications of using FI_SHARED_CONTEXT. If I only ever use one endpoint, is there any functional or performance impact to setting this? I’d rather not change to using shared contexts unconditionally, so is there a good way for me to detect this situation other than to do a maximally permissive fi_getinfo and iterate through the verbs results?
>>>>> 
>>>>> Thanks,
>>>>> Philip
>>>>> _______________________________________________
>>>>> Libfabric-users mailing list
>>>>> Libfabric-users at lists.openfabrics.org <mailto:Libfabric-users at lists.openfabrics.org>
>>>>> https://lists.openfabrics.org/mailman/listinfo/libfabric-users <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.openfabrics.org%2Fmailman%2Flistinfo%2Flibfabric-users&data=02%7C01%7Cphilip.e.davis%40rutgers.edu%7Cc0e0cc2c82fe4113b22d08d79aa97baa%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C1%7C637147925028517698&sdata=xS%2B0nVcPIbof8BC1cODtdUPjVFwtmxGMk7D1ot2bd4s%3D&reserved=0>
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200116/1b8588bf/attachment-0001.htm>


More information about the Libfabric-users mailing list