[libfabric-users] verbs questions, problem with fi_connect
Arne
arnestruck at astruck.de
Wed Oct 28 09:41:43 PDT 2020
Hello,
here the respective debug output:
libfabric:46782:verbs:fabric:vrb_open_ep():972<debug> open_ep src addr:
fi_sockaddr_in://10.0.10.3:0
libfabric:46782:verbs:fabric:vrb_open_ep():975<debug> open_ep dest addr:
fi_sockaddr_in://10.0.10.3:4711
libfabric:46782:verbs:core:ofi_check_tx_attr():883<info> Rx only caps
ignored in Tx caps
libfabric:46782:verbs:core:ofi_check_rx_attr():785<info> Tx only caps
ignored in Rx caps
libfabric:46782:verbs:core:ofi_check_rx_attr():785<info> Tx only caps
ignored in Rx caps
libfabric:46782:verbs:core:ofi_check_tx_attr():883<info> Rx only caps
ignored in Tx caps
libfabric:46782:verbs:core:ofi_check_ep_attr():679<info> Unsupported
protocol
libfabric:46782:verbs:core:ofi_check_ep_attr():680<info> Supported:
FI_PROTO_RDMA_CM_IB_XRC
libfabric:46782:verbs:core:ofi_check_ep_attr():680<info> Requested:
FI_PROTO_RDMA_CM_IB_RC
libfabric:46782:verbs:core:ofi_check_ep_type():657<info> unsupported
endpoint type
libfabric:46782:verbs:core:ofi_check_ep_type():658<info> Supported:
FI_EP_DGRAM
libfabric:46782:verbs:core:ofi_check_ep_type():658<info> Requested:
FI_EP_MSG
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap
start (nil) len 131072
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap
start (nil) len 131072
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap
start (nil) len 294912
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap
start (nil) len 4096
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap
start (nil) len 294912
Error occurred reading Eq.
Details:
Unknown error -8
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x603000008000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x60300000b000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x60300000e000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x60300002b000 len 20480
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x61d000039000 len 24576
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x61d000081000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x61d0000b1000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x61d000151000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x61d000171000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x62400000e000 len 8192
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000020000 len 401408
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000084000 len 311296
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x6240000d2000 len 589824
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000170000 len 11993088
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000d9c000 len 393216
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000e64000 len 458752
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624000f3c000 len 393216
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x624001004000 len 458752
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x6240010dc000 len 327680
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x2a6a94e9000 len 258048
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted
madvise addr 0x2a6a94e9000 len 258048
I am unsure whether any of it is useful. I started with the ep and
copied down from there.
Any ideas?
Greetings, Arne.
Am 27.10.20 um 21:46 schrieb Arne:
> Hello, it is me again.
>
> Maybe some code can help a more experienced person to spot my fault
> (mostly skipped unrelevant error handling and combined definition and
> init for brevity):
>
>
> fi_getinfo(FI_VERSION(1, 11),
>
> "10.0.10.4",
> "4711",
> FI_NUMERICHOST,
> msg_hints*,
> &info);
>
> That info is used for creating fabric, domain, and endpoint (ep) of
> the client side, msg_hints entries in prior message.
>
>
> from connection function (connection_data is 38 byte Buffer for a
> UUID, ep and event_queue are associated):
>
> uint32_t* event;
>
> size_t event_entry_size = sizeof(struct fi_eq_cm_entry) + 128;
>
> void* event_entry = malloc(event_entry_size);
>
> struct fi_eq_err_entry* event_queue_err_entry;
>
> struct sockaddr_in* address = malloc(sizeof(struct sockaddr_in));
>
>
> address = malloc(sizeof(struct sockaddr_in));
>
> address->sin_family = AF_INET;
>
> address->sin_port = htons(4711);
>
> inet_aton("10.0.10.3", &address->sin_addr);
>
>
> fi_connect(ep, address, (void*)con_data, sizeof(struct JConData));
>
> error = fi_eq_sread(event_queue, event, event_entry, event_entry_size,
> -1, 0);
>
> if (error < 0)
>
> {
> if (error == -FI_EAVAIL)
> {
> error = fi_eq_readerr(event_queue, event_queue_err_entry, 0);
> if (error < 0)
> {
> g_critical("\nError occurred reading
> Eq.\nDetails:\n%s", fi_strerror(abs(error)));
> }
> else
> {
> g_critical("\nError Message on Event
> Queue.\nDetails:\n%s", fi_eq_strerror(event_queue,
> event_queue_err_entry->prov_errno, event_queue_err_entry->err_data,
> NULL, 0));
> }
> }
>
> }
>
> Last g_critical resulst in the Unkown Error -8.
>
>
> 'fi_info -p verbs -P 4711 -n 10.0.10.3 -t FI_EP_MSG -a FI_SOCKADDR_IN
> -v'* results in:
>
> fi_info:
> caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV,
> FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM ]
> mode: [ FI_RX_CQ_DATA ]
> addr_format: FI_SOCKADDR_IN
> src_addrlen: 16
> dest_addrlen: 16
> src_addr: fi_sockaddr_in://10.0.10.4:0
> dest_addr: fi_sockaddr_in://10.0.10.3:4711
> handle: (nil)
> fi_tx_attr:
> caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
> mode: [ ]
> op_flags: [ ]
> msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS,
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS,
> FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW,
> FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
> comp_order: [ FI_ORDER_STRICT ]
> inject_size: 256
> size: 384
> iov_limit: 4
> rma_iov_limit: 1
> fi_rx_attr:
> caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_RECV, FI_REMOTE_READ,
> FI_REMOTE_WRITE ]
> mode: [ FI_RX_CQ_DATA ]
> op_flags: [ ]
> msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS,
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS,
> FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW,
> FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
> comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
> total_buffered_recv: 0
> size: 384
> iov_limit: 4
> fi_ep_attr:
> type: FI_EP_MSG
> protocol: FI_PROTO_RDMA_CM_IB_RC
> protocol_version: 1
> max_msg_size: 1073741824
> msg_prefix_size: 0
> max_order_raw_size: 1073741824
> max_order_war_size: 0
> max_order_waw_size: 1073741824
> mem_tag_format: 0x0000000000000000
> tx_ctx_cnt: 1
> rx_ctx_cnt: 1
> auth_key_size: 0
> fi_domain_attr:
> domain: 0x0
> name: mlx4_0
> threading: FI_THREAD_SAFE
> control_progress: FI_PROGRESS_AUTO
> data_progress: FI_PROGRESS_AUTO
> resource_mgmt: FI_RM_ENABLED
> av_type: FI_AV_UNSPEC
> mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED,
> FI_MR_PROV_KEY ]
> mr_key_size: 4
> cq_data_size: 4
> cq_cnt: 65408
> ep_cnt: 163768
> tx_ctx_cnt: 1024
> rx_ctx_cnt: 1024
> max_ep_tx_ctx: 1024
> max_ep_rx_ctx: 1024
> max_ep_stx_ctx: 0
> max_ep_srx_ctx: 65472
> cntr_cnt: 0
> mr_iov_limit: 1
> caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
> mode: [ ]
> auth_key_size: 0
> max_err_data: 255
> mr_cnt: 524032
> fi_fabric_attr:
> name: IB-0xfe80000000000000
> prov_name: verbs
> prov_version: 111.0
> api_version: 1.11
> fid_nic:
> fi_device_attr:
> name: mlx4_0
> device_id: 0x1003
> device_version: 1
> vendor_id: 0x02c9
> driver: (null)
> firmware: 2.31.5050
> fi_bus_attr:
> fi_bus_type: FI_BUS_UNKNOWN
> fi_link_attr:
> address: (null)
> mtu: 4096
> speed: 32000000000
> state: FI_LINK_UP
> network_type: InfiniBand
>
>
> *10.0.10.3 is IP-Address of the server.
>
>
> Regarding environment variables: I haven't changed the default values.
>
> Will see if FI_LOG_LEVEL="debug" give some useful information tomorrow.
>
>
> Greetings, Arne.
>
>
> Am 26.10.20 um 21:24 schrieb Arne:
>> Sorry to press the issue, but due date when I shall present some
>> results is comming closer.
>>
>>
>> By now I have a setupt which should result in a libfabric
>> installation with debug mode enabled.
>>
>> Could debug mode output help us trouble shooting the problem with the
>> fi_connect call?
>>
>>
>> Are there ideas why fi_connect returns an error entry containing
>> Unkown Error -8 to the respective eq of the ep?
>>
>>
>> Greetings,
>>
>> Arne.
>>
>>
>> Am 25.10.20 um 16:50 schrieb Arne:
>>> Hey, got additional questions regarding the usage of the
>>> verbs-provider.
>>>
>>>
>>> Are there any details/configs to acknowledge sending a fi_connect
>>> request when trying to connect an Endpoint on the verbs Provider to
>>> its peer or reading the respective eq?
>>>
>>>
>>> Reading the eq afterwards gives Unknown Error -8 from fi_eq_strerror
>>> (so maybe Exec format error?).
>>>
>>>
>>> Debug Level info did not provide any useful info regarding the
>>> Problem (last info is verbs checking for suitable endpoint types and
>>> since the fi_endpoint did not return an error I assume it found the
>>> requested FI_EP_MSG).
>>>
>>> eq-attributes used:
>>>
>>> eq_attr.size = eq_size;
>>> eq_attr.flags = 0;
>>> eq_attr.wait_obj = FI_WAIT_FD;
>>> eq_attr.signaling_vector = 0;
>>> eq_attr.wait_set = NULL;
>>>
>>>
>>> hints used are:
>>>
>>> msg_hints = fi_allocinfo();
>>> msg_hints->caps = FI_RMA | FI_MSG;
>>> msg_hints->mode = FI_RX_CQ_DATA;
>>> msg_hints->ep_attr->type = FI_EP_MSG;
>>> msg_hints->addr_format = FI_SOCKADDR_IN;
>>> msg_hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR
>>> | FI_MR_ALLOCATED | FI_MR_PROV_KEY;
>>> msg_hints->fabric_attr->prov_name = g_strdup("verbs");
>>>
>>>
>>> The address used as target of the fi_connect is the one
>>> corresponding to the infiniband IP of the server node and as
>>> FI_SOCKADDR_IN if that is of relevance.
>>>
>>> The call adds some additional connection data to the request.
>>>
>>>
>>> On another note: I saw that the node parameter cant be NULL for
>>> creating fi_infos for active endpoints (got a ENODATA and warning
>>> revealed that verbs couldnt resolve the address, so I replaced it
>>> with localhost).
>>>
>>> Is that true or do I have another error?
>>>
>>>
>>> Greetings, Arne.
>>>
More information about the Libfabric-users
mailing list