[libfabric-users] verbs questions, problem with fi_connect

Arne arnestruck at astruck.de
Wed Oct 28 09:41:43 PDT 2020


Hello,

here the respective debug output:

libfabric:46782:verbs:fabric:vrb_open_ep():972<debug> open_ep src addr: 
fi_sockaddr_in://10.0.10.3:0
libfabric:46782:verbs:fabric:vrb_open_ep():975<debug> open_ep dest addr: 
fi_sockaddr_in://10.0.10.3:4711
libfabric:46782:verbs:core:ofi_check_tx_attr():883<info> Rx only caps 
ignored in Tx caps
libfabric:46782:verbs:core:ofi_check_rx_attr():785<info> Tx only caps 
ignored in Rx caps
libfabric:46782:verbs:core:ofi_check_rx_attr():785<info> Tx only caps 
ignored in Rx caps
libfabric:46782:verbs:core:ofi_check_tx_attr():883<info> Rx only caps 
ignored in Tx caps
libfabric:46782:verbs:core:ofi_check_ep_attr():679<info> Unsupported 
protocol
libfabric:46782:verbs:core:ofi_check_ep_attr():680<info> Supported: 
FI_PROTO_RDMA_CM_IB_XRC
libfabric:46782:verbs:core:ofi_check_ep_attr():680<info> Requested: 
FI_PROTO_RDMA_CM_IB_RC
libfabric:46782:verbs:core:ofi_check_ep_type():657<info> unsupported 
endpoint type
libfabric:46782:verbs:core:ofi_check_ep_type():658<info> Supported: 
FI_EP_DGRAM
libfabric:46782:verbs:core:ofi_check_ep_type():658<info> Requested: 
FI_EP_MSG
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap 
start (nil) len 131072
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap 
start (nil) len 131072
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap 
start (nil) len 294912
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap 
start (nil) len 4096
libfabric:46782:core:mr:ofi_intercept_mmap():382<debug> intercepted mmap 
start (nil) len 294912
Error occurred reading Eq.
Details:
Unknown error -8

libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x603000008000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x60300000b000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x60300000e000 len 4096
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x60300002b000 len 20480
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x61d000039000 len 24576
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x61d000081000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x61d0000b1000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x61d000151000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x61d000171000 len 57344
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x62400000e000 len 8192
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000020000 len 401408
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000084000 len 311296
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x6240000d2000 len 589824
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000170000 len 11993088
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000d9c000 len 393216
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000e64000 len 458752
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624000f3c000 len 393216
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x624001004000 len 458752
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x6240010dc000 len 327680
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x2a6a94e9000 len 258048
libfabric:46782:core:mr:ofi_intercept_madvise():412<debug> intercepted 
madvise addr 0x2a6a94e9000 len 258048


I am unsure whether any of it is useful. I started with the ep and 
copied down from there.

Any ideas?


Greetings, Arne.



Am 27.10.20 um 21:46 schrieb Arne:
> Hello, it is me again.
>
> Maybe some code can help a more experienced person to spot my fault 
> (mostly skipped unrelevant error handling and combined definition and 
> init for brevity):
>
>
> fi_getinfo(FI_VERSION(1, 11),
>
>                "10.0.10.4",
>                "4711",
>                FI_NUMERICHOST,
>                msg_hints*,
>                &info);
>
> That info is used for creating fabric, domain, and endpoint (ep) of 
> the client side, msg_hints entries in prior message.
>
>
> from connection function (connection_data is 38 byte Buffer for a 
> UUID, ep and event_queue are associated):
>
> uint32_t* event;
>
> size_t event_entry_size = sizeof(struct fi_eq_cm_entry) + 128;
>
> void* event_entry = malloc(event_entry_size);
>
> struct fi_eq_err_entry* event_queue_err_entry;
>
> struct sockaddr_in* address = malloc(sizeof(struct sockaddr_in));
>
>
> address = malloc(sizeof(struct sockaddr_in));
>
> address->sin_family = AF_INET;
>
> address->sin_port = htons(4711);
>
> inet_aton("10.0.10.3", &address->sin_addr);
>
>
> fi_connect(ep, address, (void*)con_data, sizeof(struct JConData));
>
> error = fi_eq_sread(event_queue, event, event_entry, event_entry_size, 
> -1, 0);
>
> if (error < 0)
>
> {
>      if (error == -FI_EAVAIL)
>      {
>         error = fi_eq_readerr(event_queue, event_queue_err_entry, 0);
>         if (error < 0)
>         {
>                 g_critical("\nError occurred reading 
> Eq.\nDetails:\n%s", fi_strerror(abs(error)));
>          }
>          else
>          {
>                 g_critical("\nError Message on Event 
> Queue.\nDetails:\n%s", fi_eq_strerror(event_queue, 
> event_queue_err_entry->prov_errno, event_queue_err_entry->err_data, 
> NULL, 0));
>           }
>       }
>
> }
>
> Last g_critical resulst in the Unkown Error -8.
>
>
> 'fi_info -p verbs -P 4711 -n 10.0.10.3 -t FI_EP_MSG -a FI_SOCKADDR_IN 
> -v'* results in:
>
> fi_info:
>     caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, 
> FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM ]
>     mode: [ FI_RX_CQ_DATA ]
>     addr_format: FI_SOCKADDR_IN
>     src_addrlen: 16
>     dest_addrlen: 16
>     src_addr: fi_sockaddr_in://10.0.10.4:0
>     dest_addr: fi_sockaddr_in://10.0.10.3:4711
>     handle: (nil)
>     fi_tx_attr:
>         caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
>         mode: [  ]
>         op_flags: [  ]
>         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, 
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, 
> FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, 
> FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
>         comp_order: [ FI_ORDER_STRICT ]
>         inject_size: 256
>         size: 384
>         iov_limit: 4
>         rma_iov_limit: 1
>     fi_rx_attr:
>         caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_RECV, FI_REMOTE_READ, 
> FI_REMOTE_WRITE ]
>         mode: [ FI_RX_CQ_DATA ]
>         op_flags: [  ]
>         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, 
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, 
> FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, 
> FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
>         comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
>         total_buffered_recv: 0
>         size: 384
>         iov_limit: 4
>     fi_ep_attr:
>         type: FI_EP_MSG
>         protocol: FI_PROTO_RDMA_CM_IB_RC
>         protocol_version: 1
>         max_msg_size: 1073741824
>         msg_prefix_size: 0
>         max_order_raw_size: 1073741824
>         max_order_war_size: 0
>         max_order_waw_size: 1073741824
>         mem_tag_format: 0x0000000000000000
>         tx_ctx_cnt: 1
>         rx_ctx_cnt: 1
>         auth_key_size: 0
>     fi_domain_attr:
>         domain: 0x0
>         name: mlx4_0
>         threading: FI_THREAD_SAFE
>         control_progress: FI_PROGRESS_AUTO
>         data_progress: FI_PROGRESS_AUTO
>         resource_mgmt: FI_RM_ENABLED
>         av_type: FI_AV_UNSPEC
>         mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, 
> FI_MR_PROV_KEY ]
>         mr_key_size: 4
>         cq_data_size: 4
>         cq_cnt: 65408
>         ep_cnt: 163768
>         tx_ctx_cnt: 1024
>         rx_ctx_cnt: 1024
>         max_ep_tx_ctx: 1024
>         max_ep_rx_ctx: 1024
>         max_ep_stx_ctx: 0
>         max_ep_srx_ctx: 65472
>         cntr_cnt: 0
>         mr_iov_limit: 1
>     caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
>     mode: [  ]
>         auth_key_size: 0
>         max_err_data: 255
>         mr_cnt: 524032
>     fi_fabric_attr:
>         name: IB-0xfe80000000000000
>         prov_name: verbs
>         prov_version: 111.0
>         api_version: 1.11
>     fid_nic:
>         fi_device_attr:
>             name: mlx4_0
>             device_id: 0x1003
>             device_version: 1
>             vendor_id: 0x02c9
>             driver: (null)
>             firmware: 2.31.5050
>         fi_bus_attr:
>             fi_bus_type: FI_BUS_UNKNOWN
>         fi_link_attr:
>             address: (null)
>             mtu: 4096
>             speed: 32000000000
>             state: FI_LINK_UP
>             network_type: InfiniBand
>
>
> *10.0.10.3 is IP-Address of the server.
>
>
> Regarding environment variables: I haven't changed the default values.
>
> Will see if FI_LOG_LEVEL="debug" give some useful information tomorrow.
>
>
> Greetings, Arne.
>
>
> Am 26.10.20 um 21:24 schrieb Arne:
>> Sorry to press the issue, but due date when I shall present some 
>> results is comming closer.
>>
>>
>> By now I have a setupt which should result in a libfabric 
>> installation with debug mode enabled.
>>
>> Could debug mode output help us trouble shooting the problem with the 
>> fi_connect call?
>>
>>
>> Are there ideas why fi_connect returns an error entry containing 
>> Unkown Error -8 to the respective eq of the ep?
>>
>>
>> Greetings,
>>
>> Arne.
>>
>>
>> Am 25.10.20 um 16:50 schrieb Arne:
>>> Hey, got additional questions regarding the usage of the 
>>> verbs-provider.
>>>
>>>
>>> Are there any details/configs to acknowledge sending a fi_connect 
>>> request when trying to connect an Endpoint on the verbs Provider to 
>>> its peer or reading the respective eq?
>>>
>>>
>>> Reading the eq afterwards gives Unknown Error -8 from fi_eq_strerror 
>>> (so maybe Exec format error?).
>>>
>>>
>>> Debug Level info did not provide any useful info regarding the 
>>> Problem (last info is verbs checking for suitable endpoint types and 
>>> since the fi_endpoint did not return an error I assume it found the 
>>> requested FI_EP_MSG).
>>>
>>> eq-attributes used:
>>>
>>>     eq_attr.size = eq_size;
>>>     eq_attr.flags = 0;
>>>     eq_attr.wait_obj = FI_WAIT_FD;
>>>     eq_attr.signaling_vector = 0;
>>>     eq_attr.wait_set = NULL;
>>>
>>>
>>> hints used are:
>>>
>>>     msg_hints = fi_allocinfo();
>>>     msg_hints->caps = FI_RMA | FI_MSG;
>>>     msg_hints->mode = FI_RX_CQ_DATA;
>>>     msg_hints->ep_attr->type = FI_EP_MSG;
>>>     msg_hints->addr_format = FI_SOCKADDR_IN;
>>>     msg_hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR 
>>> | FI_MR_ALLOCATED | FI_MR_PROV_KEY;
>>>     msg_hints->fabric_attr->prov_name = g_strdup("verbs");
>>>
>>>
>>> The address used as target of the fi_connect is the one 
>>> corresponding to the infiniband IP of the server node and as 
>>> FI_SOCKADDR_IN if that is of relevance.
>>>
>>> The call adds some additional connection data to the request.
>>>
>>>
>>> On another note: I saw that the node parameter cant be NULL for 
>>> creating fi_infos for active endpoints (got a ENODATA and warning 
>>> revealed that verbs couldnt resolve the address, so I replaced it 
>>> with localhost).
>>>
>>> Is that true or do I have another error?
>>>
>>>
>>> Greetings, Arne.
>>>


More information about the Libfabric-users mailing list