[libfabric-users] verbs questions, problem with fi_connect

Arne arnestruck at astruck.de
Tue Oct 27 13:46:13 PDT 2020


Hello, it is me again.

Maybe some code can help a more experienced person to spot my fault 
(mostly skipped unrelevant error handling and combined definition and 
init for brevity):


fi_getinfo(FI_VERSION(1, 11),

                "10.0.10.4",
                "4711",
                FI_NUMERICHOST,
                msg_hints*,
                &info);

That info is used for creating fabric, domain, and endpoint (ep) of the 
client side, msg_hints entries in prior message.


from connection function (connection_data is 38 byte Buffer for a UUID, 
ep and event_queue are associated):

uint32_t* event;

size_t event_entry_size = sizeof(struct fi_eq_cm_entry) + 128;

void* event_entry = malloc(event_entry_size);

struct fi_eq_err_entry* event_queue_err_entry;

struct sockaddr_in* address = malloc(sizeof(struct sockaddr_in));


address = malloc(sizeof(struct sockaddr_in));

address->sin_family = AF_INET;

address->sin_port = htons(4711);

inet_aton("10.0.10.3", &address->sin_addr);


fi_connect(ep, address, (void*)con_data, sizeof(struct JConData));

error = fi_eq_sread(event_queue, event, event_entry, event_entry_size, 
-1, 0);

if (error < 0)

{
      if (error == -FI_EAVAIL)
      {
         error = fi_eq_readerr(event_queue, event_queue_err_entry, 0);
         if (error < 0)
         {
                 g_critical("\nError occurred reading 
Eq.\nDetails:\n%s", fi_strerror(abs(error)));
          }
          else
          {
                 g_critical("\nError Message on Event 
Queue.\nDetails:\n%s", fi_eq_strerror(event_queue, 
event_queue_err_entry->prov_errno, event_queue_err_entry->err_data, 
NULL, 0));
           }
       }

}

Last g_critical resulst in the Unkown Error -8.


'fi_info -p verbs -P 4711 -n 10.0.10.3 -t FI_EP_MSG -a FI_SOCKADDR_IN 
-v'* results in:

fi_info:
     caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, 
FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM ]
     mode: [ FI_RX_CQ_DATA ]
     addr_format: FI_SOCKADDR_IN
     src_addrlen: 16
     dest_addrlen: 16
     src_addr: fi_sockaddr_in://10.0.10.4:0
     dest_addr: fi_sockaddr_in://10.0.10.3:4711
     handle: (nil)
     fi_tx_attr:
         caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
         mode: [  ]
         op_flags: [  ]
         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, 
FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, 
FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, 
FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
         comp_order: [ FI_ORDER_STRICT ]
         inject_size: 256
         size: 384
         iov_limit: 4
         rma_iov_limit: 1
     fi_rx_attr:
         caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_RECV, FI_REMOTE_READ, 
FI_REMOTE_WRITE ]
         mode: [ FI_RX_CQ_DATA ]
         op_flags: [  ]
         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, 
FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS, 
FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW, 
FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
         comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
         total_buffered_recv: 0
         size: 384
         iov_limit: 4
     fi_ep_attr:
         type: FI_EP_MSG
         protocol: FI_PROTO_RDMA_CM_IB_RC
         protocol_version: 1
         max_msg_size: 1073741824
         msg_prefix_size: 0
         max_order_raw_size: 1073741824
         max_order_war_size: 0
         max_order_waw_size: 1073741824
         mem_tag_format: 0x0000000000000000
         tx_ctx_cnt: 1
         rx_ctx_cnt: 1
         auth_key_size: 0
     fi_domain_attr:
         domain: 0x0
         name: mlx4_0
         threading: FI_THREAD_SAFE
         control_progress: FI_PROGRESS_AUTO
         data_progress: FI_PROGRESS_AUTO
         resource_mgmt: FI_RM_ENABLED
         av_type: FI_AV_UNSPEC
         mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, 
FI_MR_PROV_KEY ]
         mr_key_size: 4
         cq_data_size: 4
         cq_cnt: 65408
         ep_cnt: 163768
         tx_ctx_cnt: 1024
         rx_ctx_cnt: 1024
         max_ep_tx_ctx: 1024
         max_ep_rx_ctx: 1024
         max_ep_stx_ctx: 0
         max_ep_srx_ctx: 65472
         cntr_cnt: 0
         mr_iov_limit: 1
     caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
     mode: [  ]
         auth_key_size: 0
         max_err_data: 255
         mr_cnt: 524032
     fi_fabric_attr:
         name: IB-0xfe80000000000000
         prov_name: verbs
         prov_version: 111.0
         api_version: 1.11
     fid_nic:
         fi_device_attr:
             name: mlx4_0
             device_id: 0x1003
             device_version: 1
             vendor_id: 0x02c9
             driver: (null)
             firmware: 2.31.5050
         fi_bus_attr:
             fi_bus_type: FI_BUS_UNKNOWN
         fi_link_attr:
             address: (null)
             mtu: 4096
             speed: 32000000000
             state: FI_LINK_UP
             network_type: InfiniBand


*10.0.10.3 is IP-Address of the server.


Regarding environment variables: I haven't changed the default values.

Will see if FI_LOG_LEVEL="debug" give some useful information tomorrow.


Greetings, Arne.


Am 26.10.20 um 21:24 schrieb Arne:
> Sorry to press the issue, but due date when I shall present some 
> results is comming closer.
>
>
> By now I have a setupt which should result in a libfabric installation 
> with debug mode enabled.
>
> Could debug mode output help us trouble shooting the problem with the 
> fi_connect call?
>
>
> Are there ideas why fi_connect returns an error entry containing 
> Unkown Error -8 to the respective eq of the ep?
>
>
> Greetings,
>
> Arne.
>
>
> Am 25.10.20 um 16:50 schrieb Arne:
>> Hey, got additional questions regarding the usage of the verbs-provider.
>>
>>
>> Are there any details/configs to acknowledge sending a fi_connect 
>> request when trying to connect an Endpoint on the verbs Provider to 
>> its peer or reading the respective eq?
>>
>>
>> Reading the eq afterwards gives Unknown Error -8 from fi_eq_strerror 
>> (so maybe Exec format error?).
>>
>>
>> Debug Level info did not provide any useful info regarding the 
>> Problem (last info is verbs checking for suitable endpoint types and 
>> since the fi_endpoint did not return an error I assume it found the 
>> requested FI_EP_MSG).
>>
>> eq-attributes used:
>>
>>     eq_attr.size = eq_size;
>>     eq_attr.flags = 0;
>>     eq_attr.wait_obj = FI_WAIT_FD;
>>     eq_attr.signaling_vector = 0;
>>     eq_attr.wait_set = NULL;
>>
>>
>> hints used are:
>>
>>     msg_hints = fi_allocinfo();
>>     msg_hints->caps = FI_RMA | FI_MSG;
>>     msg_hints->mode = FI_RX_CQ_DATA;
>>     msg_hints->ep_attr->type = FI_EP_MSG;
>>     msg_hints->addr_format = FI_SOCKADDR_IN;
>>     msg_hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR | 
>> FI_MR_ALLOCATED | FI_MR_PROV_KEY;
>>     msg_hints->fabric_attr->prov_name = g_strdup("verbs");
>>
>>
>> The address used as target of the fi_connect is the one corresponding 
>> to the infiniband IP of the server node and as FI_SOCKADDR_IN if that 
>> is of relevance.
>>
>> The call adds some additional connection data to the request.
>>
>>
>> On another note: I saw that the node parameter cant be NULL for 
>> creating fi_infos for active endpoints (got a ENODATA and warning 
>> revealed that verbs couldnt resolve the address, so I replaced it 
>> with localhost).
>>
>> Is that true or do I have another error?
>>
>>
>> Greetings, Arne.
>>


More information about the Libfabric-users mailing list