[libfabric-users] verbs questions, problem with fi_connect
Arne
arnestruck at astruck.de
Tue Oct 27 13:46:13 PDT 2020
Hello, it is me again.
Maybe some code can help a more experienced person to spot my fault
(mostly skipped unrelevant error handling and combined definition and
init for brevity):
fi_getinfo(FI_VERSION(1, 11),
"10.0.10.4",
"4711",
FI_NUMERICHOST,
msg_hints*,
&info);
That info is used for creating fabric, domain, and endpoint (ep) of the
client side, msg_hints entries in prior message.
from connection function (connection_data is 38 byte Buffer for a UUID,
ep and event_queue are associated):
uint32_t* event;
size_t event_entry_size = sizeof(struct fi_eq_cm_entry) + 128;
void* event_entry = malloc(event_entry_size);
struct fi_eq_err_entry* event_queue_err_entry;
struct sockaddr_in* address = malloc(sizeof(struct sockaddr_in));
address = malloc(sizeof(struct sockaddr_in));
address->sin_family = AF_INET;
address->sin_port = htons(4711);
inet_aton("10.0.10.3", &address->sin_addr);
fi_connect(ep, address, (void*)con_data, sizeof(struct JConData));
error = fi_eq_sread(event_queue, event, event_entry, event_entry_size,
-1, 0);
if (error < 0)
{
if (error == -FI_EAVAIL)
{
error = fi_eq_readerr(event_queue, event_queue_err_entry, 0);
if (error < 0)
{
g_critical("\nError occurred reading
Eq.\nDetails:\n%s", fi_strerror(abs(error)));
}
else
{
g_critical("\nError Message on Event
Queue.\nDetails:\n%s", fi_eq_strerror(event_queue,
event_queue_err_entry->prov_errno, event_queue_err_entry->err_data,
NULL, 0));
}
}
}
Last g_critical resulst in the Unkown Error -8.
'fi_info -p verbs -P 4711 -n 10.0.10.3 -t FI_EP_MSG -a FI_SOCKADDR_IN
-v'* results in:
fi_info:
caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV,
FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_LOCAL_COMM, FI_REMOTE_COMM ]
mode: [ FI_RX_CQ_DATA ]
addr_format: FI_SOCKADDR_IN
src_addrlen: 16
dest_addrlen: 16
src_addr: fi_sockaddr_in://10.0.10.4:0
dest_addr: fi_sockaddr_in://10.0.10.3:4711
handle: (nil)
fi_tx_attr:
caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
mode: [ ]
op_flags: [ ]
msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS,
FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS,
FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW,
FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
comp_order: [ FI_ORDER_STRICT ]
inject_size: 256
size: 384
iov_limit: 4
rma_iov_limit: 1
fi_rx_attr:
caps: [ FI_MSG, FI_RMA, FI_ATOMIC, FI_RECV, FI_REMOTE_READ,
FI_REMOTE_WRITE ]
mode: [ FI_RX_CQ_DATA ]
op_flags: [ ]
msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS,
FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAW, FI_ORDER_SAS,
FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAW,
FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAW ]
comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
total_buffered_recv: 0
size: 384
iov_limit: 4
fi_ep_attr:
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
protocol_version: 1
max_msg_size: 1073741824
msg_prefix_size: 0
max_order_raw_size: 1073741824
max_order_war_size: 0
max_order_waw_size: 1073741824
mem_tag_format: 0x0000000000000000
tx_ctx_cnt: 1
rx_ctx_cnt: 1
auth_key_size: 0
fi_domain_attr:
domain: 0x0
name: mlx4_0
threading: FI_THREAD_SAFE
control_progress: FI_PROGRESS_AUTO
data_progress: FI_PROGRESS_AUTO
resource_mgmt: FI_RM_ENABLED
av_type: FI_AV_UNSPEC
mr_mode: [ FI_MR_LOCAL, FI_MR_VIRT_ADDR, FI_MR_ALLOCATED,
FI_MR_PROV_KEY ]
mr_key_size: 4
cq_data_size: 4
cq_cnt: 65408
ep_cnt: 163768
tx_ctx_cnt: 1024
rx_ctx_cnt: 1024
max_ep_tx_ctx: 1024
max_ep_rx_ctx: 1024
max_ep_stx_ctx: 0
max_ep_srx_ctx: 65472
cntr_cnt: 0
mr_iov_limit: 1
caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
mode: [ ]
auth_key_size: 0
max_err_data: 255
mr_cnt: 524032
fi_fabric_attr:
name: IB-0xfe80000000000000
prov_name: verbs
prov_version: 111.0
api_version: 1.11
fid_nic:
fi_device_attr:
name: mlx4_0
device_id: 0x1003
device_version: 1
vendor_id: 0x02c9
driver: (null)
firmware: 2.31.5050
fi_bus_attr:
fi_bus_type: FI_BUS_UNKNOWN
fi_link_attr:
address: (null)
mtu: 4096
speed: 32000000000
state: FI_LINK_UP
network_type: InfiniBand
*10.0.10.3 is IP-Address of the server.
Regarding environment variables: I haven't changed the default values.
Will see if FI_LOG_LEVEL="debug" give some useful information tomorrow.
Greetings, Arne.
Am 26.10.20 um 21:24 schrieb Arne:
> Sorry to press the issue, but due date when I shall present some
> results is comming closer.
>
>
> By now I have a setupt which should result in a libfabric installation
> with debug mode enabled.
>
> Could debug mode output help us trouble shooting the problem with the
> fi_connect call?
>
>
> Are there ideas why fi_connect returns an error entry containing
> Unkown Error -8 to the respective eq of the ep?
>
>
> Greetings,
>
> Arne.
>
>
> Am 25.10.20 um 16:50 schrieb Arne:
>> Hey, got additional questions regarding the usage of the verbs-provider.
>>
>>
>> Are there any details/configs to acknowledge sending a fi_connect
>> request when trying to connect an Endpoint on the verbs Provider to
>> its peer or reading the respective eq?
>>
>>
>> Reading the eq afterwards gives Unknown Error -8 from fi_eq_strerror
>> (so maybe Exec format error?).
>>
>>
>> Debug Level info did not provide any useful info regarding the
>> Problem (last info is verbs checking for suitable endpoint types and
>> since the fi_endpoint did not return an error I assume it found the
>> requested FI_EP_MSG).
>>
>> eq-attributes used:
>>
>> eq_attr.size = eq_size;
>> eq_attr.flags = 0;
>> eq_attr.wait_obj = FI_WAIT_FD;
>> eq_attr.signaling_vector = 0;
>> eq_attr.wait_set = NULL;
>>
>>
>> hints used are:
>>
>> msg_hints = fi_allocinfo();
>> msg_hints->caps = FI_RMA | FI_MSG;
>> msg_hints->mode = FI_RX_CQ_DATA;
>> msg_hints->ep_attr->type = FI_EP_MSG;
>> msg_hints->addr_format = FI_SOCKADDR_IN;
>> msg_hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_VIRT_ADDR |
>> FI_MR_ALLOCATED | FI_MR_PROV_KEY;
>> msg_hints->fabric_attr->prov_name = g_strdup("verbs");
>>
>>
>> The address used as target of the fi_connect is the one corresponding
>> to the infiniband IP of the server node and as FI_SOCKADDR_IN if that
>> is of relevance.
>>
>> The call adds some additional connection data to the request.
>>
>>
>> On another note: I saw that the node parameter cant be NULL for
>> creating fi_infos for active endpoints (got a ENODATA and warning
>> revealed that verbs couldnt resolve the address, so I replaced it
>> with localhost).
>>
>> Is that true or do I have another error?
>>
>>
>> Greetings, Arne.
>>
More information about the Libfabric-users
mailing list