[libfabric-users] verbs questions, problem with fi_connect
sean.hefty at intel.com
Mon Nov 2 14:34:07 PST 2020
I was on vacation last week, so I did not see this until today.
> Maybe some code can help a more experienced person to spot my fault
> (mostly skipped unrelevant error handling and combined definition and
> init for brevity):
> fi_getinfo(FI_VERSION(1, 11),
> That info is used for creating fabric, domain, and endpoint (ep) of the
> client side, msg_hints entries in prior message.
> from connection function (connection_data is 38 byte Buffer for a UUID,
> ep and event_queue are associated):
> uint32_t* event;
> size_t event_entry_size = sizeof(struct fi_eq_cm_entry) + 128;
> void* event_entry = malloc(event_entry_size);
> struct fi_eq_err_entry* event_queue_err_entry;
> struct sockaddr_in* address = malloc(sizeof(struct sockaddr_in));
> address = malloc(sizeof(struct sockaddr_in));
> address->sin_family = AF_INET;
> address->sin_port = htons(4711);
> inet_aton("10.0.10.3", &address->sin_addr);
Btw, you should be able to use fi_info->dest_addr from the output of fi_getinfo()
> fi_connect(ep, address, (void*)con_data, sizeof(struct JConData));
You can also pass in fi_info->dest_addrlen.
> error = fi_eq_sread(event_queue, event, event_entry, event_entry_size,
> -1, 0);
> if (error < 0)
> if (error == -FI_EAVAIL)
> error = fi_eq_readerr(event_queue, event_queue_err_entry, 0);
Be sure event_queue_err_entry is pointing to allocated memory, and that event_queue_err_entry::err_data_size/err_data is set correctly.
Before calling fi_connect(), ensure that the server is listening.
> >> The address used as target of the fi_connect is the one corresponding
> >> to the infiniband IP of the server node and as FI_SOCKADDR_IN if that
> >> is of relevance.
Yes, this is relevant and required actually to make use of verbs provider.
> >> The call adds some additional connection data to the request.
> >> On another note: I saw that the node parameter cant be NULL for
> >> creating fi_infos for active endpoints (got a ENODATA and warning
> >> revealed that verbs couldnt resolve the address, so I replaced it
> >> with localhost).
I didn't follow this. Once you call fi_getinfo(), you should use the returned fi_info structure to create the endpoint.
More information about the Libfabric-users