[libfabric-users] troubled by FI_SOURCE use
Biddiscombe, John A.
biddisco at cscs.ch
Tue Mar 12 15:55:20 PDT 2019
Sean
>Can you confirm the local EP address? The send below strongly indicates that its port 11111, but can you verify?
ok. Process 0 : I used 11111 for simplicity, but when I print the fi_info I get the
caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
mode: [ ]
addr_format: FI_SOCKADDR_IN
src_addrlen: 16
dest_addrlen: 0
src_addr: fi_sockaddr_in://127.0.0.1:58910
dest_addr: (null)
(the only odd thing here is that I ask for port 7910, but it comes out as 58910, same for other port numbers, must be a byte swap thing going on somewhere, but that doesn't appear to be a problem)
Process 1 : I know that process 0 will always be on 127.0.0.1:7910 and I get
caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
mode: [ ]
addr_format: FI_SOCKADDR_IN
src_addrlen: 16
dest_addrlen: 16
src_addr: fi_sockaddr_in://127.0.0.1:0
dest_addr: fi_sockaddr_in://127.0.0.1:58910
and then when I call fi_getname after enabling the endpoint I get
127.0.0.1:43163
and this is the address that I send to process 0:
process 0 receives this correctly and it has address vector entry 0 127.0.0.1:58910 and then I add entry 1 127.0.0.1:43163 so it has itself as entry 0, and the new client as entry 1
(note that if this 43163 is byte munged and should really be something else, maybe this could be a problem now?)
> Are the addresses inserted as 22222 then 11111, or in the other order? Is this AV map or table?
it is av MAP and both nodes add 127.0.0.1:58910 (process 0) as the zero entry and then process 1 as the 1 entry.
> Is 22222 the only entry in this AV?
No. Now 11111 is 0, and 22222 is 1 (or 58910 and 43163 in the real example)
>Process 1 is getting a receive completion, not send?
Process 1 sends the address to process 0:
Process 1 gets a send completion
Process 0 sends "hello" back to process 1
Process 0 gets a send completion
process 0 gets a receive with "hello" in it
>Are you using rxm+tcp for this?
here is the full get_info for process 0
caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
mode: [ ]
addr_format: FI_SOCKADDR_IN
src_addrlen: 16
dest_addrlen: 0
src_addr: fi_sockaddr_in://127.0.0.1:58910
dest_addr: (null)
handle: (nil)
fi_tx_attr:
caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
mode: [ ]
op_flags: [ FI_COMPLETION, FI_TRANSMIT_COMPLETE ]
msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
comp_order: [ FI_ORDER_NONE ]
inject_size: 255
size: 376
iov_limit: 8
rma_iov_limit: 8
fi_rx_attr:
caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
mode: [ ]
op_flags: [ FI_COMPLETION ]
msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
total_buffered_recv: 67108864
size: 376
iov_limit: 8
fi_ep_attr:
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
protocol_version: 2
max_msg_size: 18446744073709547519
msg_prefix_size: 0
max_order_raw_size: 18446744073709547519
max_order_war_size: 18446744073709547519
max_order_waw_size: 18446744073709547519
mem_tag_format: 0xaaaaaaaaaaaaaaaa
tx_ctx_cnt: 16
rx_ctx_cnt: 16
auth_key_size: 0
fi_domain_attr:
domain: 0x0
name: lo
threading: FI_THREAD_SAFE
control_progress: FI_PROGRESS_MANUAL
data_progress: FI_PROGRESS_MANUAL
resource_mgmt: FI_RM_ENABLED
av_type: FI_AV_UNSPEC
mr_mode: [ FI_MR_BASIC ]
mr_key_size: 8
cq_data_size: 8
cq_cnt: 32
ep_cnt: 128
tx_ctx_cnt: 16
rx_ctx_cnt: 16
max_ep_tx_ctx: 16
max_ep_rx_ctx: 16
max_ep_stx_ctx: 128
max_ep_srx_ctx: 128
cntr_cnt: 128
mr_iov_limit: 8
caps: [ ]
mode: [ ]
auth_key_size: 0
max_err_data: 0
mr_cnt: 0
fi_fabric_attr:
name: 127.0.0.0/8
prov_name: sockets
prov_version: 2.0
api_version: 1.4
nic_fid: (nil)
>This sounds like some error in the setup. Do you have a pointer to the test code that we could run/look at?
I could push it to github, but it is a shocking mess as I wanted to throw away all the connection based code (that was left over from several years ago when I last looked at this) and use the connectionless one and there are ifdefs everwhere and PMI bootstrap (that works lovely on the cray btw) functions in there as well. so it might be tricky for you to understand it.
Thanks for taking time to respond
JB
More information about the Libfabric-users
mailing list