[libfabric-users] troubled by FI_SOURCE use

Hefty, Sean sean.hefty at intel.com
Tue Mar 12 16:18:26 PDT 2019


> >Can you confirm the local EP address?  The send below strongly indicates that
> its port 11111, but can you verify?
> 
> ok. Process 0 : I used 11111 for simplicity, but when I print the fi_info I
> get the
>     caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ,
> FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM,
> FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
>     mode: [  ]
>     addr_format: FI_SOCKADDR_IN
>     src_addrlen: 16
>     dest_addrlen: 0
>     src_addr: fi_sockaddr_in://127.0.0.1:58910
>     dest_addr: (null)
> (the only odd thing here is that I ask for port 7910, but it comes out as
> 58910, same for other port numbers, must be a byte swap thing going on
> somewhere, but that doesn't appear to be a problem)

Sockaddr are in network order.  The printing functions will convert from that to display the actual value used.  If you want 7910, you will need to assign it as htons(7910).

> Process 1 : I know that process 0 will always be on 127.0.0.1:7910 and I get
>     caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ,
> FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM,
> FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
>     mode: [  ]
>     addr_format: FI_SOCKADDR_IN
>     src_addrlen: 16
>     dest_addrlen: 16
>     src_addr: fi_sockaddr_in://127.0.0.1:0
>     dest_addr: fi_sockaddr_in://127.0.0.1:58910
> and then when I call fi_getname after enabling the endpoint I get
>       127.0.0.1:43163
> and this is the address that I send to process 0:
> 
> process 0 receives this correctly and it has address vector entry 0
> 127.0.0.1:58910 and then I add entry 1 127.0.0.1:43163  so it has itself as
> entry 0, and the new client as entry 1
> (note that if this 43163 is byte munged and should really be something else,
> maybe this could be a problem now?)

If you are exchanging the addresses by call fi_getname() and sending the raw data to the peer, it should work fine.  That is what fabtests does as well.

> > Are the addresses inserted as 22222 then 11111, or in the other order?  Is
> this AV map or table?
> it is av MAP and both nodes add 127.0.0.1:58910 (process 0) as the zero entry
> and then process 1 as the 1 entry.

AV map returns an fi_addr_t for each address that is inserted.  Please make sure that you're using the value returned from the correct insertion call for the transfer.  If you switch to AV table, you can use a simple index.

(Note that the provider may always return an index for the fi_addr_t value, but that’s an implementation detail and not an API requirement).

> > Is 22222 the only entry in this AV?
> No. Now 11111 is 0, and 22222 is 1 (or 58910 and 43163 in the real example)
> 
> >Process 1 is getting a receive completion, not send?
> Process 1 sends the address to process 0:
> Process 1 gets a send completion
> Process 0 sends "hello" back to process 1
> Process 0 gets a send completion
> process 0 gets a receive with "hello" in it
> 
> >Are you using rxm+tcp for this?
> here is the full get_info for process 0
> 
>     caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ,
> FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM,
> FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
>     mode: [  ]
>     addr_format: FI_SOCKADDR_IN
>     src_addrlen: 16
>     dest_addrlen: 0
>     src_addr: fi_sockaddr_in://127.0.0.1:58910
>     dest_addr: (null)
>     handle: (nil)
>     fi_tx_attr:
>         caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ,
> FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM,
> FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
>         mode: [  ]
>         op_flags: [ FI_COMPLETION, FI_TRANSMIT_COMPLETE ]
>         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR,
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
>         comp_order: [ FI_ORDER_NONE ]
>         inject_size: 255
>         size: 376
>         iov_limit: 8
>         rma_iov_limit: 8
>     fi_rx_attr:
>         caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ,
> FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM,
> FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
>         mode: [  ]
>         op_flags: [ FI_COMPLETION ]
>         msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR,
> FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
>         comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
>         total_buffered_recv: 67108864
>         size: 376
>         iov_limit: 8
>     fi_ep_attr:
>         type: FI_EP_RDM
>         protocol: FI_PROTO_SOCK_TCP
>         protocol_version: 2
>         max_msg_size: 18446744073709547519
>         msg_prefix_size: 0
>         max_order_raw_size: 18446744073709547519
>         max_order_war_size: 18446744073709547519
>         max_order_waw_size: 18446744073709547519
>         mem_tag_format: 0xaaaaaaaaaaaaaaaa
>         tx_ctx_cnt: 16
>         rx_ctx_cnt: 16
>         auth_key_size: 0
>     fi_domain_attr:
>         domain: 0x0
>         name: lo
>         threading: FI_THREAD_SAFE
>         control_progress: FI_PROGRESS_MANUAL
>         data_progress: FI_PROGRESS_MANUAL
>         resource_mgmt: FI_RM_ENABLED
>         av_type: FI_AV_UNSPEC
>         mr_mode: [ FI_MR_BASIC ]
>         mr_key_size: 8
>         cq_data_size: 8
>         cq_cnt: 32
>         ep_cnt: 128
>         tx_ctx_cnt: 16
>         rx_ctx_cnt: 16
>         max_ep_tx_ctx: 16
>         max_ep_rx_ctx: 16
>         max_ep_stx_ctx: 128
>         max_ep_srx_ctx: 128
>         cntr_cnt: 128
>         mr_iov_limit: 8
>     caps: [  ]
>     mode: [  ]
>         auth_key_size: 0
>         max_err_data: 0
>         mr_cnt: 0
>     fi_fabric_attr:
>         name: 127.0.0.0/8
>         prov_name: sockets
>         prov_version: 2.0
>         api_version: 1.4
>     nic_fid: (nil)

This is using the sockets provider.  Can you update to a newer version of libfabric?

- Sean



More information about the Libfabric-users mailing list