[libfabric-users] troubled by FI_SOURCE use

Biddiscombe, John A. biddisco at cscs.ch
Tue Mar 12 15:55:20 PDT 2019


Sean

>Can you confirm the local EP address?  The send below strongly indicates that its port 11111, but can you verify?

ok. Process 0 : I used 11111 for simplicity, but when I print the fi_info I get the 
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
    mode: [  ]
    addr_format: FI_SOCKADDR_IN
    src_addrlen: 16
    dest_addrlen: 0
    src_addr: fi_sockaddr_in://127.0.0.1:58910
    dest_addr: (null)
(the only odd thing here is that I ask for port 7910, but it comes out as 58910, same for other port numbers, must be a byte swap thing going on somewhere, but that doesn't appear to be a problem)

Process 1 : I know that process 0 will always be on 127.0.0.1:7910 and I get
    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
    mode: [  ]
    addr_format: FI_SOCKADDR_IN
    src_addrlen: 16
    dest_addrlen: 16
    src_addr: fi_sockaddr_in://127.0.0.1:0
    dest_addr: fi_sockaddr_in://127.0.0.1:58910
and then when I call fi_getname after enabling the endpoint I get
      127.0.0.1:43163 
and this is the address that I send to process 0:

process 0 receives this correctly and it has address vector entry 0 127.0.0.1:58910 and then I add entry 1 127.0.0.1:43163  so it has itself as entry 0, and the new client as entry 1
(note that if this 43163 is byte munged and should really be something else, maybe this could be a problem now?)

> Are the addresses inserted as 22222 then 11111, or in the other order?  Is this AV map or table?
it is av MAP and both nodes add 127.0.0.1:58910 (process 0) as the zero entry and then process 1 as the 1 entry.


> Is 22222 the only entry in this AV?
No. Now 11111 is 0, and 22222 is 1 (or 58910 and 43163 in the real example)

>Process 1 is getting a receive completion, not send?
Process 1 sends the address to process 0:
Process 1 gets a send completion
Process 0 sends "hello" back to process 1
Process 0 gets a send completion
process 0 gets a receive with "hello" in it

>Are you using rxm+tcp for this?
here is the full get_info for process 0

    caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
    mode: [  ]
    addr_format: FI_SOCKADDR_IN
    src_addrlen: 16
    dest_addrlen: 0
    src_addr: fi_sockaddr_in://127.0.0.1:58910
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
        mode: [  ]
        op_flags: [ FI_COMPLETION, FI_TRANSMIT_COMPLETE ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
        comp_order: [ FI_ORDER_NONE ]
        inject_size: 255
        size: 376
        iov_limit: 8
        rma_iov_limit: 8
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_RMA_PMEM, FI_SHARED_AV, FI_RMA_EVENT, FI_SOURCE ]
        mode: [  ]
        op_flags: [ FI_COMPLETION ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS ]
        comp_order: [ FI_ORDER_STRICT, FI_ORDER_DATA ]
        total_buffered_recv: 67108864
        size: 376
        iov_limit: 8
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_SOCK_TCP
        protocol_version: 2
        max_msg_size: 18446744073709547519
        msg_prefix_size: 0
        max_order_raw_size: 18446744073709547519
        max_order_war_size: 18446744073709547519
        max_order_waw_size: 18446744073709547519
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 16
        rx_ctx_cnt: 16
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: lo
        threading: FI_THREAD_SAFE
        control_progress: FI_PROGRESS_MANUAL
        data_progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_BASIC ]
        mr_key_size: 8
        cq_data_size: 8
        cq_cnt: 32
        ep_cnt: 128
        tx_ctx_cnt: 16
        rx_ctx_cnt: 16
        max_ep_tx_ctx: 16
        max_ep_rx_ctx: 16
        max_ep_stx_ctx: 128
        max_ep_srx_ctx: 128
        cntr_cnt: 128
        mr_iov_limit: 8
    caps: [  ]
    mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
    fi_fabric_attr:
        name: 127.0.0.0/8
        prov_name: sockets
        prov_version: 2.0
        api_version: 1.4
    nic_fid: (nil)



>This sounds like some error in the setup.  Do you have a pointer to the test code that we could run/look at?

I could push it to github, but it is a shocking mess as I wanted to throw away all the connection based code (that was left over from several years ago when I last looked at this) and use the connectionless one and there are ifdefs everwhere and PMI bootstrap (that works lovely on the cray btw) functions in there as well. so it might be tricky for you to understand it.

Thanks for taking time to respond

JB




More information about the Libfabric-users mailing list