[libfabric-users] Can only use one NIC port in libfabric 1.6.1

Jörn Schumacher jorn.schumacher at cern.ch
Wed Sep 5 05:39:42 PDT 2018


Hi all,

sorry for the third message in a row, but writing everything down seemed 
to have had a therapeutic effect and the issue has been found :)

Looks like this was in the past OK:

fi_getinfo(FI_VERSION(1, 1), "0.0.0.0", "12345", FI_SOURCE, hints, &fi))

While now you need to use:

fi_getinfo(FI_VERSION(1, 1), NULL, "12345", FI_SOURCE, hints, &fi))

Sorry for all the noise and thanks for the help.

Cheers,
Jörn

On 09/05/2018 01:30 PM, Jörn Schumacher wrote:
> Hi,
>
> In addition to my last message:
>
> - I cross-checked with libfabric 1.6.0 since you reported not see the 
> issue with this version. Unfortunately we still see the same issue.
>
> - I cross-checked on a different system to exclude issues with the 
> NIC, same result.
>
> Cheers,
> Jörn
>
>
> On 09/05/2018 12:12 PM, Jörn Schumacher wrote:
>> Hi Arun,
>>
>> Sorry for the late reply. Our servers got updated and I was without 
>> PCs to test for a while.
>>
>> I put together a minimal test program that demonstrates the issue: 
>> https://gitlab.cern.ch/joschuma/libfabric-debug (let me know if there 
>> are issues with the access)
>>
>> The issue occurs even on a single host. IP configuration:
>>
>> eth2: 192.168.1.17/24
>> eth3: 192.168.2.17/24
>>
>> In one terminal ./listener will listen on 0.0.0.0:12345 and print if 
>> a CONNREQ occurs.
>>
>> In the other terminal:
>>
>> (1) ./connect 192.168.1.17 12345
>> (2) ./connect 192.168.2.17 12345
>>
>> (1) will generate no event in the listener program. (2) yields a 
>> CONNREQ event in the listener program. This happens with libfabric 
>> 1.6.1 and the verbs provider.
>>
>> With the rdma_server/rdma_client tools I am able to create a 
>> connection using both IP addresses. So I suspect a bug in libfabric.
>>
>> Let me know if you need any more info, I am happy to provide any help 
>> you might need.
>>
>>
>> Thanks a lot.
>>
>>
>> Cheers,
>>
>> Jörn
>>
>>
>>
>> On 08/25/2018 12:00 AM, Ilango, Arun wrote:
>>> Hi Jörn,
>>>
>>> ibv_devinfo shows different NIC ports as different devices as expected.
>>>
>>> To listen on multiple NIC ports, you just need one fabric and a 
>>> passive endpoint listening on the wildcard address (0.0.0.0). That 
>>> should work. I tried the same on a multi-port iwarp NIC and it was 
>>> working for me. This is on v1.6.0 and master.
>>>
>>> You can try initiating a connection request only from the second 
>>> port to check if that works.
>>>
>>> Thanks,
>>> Arun.
>>>
>>> -----Original Message-----
>>> From: Jörn Schumacher [mailto:jorn.schumacher at cern.ch]
>>> Sent: Thursday, August 23, 2018 2:27 AM
>>> To: Ilango, Arun <arun.ilango at intel.com>; 
>>> libfabric-users at lists.openfabrics.org
>>> Subject: Re: [libfabric-users] Can only use one NIC port in 
>>> libfabric 1.6.1
>>>
>>> Hi Arun,
>>>
>>> Thanks for your reply.
>>>
>>> ibv_devinfo: 
>>> https://gist.github.com/joerns/cb7d216b0c3a71b5ea327d0292459211
>>>
>>> Looking at my code, I realize the issue actually occurs before even 
>>> setting up the fi_domain object. I posted my (stripped-down) 
>>> initialization procedure in the other file in the gist.
>>>
>>> In case I want to listen on multiple ports, do I need multiple 
>>> fi_fabric objects? Or multiple endpoints? Or should I be able to 
>>> listen on multiple interfaces with "0.0.0.0" like I am doing?
>>>
>>> Thanks,
>>> Jörn
>>>
>>>
>>>
>>> On 08/22/2018 07:46 PM, Ilango, Arun wrote:
>>>> Hi Jörn,
>>>>
>>>> The verbs provider assigns separate domains for each device got 
>>>> from rdma_get_devices(). So if the NIC ports show up as separate 
>>>> devices, they would belong to separate domains. This had been the 
>>>> case even for 1.4.
>>>>
>>>> Can you check the output of ibv_devinfo? How does the ports show up 
>>>> there?
>>>>
>>>> Thanks,
>>>> Arun.
>>>>
>>>> -----Original Message-----
>>>> From: Libfabric-users
>>>> [mailto:libfabric-users-bounces at lists.openfabrics.org] On Behalf Of
>>>> Jörn Schumacher
>>>> Sent: Tuesday, August 21, 2018 2:17 AM
>>>> To: libfabric-users at lists.openfabrics.org
>>>> Subject: [libfabric-users] Can only use one NIC port in libfabric
>>>> 1.6.1
>>>>
>>>> Dear libfabric developers,
>>>>
>>>> I recently updated to libfabric 1.6.1 (from 1.4). It looks like in 
>>>> this release we can only use on port of our NIC (Mellanox 
>>>> ConnectX-5 with RoCE).
>>>>
>>>> On the receiving side we listen for a RC. We monitor the event queue
>>>> with a file descriptor + epoll. On one port of the NIC this works
>>>> fine, but if the request comes in on the second port (on a different
>>>> IP
>>>> subnet) this fails: we get an epoll notification, but then the 
>>>> subsequent fi_eq_sread(...) call yields FI_EAGAIN.
>>>>
>>>> I open a single domain. This worked fine in the earlier libfabric.
>>>> Reading the documentation a bit I understand that a domain is tied 
>>>> to a port. Does this mean I need to open multiple domains?
>>>>
>>>>
>>>> Thanks and best regards,
>>>> Jörn
>>>> _______________________________________________
>>>> Libfabric-users mailing list
>>>> Libfabric-users at lists.openfabrics.org
>>>> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users



More information about the Libfabric-users mailing list