From sean.hefty at intel.com Fri Sep 3 12:30:17 2021 From: sean.hefty at intel.com (Hefty, Sean) Date: Fri, 3 Sep 2021 19:30:17 +0000 Subject: [libfabric-users] Problem with the ofi+tcp;ofi_rxm provider In-Reply-To: <1690940179.5465500.1630313873158.JavaMail.zimbra@inria.fr> References: <1690940179.5465500.1630313873158.JavaMail.zimbra@inria.fr> Message-ID: > I'm using libfabric with the Mercury RPC framework and ran into a strange problem: > using the ofi+tcp;ofi_rxm provider, I can make one RPC but after that they all hang > indefinitely. > libfabric 1.13.0 & 1.13.1 have this problem, libfabric 1.12.1 does not, so I'm assuming > this is a bug on the libfabric side. > > Is there a libfabric tool I can use to get a more precise bug report? Your only option is to set the log level and capture that output. Running with a debug version may help. - Sean From john.biddiscombe at cscs.ch Mon Sep 20 10:33:06 2021 From: john.biddiscombe at cscs.ch (Biddiscombe, John A.) Date: Mon, 20 Sep 2021 17:33:06 +0000 Subject: [libfabric-users] Trouble with scalable endpoint Message-ID: Dear list, I have been using one endpoint for Rx and another scalable one for Tx, so that each thread uses one context on send, but receives are shared, it works very well. I decided to benchmark using a scalable endpoint for Rx as well and so created a single scalable endpoint with N Tx contexts/CQs and N Rx contexts/CQs, bound an address vector to it - but when I insert an address into the address vector I get an invalid argument error. NB. I have set the AV to use the correct number of bits for Rx context count. I have no idea why it is unhappy. Are there any other differences I need to look out for that might trigger an invalid argument error? Thanks JB -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Sep 27 11:04:23 2021 From: sean.hefty at intel.com (Hefty, Sean) Date: Mon, 27 Sep 2021 18:04:23 +0000 Subject: [libfabric-users] Trouble with scalable endpoint In-Reply-To: References: Message-ID: > I have been using one endpoint for Rx and another scalable one for Tx, so that each > thread uses one context on send, but receives are shared, it works very well. > > I decided to benchmark using a scalable endpoint for Rx as well and so created a single > scalable endpoint with N Tx contexts/CQs and N Rx contexts/CQs, bound an address vector > to it - but when I insert an address into the address vector I get an invalid argument > error. > > NB. I have set the AV to use the correct number of bits for Rx context count. > > I have no idea why it is unhappy. Are there any other differences I need to look out > for that might trigger an invalid argument error? No clue. What provider is this? - Sean From john.biddiscombe at cscs.ch Tue Sep 28 03:07:49 2021 From: john.biddiscombe at cscs.ch (Biddiscombe, John A.) Date: Tue, 28 Sep 2021 10:07:49 +0000 Subject: [libfabric-users] Trouble with scalable endpoint In-Reply-To: References: , Message-ID: <5fbf9d1345eb44efa23ed01f3db488e4@cscs.ch> Sean I don't remember exactly what I did wrong, but I think when I created the scalable endpoint, I had the number of contexts requested 1 more than was supported - instead of getting an error when creating the endpoint (or binding the queues), I got an error when adding an address to the AV. When I switched the code to using the number of contexts equal to the number of threads I was actually using (not how many were supported) - the problem went away and I realized I had an error during creation (at least I think I remember that's what I did wrong). The error appearing at the AV insertion fooled me into looking in the wrong places for the problem. Anyway, it work now thanks. JB ________________________________ From: Libfabric-users on behalf of Hefty, Sean Sent: 27 September 2021 20:04 To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org Subject: Re: [libfabric-users] Trouble with scalable endpoint > I have been using one endpoint for Rx and another scalable one for Tx, so that each > thread uses one context on send, but receives are shared, it works very well. > > I decided to benchmark using a scalable endpoint for Rx as well and so created a single > scalable endpoint with N Tx contexts/CQs and N Rx contexts/CQs, bound an address vector > to it - but when I insert an address into the address vector I get an invalid argument > error. > > NB. I have set the AV to use the correct number of bits for Rx context count. > > I have no idea why it is unhappy. Are there any other differences I need to look out > for that might trigger an invalid argument error? No clue. What provider is this? - Sean _______________________________________________ Libfabric-users mailing list Libfabric-users at lists.openfabrics.org https://lists.openfabrics.org/mailman/listinfo/libfabric-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue Sep 28 07:48:42 2021 From: sean.hefty at intel.com (Hefty, Sean) Date: Tue, 28 Sep 2021 14:48:42 +0000 Subject: [libfabric-users] Trouble with scalable endpoint In-Reply-To: <5fbf9d1345eb44efa23ed01f3db488e4@cscs.ch> References: , <5fbf9d1345eb44efa23ed01f3db488e4@cscs.ch> Message-ID: > I don't remember exactly what I did wrong, but I think when I created the scalable > endpoint, I had the number of contexts requested 1 more than was supported - instead of > getting an error when creating the endpoint (or binding the queues), I got an error Can you tell me which provider you were using? I can look at the code path to see why the error wasn't caught earlier. - Sean From john.biddiscombe at cscs.ch Tue Sep 28 23:37:37 2021 From: john.biddiscombe at cscs.ch (Biddiscombe, John A.) Date: Wed, 29 Sep 2021 06:37:37 +0000 Subject: [libfabric-users] Trouble with scalable endpoint In-Reply-To: References: , <5fbf9d1345eb44efa23ed01f3db488e4@cscs.ch>, Message-ID: <479b4cfaac564a1cac47be4a51b401c4@cscs.ch> Sean, This was using the gni provider. It's the only one I use that supports scalable endpoints I think. >Can you tell me which provider you were using? I can look at the code path to see why the error wasn't caught earlier. (I think it's probablyabout time I actually started debugging the LF code myself. it's been years I've been using it and still never really had a look inside the box). JB -------------- next part -------------- An HTML attachment was scrubbed... URL: