From vlad at lists.openfabrics.org Fri May 1 03:22:30 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 1 May 2009 03:22:30 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090501-0200 daily build status Message-ID: <20090501102230.BFDB5E61313@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From arkady.kanevsky at gmail.com Fri May 1 04:24:36 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Fri, 1 May 2009 07:24:36 -0400 Subject: [ofa-general] uDAPL DTO completion question. In-Reply-To: <49FA7C21.1050400@cs.anu.edu.au> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> <49FA7C21.1050400@cs.anu.edu.au> Message-ID: <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com> Jie,it sounds to me that either the variable is not volatile or compiler optimization causes some problem. I would check for these first. Arkady On Fri, May 1, 2009 at 12:35 AM, Jie Cai wrote: > Thanks for the help along my understanding with IB and uDAPL. > > Is that possible to spin remote memory of a rdma atomic compare and swap > (dat_ib_post_cmp_and_swap())? > > I have wrote a program that initiator atomic cmp_swap a value to a remote > memory. > > Instead of sending a message to notify the remoter about the completion of > cmp_swap, > the remoter actually doing a memory spin to test the update on the memory > (e.g. while(target == 0);). > > However, at remote side, this while loops goes infinitely, and the > initiator has already received DAT_IB_DTO_EVENT. > > I don't really understand what's going on, and what would be a correct way > to spin memory for checking remote > write updates. > > Any suggestions? > > Regards, > Jie > > -- > Jie Cai > > > > > Caitlin Bestler wrote: > >> On Tue, Mar 31, 2009 at 11:41 PM, Jie Cai wrote: >> >> >>> Understood now. A further question is here again. >>> >>> To implement software level acknowledgment to inform initiator that data >>> has been available for remoter, is that possible to use a busy loop at >>> remote >>> side to detect the last element of transferring has appear in the memory. >>> >>> Or remoter has to wait for the event of recv matching initiator's send, >>> then >>> send a message back to initiator as a acknowledgment? >>> >>> >>> >> >> There are two issues when spinning on a remote memory update. >> >> The first is that packets may be received and processed out of order, >> especially for iWARP. Therefore the fact that the last byte has been >> received and placed does not guarantee that the prior packets have >> been received and placed. >> >> More importantly, the order in which updates become visible to a >> specific software thread can make the order of updates unpredictable >> to the application. >> >> When delivering a completion the Provider is responsible for dealing >> with both of these problems. So when you reap a completion from the >> CQ, the operation it represents (and all prior operations) are complete. >> There are no gaps in received packets, nothing is still sitting on an >> Adapter buffer waiting to be placed in host memory. >> >> If your application does not want to block you can consider polling >> the cq whether than enabling notifications. But polling memory locations >> directly should only be done when you're willing to have bus/adapter >> specific dependencies. You working code might stop working when >> your network changes, or you install a new Adapter that has a different >> strategy for optimizing its writes over the PCIe bus. >> >> > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Fri May 1 04:33:32 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Fri, 1 May 2009 14:33:32 +0300 Subject: [ofa-general] OFED 1.4.1-rc4 is available Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com> Hi, OFED-1.4.1-rc4 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc4.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.1 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.1-rc2 ========================== - 22 bugs fixed (see attachment) - Attached kernel git tree changes for details Tasks that should be completed for RC4 (Apr 20): ==================================== 1. High priority bug fixes - see list bellow 2. Documentation update Open bugs: ======== bug_id bug_severity op_sys assigned_to short_short_desc 1607 blo SLES Jeffrey.C.Becker at nasa.gov kernel oops during login on sles10 sp2 with OFED-1.4.1-20... 1616 cri RHEL jon at opengridcomputing.com iommu_alloc error when running connectathon on ppc64 nfs ... 1571 cri RHEL vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, Note: I saw some mails that some of these bugs are fixed but since they are still open in bugzilla I report them here. Please update bugzilla with any fixed bug. -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc3_rc4.log Type: application/octet-stream Size: 23602 bytes Desc: ofed-1.4.1-rc3_rc4.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc4-fixed-bugs.csv Type: application/octet-stream Size: 2454 bytes Desc: ofed-1.4.1-rc4-fixed-bugs.csv URL: From jsquyres at cisco.com Fri May 1 04:56:48 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 07:56:48 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090430222230.GF32114@obsidianresearch.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> <1241044080.3403.374.camel@chromite.mv.qlogic.com> <20090429224411.GC32114@obsidianresearch.com> <23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com> <20090430222230.GF32114@obsidianresearch.com> Message-ID: <4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com> On Apr 30, 2009, at 6:22 PM, Jason Gunthorpe wrote: > After reading all the postings, I think my idea to fix the verbs API > to not, essentially, corrupt an existing registration when the virtual > address space changes is the best bet. This slightly changes the > semantics of the verbs MR to refer to virtual address space within the > process, not the underlying object(s) that happen to be mapped there > when the registration is made. > I'm not sure how this helps MPI -- our registration caches will still become invalid if the MPI app free()'s registered memory...? MPI maintains a registration cache because registration is so expensive. Even if the registration cache becomes "safely" invalid (e.g., you'll never get a scenario where one virtual address could have previously pointed to a different hardware address within the span of one process), it doesn't help. > > MPIs can choose to continue to hook malloc/free/etc or not, it doesn't > No no no! We don't want to continue hooking this stuff. The hooks are horrible, horrible, horrible -- there's real-world apps that break them. > > While MPI is currently the biggest victim, this broken memory > management > > model is also an enormous roadblock for any other application or > ULP to > > write to verbs. > > I'm not sure this is true.. The purpose built verbs apps I've worked > on don't have a problem like MPI, and managing the memory registration > was not hard. Ok, I'll back off slightly: if you want verbs to go mainstream, there will be many other ULPs / middleware libraries that have memory models like MPI's (that the upper layer is responsible for allocating/freeing message buffers). Put differently: the TCP/sockets stack doesn't have this restriction; it will be extremely difficult to convert legions of sockets programmers to verbs if you effectively restrict large messages to only be allocated/freed by the network layer (kinda defeats the point of RDMA if you have to copy large messages, right?). -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Fri May 1 05:11:42 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 08:11:42 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com> Message-ID: <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com> On Apr 30, 2009, at 5:25 PM, arkady kanevsky wrote: > are the MPI applications that are broken are the ones which is > malloc/free > instead of MPI_ALLOC calls? Yes. -- Jeff Squyres Cisco Systems From arkady.kanevsky at gmail.com Fri May 1 05:25:29 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Fri, 1 May 2009 08:25:29 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com> <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com> Message-ID: <517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com> Jeff,What if we provide a script which converts all malloc/freecalls into MPI ones and move MPI_INIT before any memory allocation? Will these application user be willing to do the conversion? Will it fix all the problems or are there some loose ends? Arkady On Fri, May 1, 2009 at 8:11 AM, Jeff Squyres wrote: > On Apr 30, 2009, at 5:25 PM, arkady kanevsky wrote: > > are the MPI applications that are broken are the ones which is malloc/free >> instead of MPI_ALLOC calls? >> > > Yes. > > -- > Jeff Squyres > Cisco Systems > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Fri May 1 05:48:39 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 08:48:39 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> Message-ID: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote: > To me, all this sounds like a lot of whining.... > Why can't the OS fix all my problems. Absolutely not. As Brian stated, we have cited some real-world problems that we cannot fix (and we have tried many, many different workarounds over the past few years to fix them). It sounds like your main objection to fixing them is "it's too much work." :-( > There's an application at Sandia and at Los Alamos which both of > which cause problems for our linker tricks. This leads to such > things as (proven) silent data corruption. > There are other apps that have also been reported over the years. C++ apps with their own allocators as especially problematic. Abaqus had to change their memory allocation model several years ago to be able to workaround these issues. These memory models also break valgrind, purify, and other memory-checking debuggers. > Have you tried these applications with any MPI other than OpenMPI ? > i.e., does this corruption happen with Intel MPI and other MPIs as > well? > We have been trying to say that this is a general problem that there currently is no guaranteed fix for. There's always a way to break the MPI workarounds for verbs' broken memory management model because there's no way to guarantee the memory allocation hooks. There's two main reasons for fix these issues: 1. Business: to attract network programmers to verbs (and therefore to attract applications and therefore increase market share), it has to be simpler and within reach of today's commodity sockets-level programmers. Forcing them to have registration caches and to do memory allocation hooking significantly raises the bar. To date, this has been shunned by all network programmers except HPC and a handful of storage protocols. 2. Technical: if OFED says "to get good performance with verbs, you have to do malloc/mmap/etc. hooks and have a registration cache, "this unnecessarily *significantly* raises the education and code complexity barrier to entry for verbs programmers. It's also un-scaleable -- if this is something you *have* to do for good performance, why doesn't the network stack do it? It seems weird that you would effectively force all ULPs/MPIs/applications to implement the same functionality. The memory allocation hooking model also fails if more than one verbs- based middleware is used in the same application (because only one will be able to use the memory hooks per process). Here's a story that encompasses both reasons: We had Open MPI *not* use the registration cache by default for a long time because of the danger it posed to applications. Users could activate the registration cache with a simple command line parameter. But nobody would do that -- they wanted to run with top performance right out of the box (which is not unreasonable). It also led to OMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look, Open MPI's performance is bad! Our MPI's performance is GREAT!" Open MPI therefore was forced to change its defaults in the 1.3 series to activate the [dangerous] memory registration cache by default. You mentioned that doing this stuff is a choice; the choice that MPI's/ ULPs/applications therefore have is: - don't use registration caches/memory allocation hooking, have terrible performance - use registration caches/memory allocation hooking, have good performance Which is no choice at all. If customers pay top dollar for these networks, they want to see benchmarks run out of the box that show that they're getting every flop/byte-per-second that they can. The fact that the programming model is needlessly complicated (and dangerous) to get that performance is something that the MPI's have tolerated because we had to for competition's sake. This is not something that non-HPC customers will accept. > Of the solutions that have been presented so far, > I think the kernel notifier approach would be a better solution. > Note that Jason G. said in this thread: "Notifiers are going to be very troublesome, every time any sort of synchronous to user space notifier has been proposed or implemented in the kernel it has been a disaster." -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Fri May 1 05:56:58 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 08:56:58 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com> <0C054C01-6DF8-4B7D-A540-9693BEA58EDD@cisco.com> <517c62fb0905010525s5a05cb76w736afe494d67aeca@mail.gmail.com> Message-ID: <96A68779-ED89-49BB-9C29-B9BB33221FD6@cisco.com> On May 1, 2009, at 8:25 AM, arkady kanevsky wrote: > What if we provide a script which converts all malloc/free > calls into MPI ones and move MPI_INIT before any memory allocation? > Will these application user be willing to do the conversion? We've been trying to educate MPI application developers for 10 years. :-) If you think a script will help, go for it. :-) Sorry; I'm not trying to be snide -- this thread is getting increasingly frustrating. No, I don't think it will help for a few reasons: - MPI's already support malloc/etc. buffers; changing that now would be a big change -- based on this one network stack. - MPI's are competitive. If one MPI forces the use of MPI_ALLOC_MEM, then others will say "you should use my MPI because then you don't have to change your code to use MPI_ALLOC_MEM." Because we're ultimately competing for customer's dollars -- MPI's actively try to make programming/using their product as easy as possible. - Fortran is always problematic. I haven't thought through the problems there, but I know of many apps that have huge arrays declared statically (which the fortran compiler gets from the heap, not the stack). Forcing them to change to F90-style pointers would never happen. - I cited earlier in the thread MPI-based middleware that could do MPI_ALLOC_MEM (potentially plus a copy) for short messages, but likely re-uses application buffers directly for large messages because the copy cost would be too much. Specifically: if MPI is not the top level middleware in an application -- some other middleware is fronting the network stack, like a computational library or somesuch -- they might have to make exactly the same compromises (e.g., application buffers are too large, so let's just use those instead of MPI_ALLOC_MEM+copy). -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Fri May 1 06:07:40 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 09:07:40 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: On Apr 30, 2009, at 6:10 PM, Barrett, Brian W wrote: > I'm done now. You don't want to fix your crap, that's fine. Just > don't be > surprised by the continued "why you shouldn't use IB" presentations > from > people who have to write applications to it. > Let's not forget that Brian is not only an MPI developer (i.e., a network programmer), he's also a customer. If OpenFabrics only wants the HPC market, you can probably ignore this entire thread. The OpenFabrics-based MPI's will hobble along like they have been. If you want larger markets, it's probably pretty safe to assume that Brian's reactions are going to be quite similar to enterprise network programmers. To be clear: it's not just verbs education (books, tutorials, FAQ's, etc.) that is required to win the hearts and minds of enterprise network programmers. You also need a network API that is no more complex than common sockets usage. Verbs -- and the additional baggage that it requires for performance, like registration caches and memory allocation hooking -- does not currently meet this requirement. -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Fri May 1 06:17:44 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 09:17:44 -0400 Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> Message-ID: <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com> I'd also like to call the IWG's and MWG's attention to the other thread currently running on the general list: "New proposal for memory management." There are many points in there about attracting non-HPC / enterprise network programmers to write verbs-based applications. It's not just documentation / education that is missing -- having a series of FAQs and tutorials about verbs programming is not enough. You need a network programming API that is no more complex than common sockets usage. Specifically: let's not forget that HPC (OF's biggest market right now) tends to attract network programmers with PhD's, and/or who are among the top programming talent in the world (yes, that's being snobbish -- but it's still true). To make OF within reach of the masses, you want to lower the bar so that legions of sockets-based network programmers can hope to learn/use this stuff without requiring them to get a PhD first. On Apr 30, 2009, at 6:12 PM, Ryan, Jim wrote: > At the risk of piling on, I think what Lloyd is suggesting is very > important. The objections I continue to hear about programming using > RDMA are along the lines of "it's too hard" or "no one knows how to > do it". > > It occurs to me if we could provide some concise instruction, that, > coupled with the undeniable benefits of RDMA, could provide a > compelling package for "RDMA for the masses" > > thanks, Jim > > From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org > ] On Behalf Of Lloyd Dickman > Sent: Thursday, April 30, 2009 1:17 PM > To: arkady kanevsky; bill.boas at openfabrics.org > Cc: iwg at lists.openfabrics.org; Paul Grun; OFA at lists.openfabrics.org; > Paul Gray; Working Group; Wayne Augsburger; Andy Grover; Richard > Frank;Jeff at lists.openfabrics.org; Squyres; Mikkel Hagen; Scott at lists.openfabrics.org > ; general at lists.openfabrics.org; Friedman; bobs at voltaire.com; > Sumanta Chatterjee;asafs`@voltaire.com; Roland Dreier > Subject: RE: [mwg] Re: RDMA tutorial and OFA > > I support the idea of the RDMA tutorial. Beyond the “meat” as > described below, I would encourage the tutorial to include a “how to > program RDMA” section. While OFA Verbs provides a rich set of > mechanisms, it is difficult for the average programmer to get a > solid handle on how to use the capabilities, register memory, … > Some cookbook examples, or perhaps development of several > programming “patterns” can go a long way to having RDMA become a > much more mainstream application programming paradigm. > > Lloyd > > From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org > ] On Behalf Of arkady kanevsky > Sent: Thursday, April 30, 2009 11:27 AM > To: bill.boas at openfabrics.org > Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing > Working Group; Wayne Augsburger; Andy Grover; Richard Frank; asafs`@voltaire.com > ; Jeff Squyres; Mikkel Hagen;general at lists.openfabrics.org; Scott > Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier > Subject: [mwg] Re: RDMA tutorial and OFA > > Keep me in the loop. > I am interested to do it also. > Thanks, > Arkady > On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas > wrote: > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the > general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my > part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe > there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by > making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote > on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from > others > within OFA who have the same interest to prepare a description by > May 20 of > what the tutorial would look like, who would contribute to it, how > to get it > "polished up" for web and/or book style publication, what the > overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are > that we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, > benefits > of RDMA, applicability in HPC and Enterprise, networking background > etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), > then user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each > section > overseen by the technical leaders/maintainers of the code within OFA > for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard > for RDS) > > Finally the tutorial would have sections about Interoperability > Testing that > OFA/IOL does but also what customers can do on there own systems - > Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you > have > material you'd like to contribute please say so. If there's a better > way, > tell us what you think it is! > > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it > would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru > the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board > that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: > > Hi Rick, > > > > Are you around for a brief chat this afternoon? I have a crazy > idea that > > involves OFA doing something (or putting up $$) and I wanted to > see what > > you thought, since you're Oracle's OFA rep, right? > > > > -- Andy > > > > > > > > -- > Cheers, > Arkady Kanevsky -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Fri May 1 06:26:18 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 09:26:18 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> References: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> Message-ID: On Apr 30, 2009, at 7:20 PM, Aaron Fabbri (aafabbri) wrote: >> Yes, MPI_ALLOC_MEM / MPI_FREE_MEM calls have been around for >> a long time (~10 years?). Using them does avoid many of the >> problems that have been discussed. Most (all?) MPI's either >> support ALLOC_MEM / FREE_MEM by registering at allocation >> time and unregistering at free time, or some variation of that. > > Ah. Are there any problems that are not addressed by having MPI own > allocation of network bufs? Sure, there's lots of them. :-) But this thread is just about the memory allocation management issues. > (BTW registering for each allocation could be improved, I think.) Probably so. Since so few MPI applications use these calls, OMPI hasn't really bothered to tune them. >> But unfortunately, very few MPI apps use these calls; they use >> malloc() and friends instead. Or they're written in Fortran, >> where such concepts are not easily mapped (don't >> underestimate how much Fortran MPI code runs on verbs!). >> Indeed, in some layered scenarios, it's not easy to use these >> calls (e.g., if an MPI-enabled computational library may >> re-use user-provided buffers because they're so large, etc.). > > I understand the difficulty. A couple possible counterpoints: > > 1. Make the next version of MPI spec *require* using the mpi_alloc > atuff. The MPI Forum (the standards body) has been very resistant to this, especially based on the requirements of one not-pervasive network stack. It would effectively break all legacy MPI applications, too. I seriously doubt that the Forum would go for that. FWIW: the way the MPI spec is worded, it says that you *may* get performance benefit from using MPI_ALLOC_MEM. E.g., an MPI can always support using malloc buffers -- just copy into network-special buffers. The performance would be terrible :-), but it would be correct. > 2. MPI already requires recompilation of apps, right? I don't know > fortran, or what it uses for allocation, but worse case, maybe you > could > change the standard libraries or compilers. We tried that -- interposing our own copies of malloc, free, mmap, ... etc. (e.g., inside libmpi). Ick. Horrible, horrible ick. And it definitely breaks some real-world apps and memory-checking debuggers/ tools. > 3. Rip out your registration cache. Make malloc'd buffers go really > slow (register in fast path) and mpi_alloc_mem() buffers go really > fast. > People will migrate. The hard part of this would be getting all > MPIs to > agree on this, I'm guessing. See http://lists.openfabrics.org/pipermail/general/2009-May/ 059376.html -- Open MPI effectively tried this and got beat up by a) competing MPI's, and b) the marketing supporting Open MPI. :-\ People won't migrate, nor will main-line MPI benchmarks. Customers want top performance out-of-the-box with their MPI (which is not unreasonable). Users have used malloc() for 10+ years, and other networks don't require the use of MPI_ALLOC_MEM. -- Jeff Squyres Cisco Systems From tmtalpey at gmail.com Fri May 1 06:25:33 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Fri, 01 May 2009 09:25:33 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com> At 09:07 AM 5/1/2009, Jeff Squyres wrote: >On Apr 30, 2009, at 6:10 PM, Barrett, Brian W wrote: > >> I'm done now. You don't want to fix your crap, that's fine. Just >> don't be >> surprised by the continued "why you shouldn't use IB" presentations >> from >> people who have to write applications to it. >> > > >Let's not forget that Brian is not only an MPI developer (i.e., a >network programmer), he's also a customer. > >If OpenFabrics only wants the HPC market, you can probably ignore this >entire thread. The OpenFabrics-based MPI's will hobble along like >they have been. If you want larger markets, it's probably pretty safe >to assume that Brian's reactions are going to be quite similar to >enterprise network programmers. Completely agree. I will add that enterprise network programmers are going to reject registration caching as well, because it introduces vulnerabilities into the data path - silent data corruption. For example, storage won't tolerate it, databases won't, etc. The problem is that userspace memory registration is slow. Let's address that, not address how to make a hack (registration caching) go faster. We've solved this in the kernel with FRMR, why not take a similar solution up to user verbs? Wouldn't that address it, by allowing the library to safely and efficiently manage registration on a per-io basis? Tom. From todd.rimmer at qlogic.com Fri May 1 07:27:50 2009 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Fri, 1 May 2009 09:27:50 -0500 Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com> References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com> Message-ID: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org> It goes beyond just a tutorial. In talking to customers, the consensus is that many application programmers struggle with sockets, RDMA is an order of magnitude beyond that. It's not a cut on programmers, there are some very strong ones in the enterprise, but a fair percentage only have associate degrees or technical school training. Even the extremely smart ones have 100 things to juggle (and often must write code such that entry level programmers can support it), so the risk/reward or ROI of learning RDMA has to be there. The higher the learning cost the more difficult to justify the effort. To summarize what is really needed: - simplified APIs and easy migration of applications - SDP with zCopy was supposed to be a start, unfortunately the implementation required relinking applications. Sounds simple to developers, but very tricky in the field, especially with complex apps, 3rd party scripts to start them, etc. A kernel based "socket switch" approach is needed to make this 100% transparent. - good simple examples of how to do it, sample programs etc - write the samples then analyze and improve the API to further simplify them - connection establishment is still difficult in OFED. Also many apps are shortcutting the process by avoiding SA queries (hence impacting the ability of the applications to work properly with QOS, LMC, complex fabrics (torus, etc), Partitioning, etc). - either the Base API needs to improve or "helper libraries" are needed on top of it. - effective tools to debug applications. Right now there are very limited debug facilities in the ofa kernel (and most require a debug build), strace is not applicable to user verbs (due to kernel bypass), etc. You need ways to analyze resources (QPs, MRs, etc) while the application is running or after it has dumped. You need ways to trace the sequence of Verbs calls to analyze program behavior and bugs. Also ways to analyze the "on wire" behavior (aka tcpdump) of an application while its running is needed. Right now it's impossible in OFED to identify how many QPs are open, let alone which applications are using them, etc. Tools like madeye are inefficient and lack the proper filtering to be effective for all but very simple problems. - accessibility in scripting languages and other languages (java, C#, etc). Many languages have powerful capabilities to manage sockets and TCP layers above it (http, smtp, etc). However there is no effective way to use RDMA and IB in languages other than C. A start for scripting languages could be the transparent SDP approach. For java, C++, C# and other languages there needs to be effective APIs and libraries that map well into the style of the language. Todd Rimmer Chief Architect QLogic Network Systems Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres > Sent: Friday, May 01, 2009 9:18 AM > To: Ryan, Jim > Cc: iwg at lists.openfabrics.org; Paul Grun; asafs`@voltaire.com; Paul > Gray; Working Group; Wayne Augsburger; Lloyd Dickman; Sumanta > Chatterjee; Mikkel Hagen; Roland Dreier (rdreier); bobs at voltaire.com; > Jeff at lists.openfabrics.org; general at lists.openfabrics.org; Friedman; > bill.boas at openfabrics.org; OFA at lists.openfabrics.org; > Scott at lists.openfabrics.org > Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA > > I'd also like to call the IWG's and MWG's attention to the other > thread currently running on the general list: "New proposal for memory > management." > > There are many points in there about attracting non-HPC / enterprise > network programmers to write verbs-based applications. It's not just > documentation / education that is missing -- having a series of FAQs > and tutorials about verbs programming is not enough. You need a > network programming API that is no more complex than common sockets > usage. > > Specifically: let's not forget that HPC (OF's biggest market right > now) tends to attract network programmers with PhD's, and/or who are > among the top programming talent in the world (yes, that's being > snobbish -- but it's still true). To make OF within reach of the > masses, you want to lower the bar so that legions of sockets-based > network programmers can hope to learn/use this stuff without requiring > them to get a PhD first. > > > > On Apr 30, 2009, at 6:12 PM, Ryan, Jim wrote: > > > At the risk of piling on, I think what Lloyd is suggesting is very > > important. The objections I continue to hear about programming using > > RDMA are along the lines of "it's too hard" or "no one knows how to > > do it". > > > > It occurs to me if we could provide some concise instruction, that, > > coupled with the undeniable benefits of RDMA, could provide a > > compelling package for "RDMA for the masses" > > > > thanks, Jim > > > > From: mwg-bounces at lists.openfabrics.org [mailto:mwg- > bounces at lists.openfabrics.org > > ] On Behalf Of Lloyd Dickman > > Sent: Thursday, April 30, 2009 1:17 PM > > To: arkady kanevsky; bill.boas at openfabrics.org > > Cc: iwg at lists.openfabrics.org; Paul Grun; OFA at lists.openfabrics.org; > > Paul Gray; Working Group; Wayne Augsburger; Andy Grover; Richard > > Frank;Jeff at lists.openfabrics.org; Squyres; Mikkel Hagen; > Scott at lists.openfabrics.org > > ; general at lists.openfabrics.org; Friedman; bobs at voltaire.com; > > Sumanta Chatterjee;asafs`@voltaire.com; Roland Dreier > > Subject: RE: [mwg] Re: RDMA tutorial and OFA > > > > I support the idea of the RDMA tutorial. Beyond the "meat" as > > described below, I would encourage the tutorial to include a "how to > > program RDMA" section. While OFA Verbs provides a rich set of > > mechanisms, it is difficult for the average programmer to get a > > solid handle on how to use the capabilities, register memory, ... > > Some cookbook examples, or perhaps development of several > > programming "patterns" can go a long way to having RDMA become a > > much more mainstream application programming paradigm. > > > > Lloyd > > > > From: mwg-bounces at lists.openfabrics.org [mailto:mwg- > bounces at lists.openfabrics.org > > ] On Behalf Of arkady kanevsky > > Sent: Thursday, April 30, 2009 11:27 AM > > To: bill.boas at openfabrics.org > > Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing > > Working Group; Wayne Augsburger; Andy Grover; Richard Frank; > asafs`@voltaire.com > > ; Jeff Squyres; Mikkel Hagen;general at lists.openfabrics.org; Scott > > Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier > > Subject: [mwg] Re: RDMA tutorial and OFA > > > > Keep me in the loop. > > I am interested to do it also. > > Thanks, > > Arkady > > On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas > > wrote: > > Richard, Andy, > > > > Thanks for copying me Richard. I had not seen Andy's email on the > > general > > list. > > > > Figuring out how to get tutorial and other documentation created and > > published in the list of things to get done in 2009 for me in my > > part-time > > role as Exec. Dir. > > > > There is no funding set up for this at the moment but I believe > > there will > > be in about 30 days. > > > > That's because I'm thinking that we can get funding for this by > > making it > > part of the funding for a new marketing plan for OFA that, with Wayne > > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote > > on at > > the next con-call meeting which is on May 20 at 9.00AM PDT. > > > > Would you be willing to work with me and create a small team from > > others > > within OFA who have the same interest to prepare a description by > > May 20 of > > what the tutorial would look like, who would contribute to it, how > > to get it > > "polished up" for web and/or book style publication, what the > > overall costs > > would be, etc. > > > > My thoughts, that could be a starting point for the team's work, are > > that we > > would make the creation a collective effort. > > > > The tutorial would have several sections for example general intro, > > benefits > > of RDMA, applicability in HPC and Enterprise, networking background > > etc. > > Members of the Marketing Working Group would be responsible for this. > > > > The "meat" would be sections for kernel level things (verbs etc.), > > then user > > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each > > section > > overseen by the technical leaders/maintainers of the code within OFA > > for > > that section (for Example Tom Talpey for NFSoRDMA, or you Richard > > for RDS) > > > > Finally the tutorial would have sections about Interoperability > > Testing that > > OFA/IOL does but also what customers can do on there own systems - > > Arkady > > and Rupert and IOL have put in an SC09 tutorial proposal that we > could > > leverage in this section. > > > > To all readers of this email:- > > If you have read this far, please give us all some feedback. If you > > have > > material you'd like to contribute please say so. If there's a better > > way, > > tell us what you think it is! > > > > Thanks, > > > > Bill. > > > > Bill Boas > > Executive Director and Vice Chair > > OpenFabrics Alliance > > 510-375-8840 > > Bill.Boas at openfabrics.org > > www.openfabrics.org > > > > -----Original Message----- > > From: Richard Frank [mailto:richard.frank at oracle.com] > > Sent: Wednesday, April 29, 2009 12:58 PM > > To: Andy Grover > > Cc: Bill Boas; Sumanta Chatterjee > > Subject: Re: RDMA tutorial and OFA > > > > Andy, I saw your postings to ofa-general on this and I agree it > > would be > > great to have this documentation. > > > > As OpenFabrics is really about RDMA... we need to make it simpler > > for folks to pick up and run with RDMA concepts ...vs.. digging thru > > the IB > > specs and code examples, etc. > > > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board > > that > > can help us do this..? > > > > I can also help provide input for a new OFA RDMA tutorial doc.. > > > > Rick > > > > Andy Grover wrote: > > > Hi Rick, > > > > > > Are you around for a brief chat this afternoon? I have a crazy > > idea that > > > involves OFA doing something (or putting up $$) and I wanted to > > see what > > > you thought, since you're Oracle's OFA rep, right? > > > > > > -- Andy > > > > > > > > > > > > > > -- > > Cheers, > > Arkady Kanevsky > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From jsquyres at cisco.com Fri May 1 08:25:40 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 1 May 2009 11:25:40 -0400 Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org> References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com> <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org> Message-ID: <47C854CD-173C-45CC-9DBC-482EACC18921@cisco.com> Hear, hear! FWIW, I think the attached slide shows it pictorially pretty well. A good one-line summary: MPI is so popular [in HPC] because the simple things are simple; with verbs, even the simple things are hard. On May 1, 2009, at 10:27 AM, Todd Rimmer wrote: > It goes beyond just a tutorial. > > In talking to customers, the consensus is that many application > programmers struggle with sockets, RDMA is an order of magnitude > beyond that. It's not a cut on programmers, there are some very > strong ones in the enterprise, but a fair percentage only have > associate degrees or technical school training. Even the extremely > smart ones have 100 things to juggle (and often must write code such > that entry level programmers can support it), so the risk/reward or > ROI of learning RDMA has to be there. The higher the learning cost > the more difficult to justify the effort. > > To summarize what is really needed: > - simplified APIs and easy migration of applications > - SDP with zCopy was supposed to be a start, unfortunately > the implementation required relinking applications. Sounds simple > to developers, but very tricky in the field, especially with complex > apps, 3rd party scripts to start them, etc. A kernel based "socket > switch" approach is needed to make this 100% transparent. > > - good simple examples of how to do it, sample programs etc > - write the samples then analyze and improve the API to > further simplify them > - connection establishment is still difficult in OFED. Also > many apps are shortcutting the process by avoiding SA queries (hence > impacting the ability of the applications to work properly with QOS, > LMC, complex fabrics (torus, etc), Partitioning, etc). > - either the Base API needs to improve or "helper libraries" > are needed on top of it. > > - effective tools to debug applications. Right now there are very > limited debug facilities in the ofa kernel (and most require a debug > build), strace is not applicable to user verbs (due to kernel > bypass), etc. You need ways to analyze resources (QPs, MRs, etc) > while the application is running or after it has dumped. You need > ways to trace the sequence of Verbs calls to analyze program > behavior and bugs. Also ways to analyze the "on wire" behavior (aka > tcpdump) of an application while its running is needed. Right now > it's impossible in OFED to identify how many QPs are open, let alone > which applications are using them, etc. Tools like madeye are > inefficient and lack the proper filtering to be effective for all > but very simple problems. > > - accessibility in scripting languages and other languages (java, > C#, etc). Many languages have powerful capabilities to manage > sockets and TCP layers above it (http, smtp, etc). However there is > no effective way to use RDMA and IB in languages other than C. A > start for scripting languages could be the transparent SDP > approach. For java, C++, C# and other languages there needs to be > effective APIs and libraries that map well into the style of the > language. > > Todd Rimmer > Chief Architect > QLogic Network Systems Group > Voice: 610-233-4852 Fax: 610-233-4777 > Todd.Rimmer at QLogic.com www.QLogic.com > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general- > > bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres > > Sent: Friday, May 01, 2009 9:18 AM > > To: Ryan, Jim > > Cc: iwg at lists.openfabrics.org; Paul Grun; asafs`@voltaire.com; Paul > > Gray; Working Group; Wayne Augsburger; Lloyd Dickman; Sumanta > > Chatterjee; Mikkel Hagen; Roland Dreier (rdreier); > bobs at voltaire.com; > > Jeff at lists.openfabrics.org; general at lists.openfabrics.org; Friedman; > > bill.boas at openfabrics.org; OFA at lists.openfabrics.org; > > Scott at lists.openfabrics.org > > Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA > > > > I'd also like to call the IWG's and MWG's attention to the other > > thread currently running on the general list: "New proposal for > memory > > management." > > > > There are many points in there about attracting non-HPC / enterprise > > network programmers to write verbs-based applications. It's not > just > > documentation / education that is missing -- having a series of FAQs > > and tutorials about verbs programming is not enough. You need a > > network programming API that is no more complex than common sockets > > usage. > > > > Specifically: let's not forget that HPC (OF's biggest market right > > now) tends to attract network programmers with PhD's, and/or who are > > among the top programming talent in the world (yes, that's being > > snobbish -- but it's still true). To make OF within reach of the > > masses, you want to lower the bar so that legions of sockets-based > > network programmers can hope to learn/use this stuff without > requiring > > them to get a PhD first. > -- Jeff Squyres Cisco Systems -------------- next part -------------- A non-text attachment was scrubbed... Name: jsquyres-panel-barriers-to-ofed-adoption-slide-5.pdf Type: application/pdf Size: 345219 bytes Desc: not available URL: From caitlin.bestler at gmail.com Fri May 1 09:08:13 2009 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Fri, 1 May 2009 09:08:13 -0700 Subject: [ofa-general] uDAPL DTO completion question. In-Reply-To: <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> <49FA7C21.1050400@cs.anu.edu.au> <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com> Message-ID: <469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com> On Fri, May 1, 2009 at 4:24 AM, arkady kanevsky wrote: > Jie, > it sounds to me that either the variable is not volatile or compiler > optimization > causes some problem. I would check for these first. > Arkady > Agreed, it is definitely a caching issue. Atomics are InfiniBand specific, and there are some fairly complex rules that govern how much the HCA can do caching. The gotcha is that they basically provide some cache coherency guarantees within the context of a connection, but not much between connections or versus local applications. That said, it would be rare for HCA caching to be the cause of anything worse than some unexpected ordering. Adapters cache when they have to, but would really rather not allocate or track a lot of resources. Updating to real physical memory ASAP is much simpler. Compilers, on the other hand, *love* optimizing. The key thing to understand is that the HCA is another processor, one that is at least as distant as any other CPU core. Any and all techniques used when sharing memory with another processor apply. Completions hide all that from the application, just promising that specific things are coherent when the user invokes the verbs to reap a completion. So whenever you do without completions you are dealing with an arbitrary multi-processor memory coherence problem. From sashak at voltaire.com Fri May 1 09:07:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 19:07:17 +0300 Subject: [ofa-general] Re: [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090427142533.85f00f4d.weiny2@llnl.gov> References: <20090403154301.f656e7a4.weiny2@llnl.gov> <20090423082535.GD8281@sk> <20090423100206.c2621310.weiny2@llnl.gov> <20090425103216.GB28604@sk> <20090427142533.85f00f4d.weiny2@llnl.gov> Message-ID: <20090501160717.GD14714@sk.iol.unh.edu> On 14:25 Mon 27 Apr , Ira Weiny wrote: > > > > `PROG_LDADD' is inappropriate for passing program-specific linker > > flags (except for `-l', `-L', `-dlopen' and `-dlpreopen'). So, use > > the `PROG_LDFLAGS' variable for this purpose. > > > > So '-L' is exception suitable for LDADD. > > Ah ok, I did not know about the exception. We can change if you prefer. I asked for a "general knowledge". I don't see a reason to change especially. Sasha From pashash at gmail.com Fri May 1 09:27:53 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Fri, 01 May 2009 19:27:53 +0300 Subject: [ofa-general] New proposal for memory management In-Reply-To: <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> References: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> Message-ID: <49FB2309.5090702@dev.mellanox.co.il> Aaron Fabbri (aafabbri) wrote: > 3. Rip out your registration cache. Make malloc'd buffers go really > slow (register in fast path) and mpi_alloc_mem() buffers go really fast. > People will migrate. People will migrate to what ? (A) new malloc ? Or (B) other interconnect platform that does not require from user to change his application in order to get reasonable performance ? I'm not sure that people will chose (A) ;-) Pasha. From robert.j.woodruff at intel.com Fri May 1 09:37:41 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 1 May 2009 09:37:41 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com> Jeff wrote, >It sounds like your main objection to fixing them is "it's too much >work." :-( Not really, in general I think the kernel folks like to keep stuff out of the kernel if it is not really needed, i.e., if it can be implemented in user-space, especially really complicated things like this. It is probably a somewhat tricky code to implement, prone to bugs that could cause instability in the kernel. Remember if there is a bug in user-space and an application dies, it only effects that one application, if the bug is in the kernel and it crashes the system it affects everyone. That said, if you can find someone to implement it, then do it and send in the patches. Assuming the code is not too ugly, maybe it would get accepted. From koop at cse.ohio-state.edu Fri May 1 09:57:13 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri, 1 May 2009 12:57:13 -0400 (EDT) Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com> Message-ID: I'm sure sure everyone has gotten this point yet -- It's not a matter of the MPIs having this complicated memory registration code in userspace they want to push into the kernel to simplify their lives. The problem is that this registration cache in userspace is a hack that can't even be guaranteed. Besides that, this hack ruins many memory debugging tools. This alone is a major hassle for many users since the memory registration caching changes the timings. A user can't be told to just run the application with the registration cache on for normal runs and then debug with it off. Many errors end up being timing dependant. Matt On Fri, 1 May 2009, Woodruff, Robert J wrote: > Jeff wrote, > > >It sounds like your main objection to fixing them is "it's too much > >work." :-( > > Not really, in general I think the kernel folks like to keep stuff > out of the kernel if it is not really needed, i.e., if it can be implemented > in user-space, especially really complicated > things like this. It is probably a somewhat tricky code to implement, prone > to bugs that could cause instability in the kernel. Remember if there is a bug > in user-space and an application dies, it only effects that one application, > if the bug is in the kernel and it crashes the system it affects everyone. > > That said, if you can find someone to implement it, then do it and send in > the patches. Assuming the code is not too ugly, maybe it would get accepted. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Fri May 1 10:09:47 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 May 2009 10:09:47 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> (Jeff Squyres's message of "Fri, 1 May 2009 08:48:39 -0400") References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> <8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> Message-ID: > You mentioned that doing this stuff is a choice; the choice that > MPI's/ ULPs/applications therefore have is: > > - don't use registration caches/memory allocation hooking, have > terrible performance > - use registration caches/memory allocation hooking, have good > performance I think it's a bit of a stretch to suggest that all or even most userspace RDMA applications have the same need for registration caching as MPI. In fact my feeling is that the fact that MPI must deal with RDMA to arbitrary memory allocated by an application out of MPI's control is the exception. My most recent experience was with Cisco's RAB library, and in that case we simply designed the library so that all RDMA was done to memory allocated by the library -- so no need for a registration cache, and in fact no need for registration in any fast path. I suspect that the majority of code written to use RDMA natively will be designed with similar properties. So this proposal is very much an MPI-specific interface. Which leads to my next point. I have no doubt that the MPI community has a very good idea of a memory registration interface that would make MPI implementations simpler and more robust. However I don't think there's quite as much expertise about what the best way to implement such an interface is. My initial reaction is that I don't want to extend the kernel ABI with a set of new MPI-specific verbs if there's a way around it. We've been told over and over that the registration cache is complex and fragile code -- but moving complex and fragile code into the kernel doesn't magically make it any simpler or more robust, it just means that bugs now crash the whole system instead of just affecting one process. Now, of course MMU notifiers allow the kernel to know reliably when a process's page tables change, which means that all the complicated malloc hooking etc is not needed. So that complexity is avoided in the kernel. But suppose I give userspace the same MMU notifier capability (eg I add a system call like "if any mappings in the virtual address range X ... Y change, then write a 1 to virtual address Z") -- then what do I gain from having the rest of the registration caching in the kernel? (And avoiding the duplication of caching code between multiple MPI implementations is not an answer -- it's quite feasible to put the caching code into libibverbs if that's the best place for it) - R. From sashak at voltaire.com Fri May 1 10:38:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 20:38:06 +0300 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090427150409.9c10e479.weiny2@llnl.gov> References: <20090422185441.6f8601dc.weiny2@llnl.gov> <20090425175710.GI28604@sk> <20090427150409.9c10e479.weiny2@llnl.gov> Message-ID: <20090501173806.GF14714@sk.iol.unh.edu> On 15:04 Mon 27 Apr , Ira Weiny wrote: > > The port output should be from low to high. > What do you see? Yes, the port order is good (I was wrong about it). But switch order is reserved - first discovered switch is printed last. Right? Sasha From arkady.kanevsky at gmail.com Fri May 1 10:57:57 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Fri, 1 May 2009 13:57:57 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F42B63D6F1@orsmsx507.amr.corp.intel.com> Message-ID: <517c62fb0905011057o5dfef3d0pc949fdd083360bf4@mail.gmail.com> Matt,What is your feel for FMR like model in user space? If MPI implementation will do FMR under the covers do you foresee some issues? It will not be as performant as preregistering memory once earlier in the MPI program. Thanks, Arkady On Fri, May 1, 2009 at 12:57 PM, Matthew Koop wrote: > > I'm sure sure everyone has gotten this point yet -- It's not a matter of > the MPIs having this complicated memory registration code in userspace > they want to push into the kernel to simplify their lives. > > The problem is that this registration cache in userspace is a hack that > can't even be guaranteed. Besides that, this hack ruins many memory > debugging tools. > > This alone is a major hassle for many users since the memory registration > caching changes the timings. A user can't be told to just run the > application with the registration cache on for normal runs and then debug > with it off. Many errors end up being timing dependant. > > Matt > > On Fri, 1 May 2009, Woodruff, Robert J wrote: > > > Jeff wrote, > > > > >It sounds like your main objection to fixing them is "it's too much > > >work." :-( > > > > Not really, in general I think the kernel folks like to keep stuff > > out of the kernel if it is not really needed, i.e., if it can be > implemented > > in user-space, especially really complicated > > things like this. It is probably a somewhat tricky code to implement, > prone > > to bugs that could cause instability in the kernel. Remember if there is > a bug > > in user-space and an application dies, it only effects that one > application, > > if the bug is in the kernel and it crashes the system it affects > everyone. > > > > That said, if you can find someone to implement it, then do it and send > in > > the patches. Assuming the code is not too ugly, maybe it would get > accepted. > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Fri May 1 11:18:31 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 May 2009 12:18:31 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> <1241044080.3403.374.camel@chromite.mv.qlogic.com> <20090429224411.GC32114@obsidianresearch.com> <23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com> <20090430222230.GF32114@obsidianresearch.com> <4069D3B1-7208-4808-869A-B3B10E36C59E@cisco.com> Message-ID: <20090501181830.GB3475@obsidianresearch.com> On Fri, May 01, 2009 at 07:56:48AM -0400, Jeff Squyres wrote: > On Apr 30, 2009, at 6:22 PM, Jason Gunthorpe wrote: > > >After reading all the postings, I think my idea to fix the verbs API > >to not, essentially, corrupt an existing registration when the virtual > >address space changes is the best bet. This slightly changes the > >semantics of the verbs MR to refer to virtual address space within the > >process, not the underlying object(s) that happen to be mapped there > >when the registration is made. > I'm not sure how this helps MPI -- our registration caches will still > become invalid if the MPI app free()'s registered memory...? No, they don't. The only reason you have a problem today is because the memory registration is tied to the underlying *object* not the virtual address. So when the app fiddles with things and changed the virtual address to object mapping it wrecks your caching. If instead the registration is tied to a virtual address, then it doesn't matter what the app does, that virtual address range will *always* point to the currently mapped objects. If the app does free() and then mallocs() without an intervining kernel call then it doesn't matter, your cache of registered VM addreses still says that it is available If the app does free() resulting in munmap and then malloc() resulting in mmap() and re-uses the same address then, again, it doesn't matter to you because the VM address is still registered by the kernel and is switched to the new mmap(). The only problem is over time your cache will have registions of VM that are not in use by the app, or don't have backing objects any longer. This is not a correctness problem, but it might be a performance problem. > MPI maintains a registration cache because registration is so > expensive. Even if the registration cache becomes "safely" invalid > (e.g., you'll never get a scenario where one virtual address could > have previously pointed to a different hardware address within the > span of one process), it doesn't help. How so? That would seem to close the data corruption hole entirely. Sure you still have to call registration functions but one step at a time :) > Ok, I'll back off slightly: if you want verbs to go mainstream, there > will be many other ULPs / middleware libraries that have memory models > like MPI's (that the upper layer is responsible for allocating/freeing > message buffers). Put differently: the TCP/sockets stack doesn't have > this restriction; it will be extremely difficult to convert legions of > sockets programmers to verbs if you effectively restrict large > messages to only be allocated/freed by the network layer (kinda > defeats the point of RDMA if you have to copy large messages, right?). Fair enough - but the registration model is pretty much an inevitable consequence of kernel bypass. If you really want to get rid of it then you need to have an operating mode where the WRs are generated by the kernel through syscalls like all the other network stacks. I've not seen any notion of how to seperate the two ideas at least.. Jason From aafabbri at cisco.com Fri May 1 11:40:30 2009 From: aafabbri at cisco.com (Aaron Fabbri) Date: Fri, 1 May 2009 18:40:30 +0000 (UTC) Subject: [ofa-general] New proposal for memory management References: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> <49FB2309.5090702@dev.mellanox.co.il> Message-ID: Pavel Shamis (Pasha gmail.com> writes: > > Aaron Fabbri (aafabbri) wrote: > > 3. Rip out your registration cache. Make malloc'd buffers go really > > slow (register in fast path) and mpi_alloc_mem() buffers go really fast. > > People will migrate. > People will migrate to what ? (A) new malloc ? Or (B) other interconnect > platform > that does not require from user to change his application in order to > get reasonable performance ? > I'm not sure that people will chose (A) > Agreed. As I said, getting all MPIs to agree would be the hard part. Given that, I think a new malloc() is easier than porting to non-MPI middleware. My point is this: - Verbs works well for a number of applications (Roland and I have each written multiple, for example). - IMHO, there is a problem with your API that should be fixed (the messaging layer needs to manage network buf allocation). If you required mpi_alloc_mem, you would get rid of a whole layer of complicated crap. It may not be feasible for you, but it is the right thing to do from an engineering perspective, right? - "MPI" doesn't want to fix the problem, but instead is asking other people to make kernel changes for them and saying things like "verbs is broken". I totally see your guys' problem and feel for you. Either way it comes down to politics; getting some MPI-specific code into the Linux Kernel (fun?), or getting MPI users to have to change crusty old scientific code (very fun?). Could you use the silent corruption problems as leverage to get MPI to move to mpi_malloc_mem? Final point I want to make is that this is open source, so you can always try submitting some elite patches and get the changes you need. Aaron From sashak at voltaire.com Fri May 1 11:55:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 21:55:08 +0300 Subject: [ofa-general] Re: [PATCH 8/8] Convert ibqueryerrors.pl to C and use new ibnetdisc library. In-Reply-To: <20090427145026.7e074ffc.weiny2@llnl.gov> References: <20090423133120.acf0af63.weiny2@llnl.gov> <20090425155441.GE28604@sk> <20090427145026.7e074ffc.weiny2@llnl.gov> Message-ID: <20090501185508.GH14714@sk.iol.unh.edu> On 14:50 Mon 27 Apr , Ira Weiny wrote: > > The removal of this line causes the '-S' option to segfault. Patch to pq/ibn4 > is below. Thanks. I'm applying this to pq/ibn4 > I will work up a separate patch. Right now you are correct if the SA is > unresponsive the "-S" option will fail. iblinkinfo does the full scan every > time. But that slows down the query for a single switch to the same O(n) > query that a full system scan requires. I would rather have that query be > O(1). So I implemented ibqueryerrors in this manner with the intent of going > back and "fixing" iblinkinfo. I think having a fall back on a full system > scan is a good idea. Patch for both tools will follow... :-D Thanks. I'm doing pq/ibn4 merge now. We can apply the rest after this. Sasha From jgunthorpe at obsidianresearch.com Fri May 1 11:59:00 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 May 2009 12:59:00 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com> References: <49faf95d.02c3f10a.1fc8.ffffaff5@mx.google.com> Message-ID: <20090501185900.GJ32114@obsidianresearch.com> On Fri, May 01, 2009 at 09:25:33AM -0400, Tom Talpey wrote: > Completely agree. I will add that enterprise network programmers are > going to reject registration caching as well, because it introduces > vulnerabilities into the data path - silent data corruption. For example, > storage won't tolerate it, databases won't, etc. By the same token those apps that care about data security like you site *must* manually manage their registration to only expose the memory that needs to be exposed at any time. That is a mandatory step as soon as you have client initiated RDMA operations, no matter what your protocol is. > The problem is that userspace memory registration is slow. Let's address > that, not address how to make a hack (registration caching) go faster. Indeed, but how? You need to make a syscall to pin and map the pages, which is fine, but how do you communicate the information to the HCA in a manner that is utterly secure and doesn't let userspace 'fiddle' it to point to arbitary random memory? You get burned pretty fast by fact that the HCA is DMA'ing instructions out of user space directly :( Jason From sashak at voltaire.com Fri May 1 12:47:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 22:47:26 +0300 Subject: [ofa-general] [PATCH] libibumad: keep port capmask as 32-bit variable In-Reply-To: <112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com> References: <49F16310.1080902@ext.bull.net> <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> <20090425210255.GL28604@sk> <112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com> Message-ID: <20090501194726.GJ14714@sk.iol.unh.edu> For unknown reason IB port capmask was defined as 64-bit unsigned. Which caused some portability problems. Fixing this. Pointed out by Nicolas Morey-Chaisemartin and Sean Hefty. Signed-off-by: Sasha Khapyorsky --- libibumad/include/infiniband/umad.h | 2 +- libibumad/src/umad.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 91ccf1d..78862c8 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -129,7 +129,7 @@ typedef struct umad_port { unsigned state; unsigned phys_state; unsigned rate; - uint64_t capmask; + uint32_t capmask; uint64_t gid_prefix; uint64_t port_guid; unsigned pkeys_size; diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 72ef506..deb3b9d 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -160,7 +160,7 @@ get_port(char *ca_name, char *dir, int portnum, umad_port_t *port) goto clean; if (sys_read_uint(port_dir, SYS_PORT_RATE, &port->rate) < 0) goto clean; - if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0) + if (sys_read_uint(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0) goto clean; port->capmask = htonl(port->capmask); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Fri May 1 12:50:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 22:50:23 +0300 Subject: [ofa-general] [PATCH] ibstat.c: use htohl() for 32-bit capmask conversion In-Reply-To: <20090501194726.GJ14714@sk.iol.unh.edu> References: <49F16310.1080902@ext.bull.net> <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> <20090425210255.GL28604@sk> <112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com> <20090501194726.GJ14714@sk.iol.unh.edu> Message-ID: <20090501195022.GK14714@sk.iol.unh.edu> capmask field was changed to be 32-bit, so use ntohl() instead of ntohll(). Casting is also not needed then. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibstat.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 7985be1..06f39ae 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) printf("%sBase lid: %d\n", pre, port->base_lid); printf("%sLMC: %d\n", pre, port->lmc); printf("%sSM lid: %d\n", pre, port->sm_lid); - printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask)); + printf("%sCapability mask: 0x%08x\n", pre, ntohl(port->capmask)); printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid)); return 0; } -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Fri May 1 12:52:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 22:52:49 +0300 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity In-Reply-To: <20090426123009.GA25119@comcast.net> References: <20090426123009.GA25119@comcast.net> Message-ID: <20090501195249.GL14714@sk.iol.unh.edu> On 08:30 Sun 26 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 12:56:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 22:56:03 +0300 Subject: [ofa-general] Re: [PATCH] ibsim: Fixed custom release in SPEC file In-Reply-To: <49F58E31.3020005@ext.bull.net> References: <49F58E31.3020005@ext.bull.net> Message-ID: <20090501195603.GM14714@sk.iol.unh.edu> On 12:51 Mon 27 Apr , Nicolas Morey-Chaisemartin wrote: > Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set: > error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11 > > This is due to > Release: %rel%{?dist} > and %rel having a trailing whitespace. > > > Signed-off-by: Nicolas Morey-Chaisemartin Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 12:57:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 22:57:02 +0300 Subject: [ofa-general] Re: [PATCH] management: Fixed custom_release in SPEC files In-Reply-To: <49F58FF9.8070608@ext.bull.net> References: <49F58FF9.8070608@ext.bull.net> Message-ID: <20090501195702.GN14714@sk.iol.unh.edu> On 12:59 Mon 27 Apr , Nicolas Morey-Chaisemartin wrote: > Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set: > error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11 > > This is due to > Release: %rel%{?dist} > and %rel having a trailing whitespace. > > Signed-off-by: Nicolas Morey-Chaisemartin Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 13:27:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 23:27:23 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Add SuperMicro to list of recognized vendors In-Reply-To: <20090427135330.GA24559@comcast.net> References: <20090427135330.GA24559@comcast.net> Message-ID: <20090501202723.GO14714@sk.iol.unh.edu> On 09:53 Mon 27 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 13:27:43 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 23:27:43 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no condition by #else In-Reply-To: <49F5A930.1030102@ext.bull.net> References: <49F5A930.1030102@ext.bull.net> Message-ID: <20090501202743.GP14714@sk.iol.unh.edu> On 14:46 Mon 27 Apr , Nicolas Morey-Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 13:42:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 23:42:37 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.c: Display attribute ID in hex rather than decimal In-Reply-To: <20090427181753.GA20430@comcast.net> References: <20090427181753.GA20430@comcast.net> Message-ID: <20090501204237.GQ14714@sk.iol.unh.edu> On 14:17 Mon 27 Apr , Hal Rosenstock wrote: > > for easier correlation to IBA spec > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 13:44:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 23:44:37 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Changes to spec and make files for updated release notes In-Reply-To: <20090427110832.GA22098@comcast.net> References: <20090427110832.GA22098@comcast.net> Message-ID: <20090501204437.GR14714@sk.iol.unh.edu> On 07:08 Mon 27 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Fri May 1 13:50:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 May 2009 23:50:28 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL field In-Reply-To: <20090427110619.GA22089@comcast.net> References: <20090427110619.GA22089@comcast.net> Message-ID: <20090501205028.GS14714@sk.iol.unh.edu> Hi Hal, On 07:06 Mon 27 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock There are different things mixed in this patch (like self NodeInfo resolution and redirection status printouts)? Is it just typo? Sasha > --- > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 0e47ccf..c74cb1d 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -500,6 +500,7 @@ enum MAD_FIELDS { > IB_SA_PR_DLID_F, > IB_SA_PR_SLID_F, > IB_SA_PR_NPATH_F, > + IB_SA_PR_SL_F, > > /* > * MC Member rec > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > index c24bc12..81693a2 100644 > --- a/libibmad/src/fields.c > +++ b/libibmad/src/fields.c > @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = { > {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint}, > {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint}, > {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint}, > + {BITSOFFS(428, 4), "PathRecSL", mad_dump_uint}, > > /* > * MC Member rec > diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c > index 691bdc3..f17da11 100644 > --- a/libibmad/src/resolve.c > +++ b/libibmad/src/resolve.c > @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, > return -1; > > mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid); > + mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl); > > return ib_portid_set(sm_id, lid, 0, 0); > } > @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > { > ib_portid_t sm_portid; > char buf[IB_SA_DATA_SIZE] = { 0 }; > + ib_portid_t self = { 0 }; > + uint64_t selfguid; > + ibmad_gid_t selfgid; > + uint8_t nodeinfo[64]; > > if (!sm_id) { > sm_id = &sm_portid; > if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) > return -1; > } > + > + if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport)) > + return -1; > + mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid); > + mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); > + mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid); > + > if (*(uint64_t *) & portid->gid == 0) > mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, > IB_DEFAULT_SUBN_PREFIX); > @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, > mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); > > if ((portid->lid = > - ib_path_query_via(srcport, portid->gid, portid->gid, sm_id, > + ib_path_query_via(srcport, selfgid, portid->gid, sm_id, > buf)) < 0) > return -1; > > + mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl); > return 0; > } > > @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, > return -1; > > mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); > + mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl); > mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix); > mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid); > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index 07b623d..21fcc9a 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -187,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, > ib_portid_t * dport, void *payload, void *rcvdata) > { > int status, len; > - uint8_t sndbuf[1024], rcvbuf[1024], *mad; > + uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass; > int timeout, retries; > > len = 0; > @@ -209,7 +209,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, > > mad = umad_get_mad(rcvbuf); > > - if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { > + status = mad_get_field(mad, 0, IB_MAD_STATUS_F); > + mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); > + if (mgmtclass == IB_SMI_DIRECT_CLASS) > + status &= 0x7fff; > + else if (mgmtclass != IB_SMI_CLASS) { > + if (status & 2) { > + ERRS("MAD redirection not supported; dport (%s)", > + portid2str(dport)); > + return 0; > + } > + } > + if (status) { > ERRS("MAD completed with error status 0x%x; dport (%s)", > status, portid2str(dport)); > return 0; > @@ -254,8 +265,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, > mad = umad_get_mad(rcvbuf); > > if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { > - ERRS("MAD completed with error status 0x%x; dport (%s)", > - status, portid2str(dport)); > + if (status & 2) > + ERRS("MAD redirection not supported; dport (%s)", > + portid2str(dport)); > + else > + ERRS("MAD completed with error status 0x%x; dport (%s)", > + status, portid2str(dport)); > return 0; > } > > From hal.rosenstock at gmail.com Fri May 1 13:59:04 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 1 May 2009 16:59:04 -0400 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL field In-Reply-To: <20090501205028.GS14714@sk.iol.unh.edu> References: <20090427110619.GA22089@comcast.net> <20090501205028.GS14714@sk.iol.unh.edu> Message-ID: Hi Sasha, On Fri, May 1, 2009 at 4:50 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 07:06 Mon 27 Apr     , Hal Rosenstock wrote: >> >> Signed-off-by: Hal Rosenstock > > There are different things mixed in this patch (like self NodeInfo > resolution and redirection status printouts)? Is it just typo? Yes, it was meant to just be the mad.h and fields.c part. The other files were mistakenly included. Sorry. Let me know if you want me to regenerate this. -- Hal > Sasha > >> --- >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h >> index 0e47ccf..c74cb1d 100644 >> --- a/libibmad/include/infiniband/mad.h >> +++ b/libibmad/include/infiniband/mad.h >> @@ -500,6 +500,7 @@ enum MAD_FIELDS { >>       IB_SA_PR_DLID_F, >>       IB_SA_PR_SLID_F, >>       IB_SA_PR_NPATH_F, >> +     IB_SA_PR_SL_F, >> >>       /* >>        * MC Member rec >> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c >> index c24bc12..81693a2 100644 >> --- a/libibmad/src/fields.c >> +++ b/libibmad/src/fields.c >> @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = { >>       {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint}, >>       {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint}, >>       {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint}, >> +     {BITSOFFS(428, 4), "PathRecSL", mad_dump_uint}, >> >>       /* >>        * MC Member rec >> diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c >> index 691bdc3..f17da11 100644 >> --- a/libibmad/src/resolve.c >> +++ b/libibmad/src/resolve.c >> @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, >>               return -1; >> >>       mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid); >> +     mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl); >> >>       return ib_portid_set(sm_id, lid, 0, 0); >>  } >> @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, >>  { >>       ib_portid_t sm_portid; >>       char buf[IB_SA_DATA_SIZE] = { 0 }; >> +     ib_portid_t self = { 0 }; >> +     uint64_t selfguid; >> +     ibmad_gid_t selfgid; >> +     uint8_t nodeinfo[64]; >> >>       if (!sm_id) { >>               sm_id = &sm_portid; >>               if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) >>                       return -1; >>       } >> + >> +     if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport)) >> +             return -1; >> +     mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid); >> +     mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); >> +     mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid); >> + >>       if (*(uint64_t *) & portid->gid == 0) >>               mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, >>                               IB_DEFAULT_SUBN_PREFIX); >> @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, >>               mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); >> >>       if ((portid->lid = >> -          ib_path_query_via(srcport, portid->gid, portid->gid, sm_id, >> +          ib_path_query_via(srcport, selfgid, portid->gid, sm_id, >>                              buf)) < 0) >>               return -1; >> >> +     mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl); >>       return 0; >>  } >> >> @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, >>               return -1; >> >>       mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); >> +     mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl); >>       mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix); >>       mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid); >> >> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >> index 07b623d..21fcc9a 100644 >> --- a/libibmad/src/rpc.c >> +++ b/libibmad/src/rpc.c >> @@ -187,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, >>             ib_portid_t * dport, void *payload, void *rcvdata) >>  { >>       int status, len; >> -     uint8_t sndbuf[1024], rcvbuf[1024], *mad; >> +     uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass; >>       int timeout, retries; >> >>       len = 0; >> @@ -209,7 +209,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, >> >>       mad = umad_get_mad(rcvbuf); >> >> -     if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { >> +     status = mad_get_field(mad, 0, IB_MAD_STATUS_F); >> +     mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); >> +     if (mgmtclass == IB_SMI_DIRECT_CLASS) >> +             status &= 0x7fff; >> +     else if (mgmtclass != IB_SMI_CLASS) { >> +             if (status & 2) { >> +                     ERRS("MAD redirection not supported; dport (%s)", >> +                          portid2str(dport)); >> +                     return 0; >> +             } >> +     } >> +     if (status) { >>               ERRS("MAD completed with error status 0x%x; dport (%s)", >>                    status, portid2str(dport)); >>               return 0; >> @@ -254,8 +265,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, >>       mad = umad_get_mad(rcvbuf); >> >>       if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { >> -             ERRS("MAD completed with error status 0x%x; dport (%s)", >> -                  status, portid2str(dport)); >> +             if (status & 2) >> +                     ERRS("MAD redirection not supported; dport (%s)", >> +                          portid2str(dport)); >> +             else >> +                     ERRS("MAD completed with error status 0x%x; dport (%s)", >> +                          status, portid2str(dport)); >>               return 0; >>       } >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Fri May 1 14:25:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 2 May 2009 00:25:32 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL field In-Reply-To: References: <20090427110619.GA22089@comcast.net> <20090501205028.GS14714@sk.iol.unh.edu> Message-ID: <20090501212532.GT14714@sk.iol.unh.edu> On 16:59 Fri 01 May , Hal Rosenstock wrote: > > Yes, it was meant to just be the mad.h and fields.c part. The other > files were mistakenly included. Sorry. Let me know if you want me to > regenerate this. And wasn't SL decoding to portid's sl field part of the patch? I think it would be better to regenerate, then it will be clear what was supposed to be there. Sasha From hal.rosenstock at gmail.com Fri May 1 14:30:38 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 1 May 2009 17:30:38 -0400 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL field In-Reply-To: <20090501212532.GT14714@sk.iol.unh.edu> References: <20090427110619.GA22089@comcast.net> <20090501205028.GS14714@sk.iol.unh.edu> <20090501212532.GT14714@sk.iol.unh.edu> Message-ID: On Fri, May 1, 2009 at 5:25 PM, Sasha Khapyorsky wrote: > On 16:59 Fri 01 May     , Hal Rosenstock wrote: >> >> Yes, it was meant to just be the mad.h and fields.c part. The other >> files were mistakenly included. Sorry. Let me know if you want me to >> regenerate this. > > And wasn't SL decoding to portid's sl field part of the patch? Not yet; It's under test. > I think it would be better to regenerate, then it will be clear what was > supposed to be there. OK. -- Hal > Sasha > From hnrose at comcast.net Fri May 1 14:33:12 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 1 May 2009 17:33:12 -0400 Subject: [ofa-general] [PATCH] libibmad: Add support for SA PathRecord SL field Message-ID: <20090501213312.GA29913@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 2f5673f..432710a 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -500,6 +500,7 @@ enum MAD_FIELDS { IB_SA_PR_DLID_F, IB_SA_PR_SLID_F, IB_SA_PR_NPATH_F, + IB_SA_PR_SL_F, /* * MC Member rec diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 60b310c..129f7e5 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint}, {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint}, {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint}, + {BITSOFFS(428, 4), "PathRecSL", mad_dump_uint}, /* * MC Member rec From jgunthorpe at obsidianresearch.com Fri May 1 14:36:52 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 May 2009 15:36:52 -0600 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F9A729.3090904@voltaire.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> <20090427162349.GI4431@obsidianresearch.com> <49F9A729.3090904@voltaire.com> Message-ID: <20090501213652.GO32114@obsidianresearch.com> On Thu, Apr 30, 2009 at 04:27:05PM +0300, Or Gerlitz wrote: > Jason Gunthorpe wrote: >> including a PID is not best, you should include enough information to >> figure out the pid(s) from proc/xx/fd, and vice versa. > maybe its not the best solution but it seems to me good enough Well, we have to live with these interfaces literally forever, shortcuts ultimately just cause more problems down the road.. Reall the thinking should be 'I want to make lsof work usefully' not 'I want some random and different hack to let me see something'. And yes, that is harder. But the IB stack is now at the point where these small hard things are the sort of work that is needed to get parity with the other stuff in linux.. Jason From sashak at voltaire.com Fri May 1 14:36:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 2 May 2009 00:36:27 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Add support for SA PathRecord SL field In-Reply-To: <20090501213312.GA29913@comcast.net> References: <20090501213312.GA29913@comcast.net> Message-ID: <20090501213627.GU14714@sk.iol.unh.edu> On 17:33 Fri 01 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hnrose at comcast.net Fri May 1 14:47:24 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 1 May 2009 17:47:24 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Remove some underbars from internal names Message-ID: <20090501214724.GA30974@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index 16a59ef..e6a1cfe 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -97,15 +97,15 @@ typedef struct redir { } redir_t; /* Node to store information about which nodes we are monitoring */ -typedef struct _monitored_node { +typedef struct monitored_node { cl_map_item_t map_item; - struct _monitored_node *next; + struct monitored_node *next; uint64_t guid; boolean_t esp0; char *name; uint32_t num_ports; redir_t redir_port[1]; /* redirection on a per port basis */ -} __monitored_node_t; +} monitored_node_t; struct osm_opensm; /****s* OpenSM: PerfMgr/osm_perfmgr_t @@ -133,7 +133,7 @@ typedef struct osm_perfmgr { cl_event_t sig_query; /* will throttle our querys */ uint32_t max_outstanding_queries; cl_qmap_t monitored_map; /* map the nodes we are tracking */ - __monitored_node_t *remove_list; + monitored_node_t *remove_list; } osm_perfmgr_t; /* * FIELDS diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 7c24819..93644a0 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -119,7 +119,7 @@ extern int wait_for_pending_transactions(osm_stats_t * stats); /********************************************************************** * Internal helper functions. **********************************************************************/ -static void __init_monitored_nodes(osm_perfmgr_t * pm) +static void init_monitored_nodes(osm_perfmgr_t * pm) { cl_qmap_init(&pm->monitored_map); pm->remove_list = NULL; @@ -127,7 +127,7 @@ static void __init_monitored_nodes(osm_perfmgr_t * pm) cl_event_init(&pm->sig_query, FALSE); } -static void __mark_for_removal(osm_perfmgr_t * pm, __monitored_node_t * node) +static void mark_for_removal(osm_perfmgr_t * pm, monitored_node_t * node) { if (pm->remove_list) { node->next = pm->remove_list; @@ -138,10 +138,10 @@ static void __mark_for_removal(osm_perfmgr_t * pm, __monitored_node_t * node) } } -static void __remove_marked_nodes(osm_perfmgr_t * pm) +static void remove_marked_nodes(osm_perfmgr_t * pm) { while (pm->remove_list) { - __monitored_node_t *next = pm->remove_list->next; + monitored_node_t *next = pm->remove_list->next; cl_qmap_remove_item(&pm->monitored_map, (cl_map_item_t *) (pm->remove_list)); @@ -153,7 +153,7 @@ static void __remove_marked_nodes(osm_perfmgr_t * pm) } } -static inline void __decrement_outstanding_queries(osm_perfmgr_t * pm) +static inline void decrement_outstanding_queries(osm_perfmgr_t * pm) { cl_atomic_dec(&pm->outstanding_queries); cl_event_signal(&pm->sig_query); @@ -173,7 +173,7 @@ static void perfmgr_mad_recv_callback(osm_madw_t * p_madw, void *bind_context, osm_madw_copy_context(p_madw, p_req_madw); osm_mad_pool_put(pm->mad_pool, p_req_madw); - __decrement_outstanding_queries(pm); + decrement_outstanding_queries(pm); /* post this message for later processing. */ if (cl_disp_post(pm->pc_disp_h, OSM_MSG_MAD_PORT_COUNTERS, @@ -196,7 +196,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, uint64_t node_guid = context->perfmgr_context.node_guid; uint8_t port = context->perfmgr_context.port; cl_map_item_t *p_node; - __monitored_node_t *p_mon_node; + monitored_node_t *p_mon_node; OSM_LOG_ENTER(pm->log); @@ -209,7 +209,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, PRIx64 " not found in monitored map\n", node_guid); goto Exit; } - p_mon_node = (__monitored_node_t *) p_node; + p_mon_node = (monitored_node_t *) p_node; OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C02: %s (0x%" PRIx64 ") port %u\n", p_mon_node->name, p_mon_node->guid, port); @@ -236,7 +236,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, Exit: osm_mad_pool_put(pm->mad_pool, p_madw); - __decrement_outstanding_queries(pm); + decrement_outstanding_queries(pm); OSM_LOG_EXIT(pm->log); } @@ -305,7 +305,7 @@ Exit: /********************************************************************** * Given a monitored node and a port, return the qp **********************************************************************/ -static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) +static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) { ib_net32_t qp = cl_ntoh32(1); @@ -322,7 +322,7 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) * return the appropriate lid to query that port **********************************************************************/ static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port, - __monitored_node_t * mon_node) + monitored_node_t * mon_node) { if (mon_node && mon_node->num_ports && port < mon_node->num_ports && mon_node->redir_port[port].redir_lid) @@ -414,12 +414,12 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, /********************************************************************** * sweep the node_guid_tbl and collect the node guids to be tracked **********************************************************************/ -static void __collect_guids(cl_map_item_t * p_map_item, void *context) +static void collect_guids(cl_map_item_t * p_map_item, void *context) { osm_node_t *node = (osm_node_t *) p_map_item; uint64_t node_guid = cl_ntoh64(node->node_info.node_guid); osm_perfmgr_t *pm = (osm_perfmgr_t *) context; - __monitored_node_t *mon_node = NULL; + monitored_node_t *mon_node = NULL; uint32_t num_ports; OSM_LOG_ENTER(pm->log); @@ -462,7 +462,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context) ib_api_status_t status = IB_SUCCESS; osm_perfmgr_t *pm = context; osm_node_t *node = NULL; - __monitored_node_t *mon_node = (__monitored_node_t *) p_map_item; + monitored_node_t *mon_node = (monitored_node_t *) p_map_item; osm_madw_context_t mad_context; uint64_t node_guid = 0; ib_net32_t remote_qp; @@ -477,7 +477,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context) "ERR 4C07: Node \"%s\" (guid 0x%" PRIx64 ") no longer exists so removing from PerfMgr monitoring\n", mon_node->name, mon_node->guid); - __mark_for_removal(pm, mon_node); + mark_for_removal(pm, mon_node); goto Exit; } @@ -779,7 +779,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) */ OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Gathering PerfMgr stats\n"); cl_plock_acquire(pm->lock); - cl_qmap_apply_func(&pm->subn->node_guid_tbl, __collect_guids, pm); + cl_qmap_apply_func(&pm->subn->node_guid_tbl, collect_guids, pm); cl_plock_release(pm->lock); /* then for each node query their counters */ @@ -788,7 +788,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) /* Clean out any nodes found to be removed during the * sweep */ - __remove_marked_nodes(pm); + remove_marked_nodes(pm); #if ENABLE_OSM_PERF_MGR_PROFILE /* spin on outstanding queries */ @@ -854,7 +854,7 @@ void osm_perfmgr_destroy(osm_perfmgr_t * pm) * will be missed. **********************************************************************/ static void perfmgr_check_oob_clear(osm_perfmgr_t * pm, - __monitored_node_t * mon_node, uint8_t port, + monitored_node_t * mon_node, uint8_t port, perfmgr_db_err_reading_t * cr, perfmgr_db_data_cnt_reading_t * dc) { @@ -938,7 +938,7 @@ static int counter_overflow_32(ib_net32_t val) * MAD to the port. **********************************************************************/ static void perfmgr_check_overflow(osm_perfmgr_t * pm, - __monitored_node_t * mon_node, uint8_t port, + monitored_node_t * mon_node, uint8_t port, ib_port_counters_t * pc) { osm_madw_context_t mad_context; @@ -1009,7 +1009,7 @@ Exit: * Check values for logging of errors **********************************************************************/ static void perfmgr_log_events(osm_perfmgr_t * pm, - __monitored_node_t * mon_node, uint8_t port, + monitored_node_t * mon_node, uint8_t port, perfmgr_db_err_reading_t * reading) { perfmgr_db_err_reading_t prev_read; @@ -1066,7 +1066,7 @@ static void pc_rcv_process(void *context, void *data) perfmgr_db_err_reading_t err_reading; perfmgr_db_data_cnt_reading_t data_reading; cl_map_item_t *p_node; - __monitored_node_t *p_mon_node; + monitored_node_t *p_mon_node; OSM_LOG_ENTER(pm->log); @@ -1079,7 +1079,7 @@ static void pc_rcv_process(void *context, void *data) PRIx64 " not found in monitored map\n", node_guid); goto Exit; } - p_mon_node = (__monitored_node_t *) p_node; + p_mon_node = (monitored_node_t *) p_node; OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Processing received MAD status 0x%x context 0x%" @@ -1233,7 +1233,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm, goto Exit; } - __init_monitored_nodes(pm); + init_monitored_nodes(pm); cl_timer_start(&pm->sweep_timer, pm->sweep_time_s * 1000); From weiny2 at llnl.gov Fri May 1 16:53:34 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 1 May 2009 16:53:34 -0700 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090501173806.GF14714@sk.iol.unh.edu> References: <20090422185441.6f8601dc.weiny2@llnl.gov> <20090425175710.GI28604@sk> <20090427150409.9c10e479.weiny2@llnl.gov> <20090501173806.GF14714@sk.iol.unh.edu> Message-ID: <20090501165334.59bf72a9.weiny2@llnl.gov> On Fri, 1 May 2009 20:38:06 +0300 Sasha Khapyorsky wrote: > On 15:04 Mon 27 Apr , Ira Weiny wrote: > > > > The port output should be from low to high. > > > What do you see? > > Yes, the port order is good (I was wrong about it). But switch order is > reserved - first discovered switch is printed last. Right? Actually there is no specific order on the switch output at this point. If you choose the "-g" option ibnetdiscover will print differently based on "chassis". I did not attempt to preserve any switch or HCA order printing. I don't know of any utils which require this. Am I wrong? Ira From andy.grover at oracle.com Fri May 1 17:41:49 2009 From: andy.grover at oracle.com (Andy Grover) Date: Fri, 01 May 2009 17:41:49 -0700 Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org> References: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> <264AF717-9B0C-4DB3-922A-39DCA1940900@cisco.com> <5AEC2602AE03EB46BFC16C6B9B200DA8134A97300F@MNEXMB2.qlogic.org> Message-ID: <49FB96CD.6090100@oracle.com> Todd Rimmer wrote: > It goes beyond just a tutorial. > In talking to customers, the consensus is that many application > programmers struggle with sockets, RDMA is an order of magnitude > beyond that. It's not a cut on programmers, there are some very > strong ones in the enterprise, but a fair percentage only have > associate degrees or technical school training. Even the extremely > smart ones have 100 things to juggle (and often must write code such > that entry level programmers can support it), so the risk/reward or > ROI of learning RDMA has to be there. The higher the learning cost > the more difficult to justify the effort. Totally agree. Someone emailed me off-list and mentioned he had proposed an RDMA/IB book to a few publishers and been turned down. (!?!) Don't know if that would still be the case but it means there's a lot of work to do increasing the technology's mindshare and perceived relevance to a lot of developers, and the OFA and its members need to get the ball rolling before we can expect the "... for Dummies" people to want to write a book about RDMA :-) > simplified APIs and easy > migration of applications > accessibility in scripting languages and other languages Both of these would be great, and I think go together -- a C# RDMA API is going to be more accessible to a C# programmer first just because it's in the right language, but also handle many boilerplate sections of code on behalf of the user, presenting a simpler API than the C API. > - good simple examples of how to do it, sample programs etc Yes I would think this could be in the Tutorial, or a Cookbook section? > - connection establishment is still difficult in OFED. Also many > apps are shortcutting the process by avoiding SA queries (hence > impacting the ability of the applications to work properly with QOS, > LMC, complex fabrics (torus, etc), Partitioning, etc). - either the > Base API needs to improve or "helper libraries" are needed on top of > it. Could go in the language wrapper libs. A helper lib for C API itself also might be nice, yes. > - effective tools to debug applications. True! Regards -- Andy From Jie.Cai at cs.anu.edu.au Fri May 1 23:36:29 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Sat, 02 May 2009 16:36:29 +1000 Subject: [ofa-general] uDAPL DTO completion question. In-Reply-To: <469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> <49FA7C21.1050400@cs.anu.edu.au> <517c62fb0905010424k76da4e59tc9a7a857ba5af727@mail.gmail.com> <469958e00905010908kb6d1361n43b48b7486824bf3@mail.gmail.com> Message-ID: <49FBE9ED.8090308@cs.anu.edu.au> Yes, the variable "target" has been declared volatile. However, it is a pointer points to "char *rbuf" with type cast, where rbuf been allocated memory with malloc. Will this bring the trouble? I tried gcc with no optimization, -O2 and -O3 as well, but the program still goes infinitely. Still haven't figured out where is the problem. Do u have some other comments? Regards, -- Jie Cai Caitlin Bestler wrote: > On Fri, May 1, 2009 at 4:24 AM, arkady kanevsky > wrote: > >> Jie, >> it sounds to me that either the variable is not volatile or compiler >> optimization >> causes some problem. I would check for these first. >> Arkady >> >> > > Agreed, it is definitely a caching issue. > > Atomics are InfiniBand specific, and there are some fairly complex > rules that govern > how much the HCA can do caching. The gotcha is that they basically provide some > cache coherency guarantees within the context of a connection, but not > much between > connections or versus local applications. > > That said, it would be rare for HCA caching to be the cause of > anything worse than > some unexpected ordering. Adapters cache when they have to, but would > really rather > not allocate or track a lot of resources. Updating to real physical > memory ASAP is much > simpler. > > Compilers, on the other hand, *love* optimizing. The key thing to > understand is that the > HCA is another processor, one that is at least as distant as any other > CPU core. Any > and all techniques used when sharing memory with another processor apply. > > Completions hide all that from the application, just promising that > specific things are > coherent when the user invokes the verbs to reap a completion. So > whenever you do > without completions you are dealing with an arbitrary multi-processor > memory coherence > problem. > From vlad at lists.openfabrics.org Sat May 2 03:22:09 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 2 May 2009 03:22:09 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090502-0200 daily build status Message-ID: <20090502102209.DFC8AE613C2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From bart.vanassche at gmail.com Sat May 2 04:46:24 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sat, 2 May 2009 13:46:24 +0200 Subject: [ofa-general] OFED, the backported header and sg_init_table() Message-ID: Hello, Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started looking at the backported kernel headers. I found the following in the header file /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h: #define sg_init_table(a, b) Or: sg_init_table() is defined to do nothing. I was expecting the following however: #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents); The sg_init_table() function is implemented in e.g. 2.6.29 as follows: void sg_init_table(struct scatterlist *sgl, unsigned int nents) { memset(sgl, 0, sizeof(*sgl) * nents); #ifdef CONFIG_DEBUG_SG { unsigned int i; for (i = 0; i < nents; i++) sgl[i].sg_magic = SG_MAGIC; } #endif sg_mark_end(&sgl[nents - 1]); } Does anyone know why sg_init_table() is defined such that it does nothing in the backported OFED headers ? Bart. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sun May 3 03:21:42 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 3 May 2009 03:21:42 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090503-0200 daily build status Message-ID: <20090503102142.B2566E611FE@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at dev.mellanox.co.il Sun May 3 03:48:32 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 03 May 2009 13:48:32 +0300 Subject: [ofa-general] Build failures on current 1.4.1 dailies In-Reply-To: References: Message-ID: <49FD7680.1060508@mellanox.co.il> Jon/Steve I see the issue is with nfs - please look at this Thanks Tziporet Gennadiy Nerubayev wrote: > Hi all, > > Running on 2.6.27.21 x64. ofa_kernel build error as follows: > > -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \ > -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing > -fno-common -Werror-implicit-function-declaration -Os -m64 > -mtune=generic -mno-red-zone -mc > model=kernel -funit-at-a-time -maccumulate-outgoing-args > -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe > -Wno-sign-compare -fno-asynchronous-unwind-tables > -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default > -fno-stack-protector -fomit-frame-pointer -g > -Wdeclaration-after-statement -Wno-pointer-sign > -fwrapv -DMODULE -D"KBUILD_STR(s)=#s" > -D"KBUILD_BASENAME=KBUILD_STR(file)" > -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/ > fs/nfs/.tmp_file.o > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function > 'nfs_write_begin': > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error: > implicit declaration of function '__grab_cache_page' > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: > warning: assignment makes pointer from integer without a cast > make[3]: *** > [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1 > make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2 > make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2 > make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21' > make: *** [kernel] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build) > > Assuming we turn off nfs stuff to go further, error number two is from > infiniband-diags: > > checking whether to build shared libraries... yes > checking whether to build static libraries... yes > checking for sys_read_string in -libcommon... yes > checking for umad_init in -libumad... yes > checking for mad_dump_int in -libmad... no > configure: error: mad_dump_int() not found. diags require libibmad. > error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build) > > I confirmed that pulling management git and compiling libs and diags > from there does not have this issue, and that the libibmad.so.1 that > gets compiled in the daily OFED does not have mad_dump_int(). > > From jackm at dev.mellanox.co.il Sun May 3 05:15:44 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 3 May 2009 15:15:44 +0300 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: References: Message-ID: <200905031515.45206.jackm@dev.mellanox.co.il> On Saturday 02 May 2009 14:46, Bart Van Assche wrote: > Does anyone know why sg_init_table() is defined such that it does nothing in > the backported OFED headers ? > My mistake while doing backports. Will be fixed in rc5. - Jack From jackm at dev.mellanox.co.il Sun May 3 06:04:05 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 3 May 2009 16:04:05 +0300 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: References: Message-ID: <200905031604.05907.jackm@dev.mellanox.co.il> On Saturday 02 May 2009 14:46, Bart Van Assche wrote: > Hello, > > Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started > looking at the backported kernel headers. I found the following in the > header file > /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h: > > #define sg_init_table(a, b) > > Or: sg_init_table() is defined to do nothing. I was expecting the following > however: > > #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents); > > The sg_init_table() function is implemented in e.g. 2.6.29 as follows: > > void sg_init_table(struct scatterlist *sgl, unsigned int nents) > { > memset(sgl, 0, sizeof(*sgl) * nents); > #ifdef CONFIG_DEBUG_SG > { > unsigned int i; > for (i = 0; i < nents; i++) > sgl[i].sg_magic = SG_MAGIC; > } > #endif > sg_mark_end(&sgl[nents - 1]); > } > > Does anyone know why sg_init_table() is defined such that it does nothing in > the backported OFED headers ? > > Bart. I checked this more carefully. Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit 45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c) Before this, no initialization was done on the sg page_list, and we had no problems. When doing the backport, then, I simply made this a NOP. I'm not convinced that sg_init_table needs to be implemented in kernels earlier than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and the page list was not previously zeroed out before usage. What do you think? - Jack From bart.vanassche at gmail.com Sun May 3 08:36:53 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 3 May 2009 17:36:53 +0200 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: <200905031604.05907.jackm@dev.mellanox.co.il> References: <200905031604.05907.jackm@dev.mellanox.co.il> Message-ID: On Sun, May 3, 2009 at 3:04 PM, Jack Morgenstein wrote: > On Saturday 02 May 2009 14:46, Bart Van Assche wrote: >> Hello, >> >> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started >> looking at the backported kernel headers. I found the following in the >> header file >> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h: >> >> #define sg_init_table(a, b) >> >> Or: sg_init_table() is defined to do nothing. I was expecting the following >> however: >> >> #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents); >> >> The sg_init_table() function is implemented in e.g. 2.6.29 as follows: >> >> void sg_init_table(struct scatterlist *sgl, unsigned int nents) >> { >>         memset(sgl, 0, sizeof(*sgl) * nents); >> #ifdef CONFIG_DEBUG_SG >>         { >>                 unsigned int i; >>                 for (i = 0; i < nents; i++) >>                         sgl[i].sg_magic = SG_MAGIC; >>         } >> #endif >>         sg_mark_end(&sgl[nents - 1]); >> } >> >> Does anyone know why sg_init_table() is defined such that it does nothing in >> the backported OFED headers ? > > I checked this more carefully. > Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit > 45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c) > > Before this, no initialization was done on the sg page_list, and we had no > problems.  When doing the backport, then, I simply made this a NOP. > I'm not convinced that sg_init_table needs to be implemented in kernels earlier > than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and > the page list was not previously zeroed out before usage. > > What do you think? My opinion is that it is really dangerous and confusing to have one version of the sg_init_table() macro that performs initialization and another version that does not. As an example, the OFED source file net/sunrpc/xdr.c invokes sg_init_table(). When this code is compiled against e.g. a 2.6.27 kernel, invoking sg_init_table() will initialize the sg-list properly because in this case the sg_init_table() included with the 2.6.27 kernel is used. When this code is compiled against e.g. an RHEL 5.3 kernel, invoking the sg_init_table() macro will have no effect because the sg_init_table() macro from OFED's backported header files is used. Is this effect really desired ? Bart. From pashash at gmail.com Sun May 3 09:32:18 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Sun, 03 May 2009 19:32:18 +0300 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> <49FB2309.5090702@dev.mellanox.co.il> Message-ID: <49FDC712.1010208@dev.mellanox.co.il> > - Verbs works well for a number of applications (Roland and I have each written > multiple, for example). > BTW it is not too much user-level native IB applications that I know. Verbs works perfect for kernel level ULPs that actually hide all the complexity from user level. > - IMHO, there is a problem with your API that should be fixed (the messaging > layer needs to manage network buf allocation). If you required mpi_alloc_mem, > you would get rid of a whole layer of complicated crap. It may not be feasible > for you, but it is the right thing to do from an engineering perspective, > right? > From engineering point maybe it is correct. From business point we have application that were written and defined before OFA. The application worked well and continue to work well today with other modern interconnects. MPI people want to use IB and asking from OFA people to help to resolve problem that just can not be resolved on user/MPI level. > - "MPI" doesn't want to fix the problem, but instead is asking other people to > make kernel changes for them and saying things like "verbs is broken". > Maybe the best way is to change spec and push people to use mpi_alloc_mem, but it is long term solution. We want to allow people to run the applications now. > I totally see your guys' problem and feel for you. Either way it comes down to > politics; getting some MPI-specific code into the Linux Kernel (fun?), or > getting MPI users to have to change crusty old scientific code (very fun?). > BTW the registration cache code may be useful not only for MPI model. I definitely see other HPC models were I would like to have kernel level registration cache. It will be very difficult to push users to change their code, especially when you have other interconnect that does not require from them any code changes. > > Final point I want to make is that this is open source, so you can always try > submitting some elite patches and get the changes you need. > Before somebody will put any human resource on this project it will be good to know if the concept of this solution will be accepted by OFA community and it is reason why we are discussing it here. Thanks Pasha From nicolas.morey-chaisemartin at ext.bull.net Mon May 4 02:02:12 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 04 May 2009 11:02:12 +0200 Subject: [ofa-general] [PATCH] ibutils - Fix cleanup phase Message-ID: <49FEAF14.1080606@ext.bull.net> Move deletion of RPM_BUILD_ROOT before RPM_BUILD_DIR to avoid 'rm: cannot get current directory: No such file or directory' errors during cleanup phase (showed up on old IA64 RHEL). Signed-off-by: Sebastien Dugue Signed-off-by: Nicolas Morey-Chaisemartin --- ibutils.spec.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ibutils.spec.in b/ibutils.spec.in index 628230a..1770eb1 100644 --- a/ibutils.spec.in +++ b/ibutils.spec.in @@ -81,8 +81,8 @@ esac %clean #Remove installed driver after rpm build finished -rm -rf $RPM_BUILD_DIR/%{name}-%{version} rm -rf $RPM_BUILD_ROOT +rm -rf $RPM_BUILD_DIR/%{name}-%{version} %post /sbin/ldconfig From nicolas.morey-chaisemartin at ext.bull.net Mon May 4 02:03:21 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 04 May 2009 11:03:21 +0200 Subject: [ofa-general] [PATCH] Fixed dependcies of ibdmsh on libibdmcom.la Message-ID: <49FEAF59.6090502@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- Repost as I sent it to sasha and not yevgeny ! ibdm/ibdm/Makefile.am | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am index 1c57b3b..ba5789a 100644 --- a/ibdm/ibdm/Makefile.am +++ b/ibdm/ibdm/Makefile.am @@ -88,6 +88,7 @@ bin_PROGRAMS = ibdmsh ibdmsh_SOURCES = ibdmsh_wrap.cpp ibdmsh_LDADD = -libdmcom $(TCL_LIBS) ibdmsh_LDFLAGS = -static -Wl,-rpath -Wl,$(TCL_PREFIX)/lib +ibdmsh_DEPENDENCIES=$(lib_LTLIBRARIES) $(srcdir)/Fabric.cpp: $(srcdir)/git_version.h -- 1.6.2-rc2.GIT From kliteyn at dev.mellanox.co.il Mon May 4 02:22:49 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 04 May 2009 12:22:49 +0300 Subject: [ofa-general] Re: [PATCH] ibutils - Fix cleanup phase In-Reply-To: <49FEAF14.1080606@ext.bull.net> References: <49FEAF14.1080606@ext.bull.net> Message-ID: <49FEB3E9.10607@dev.mellanox.co.il> Nicolas Morey-Chaisemartin wrote: > Move deletion of RPM_BUILD_ROOT before RPM_BUILD_DIR to avoid > 'rm: cannot get current directory: No such file or directory' errors during > cleanup phase (showed up on old IA64 RHEL). > > Signed-off-by: Sebastien Dugue > Signed-off-by: Nicolas Morey-Chaisemartin Thanks, applied. -- Yevgeny From kliteyn at dev.mellanox.co.il Mon May 4 02:25:34 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 04 May 2009 12:25:34 +0300 Subject: [ofa-general] Re: [PATCH] Fixed dependcies of ibdmsh on libibdmcom.la In-Reply-To: <49FEAF59.6090502@ext.bull.net> References: <49FEAF59.6090502@ext.bull.net> Message-ID: <49FEB48E.2090109@dev.mellanox.co.il> Nicolas Morey-Chaisemartin wrote: > Signed-off-by: Nicolas Morey-Chaisemartin > --- > Repost as I sent it to sasha and not yevgeny ! > ibdm/ibdm/Makefile.am | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am > index 1c57b3b..ba5789a 100644 > --- a/ibdm/ibdm/Makefile.am > +++ b/ibdm/ibdm/Makefile.am > @@ -88,6 +88,7 @@ bin_PROGRAMS = ibdmsh > ibdmsh_SOURCES = ibdmsh_wrap.cpp > ibdmsh_LDADD = -libdmcom $(TCL_LIBS) > ibdmsh_LDFLAGS = -static -Wl,-rpath -Wl,$(TCL_PREFIX)/lib > +ibdmsh_DEPENDENCIES=$(lib_LTLIBRARIES) > > $(srcdir)/Fabric.cpp: $(srcdir)/git_version.h > Thanks, applied. Guess it should take care of bugzilla issue 1539 (https://bugs.openfabrics.org/show_bug.cgi?id=1539) -- Yevgeny From vlad at lists.openfabrics.org Mon May 4 03:25:55 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 4 May 2009 03:25:55 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090504-0200 daily build status Message-ID: <20090504102555.82A30E61401@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From nicolas.morey-chaisemartin at ext.bull.net Mon May 4 03:58:09 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 04 May 2009 12:58:09 +0200 Subject: [ofa-general] [PATCH] infiniband-diags: Added libibnetdiscover to .spec file Message-ID: <49FECA41.7060200@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- infiniband-diags/infiniband-diags.spec.in | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in index 4bbd907..07c46c9 100644 --- a/infiniband-diags/infiniband-diags.spec.in +++ b/infiniband-diags/infiniband-diags.spec.in @@ -51,9 +51,13 @@ rm -rf $RPM_BUILD_ROOT %{_sbindir}/check_lft_balance.pl %{_sbindir}/set_nodedesc.sh %{_sbindir}/sm* +%{_libdir}/*.a +%{_libdir}/*.so* +%{_includedir}/infiniband/*.h %define _perldir %(perl -e 'use Config; $T=$Config{installsitearch}; $T=~/(.*)\\/site_perl.*/; print $1;') %{_perldir}/* %{_mandir}/man8/* +%{_mandir}/man3/* %doc README COPYING ChangeLog %changelog -- 1.6.2-rc2.GIT From amirv.mellanox at gmail.com Mon May 4 05:52:32 2009 From: amirv.mellanox at gmail.com (Amir Mellanox) Date: Mon, 4 May 2009 15:52:32 +0300 Subject: [ofa-general] [PATCHv2] sdp: Fixed SDP to work on 2.6.29+ In-Reply-To: <49F862C8.6030102@ext.bull.net> References: <49F862C8.6030102@ext.bull.net> Message-ID: <18e64aac0905040552x7d4ebc16oe69383186afd073e@mail.gmail.com> Thanks, I committed the fix to ofed-1.5 tree - Amir On Wed, Apr 29, 2009 at 5:23 PM, Nicolas Morey-Chaisemartin < nicolas.morey-chaisemartin at ext.bull.net> wrote: > orphan_count and sockets_allocated have been changed from atomic_t to > percpu_counter. > As percpu_counter are huge they can be allocated on the stack without > causing sdp module to crash. > Both variable are now dynamically allocated at module init. > > Signed-off-by: Nicolas Morey-Chaisemartin < > nicolas.morey-chaisemartin at ext.bull.net> > --- > drivers/infiniband/ulp/sdp/sdp_main.c | 29 +++++++++++++++++++---------- > 1 files changed, 19 insertions(+), 10 deletions(-) > > diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c > b/drivers/infiniband/ulp/sdp/sdp_main.c > index 51801e0..7a38c47 100644 > --- a/drivers/infiniband/ulp/sdp/sdp_main.c > +++ b/drivers/infiniband/ulp/sdp/sdp_main.c > @@ -580,7 +580,7 @@ adjudge_to_death: > /* TODO: tcp_fin_time to get timeout */ > sdp_dbg(sk, "%s: entering time wait refcnt %d\n", __func__, > atomic_read(&sk->sk_refcnt)); > - atomic_inc(sk->sk_prot->orphan_count); > + percpu_counter_inc(sk->sk_prot->orphan_count); > } > > /* TODO: limit number of orphaned sockets. > @@ -861,7 +861,7 @@ void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk) > sock_put(&ssk->isk.sk, SOCK_REF_DREQ_TO); > } > > - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); > + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); > } > > void sdp_destroy_work(struct work_struct *work) > @@ -902,7 +902,7 @@ void sdp_dreq_wait_timeout_work(struct work_struct > *work) > sdp_sk(sk)->dreq_wait_timeout = 0; > > if (sk->sk_state == TCP_FIN_WAIT1) > - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); > + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); > > sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT); > > @@ -2162,9 +2162,9 @@ void sdp_urg(struct sdp_sock *ssk, struct sk_buff > *skb) > sk->sk_data_ready(sk, 0); > } > > -static atomic_t sockets_allocated; > +static struct percpu_counter *sockets_allocated; > static atomic_t memory_allocated; > -static atomic_t orphan_count; > +static struct percpu_counter *orphan_count; > static int memory_pressure; > struct proto sdp_proto = { > .close = sdp_close, > @@ -2182,10 +2182,8 @@ struct proto sdp_proto = { > .get_port = sdp_get_port, > /* Wish we had this: .listen = sdp_listen */ > .enter_memory_pressure = sdp_enter_memory_pressure, > - .sockets_allocated = &sockets_allocated, > .memory_allocated = &memory_allocated, > .memory_pressure = &memory_pressure, > - .orphan_count = &orphan_count, > .sysctl_mem = sysctl_tcp_mem, > .sysctl_wmem = sysctl_tcp_wmem, > .sysctl_rmem = sysctl_tcp_rmem, > @@ -2540,6 +2538,15 @@ static int __init sdp_init(void) > spin_lock_init(&sock_list_lock); > spin_lock_init(&sdp_large_sockets_lock); > > + sockets_allocated = kmalloc(sizeof(*sockets_allocated), > GFP_KERNEL); > + orphan_count = kmalloc(sizeof(*orphan_count), GFP_KERNEL); > + percpu_counter_init(sockets_allocated, 0); > + percpu_counter_init(orphan_count, 0); > + > + sdp_proto.sockets_allocated = sockets_allocated; > + sdp_proto.orphan_count = orphan_count; > + > + > sdp_workqueue = create_singlethread_workqueue("sdp"); > if (!sdp_workqueue) { > return -ENOMEM; > @@ -2574,9 +2581,9 @@ static void __exit sdp_exit(void) > sock_unregister(PF_INET_SDP); > proto_unregister(&sdp_proto); > > - if (atomic_read(&orphan_count)) > - printk(KERN_WARNING "%s: orphan_count %d\n", __func__, > - atomic_read(&orphan_count)); > + if (percpu_counter_read_positive(orphan_count)) > + printk(KERN_WARNING "%s: orphan_count %lld\n", __func__, > + percpu_counter_read_positive(orphan_count)); > destroy_workqueue(sdp_workqueue); > flush_scheduled_work(); > > @@ -2589,6 +2596,8 @@ static void __exit sdp_exit(void) > sdp_proc_unregister(); > > ib_unregister_client(&sdp_client); > + kfree(orphan_count); > + kfree(sockets_allocated); > } > > module_init(sdp_init); > -- > 1.6.2.GIT > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.heinz at qlogic.com Mon May 4 06:19:04 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon, 4 May 2009 08:19:04 -0500 Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> Hey, all - I submitted this patch back in December; there's some question on my end about whether or not it was accepted for the next release of OFED. Can anyone set me straight? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Mike Heinz Sent: Thursday, December 18, 2008 4:05 PM To: 'Hal Rosenstock' Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic SM No problem. I figured it had to be something like that. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Thursday, December 18, 2008 4:02 PM To: Mike Heinz Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM Mike, On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz wrote: > Hal, > > You've got me really confused now - there are only two cases that need changing, OSMV_QUERY_PATH_REC_BY_GIDS and OSMV_QUERY_PATH_REC_BY_PORT_GUIDS; OSMV_QUERY_PATH_REC_BY_LIDS does *not* need to be changed because it uses the GET method. Thus, this should be the correct patch. (I'm re-including it for clarity). The below looks right to me. The previous one with osm_vendor_mlx_sa.c was truncated somehow in my gmail and appeared to only have 1 of the 2 cases and I didn't look at the attachment. Sorry for the confusion. -- Hal > > Signed-off-by: Michael Heinz > -------------------------------- > --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 > +++ osm_vendor_ibumad_sa.c 2008-12-18 14:50:49.000000000 -0500 > @@ -615,7 +615,8 @@ > sa_mad_data.attr_offset = > ib_get_attr_offset(sizeof(ib_path_rec_t)); > sa_mad_data.comp_mask = > - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); > + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH); > + path_rec.num_path = 0x7f; > sa_mad_data.p_attr = &path_rec; > ib_gid_set_default(&path_rec.dgid, > ((osmv_guid_pair_t *) (p_query_req-> > @@ -634,7 +635,8 @@ > sa_mad_data.attr_offset = > ib_get_attr_offset(sizeof(ib_path_rec_t)); > sa_mad_data.comp_mask = > - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); > + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH); > + path_rec.num_path = 0x7f; > sa_mad_data.p_attr = &path_rec; > memcpy(&path_rec.dgid, > &((osmv_gid_pair_t *) (p_query_req->p_query_input))-> > --- osm_vendor_mlx_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 > +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500 > @@ -743,7 +743,8 @@ > sa_mad_data.attr_offset = > ib_get_attr_offset(sizeof(ib_path_rec_t)); > sa_mad_data.comp_mask = > - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); > + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH); > + path_rec.num_path = 0x7f; > sa_mad_data.p_attr = &path_rec; > ib_gid_set_default(&path_rec.dgid, > ((osmv_guid_pair_t *) (p_query_req-> > @@ -763,7 +764,8 @@ > sa_mad_data.attr_offset = > ib_get_attr_offset(sizeof(ib_path_rec_t)); > sa_mad_data.comp_mask = > - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); > + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | IB_PR_COMPMASK_NUMBPATH); > + path_rec.num_path = 0x7f; > sa_mad_data.p_attr = &path_rec; > memcpy(&path_rec.dgid, > &((osmv_gid_pair_t *) > (p_query_req->p_query_input))-> > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Thursday, December 18, 2008 3:32 PM > To: Mike Heinz > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Patch for libvendor incompatibility with > QLogic SM > > On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz wrote: >> >>> Right and it wouldn't need num_paths either (as get assumes 1) so I don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these patches are needed. >> >> Sorry if I was unclear, the last patch submission neither sets the num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS queries. > > Right; I didn't see the updated patch was for both sa files. In the new patch, one case was missed in terms of the needed change though unless I missed that too... > From hal.rosenstock at gmail.com Mon May 4 06:36:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 4 May 2009 09:36:54 -0400 Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> Message-ID: On 5/4/09, Mike Heinz wrote: > Hey, all - > > I submitted this patch back in December; there's some question on my end > about whether or not it was accepted for the next release of OFED. > > Can anyone set me straight? It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master. It depends on what you mean by next release of OFED as to whether it will be there. If you mean OFED 1.4.1, then the answer appears to be not currently. See opensm-3.2 branch. -- Hal > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Mike Heinz > Sent: Thursday, December 18, 2008 4:05 PM > To: 'Hal Rosenstock' > Cc: general at lists.openfabrics.org > Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic > SM > > No problem. I figured it had to be something like that. > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Thursday, December 18, 2008 4:02 PM > To: Mike Heinz > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic > SM > > Mike, > > On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz > wrote: >> Hal, >> >> You've got me really confused now - there are only two cases that need >> changing, OSMV_QUERY_PATH_REC_BY_GIDS and >> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS; OSMV_QUERY_PATH_REC_BY_LIDS does *not* >> need to be changed because it uses the GET method. Thus, this should be >> the correct patch. (I'm re-including it for clarity). > > The below looks right to me. The previous one with osm_vendor_mlx_sa.c was > truncated somehow in my gmail and appeared to only have 1 of the 2 cases and > I didn't look at the attachment. Sorry for the confusion. > > -- Hal > >> >> Signed-off-by: Michael Heinz >> -------------------------------- >> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 >> +++ osm_vendor_ibumad_sa.c 2008-12-18 14:50:49.000000000 -0500 >> @@ -615,7 +615,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> ib_gid_set_default(&path_rec.dgid, >> ((osmv_guid_pair_t *) (p_query_req-> >> @@ -634,7 +635,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> memcpy(&path_rec.dgid, >> &((osmv_gid_pair_t *) >> (p_query_req->p_query_input))-> >> --- osm_vendor_mlx_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 >> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500 >> @@ -743,7 +743,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> ib_gid_set_default(&path_rec.dgid, >> ((osmv_guid_pair_t *) (p_query_req-> >> @@ -763,7 +764,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> memcpy(&path_rec.dgid, >> &((osmv_gid_pair_t *) >> (p_query_req->p_query_input))-> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] >> Sent: Thursday, December 18, 2008 3:32 PM >> To: Mike Heinz >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] Patch for libvendor incompatibility with >> QLogic SM >> >> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz >> wrote: >>> >>>> Right and it wouldn't need num_paths either (as get assumes 1) so I >>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these >>>> patches are needed. >>> >>> Sorry if I was unclear, the last patch submission neither sets the >>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS >>> queries. >> >> Right; I didn't see the updated patch was for both sa files. In the new >> patch, one case was missed in terms of the needed change though unless I >> missed that too... >> > From michael.heinz at qlogic.com Mon May 4 06:37:51 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon, 4 May 2009 08:37:51 -0500 Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM In-Reply-To: References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org> Thanks for the quick response, Hal. Will that branch be folded into 1.5? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, May 04, 2009 9:37 AM To: Mike Heinz Cc: general at lists.openfabrics.org; Bob Jaworski; Todd Rimmer Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM On 5/4/09, Mike Heinz wrote: > Hey, all - > > I submitted this patch back in December; there's some question on my end > about whether or not it was accepted for the next release of OFED. > > Can anyone set me straight? It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master. It depends on what you mean by next release of OFED as to whether it will be there. If you mean OFED 1.4.1, then the answer appears to be not currently. See opensm-3.2 branch. -- Hal > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Mike Heinz > Sent: Thursday, December 18, 2008 4:05 PM > To: 'Hal Rosenstock' > Cc: general at lists.openfabrics.org > Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic > SM > > No problem. I figured it had to be something like that. > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Thursday, December 18, 2008 4:02 PM > To: Mike Heinz > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic > SM > > Mike, > > On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz > wrote: >> Hal, >> >> You've got me really confused now - there are only two cases that need >> changing, OSMV_QUERY_PATH_REC_BY_GIDS and >> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS; OSMV_QUERY_PATH_REC_BY_LIDS does *not* >> need to be changed because it uses the GET method. Thus, this should be >> the correct patch. (I'm re-including it for clarity). > > The below looks right to me. The previous one with osm_vendor_mlx_sa.c was > truncated somehow in my gmail and appeared to only have 1 of the 2 cases and > I didn't look at the attachment. Sorry for the confusion. > > -- Hal > >> >> Signed-off-by: Michael Heinz >> -------------------------------- >> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 >> +++ osm_vendor_ibumad_sa.c 2008-12-18 14:50:49.000000000 -0500 >> @@ -615,7 +615,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> ib_gid_set_default(&path_rec.dgid, >> ((osmv_guid_pair_t *) (p_query_req-> >> @@ -634,7 +635,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> memcpy(&path_rec.dgid, >> &((osmv_gid_pair_t *) >> (p_query_req->p_query_input))-> >> --- osm_vendor_mlx_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 >> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500 >> @@ -743,7 +743,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> ib_gid_set_default(&path_rec.dgid, >> ((osmv_guid_pair_t *) (p_query_req-> >> @@ -763,7 +764,8 @@ >> sa_mad_data.attr_offset = >> ib_get_attr_offset(sizeof(ib_path_rec_t)); >> sa_mad_data.comp_mask = >> - (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >> + (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >> IB_PR_COMPMASK_NUMBPATH); >> + path_rec.num_path = 0x7f; >> sa_mad_data.p_attr = &path_rec; >> memcpy(&path_rec.dgid, >> &((osmv_gid_pair_t *) >> (p_query_req->p_query_input))-> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] >> Sent: Thursday, December 18, 2008 3:32 PM >> To: Mike Heinz >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] Patch for libvendor incompatibility with >> QLogic SM >> >> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz >> wrote: >>> >>>> Right and it wouldn't need num_paths either (as get assumes 1) so I >>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these >>>> patches are needed. >>> >>> Sorry if I was unclear, the last patch submission neither sets the >>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS >>> queries. >> >> Right; I didn't see the updated patch was for both sa files. In the new >> patch, one case was missed in terms of the needed change though unless I >> missed that too... >> > From tziporet at mellanox.co.il Mon May 4 06:39:12 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 4 May 2009 16:39:12 +0300 Subject: [ofa-general] EWG/OFED meeting agenda for today (May 4) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com> This is the agenda for today's EWG/OFED meeting 1. OFED 1.4.1 status RC4 was done on Thursday, but we still have some open bugs. We must decide which bugs are really critical for this release and decide when we are doing RC5 (should be final release) ID Sev OS Assignee Summary 1607 blo SLES Jeffrey.C.Becker at nasa.gov kernel oops during login on sles10 sp2 with OFED-1.4.1-20... 1616 cri RHEL jon at opengridcomputing.com iommu_alloc error when running connectathon on ppc64 nfs ... 1620 cri Othe jon at opengridcomputing.com backport definition of struct hash_desc doesn't match the... 1571 cri RHEL vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, 1287 maj RHEL bugzilla at openib.org IPoIB datagram mode initial packet loss - decided to hold now 1596 maj Othe Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is loaded 1621 maj RHEL vu at mellanox.com RHEL 5.3 + OFED 1.4.1-rc4: loading ib_sprt kernel module ... - not sure if this is a showstopper 2. OFED 1.5 a. Schedule: Since OFED 1.4.1 is delayed by more then a month I think we need to consider its influence on the 1.5 schedule. BTW: If we delay the release we may want to change kernel base to 2.6.31 too b. Status: We opened a git tree that is based on 2.6.30, and for now its compiled on 2.6.30. Need to start the backports. Mellanox will be able to work on the backports only in few weeks from now. Is there other company that can start earlier? 3. MPI new memory API If Jeff S. will join we can discuss the next steps 4. Open discussion Tziporet From hal.rosenstock at gmail.com Mon May 4 06:42:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 4 May 2009 09:42:36 -0400 Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E7462465F@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org> Message-ID: On Mon, May 4, 2009 at 9:37 AM, Mike Heinz wrote: > Thanks for the quick response, Hal. Will that branch be folded into 1.5? I was saying the patch is _not_ on that branch. I would expect OFED 1.5 to be based off the current master but this is up to Sasha. The master is currently the 3.3 series whereas OFED 1.4 is the 3.2 series. -- Hal > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Monday, May 04, 2009 9:37 AM > To: Mike Heinz > Cc: general at lists.openfabrics.org; Bob Jaworski; Todd Rimmer > Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic SM > > On 5/4/09, Mike Heinz wrote: >> Hey, all - >> >> I submitted this patch back in December; there's some question on my end >> about whether or not it was accepted for the next release of OFED. >> >> Can anyone set me straight? > > It is commit fa905120f9971bf1601cc3fed4a7900fe9814892 on the master. > > It depends on what you mean by next release of OFED as to whether it > will be there. If you mean OFED 1.4.1, then the answer appears to be > not currently. See opensm-3.2 branch. > > -- Hal > >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation >> King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Mike Heinz >> Sent: Thursday, December 18, 2008 4:05 PM >> To: 'Hal Rosenstock' >> Cc: general at lists.openfabrics.org >> Subject: RE: [ofa-general] Patch for libvendor incompatibility with QLogic >> SM >> >> No problem. I figured it had to be something like that. >> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation >> King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] >> Sent: Thursday, December 18, 2008 4:02 PM >> To: Mike Heinz >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] Patch for libvendor incompatibility with QLogic >> SM >> >> Mike, >> >> On Thu, Dec 18, 2008 at 3:49 PM, Mike Heinz >> wrote: >>> Hal, >>> >>> You've got me really confused now - there are only two cases that need >>> changing, OSMV_QUERY_PATH_REC_BY_GIDS and >>> OSMV_QUERY_PATH_REC_BY_PORT_GUIDS;  OSMV_QUERY_PATH_REC_BY_LIDS does *not* >>> need to be changed because it uses the GET method. Thus, this should be >>> the correct patch. (I'm re-including it for clarity). >> >> The below looks right to me. The previous one with osm_vendor_mlx_sa.c was >> truncated somehow in my gmail and appeared to only have 1 of the 2 cases and >> I didn't look at the attachment. Sorry for the confusion. >> >> -- Hal >> >>> >>> Signed-off-by: Michael Heinz >>> -------------------------------- >>> --- osm_vendor_ibumad_sa.c.orig 2008-10-20 01:00:09.000000000 -0400 >>> +++ osm_vendor_ibumad_sa.c      2008-12-18 14:50:49.000000000 -0500 >>> @@ -615,7 +615,8 @@ >>>                sa_mad_data.attr_offset = >>>                    ib_get_attr_offset(sizeof(ib_path_rec_t)); >>>                sa_mad_data.comp_mask = >>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >>> IB_PR_COMPMASK_NUMBPATH); >>> +               path_rec.num_path = 0x7f; >>>                sa_mad_data.p_attr = &path_rec; >>>                ib_gid_set_default(&path_rec.dgid, >>>                                   ((osmv_guid_pair_t *) (p_query_req-> >>> @@ -634,7 +635,8 @@ >>>                sa_mad_data.attr_offset = >>>                    ib_get_attr_offset(sizeof(ib_path_rec_t)); >>>                sa_mad_data.comp_mask = >>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >>> IB_PR_COMPMASK_NUMBPATH); >>> +               path_rec.num_path = 0x7f; >>>                sa_mad_data.p_attr = &path_rec; >>>                memcpy(&path_rec.dgid, >>>                       &((osmv_gid_pair_t *) >>> (p_query_req->p_query_input))-> >>> --- osm_vendor_mlx_sa.c.orig    2008-10-20 01:00:09.000000000 -0400 >>> +++ osm_vendor_mlx_sa.c 2008-12-18 14:51:34.000000000 -0500 >>> @@ -743,7 +743,8 @@ >>>                sa_mad_data.attr_offset = >>>                    ib_get_attr_offset(sizeof(ib_path_rec_t)); >>>                sa_mad_data.comp_mask = >>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >>> IB_PR_COMPMASK_NUMBPATH); >>> +               path_rec.num_path = 0x7f; >>>                sa_mad_data.p_attr = &path_rec; >>>                ib_gid_set_default(&path_rec.dgid, >>>                                   ((osmv_guid_pair_t *) (p_query_req-> >>> @@ -763,7 +764,8 @@ >>>                sa_mad_data.attr_offset = >>>                    ib_get_attr_offset(sizeof(ib_path_rec_t)); >>>                sa_mad_data.comp_mask = >>> -                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID); >>> +                   (IB_PR_COMPMASK_DGID | IB_PR_COMPMASK_SGID | >>> IB_PR_COMPMASK_NUMBPATH); >>> +               path_rec.num_path = 0x7f; >>>                sa_mad_data.p_attr = &path_rec; >>>                memcpy(&path_rec.dgid, >>>                       &((osmv_gid_pair_t *) >>> (p_query_req->p_query_input))-> >>> >>> -- >>> Michael Heinz >>> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >>> >>> -----Original Message----- >>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] >>> Sent: Thursday, December 18, 2008 3:32 PM >>> To: Mike Heinz >>> Cc: general at lists.openfabrics.org >>> Subject: Re: [ofa-general] Patch for libvendor incompatibility with >>> QLogic SM >>> >>> On Thu, Dec 18, 2008 at 3:22 PM, Mike Heinz >>> wrote: >>>> >>>>> Right and it wouldn't need num_paths either (as get assumes 1) so I >>>>> don't think the changes for OSMV_QUERY_PATH_REC_BY_LIDS in both these >>>>> patches are needed. >>>> >>>> Sorry if I was unclear, the last patch submission neither sets the >>>> num_path field nor the attribute mask for OSMV_QUERY_PATH_REC_BY_LIDS >>>> queries. >>> >>> Right; I didn't see the updated patch was for both sa files. In the new >>> patch, one case was missed in terms of the needed change though unless I >>> missed that too... >>> >> > From jon at opengridcomputing.com Mon May 4 07:56:42 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Mon, 4 May 2009 09:56:42 -0500 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: References: <200905031604.05907.jackm@dev.mellanox.co.il> Message-ID: <20090504145641.GA19565@opengridcomputing.com> On Sun, May 03, 2009 at 05:36:53PM +0200, Bart Van Assche wrote: > On Sun, May 3, 2009 at 3:04 PM, Jack Morgenstein > wrote: > > On Saturday 02 May 2009 14:46, Bart Van Assche wrote: > >> Hello, > >> > >> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started > >> looking at the backported kernel headers. I found the following in the > >> header file > >> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/linux/scatterlist.h: > >> > >> #define sg_init_table(a, b) > >> > >> Or: sg_init_table() is defined to do nothing. I was expecting the following > >> however: > >> > >> #define sg_init_table(sgl, nents) memset(sgl, 0, sizeof(*sgl) * nents); > >> > >> The sg_init_table() function is implemented in e.g. 2.6.29 as follows: > >> > >> void sg_init_table(struct scatterlist *sgl, unsigned int nents) > >> { > >>         memset(sgl, 0, sizeof(*sgl) * nents); > >> #ifdef CONFIG_DEBUG_SG > >>         { > >>                 unsigned int i; > >>                 for (i = 0; i < nents; i++) > >>                         sgl[i].sg_magic = SG_MAGIC; > >>         } > >> #endif > >>         sg_mark_end(&sgl[nents - 1]); > >> } > >> > >> Does anyone know why sg_init_table() is defined such that it does nothing in > >> the backported OFED headers ? > > > > I checked this more carefully. > > Use of sg_init_table was introduced in 2.6.24 by Jens Axboe, in commit > > 45711f1af6eff1a6d010703b4862e0d2b9afd056. (see chunks for core/umem.c) > > > > Before this, no initialization was done on the sg page_list, and we had no > > problems.  When doing the backport, then, I simply made this a NOP. > > I'm not convinced that sg_init_table needs to be implemented in kernels earlier > > than 2.6.24, since this call is not replacing anything (e.g., a kzalloc), and > > the page list was not previously zeroed out before usage. > > > > What do you think? > > My opinion is that it is really dangerous and confusing to have one > version of the sg_init_table() macro that performs initialization and > another version that does not. As an example, the OFED source file > net/sunrpc/xdr.c invokes sg_init_table(). When this code is compiled > against e.g. a 2.6.27 kernel, invoking sg_init_table() will > initialize the sg-list properly because in this case the > sg_init_table() included with the 2.6.27 kernel is used. When this > code is compiled against e.g. an RHEL 5.3 kernel, invoking the > sg_init_table() macro will have no effect because the sg_init_table() > macro from OFED's backported header files is used. Is this effect > really desired ? What's even worse is that sg_init_table is already defined in the RHEL5.3 headers. When coding up a header cleanup patch for RHEL5.3, I noticed it was already defined in linux/ncrypto.h. Also, it's there for RHEL5.2 (and a few older kernels). I should have the patch out today for review. Thanks, Jon > > Bart. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jon at opengridcomputing.com Mon May 4 08:20:37 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Mon, 4 May 2009 10:20:37 -0500 Subject: [ofa-general] Build failures on current 1.4.1 dailies In-Reply-To: <49FD7680.1060508@mellanox.co.il> References: <49FD7680.1060508@mellanox.co.il> Message-ID: <20090504152037.GC19565@opengridcomputing.com> On Sun, May 03, 2009 at 01:48:32PM +0300, Tziporet Koren wrote: > Jon/Steve > I see the issue is with nfs - please look at this I do not think anyone has backported 2.6.27 (as I do not see a kernel_addons/backport/2.6.27 backport dir). The fix is a simple 1 liner in pagemap.h consisting of: #define __grab_cache_page grab_cache_page Since there is not a backport dir for this kernel, do we really want to add support for it this late in the OFED 1.4.1 release? I have not done any NFSRDMA testing for this kernel. So this could end up to be something that could delay the 1.41. release further. Thanks, Jon > > Thanks > Tziporet > > Gennadiy Nerubayev wrote: >> Hi all, >> >> Running on 2.6.27.21 x64. ofa_kernel build error as follows: >> >> -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \ >> -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing >> -fno-common -Werror-implicit-function-declaration -Os -m64 >> -mtune=generic -mno-red-zone -mc >> model=kernel -funit-at-a-time -maccumulate-outgoing-args >> -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe >> -Wno-sign-compare -fno-asynchronous-unwind-tables >> -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default >> -fno-stack-protector -fomit-frame-pointer -g >> -Wdeclaration-after-statement -Wno-pointer-sign >> -fwrapv -DMODULE -D"KBUILD_STR(s)=#s" >> -D"KBUILD_BASENAME=KBUILD_STR(file)" >> -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/ >> fs/nfs/.tmp_file.o >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function >> 'nfs_write_begin': >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error: >> implicit declaration of function '__grab_cache_page' >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: >> warning: assignment makes pointer from integer without a cast >> make[3]: *** >> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1 >> make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2 >> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2 >> make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21' >> make: *** [kernel] Error 2 >> error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build) >> >> Assuming we turn off nfs stuff to go further, error number two is from >> infiniband-diags: >> >> checking whether to build shared libraries... yes >> checking whether to build static libraries... yes >> checking for sys_read_string in -libcommon... yes >> checking for umad_init in -libumad... yes >> checking for mad_dump_int in -libmad... no >> configure: error: mad_dump_int() not found. diags require libibmad. >> error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build) >> >> I confirmed that pulling management git and compiling libs and diags >> from there does not have this issue, and that the libibmad.so.1 that >> gets compiled in the daily OFED does not have mad_dump_int(). >> >> > From monis at Voltaire.COM Mon May 4 08:32:57 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 04 May 2009 18:32:57 +0300 Subject: [ofa-general] EWG/OFED meeting agenda for today (May 4) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD028C3B83@mtlexch01.mtl.com> Message-ID: <49FF0AA9.6000006@Voltaire.COM> Tziporet Koren wrote: > This is the agenda for today's EWG/OFED meeting > > 1. OFED 1.4.1 status > RC4 was done on Thursday, but we still have some open bugs. > We must decide which bugs are really critical for this release and > decide when we are doing RC5 (should be final release) > > ID Sev OS Assignee Summary > 1607 blo SLES Jeffrey.C.Becker at nasa.gov kernel > oops during login on sles10 sp2 with OFED-1.4.1-20... > 1616 cri RHEL jon at opengridcomputing.com iommu_alloc > error when running connectathon on ppc64 nfs ... > 1620 cri Othe jon at opengridcomputing.com backport > definition of struct hash_desc doesn't match the... > 1571 cri RHEL vu at mellanox.com nfsrdma server > crash @test5 connectathon basic test, > 1287 maj RHEL bugzilla at openib.org IPoIB datagram > mode initial packet loss - decided to hold now > 1596 maj Othe Jeffrey.C.Becker at nasa.gov openibd stop > failed when nfs is loaded > 1621 maj RHEL vu at mellanox.com RHEL 5.3 + OFED > 1.4.1-rc4: loading ib_sprt kernel module ... - not sure if this is a > showstopper > > > 2. OFED 1.5 > a. Schedule: Since OFED 1.4.1 is delayed by more then a month I think we > need to consider its influence on the 1.5 schedule. > BTW: If we delay the release we may want to change kernel base to 2.6.31 > too > > b. Status: We opened a git tree that is based on 2.6.30, and for now its > compiled on 2.6.30. Need to start the backports. > Mellanox will be able to work on the backports only in few weeks from > now. > Is there other company that can start earlier? > > 3. MPI new memory API > If Jeff S. will join we can discuss the next steps > > 4. Open discussion > > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Please add 1623, it was opened by mistake on gen2 From Jeffrey.C.Becker at nasa.gov Mon May 4 10:08:03 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Mon, 04 May 2009 10:08:03 -0700 Subject: [ofa-general] Build failures on current 1.4.1 dailies In-Reply-To: <20090504152037.GC19565@opengridcomputing.com> References: <49FD7680.1060508@mellanox.co.il> <20090504152037.GC19565@opengridcomputing.com> Message-ID: <49FF20F3.1090808@nasa.gov> Hi Jon Jon Mason wrote: > On Sun, May 03, 2009 at 01:48:32PM +0300, Tziporet Koren wrote: > >> Jon/Steve >> I see the issue is with nfs - please look at this >> > > I do not think anyone has backported 2.6.27 (as I do not see a > kernel_addons/backport/2.6.27 backport dir). The fix is a simple 1 > liner in pagemap.h consisting of: > #define __grab_cache_page grab_cache_page > > Since there is not a backport dir for this kernel, do we really want to > add support for it this late in the OFED 1.4.1 release? I have not done > any NFSRDMA testing for this kernel. So this could end up to be > something that could delay the 1.41. release further. > I originally verified that NFSRDMA built against 2.6.27 for OFED 1.4. Since OFED 1.4 was based on 2.6.27 kernel, there was no reason to have a backport. I believe this is still the case for 1.4.1, but it's possible that one of the upstream fixes caused this breakage. -jeff > Thanks, > Jon > > >> Thanks >> Tziporet >> >> Gennadiy Nerubayev wrote: >> >>> Hi all, >>> >>> Running on 2.6.27.21 x64. ofa_kernel build error as follows: >>> >>> -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \ >>> -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing >>> -fno-common -Werror-implicit-function-declaration -Os -m64 >>> -mtune=generic -mno-red-zone -mc >>> model=kernel -funit-at-a-time -maccumulate-outgoing-args >>> -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe >>> -Wno-sign-compare -fno-asynchronous-unwind-tables >>> -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default >>> -fno-stack-protector -fomit-frame-pointer -g >>> -Wdeclaration-after-statement -Wno-pointer-sign >>> -fwrapv -DMODULE -D"KBUILD_STR(s)=#s" >>> -D"KBUILD_BASENAME=KBUILD_STR(file)" >>> -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o >>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/ >>> fs/nfs/.tmp_file.o >>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c >>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function >>> 'nfs_write_begin': >>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error: >>> implicit declaration of function '__grab_cache_page' >>> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: >>> warning: assignment makes pointer from integer without a cast >>> make[3]: *** >>> [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1 >>> make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2 >>> make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2 >>> make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21' >>> make: *** [kernel] Error 2 >>> error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build) >>> >>> Assuming we turn off nfs stuff to go further, error number two is from >>> infiniband-diags: >>> >>> checking whether to build shared libraries... yes >>> checking whether to build static libraries... yes >>> checking for sys_read_string in -libcommon... yes >>> checking for umad_init in -libumad... yes >>> checking for mad_dump_int in -libmad... no >>> configure: error: mad_dump_int() not found. diags require libibmad. >>> error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build) >>> >>> I confirmed that pulling management git and compiling libs and diags >>> from there does not have this issue, and that the libibmad.so.1 that >>> gets compiled in the daily OFED does not have mad_dump_int(). >>> >>> >>> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Mon May 4 12:17:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 4 May 2009 15:17:32 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover.c: Cosmetic formatting changes Message-ID: <20090504191732.GA29650@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 810b8db..1799618 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -144,7 +144,7 @@ list_node(ibnd_node_t *node, void *user_data) { char *node_type; char *nodename = remap_node_name(node_name_map, node->guid, - node->nodedesc); + node->nodedesc); switch(node->type) { case IB_NODE_SWITCH: @@ -161,8 +161,7 @@ list_node(ibnd_node_t *node, void *user_data) break; } fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", - node_type, - node->guid, node->numports, + node_type, node->guid, node->numports, mad_get_field(node->info, 0, IB_NODE_DEVID_F), mad_get_field(node->info, 0, IB_NODE_VENDORID_F), nodename); @@ -173,15 +172,12 @@ list_node(ibnd_node_t *node, void *user_data) void list_nodes(ibnd_fabric_t *fabric, int list) { - if (list & LIST_CA_NODE) { + if (list & LIST_CA_NODE) ibnd_iter_nodes_type(fabric, list_node, IB_NODE_CA, NULL); - } - if (list & LIST_SWITCH_NODE) { + if (list & LIST_SWITCH_NODE) ibnd_iter_nodes_type(fabric, list_node, IB_NODE_SWITCH, NULL); - } - if (list & LIST_ROUTER_NODE) { + if (list & LIST_ROUTER_NODE) ibnd_iter_nodes_type(fabric, list_node, IB_NODE_ROUTER, NULL); - } } void @@ -194,14 +190,12 @@ out_ids(ibnd_node_t *node, int group, char *chname) mad_get_field(node->info, 0, IB_NODE_DEVID_F)); if (sysimgguid) fprintf(f, "sysimgguid=0x%" PRIx64, sysimgguid); - if (group - && node->chassis && node->chassis->chassisnum) { + if (group && node->chassis && node->chassis->chassisnum) { fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); if (chname) fprintf(f, " (%s)", clean_nodedesc(chname)); - if (ibnd_is_xsigo_tca(node->guid) - && node->ports[1] - && node->ports[1]->remoteport) + if (ibnd_is_xsigo_tca(node->guid) && node->ports[1] && + node->ports[1]->remoteport) fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); } fprintf(f, "\n"); @@ -242,8 +236,7 @@ out_switch(ibnd_node_t *node, int group, char *chname) nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", - node->numports, node_name(node), - nodename, + node->numports, node_name(node), nodename, node->smaenhsp0 ? "enhanced" : "base", node->smalid, node->smalmc); @@ -314,13 +307,12 @@ out_switch_port(ibnd_port_t *port, int group) fprintf(f, "%s", ext_port_str); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->guid, - port->remoteport->node->nodedesc); + port->remoteport->node->guid, + port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); fprintf(f, "\t%s[%d]%s", - node_name(port->remoteport->node), - port->remoteport->portnum, + node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : ""); if (port->remoteport->node->type != IB_NODE_SWITCH) fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); @@ -355,8 +347,7 @@ out_ca_port(ibnd_port_t *port, int group) if (port->node->type != IB_NODE_SWITCH) fprintf(f, "(%" PRIx64 ") ", port->guid); fprintf(f, "\t%s[%d]", - node_name(port->remoteport->node), - port->remoteport->portnum); + node_name(port->remoteport->node), port->remoteport->portnum); str = out_ext_port(port->remoteport, group); if (str) fprintf(f, "%s", str); @@ -364,8 +355,8 @@ out_ca_port(ibnd_port_t *port, int group) fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->guid, - port->remoteport->node->nodedesc); + port->remoteport->node->guid, + port->remoteport->node->nodedesc); fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", port->base_lid, port->lmc, rem_nodename, @@ -513,7 +504,7 @@ dump_topology(int group, ibnd_fabric_t *fabric) fprintf(f, "\n# Chassis Switches"); for (node = ch->nodes; node; - node = node->next_chassis_node) { + node = node->next_chassis_node) { if (node->type == IB_NODE_SWITCH) { out_switch(node, group, chname); for (p = 1; p <= node->numports; p++) { @@ -527,7 +518,7 @@ dump_topology(int group, ibnd_fabric_t *fabric) fprintf(f, "\n# Chassis CAs"); for (node = ch->nodes; node; - node = node->next_chassis_node) { + node = node->next_chassis_node) { if (node->type == IB_NODE_CA) { out_ca(node, group, chname); for (p = 1; p <= node->numports; p++) { @@ -545,7 +536,7 @@ dump_topology(int group, ibnd_fabric_t *fabric) iter_user_data.group = group; iter_user_data.skip_chassis_nodes = 0; ibnd_iter_nodes_type(fabric, switch_iter_func, - IB_NODE_SWITCH, &iter_user_data); + IB_NODE_SWITCH, &iter_user_data); } chname = NULL; @@ -556,18 +547,17 @@ dump_topology(int group, ibnd_fabric_t *fabric) fprintf(f, "\nNon-Chassis Nodes\n"); ibnd_iter_nodes_type(fabric, switch_iter_func, - IB_NODE_SWITCH, &iter_user_data); + IB_NODE_SWITCH, &iter_user_data); } iter_user_data.group = group; iter_user_data.skip_chassis_nodes = 0; /* Make pass on CAs */ - ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA, - &iter_user_data); + ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA, &iter_user_data); - /* make pass on routers */ + /* Make pass on routers */ ibnd_iter_nodes_type(fabric, router_iter_func, IB_NODE_ROUTER, - &iter_user_data); + &iter_user_data); return i; } @@ -578,8 +568,7 @@ void dump_ports_report (ibnd_node_t *node, void *user_data) ibnd_port_t *port = NULL; /* for each port */ - for (p = node->numports, port = node->ports[p]; - p > 0; + for (p = node->numports, port = node->ports[p]; p > 0; port = node->ports[--p]) { uint32_t iwidth, ispeed; if (port == NULL) @@ -591,8 +580,7 @@ void dump_ports_report (ibnd_node_t *node, void *user_data) ports_nt_str_compat(node), node->type == IB_NODE_SWITCH ? node->smalid : port->base_lid, - port->portnum, - port->guid, + port->portnum, port->guid, dump_linkwidth_compat(iwidth), dump_linkspeed_compat(ispeed)); if (port->remoteport) @@ -604,12 +592,10 @@ void dump_ports_report (ibnd_node_t *node, void *user_data) port->remoteport->node->smalid : port->remoteport->base_lid, port->remoteport->portnum, - port->remoteport->guid, - port->node->nodedesc, + port->remoteport->guid, port->node->nodedesc, port->remoteport->node->nodedesc); else - fprintf(stdout, "%36s'%s'\n", "", - port->node->nodedesc); + fprintf(stdout, "%36s'%s'\n", "", port->node->nodedesc); } } From hnrose at comcast.net Mon May 4 13:00:18 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 4 May 2009 16:00:18 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr DB: Remove leading underscores from internal names Message-ID: <20090504200018.GA4590@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h index 9598d02..d0eff73 100644 --- a/opensm/include/opensm/osm_perfmgr_db.h +++ b/opensm/include/opensm/osm_perfmgr_db.h @@ -120,32 +120,32 @@ typedef enum { * Port counter object. * Store all the port counters for a single port. */ -typedef struct _db_port { +typedef struct db_port { perfmgr_db_err_reading_t err_total; perfmgr_db_err_reading_t err_previous; perfmgr_db_data_cnt_reading_t dc_total; perfmgr_db_data_cnt_reading_t dc_previous; time_t last_reset; -} _db_port_t; +} db_port_t; /** ========================================================================= * group port counters for ports into the nodes */ #define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE << 1) -typedef struct _db_node { +typedef struct db_node { cl_map_item_t map_item; /* must be first */ uint64_t node_guid; boolean_t esp0; - _db_port_t *ports; + db_port_t *ports; uint8_t num_ports; char node_name[NODE_NAME_SIZE]; -} _db_node_t; +} db_node_t; /** ========================================================================= - * all nodes in the system. + * all nodes in the subnet. */ -typedef struct _db { - cl_qmap_t pc_data; /* stores type (_db_node_t *) */ +typedef struct perfmgr_db { + cl_qmap_t pc_data; /* stores type (db_node_t *) */ cl_plock_t lock; struct osm_perfmgr *perfmgr; } perfmgr_db_t; diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c index 8be0b6f..b0bfd36 100644 --- a/opensm/opensm/osm_perfmgr_db.c +++ b/opensm/opensm/osm_perfmgr_db.c @@ -77,17 +77,17 @@ void perfmgr_db_destroy(perfmgr_db_t * db) /********************************************************************** * Internal call db->lock should be held when calling **********************************************************************/ -static inline _db_node_t *_get(perfmgr_db_t * db, uint64_t guid) +static inline db_node_t *get(perfmgr_db_t * db, uint64_t guid) { cl_map_item_t *rc = cl_qmap_get(&db->pc_data, guid); const cl_map_item_t *end = cl_qmap_end(&db->pc_data); if (rc == end) return (NULL); - return ((_db_node_t *) rc); + return ((db_node_t *) rc); } -static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port) +static inline perfmgr_db_err_t bad_node_port(db_node_t * node, uint8_t port) { if (!node) return (PERFMGR_EVENT_DB_GUIDNOTFOUND); @@ -98,16 +98,16 @@ static inline perfmgr_db_err_t bad_node_port(_db_node_t * node, uint8_t port) /** ========================================================================= */ -static _db_node_t *__malloc_node(uint64_t guid, boolean_t esp0, - uint8_t num_ports, char *name) +static db_node_t *malloc_node(uint64_t guid, boolean_t esp0, + uint8_t num_ports, char *name) { int i = 0; time_t cur_time = 0; - _db_node_t *rc = malloc(sizeof(*rc)); + db_node_t *rc = malloc(sizeof(*rc)); if (!rc) return (NULL); - rc->ports = calloc(num_ports, sizeof(_db_port_t)); + rc->ports = calloc(num_ports, sizeof(db_port_t)); if (!rc->ports) goto free_rc; rc->num_ports = num_ports; @@ -131,7 +131,7 @@ free_rc: /** ========================================================================= */ -static void __free_node(_db_node_t * node) +static void free_node(db_node_t * node) { if (!node) return; @@ -141,7 +141,7 @@ static void __free_node(_db_node_t * node) } /* insert nodes to the database */ -static perfmgr_db_err_t __insert(perfmgr_db_t * db, _db_node_t * node) +static perfmgr_db_err_t insert(perfmgr_db_t * db, db_node_t * node) { cl_map_item_t *rc = cl_qmap_insert(&db->pc_data, node->node_guid, (cl_map_item_t *) node); @@ -160,15 +160,15 @@ perfmgr_db_create_entry(perfmgr_db_t * db, uint64_t guid, boolean_t esp0, perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_excl_acquire(&db->lock); - if (!_get(db, guid)) { - _db_node_t *pc_node = __malloc_node(guid, esp0, num_ports, - name); + if (!get(db, guid)) { + db_node_t *pc_node = malloc_node(guid, esp0, num_ports, + name); if (!pc_node) { rc = PERFMGR_EVENT_DB_NOMEM; goto Exit; } - if (__insert(db, pc_node)) { - __free_node(pc_node); + if (insert(db, pc_node)) { + free_node(pc_node); rc = PERFMGR_EVENT_DB_FAIL; goto Exit; } @@ -183,7 +183,7 @@ Exit: **********************************************************************/ static inline void debug_dump_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num, - _db_port_t * port, perfmgr_db_err_reading_t * cur) + db_port_t * port, perfmgr_db_err_reading_t * cur) { osm_log_t *log = db->perfmgr->log; @@ -250,14 +250,14 @@ perfmgr_db_err_t perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port, perfmgr_db_err_reading_t * reading) { - _db_port_t *p_port = NULL; - _db_node_t *node = NULL; + db_port_t *p_port = NULL; + db_node_t *node = NULL; perfmgr_db_err_reading_t *previous = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; osm_epi_pe_event_t epi_pe_data; cl_plock_excl_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -323,12 +323,12 @@ perfmgr_db_err_t perfmgr_db_get_prev_err(perfmgr_db_t * db, uint64_t guid, uint8_t port, perfmgr_db_err_reading_t * reading) { - _db_node_t *node = NULL; + db_node_t *node = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -342,12 +342,12 @@ Exit: perfmgr_db_err_t perfmgr_db_clear_prev_err(perfmgr_db_t * db, uint64_t guid, uint8_t port) { - _db_node_t *node = NULL; + db_node_t *node = NULL; perfmgr_db_err_reading_t *previous = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_excl_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -363,7 +363,7 @@ Exit: static inline void debug_dump_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num, - _db_port_t * port, perfmgr_db_data_cnt_reading_t * cur) + db_port_t * port, perfmgr_db_data_cnt_reading_t * cur) { osm_log_t *log = db->perfmgr->log; if (!osm_log_is_active(log, OSM_LOG_DEBUG)) @@ -392,14 +392,14 @@ perfmgr_db_err_t perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port, perfmgr_db_data_cnt_reading_t * reading) { - _db_port_t *p_port = NULL; - _db_node_t *node = NULL; + db_port_t *p_port = NULL; + db_node_t *node = NULL; perfmgr_db_data_cnt_reading_t *previous = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; osm_epi_dc_event_t epi_dc_data; cl_plock_excl_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -448,12 +448,12 @@ perfmgr_db_err_t perfmgr_db_get_prev_dc(perfmgr_db_t * db, uint64_t guid, uint8_t port, perfmgr_db_data_cnt_reading_t * reading) { - _db_node_t *node = NULL; + db_node_t *node = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -467,12 +467,12 @@ Exit: perfmgr_db_err_t perfmgr_db_clear_prev_dc(perfmgr_db_t * db, uint64_t guid, uint8_t port) { - _db_node_t *node = NULL; + db_node_t *node = NULL; perfmgr_db_data_cnt_reading_t *previous = NULL; perfmgr_db_err_t rc = PERFMGR_EVENT_DB_SUCCESS; cl_plock_excl_acquire(&db->lock); - node = _get(db, guid); + node = get(db, guid); if ((rc = bad_node_port(node, port)) != PERFMGR_EVENT_DB_SUCCESS) goto Exit; @@ -486,9 +486,9 @@ Exit: return (rc); } -static void __clear_counters(cl_map_item_t * const p_map_item, void *context) +static void clear_counters(cl_map_item_t * const p_map_item, void *context) { - _db_node_t *node = (_db_node_t *) p_map_item; + db_node_t *node = (db_node_t *) p_map_item; int i = 0; time_t ts = time(NULL); @@ -527,7 +527,7 @@ static void __clear_counters(cl_map_item_t * const p_map_item, void *context) void perfmgr_db_clear_counters(perfmgr_db_t * db) { cl_plock_excl_acquire(&db->lock); - cl_qmap_apply_func(&db->pc_data, __clear_counters, (void *)db); + cl_qmap_apply_func(&db->pc_data, clear_counters, (void *)db); cl_plock_release(&db->lock); #if 0 if (db->db_impl->clear_counters) @@ -538,7 +538,7 @@ void perfmgr_db_clear_counters(perfmgr_db_t * db) /********************************************************************** * Output a tab delimited output of the port counters **********************************************************************/ -static void __dump_node_mr(_db_node_t * node, FILE * fp) +static void dump_node_mr(db_node_t * node, FILE * fp) { int i = 0; @@ -605,7 +605,7 @@ static void __dump_node_mr(_db_node_t * node, FILE * fp) /********************************************************************** * Output a human readable output of the port counters **********************************************************************/ -static void __dump_node_hr(_db_node_t * node, FILE * fp) +static void dump_node_hr(db_node_t * node, FILE * fp) { int i = 0; @@ -670,19 +670,19 @@ typedef struct { /********************************************************************** **********************************************************************/ -static void __db_dump(cl_map_item_t * const p_map_item, void *context) +static void db_dump(cl_map_item_t * const p_map_item, void *context) { - _db_node_t *node = (_db_node_t *) p_map_item; + db_node_t *node = (db_node_t *) p_map_item; dump_context_t *c = (dump_context_t *) context; FILE *fp = c->fp; switch (c->dump_type) { case PERFMGR_EVENT_DB_DUMP_MR: - __dump_node_mr(node, fp); + dump_node_mr(node, fp); break; case PERFMGR_EVENT_DB_DUMP_HR: default: - __dump_node_hr(node, fp); + dump_node_hr(node, fp); break; } } @@ -694,16 +694,16 @@ void perfmgr_db_print_by_name(perfmgr_db_t * db, char *nodename, FILE *fp) { cl_map_item_t *item = NULL; - _db_node_t *node = NULL; + db_node_t *node = NULL; cl_plock_acquire(&db->lock); /* find the node */ item = cl_qmap_head(&db->pc_data); while (item != cl_qmap_end(&db->pc_data)) { - node = (_db_node_t *)item; + node = (db_node_t *)item; if (strcmp(node->node_name, nodename) == 0) { - __dump_node_hr(node, fp); + dump_node_hr(node, fp); goto done; } item = cl_qmap_next(item); @@ -726,7 +726,7 @@ perfmgr_db_print_by_guid(perfmgr_db_t * db, uint64_t nodeguid, FILE *fp) node = cl_qmap_get(&db->pc_data, nodeguid); if (node != cl_qmap_end(&db->pc_data)) - __dump_node_hr((_db_node_t *)node, fp); + dump_node_hr((db_node_t *)node, fp); else fprintf(fp, "Node 0x%" PRIx64 " not found...\n", nodeguid); @@ -747,7 +747,7 @@ perfmgr_db_dump(perfmgr_db_t * db, char *file, perfmgr_db_dump_t dump_type) context.dump_type = dump_type; cl_plock_acquire(&db->lock); - cl_qmap_apply_func(&db->pc_data, __db_dump, (void *)&context); + cl_qmap_apply_func(&db->pc_data, db_dump, (void *)&context); cl_plock_release(&db->lock); fclose(context.fp); return (PERFMGR_EVENT_DB_SUCCESS); From gregkh at suse.de Mon May 4 13:00:22 2009 From: gregkh at suse.de (Greg Kroah-Hartman) Date: Mon, 4 May 2009 13:00:22 -0700 Subject: [ofa-general] [PATCH] infiniband: ehca: remove driver_data direct access of struct device Message-ID: <20090504200022.GA22746@kroah.com> From: Greg Kroah-Hartman In the near future, the driver core is going to not allow direct access to the driver_data pointer in struct device. Instead, the functions dev_get_drvdata() and dev_set_drvdata() should be used. These functions have been around since the beginning, so are backwards compatible with all older kernel versions. Cc: Sean Hefty Cc: Roland Dreier Cc: Hal Rosenstock Cc: general at lists.openfabrics.org Cc: Christoph Raisch Cc: Hoang-Nam Nguyen Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/hw/ehca/ehca_main.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -636,7 +636,7 @@ static ssize_t ehca_show_##name(struct struct hipz_query_hca *rblock; \ int data; \ \ - shca = dev->driver_data; \ + shca = dev_get_drvdata(dev); \ \ rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); \ if (!rblock) { \ @@ -680,7 +680,7 @@ static ssize_t ehca_show_adapter_handle( struct device_attribute *attr, char *buf) { - struct ehca_shca *shca = dev->driver_data; + struct ehca_shca *shca = dev_get_drvdata(dev); return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle); @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct o shca->ofdev = dev; shca->ipz_hca_handle.handle = *handle; - dev->dev.driver_data = shca; + dev_set_drvdata(&dev->dev, shca); ret = ehca_sense_attributes(shca); if (ret < 0) { @@ -878,7 +878,7 @@ probe1: static int __devexit ehca_remove(struct of_device *dev) { - struct ehca_shca *shca = dev->dev.driver_data; + struct ehca_shca *shca = dev_get_drvdata(&dev->dev); unsigned long flags; int ret; From sean.hefty at intel.com Mon May 4 15:49:49 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 4 May 2009 15:49:49 -0700 Subject: [ofa-general] [PATCH] ib-mgmt: fixup ibsendtrap for windows Message-ID: Fix some typecast issues. Signed-off-by: Sean Hefty --- infiniband-diags/src/ibsendtrap.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 469bc39..7ad588e 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port) static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; - n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); + n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port)); n->g_or_v.generic.trap_num = cl_hton16(144); - n->issuer_lid = cl_hton16(port->lid); - n->data_details.ntc_144.lid = cl_hton16(port->lid); + n->issuer_lid = cl_hton16((uint16_t) port->lid); + n->data_details.ntc_144.lid = n->issuer_lid; n->data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; n->data_details.ntc_144.change_flgs = @@ -79,10 +79,10 @@ static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT; - n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); + n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port)); n->g_or_v.generic.trap_num = cl_hton16(129); - n->issuer_lid = cl_hton16(port->lid); - n->data_details.ntc_129_131.lid = cl_hton16(port->lid); + n->issuer_lid = cl_hton16((uint16_t) port->lid); + n->data_details.ntc_129_131.lid = n->issuer_lid; n->data_details.ntc_129_131.pad = 0; n->data_details.ntc_129_131.port_num = (uint8_t) error_port; } From messenger at webex.com Mon May 4 17:24:46 2009 From: messenger at webex.com (Jeff Squyres) Date: Tue, 5 May 2009 00:24:46 GMT Subject: [ofa-general] ***SPAM*** Meeting invitation: Verbs memory registration Message-ID: <94632617.1241483086001.JavaMail.nobody@jsj6wl002.webex.com> Hello , Jeff Squyres invites you to attend this online meeting. Topic: Verbs memory registration Date: Monday, May 11, 2009 Time: 12:00 pm, Eastern Daylight Time (GMT -04:00, New York) Meeting Number: 203 642 533 Meeting Password: verbs Please click the link below to see more information, or to join the meeting. ---------------------------------------------------------------- ALERT:Toll-Free Dial Restrictions for (408) and (919) Area Codes ---------------------------------------------------------------- As of April 9th, 2009, you can no longer dial toll free in the 408 or 919 area codes in the United States. The affected toll free numbers are: (866) 432-9903 for the San Jose/Milpitas area and (866) 349-3520 for the RTP area. Please dial the local access number for your area from the list below: - San Jose/Milpitas (408) area: 525-6800 - RTP (919) area: 392-3330 ------------------------------------------------------- To join the online meeting ------------------------------------------------------- 1. Go to https://cisco.webex.com/cisco/j.php?ED=119193612&UID=1123387277&PW=5ef2c01d4e5c171043 2. Enter your name and email address. 3. Enter the meeting password: verbs 4. Click "Join Now". ------------------------------------------------------- To join the teleconference only ------------------------------------------------------- 1. Dial into Cisco WebEx (view all Global Access Numbers at http://cisco.com/en/US/about/doing_business/conferencing/index.html 2. Press 3 to attend the meeting. 3. Follow the prompts to enter the Meeting Number (listed above) or Access Code followed by the # sign. San Jose, CA: +1.408.525.6800 RTP: +1.919.392.3330 US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4350.1111 Germany: +49.619.6773.9002 Japan: +81.3.5763.9394 China: +86.10.8515.5666 ------------------------------------------------------- To join the meeting on iPhone ------------------------------------------------------- Go to wbx://cisco.webex.com/ciscosales?MK=203642533&MPW=768e0fa81cb639e8ff44dd29522061f0bb154512eeedc42495588659cb1bf790 Don't have the iPhone WebEx application yet? Go to http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=298844386 ------------------------------------------------------- For assistance ------------------------------------------------------- 1. Go to https://cisco.webex.com/cisco/mc 2. On the left navigation bar, click "Support". You can contact me at: jsquyres at cisco.com 1-408-525 0971 To add this meeting to your calendar program (for example Microsoft Outlook), click this link: https://cisco.webex.com/cisco/j.php?ED=119193612&UID=1123387277&ICS=MI&LD=1&RD=2&ST=1&SHA2=PQM0FncQOp/M461AFXiuPStSZyv8DeZiipMItYw7884= The playback of UCF (Universal Communications Format) rich media files requires appropriate players. To view this type of rich media files in the meeting, please check whether you have the players installed on your computer by going to https://cisco.webex.com/cisco/systemdiagnosis.php Sign up for a free trial of WebEx http://www.webex.com/go/mcemfreetrial http://www.webex.com We've got to start meeting like this(TM) IMPORTANT NOTICE: This WebEx service includes a feature that allows audio and any documents and other materials exchanged or viewed during the session to be recorded. By joining this session, you automatically consent to such recordings. If you do not consent to the recording, do not join the session. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Mon May 4 17:25:23 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 4 May 2009 20:25:23 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com><382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com><8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> Message-ID: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com> I think that this thread has gotten to the point where people are no longer reading each post carefully and are therefore re-hashing points that have already been discussed. It has therefore reached the end of its usefulness. It was suggested today that a teleconference to discuss these issues might be much more useful (an hour-long teleconference can save a week's worth of emails!). This will be a technical call to discuss memory registration issues; it will not be an EWG call. I've setup a WebEx call for next Monday at the "normal" time: noon US Eastern, 9am US Pacific, 7pm Israel. The invite will be coming to the ewg and general lists shortly. *** PLEASE USE THE WEBEX URL TO JOIN THE TELECONFERENCE (vs. just dialing in) (when you logon, it'll prompt you for a phone number to call you back; yes, non-US phone numbers are supported) I will make up a small number of slides that attempt to summarize all the arguments (on both sides) so far. Hopefully, they can serve as a starting point for discussion. Thanks; see you next Monday. On May 1, 2009, at 1:09 PM, Roland Dreier (rdreier) wrote: > > You mentioned that doing this stuff is a choice; the choice that > > MPI's/ ULPs/applications therefore have is: > > > > - don't use registration caches/memory allocation hooking, have > > terrible performance > > - use registration caches/memory allocation hooking, have good > > performance > > I think it's a bit of a stretch to suggest that all or even most > userspace RDMA applications have the same need for registration > caching > as MPI. In fact my feeling is that the fact that MPI must deal with > RDMA to arbitrary memory allocated by an application out of MPI's > control is the exception. My most recent experience was with Cisco's > RAB library, and in that case we simply designed the library so that > all > RDMA was done to memory allocated by the library -- so no need for a > registration cache, and in fact no need for registration in any fast > path. I suspect that the majority of code written to use RDMA > natively > will be designed with similar properties. > > So this proposal is very much an MPI-specific interface. Which > leads to > my next point. I have no doubt that the MPI community has a very good > idea of a memory registration interface that would make MPI > implementations simpler and more robust. However I don't think > there's > quite as much expertise about what the best way to implement such an > interface is. > > My initial reaction is that I don't want to extend the kernel ABI with > a set of new MPI-specific verbs if there's a way around it. We've > been > told over and over that the registration cache is complex and fragile > code -- but moving complex and fragile code into the kernel doesn't > magically make it any simpler or more robust, it just means that bugs > now crash the whole system instead of just affecting one process. > > Now, of course MMU notifiers allow the kernel to know reliably when a > process's page tables change, which means that all the complicated > malloc hooking etc is not needed. So that complexity is avoided in > the > kernel. But suppose I give userspace the same MMU notifier capability > (eg I add a system call like "if any mappings in the virtual address > range X ... Y change, then write a 1 to virtual address Z") -- then > what > do I gain from having the rest of the registration caching in the > kernel? (And avoiding the duplication of caching code between > multiple > MPI implementations is not an answer -- it's quite feasible to put the > caching code into libibverbs if that's the best place for it) > > - R. -- Jeff Squyres Cisco Systems From HNGUYEN at de.ibm.com Mon May 4 22:13:16 2009 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Tue, 5 May 2009 07:13:16 +0200 Subject: [ofa-general] Re: [PATCH] infiniband: ehca: remove driver_data direct access of struct device In-Reply-To: <20090504200022.GA22746@kroah.com> References: <20090504200022.GA22746@kroah.com> Message-ID: Hi, This patch looks fine to me. Thanks! Nam Greg Kroah-Hartman wrote on 04.05.2009 22:00:22: > [image removed] > > [PATCH] infiniband: ehca: remove driver_data direct access of struct device > > Greg Kroah-Hartman > > to: > > Sean Hefty, Roland Dreier, Hal Rosenstock, Christoph Raisch, Hoang-Nam Nguyen > > 04.05.2009 22:05 > > Cc: > > general, Greg KH > > From: Greg Kroah-Hartman > > In the near future, the driver core is going to not allow direct access > to the driver_data pointer in struct device. Instead, the functions > dev_get_drvdata() and dev_set_drvdata() should be used. These functions > have been around since the beginning, so are backwards compatible with > all older kernel versions. > > Cc: Sean Hefty > Cc: Roland Dreier > Cc: Hal Rosenstock > Cc: general at lists.openfabrics.org > Cc: Christoph Raisch > Cc: Hoang-Nam Nguyen > Signed-off-by: Greg Kroah-Hartman > > --- > drivers/infiniband/hw/ehca/ehca_main.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > --- a/drivers/infiniband/hw/ehca/ehca_main.c > +++ b/drivers/infiniband/hw/ehca/ehca_main.c > @@ -636,7 +636,7 @@ static ssize_t ehca_show_##name(struct > struct hipz_query_hca *rblock; \ > int data; \ > \ > - shca = dev->driver_data; \ > + shca = dev_get_drvdata(dev); \ > \ > rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); \ > if (!rblock) { \ > @@ -680,7 +680,7 @@ static ssize_t ehca_show_adapter_handle( > struct device_attribute *attr, > char *buf) > { > - struct ehca_shca *shca = dev->driver_data; > + struct ehca_shca *shca = dev_get_drvdata(dev); > > return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle); > > @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct o > > shca->ofdev = dev; > shca->ipz_hca_handle.handle = *handle; > - dev->dev.driver_data = shca; > + dev_set_drvdata(&dev->dev, shca); > > ret = ehca_sense_attributes(shca); > if (ret < 0) { > @@ -878,7 +878,7 @@ probe1: > > static int __devexit ehca_remove(struct of_device *dev) > { > - struct ehca_shca *shca = dev->dev.driver_data; > + struct ehca_shca *shca = dev_get_drvdata(&dev->dev); > unsigned long flags; > int ret; > From jackm at dev.mellanox.co.il Tue May 5 00:21:36 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 5 May 2009 10:21:36 +0300 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: <20090504145641.GA19565@opengridcomputing.com> References: <20090504145641.GA19565@opengridcomputing.com> Message-ID: <200905051021.36725.jackm@dev.mellanox.co.il> On Monday 04 May 2009 17:56, Jon Mason wrote: > What's even worse is that sg_init_table is already defined in the > RHEL5.3 headers.  When coding up a header cleanup patch for RHEL5.3, I > noticed it was already defined in linux/ncrypto.h.  Also, it's there for > RHEL5.2 (and a few older kernels). > I do not see that as "worse". ncrypto is the cryptographic scatterlist API, which is not used anywhere in OFED. Do we include this only because of its base scatterlist additions? ncrypto.h itself has a list of includes. I guess, though, you could do the following for scatterlist.h in the RHEL5.3 backport: ============================================================================== #ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__ #define __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__ /* crypto.h includes scatterlist.h */ #include static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { sg->page = page; } #define for_each_sg(sglist, sg, nr, __i) \ for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++) static inline struct scatterlist *sg_next(struct scatterlist *sg) { if (!sg) { BUG(); return NULL; } return sg + 1; } #endif ============================================================================== linux/ncrypto.h, though, is not part of, say, kernel 2.6.23. Need to check if the above is RedHat-only solution. - Jack From vlad at lists.openfabrics.org Tue May 5 03:25:02 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 5 May 2009 03:25:02 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090505-0200 daily build status Message-ID: <20090505102502.7E36AE61024@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at mellanox.co.il Tue May 5 04:01:51 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 5 May 2009 14:01:51 +0300 Subject: [ofa-general] EWG/OFED meeting meeting minutes for May 4, 09 Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0291246E@mtlexch01.mtl.com> These are the OFED meeting minutes for May 4 09: Summary: ======== 1. OFED 1.4.1 is delayed: RC5 is planed for next Monday May 11. GA for May 14 2. OFED 1.5 schedule: will be delayed to October since 1.4.1 is delayed 3. MPI memory registration API request: We decided to have a special meeting on this subject next week on the same slot. Reminder: OFED roadmap on the web: http://www.openfabrics.org/txt/woody/roadmap.txt EWG meeting minutes: http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/ Details: ====== 1. OFED 1.4.1 bugs status These are the open critical bugs: ID Sev OS Assignee Summary 1616 cri RHEL jon at opengridcomputing.com iommu_alloc error when running connectathon on ppc64 nfs ... - we see similar problem in SDP with bug 1612 in PPC. IBM will try to help in debug. 1571 cri RHEL vu at mellanox.com nfsrdma server crash @test5 connectathon basic test - related to mlx4 implementation, fix is under test Other bugs: 1620 cri Other jon at opengridcomputing.com backport definition of struct hash_desc doesn't match the... - There is a fix; wait for ok from Brian 1287 maj RHEL bugzilla at openib.org IPoIB datagram mode initial packet loss - document in RN and say there is a workaround 1596 maj Othe Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is loaded - we will document it for 1.4.1 and do a better fix for 1.5 1621 maj RHEL vu at mellanox.com RHEL 5.3 + OFED 1.4.1-rc4: loading ib_sprt kernel module ... - was usage issue only 1623 maj Othe shirif at voltaire.com IB Devices not found on SLES11, ia64 (HP Blade) - seems FW configuration issue We decided to wait for another week to fix the PPC issue. New schedule is: * RC5 next Monday - May 11 * GA Thu May 14 2. OFED 1.5 schedule Release is delayed by a month. We do not wish to delay any more since we wish to have one major OFED release each year and we want the new OFED release before SC09 Kernel base will stay 2.6.30 This is the new schedule is: Feature Freeze: Jun 7, 09 Alpha Release: Jun 12, 09 Beta Release: Jun 9, 09 RC1: Jul 25, 09 RC2-RCx: About every 2 weeks as needed We usually have ~6 RCs Release: Oct 15, 09 Note: Jeff S. suggests that we drop MPI from the OFED package. We will discuss this after 1.4.1 release. 3. MPI new memory API We will have next week a special meeting on this subject. Jeff will prepare the meeting. Tziporet From ossrosch at linux.vnet.ibm.com Tue May 5 04:23:51 2009 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Tue, 5 May 2009 13:23:51 +0200 Subject: [ofa-general] Queue pair state for multicast group attachment Message-ID: <200905051323.52192.ossrosch@linux.vnet.ibm.com> Hi, during testing with ib_diag tools (ib_send_lat/bw) we noticed some problems by using multicast option. The tools attach queue pairs to multicast groups while the queue pairs are in RESET state. In the IB standard there is no specific definition when the attach operation should be processed. In our opinion a QP should not be attached until it's in RTR to receive data from the multicast group. What is the communitys opinion about that? Is it possible to change the diag tools? regards Stefan From dotanba at gmail.com Tue May 5 04:36:03 2009 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 5 May 2009 14:36:03 +0300 Subject: [ofa-general] Queue pair state for multicast group attachment In-Reply-To: <200905051323.52192.ossrosch@linux.vnet.ibm.com> References: <200905051323.52192.ossrosch@linux.vnet.ibm.com> Message-ID: <2f3bf9a60905050436m2569fafbj7a2a1d49c806bc3b@mail.gmail.com> I believe that the right QP state to attach it to a multicast group is in INIT state, since it this state you can post receive request too. As soon as you will modify the QP state to RTR the multicast messages will be received by this QP. Dotan On Tue, May 5, 2009 at 2:23 PM, Stefan Roscher wrote: > Hi, > > during testing with ib_diag tools (ib_send_lat/bw) we noticed some problems by using multicast > option. The tools attach queue pairs to multicast groups while the queue > pairs are in RESET state. In the IB standard there is no specific definition > when the attach operation should be processed. In our opinion a QP should not be > attached until it's in  RTR to receive data from the multicast group. > What is the communitys opinion about that? Is it possible to change the diag tools? > > regards Stefan > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sokar6012 at hotmail.com Tue May 5 05:53:39 2009 From: sokar6012 at hotmail.com (anthony garnier) Date: Tue, 5 May 2009 12:53:39 +0000 Subject: [ofa-general] SDP error Message-ID: Hello, i`m running a debian 5.0 OS with ofed 1.4, RDMA work very well, but when I`m trying to use the SDP protocol with ssh, Netperf or a simple Client-Server programming in C, I got socket error like that : NetPIPE: can't open stream socket! errno=97 (for Netpipe) Address family not supported by protocol ssh (for ssh) Address family not supported by protocol (for clent-server) Someone knows those errors? _________________________________________________________________ Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant !  http://www.windowslive.fr/messenger/1.asp -------------- next part -------------- An HTML attachment was scrubbed... URL: From dorons at voltaire.com Tue May 5 06:00:30 2009 From: dorons at voltaire.com (Doron Shoham) Date: Tue, 05 May 2009 16:00:30 +0300 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 Message-ID: <4A00386E.2050300@voltaire.com> when setting max_op_vls = 0 do not force it to 1. 0 is valid value which means "No change" Signed-off-by: Doron Shoham --- opensm/opensm/osm_port.c | 6 ------ opensm/opensm/osm_subnet.c | 8 ++++++++ 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 2e6c642..db0c27e 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, if (op_vls > p_subn->opt.max_op_vls) op_vls = p_subn->opt.max_op_vls; - if (op_vls == 0) { - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " - "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); - op_vls = 1; - } - OSM_LOG_EXIT(p_log); return op_vls; } diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ec15f8a..71fc7a0 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) "# switch port connected to a CA or router port\n" "leaf_head_of_queue_lifetime 0x%02x\n\n" "# Limit the maximal operational VLs\n" + "# Virtual Lanes operational on this port\n" + "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" + "# 0: No change; valid only on Set()\n" + "# 1: VL0\n" + "# 2: VL0, VL1\n" + "# 3: VL0 - VL3\n" + "# 4: VL0 - VL7\n" + "# 5: VL0 - VL14\n" "max_op_vls %u\n\n" "# Force PortInfo:LinkSpeedEnabled on switch ports\n" "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" -- 1.5.4 From gmpc at sanger.ac.uk Tue May 5 06:13:34 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue, 05 May 2009 14:13:34 +0100 Subject: [ofa-general] SDP error In-Reply-To: References: Message-ID: <4A003B7E.3070604@sanger.ac.uk> anthony garnier wrote: > Hello, > > i`m running a debian 5.0 OS with ofed 1.4, RDMA work very well, but > when I`m trying to use the SDP protocol with ssh, Netperf or a simple > Client-Server programming in C, I got socket error like that : > > NetPIPE: can't open stream socket! errno=97 (for Netpipe) > > Address family not supported by protocol ssh (for ssh) > > Address family not supported by protocol (for clent-server) > > Someone knows those errors? Is the ib_sdp module loaded? Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From hal.rosenstock at gmail.com Tue May 5 06:14:11 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 5 May 2009 09:14:11 -0400 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <4A00386E.2050300@voltaire.com> References: <4A00386E.2050300@voltaire.com> Message-ID: On Tue, May 5, 2009 at 9:00 AM, Doron Shoham wrote: > when setting max_op_vls = 0 > do not force it to 1. > 0 is valid value which means "No change" > > Signed-off-by: Doron Shoham > --- >  opensm/opensm/osm_port.c   |    6 ------ >  opensm/opensm/osm_subnet.c |    8 ++++++++ >  2 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..db0c27e 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >        if (op_vls > p_subn->opt.max_op_vls) >                op_vls = p_subn->opt.max_op_vls; > > -       if (op_vls == 0) { > -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > -               op_vls = 1; > -       } > - Should that only be done when max_op_vls is 0 ? Something like: if (op_vls > p_subn->opt.max_op_vls) op_vls = p_subn->opt.max_op_vls; else if (op_vls == 0) { OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; } -- Hal >        OSM_LOG_EXIT(p_log); >        return op_vls; >  } > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..71fc7a0 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) >                "# switch port connected to a CA or router port\n" >                "leaf_head_of_queue_lifetime 0x%02x\n\n" >                "# Limit the maximal operational VLs\n" > +               "# Virtual Lanes operational on this port\n" > +               "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" > +               "#    0: No change; valid only on Set()\n" > +               "#    1: VL0\n" > +               "#    2: VL0, VL1\n" > +               "#    3: VL0 - VL3\n" > +               "#    4: VL0 - VL7\n" > +               "#    5: VL0 - VL14\n" >                "max_op_vls %u\n\n" >                "# Force PortInfo:LinkSpeedEnabled on switch ports\n" >                "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" > -- > 1.5.4 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Line.Holen at Sun.COM Tue May 5 06:25:19 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Tue, 05 May 2009 15:25:19 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size of the hop table Message-ID: <4A003E3F.2010100@Sun.COM> The hops table of ftree_sw_t is too small to hold the hop count of max_lid. Changed sw_create() to allocate hops[max_lid+1] not hops[max_lid]. Signed-off-by: Line Holen --- diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 0c4741a..8ed2f74 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. @@ -554,10 +555,10 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, /* initialize lft buffer */ memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); - p_sw->hops = malloc(p_osm_sw->max_lid_ho * sizeof(*(p_sw->hops))); + p_sw->hops = malloc((p_osm_sw->max_lid_ho + 1) * sizeof(*(p_sw->hops))); if(p_sw->hops == NULL) return NULL; - memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho); + memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho + 1); return p_sw; } /* sw_create() */ From dorfman.eli at gmail.com Tue May 5 06:48:32 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 05 May 2009 16:48:32 +0300 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: References: <4A00386E.2050300@voltaire.com> Message-ID: <4A0043B0.3030400@gmail.com> Hal Rosenstock wrote: > On Tue, May 5, 2009 at 9:00 AM, Doron Shoham wrote: >> when setting max_op_vls = 0 >> do not force it to 1. >> 0 is valid value which means "No change" >> >> Signed-off-by: Doron Shoham >> --- >> opensm/opensm/osm_port.c | 6 ------ >> opensm/opensm/osm_subnet.c | 8 ++++++++ >> 2 files changed, 8 insertions(+), 6 deletions(-) >> >> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >> index 2e6c642..db0c27e 100644 >> --- a/opensm/opensm/osm_port.c >> +++ b/opensm/opensm/osm_port.c >> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >> if (op_vls > p_subn->opt.max_op_vls) >> op_vls = p_subn->opt.max_op_vls; >> >> - if (op_vls == 0) { >> - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >> - "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >> - op_vls = 1; >> - } >> - > > Should that only be done when max_op_vls is 0 ? > > Something like: > if (op_vls > p_subn->opt.max_op_vls) > op_vls = p_subn->opt.max_op_vls; > else if (op_vls == 0) { > OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > op_vls = 1; > } why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? is there a firmware bug that reports op_vls=0? Eli From hal.rosenstock at gmail.com Tue May 5 06:59:17 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 5 May 2009 09:59:17 -0400 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <4A0043B0.3030400@gmail.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> Message-ID: On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire) wrote: > Hal Rosenstock wrote: >> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham wrote: >>> when setting max_op_vls = 0 >>> do not force it to 1. >>> 0 is valid value which means "No change" >>> >>> Signed-off-by: Doron Shoham >>> --- >>>  opensm/opensm/osm_port.c   |    6 ------ >>>  opensm/opensm/osm_subnet.c |    8 ++++++++ >>>  2 files changed, 8 insertions(+), 6 deletions(-) >>> >>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >>> index 2e6c642..db0c27e 100644 >>> --- a/opensm/opensm/osm_port.c >>> +++ b/opensm/opensm/osm_port.c >>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >>>        if (op_vls > p_subn->opt.max_op_vls) >>>                op_vls = p_subn->opt.max_op_vls; >>> >>> -       if (op_vls == 0) { >>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>> -               op_vls = 1; >>> -       } >>> - >> >> Should that only be done when max_op_vls is 0 ? >> >> Something like: >>            if (op_vls > p_subn->opt.max_op_vls) >>                 op_vls = p_subn->opt.max_op_vls; >>            else if (op_vls == 0) { >>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>                op_vls = 1; >>           } > > why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? > is there a firmware bug that reports op_vls=0? There were (still are ?) implementations which returned op_vls 0 which is why the words "valid on Set()" were added to the IBA spec and why I don't feel safe removing the code as originally proposed but think my alternative is safe and accomplishes the stated goal. Is there a problem with my alternative proposal ? -- Hal > Eli > > > From dorfman.eli at gmail.com Tue May 5 07:45:09 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 05 May 2009 17:45:09 +0300 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> Message-ID: <4A0050F5.2010208@gmail.com> Hal Rosenstock wrote: > On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire) > wrote: >> Hal Rosenstock wrote: >>> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham wrote: >>>> when setting max_op_vls = 0 >>>> do not force it to 1. >>>> 0 is valid value which means "No change" >>>> >>>> Signed-off-by: Doron Shoham >>>> --- >>>> opensm/opensm/osm_port.c | 6 ------ >>>> opensm/opensm/osm_subnet.c | 8 ++++++++ >>>> 2 files changed, 8 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >>>> index 2e6c642..db0c27e 100644 >>>> --- a/opensm/opensm/osm_port.c >>>> +++ b/opensm/opensm/osm_port.c >>>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >>>> if (op_vls > p_subn->opt.max_op_vls) >>>> op_vls = p_subn->opt.max_op_vls; >>>> >>>> - if (op_vls == 0) { >>>> - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>>> - "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>>> - op_vls = 1; >>>> - } >>>> - >>> Should that only be done when max_op_vls is 0 ? >>> >>> Something like: >>> if (op_vls > p_subn->opt.max_op_vls) >>> op_vls = p_subn->opt.max_op_vls; >>> else if (op_vls == 0) { >>> OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>> "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>> op_vls = 1; >>> } >> why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? > >> is there a firmware bug that reports op_vls=0? > > There were (still are ?) implementations which returned op_vls 0 which > is why the words "valid on Set()" were added to the IBA spec and why I > don't feel safe removing the code as originally proposed but think my > alternative is safe and accomplishes the stated goal. Is there a > problem with my alternative proposal ? no, but there are other fields in portinfo that are not validated. for example link_speed_enabled (which allows 0 value only on Set as well). also if a node returns op_vl=0 how do you know it supports op_vl=1? Eli From jon at opengridcomputing.com Tue May 5 08:06:36 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 5 May 2009 10:06:36 -0500 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: <200905051021.36725.jackm@dev.mellanox.co.il> References: <20090504145641.GA19565@opengridcomputing.com> <200905051021.36725.jackm@dev.mellanox.co.il> Message-ID: <20090505150635.GA30788@opengridcomputing.com> On Tue, May 05, 2009 at 10:21:36AM +0300, Jack Morgenstein wrote: > On Monday 04 May 2009 17:56, Jon Mason wrote: > > What's even worse is that sg_init_table is already defined in the > > RHEL5.3 headers.  When coding up a header cleanup patch for RHEL5.3, I > > noticed it was already defined in linux/ncrypto.h.  Also, it's there for > > RHEL5.2 (and a few older kernels). > > > I do not see that as "worse". ncrypto is the cryptographic scatterlist API, which is not used anywhere in OFED. > Do we include this only because of its base scatterlist additions? No, we currently duplicate all the scatterlist functionality. Including ncrypto.h would greatly simplify the backport headers, but it is a RHEL5.2/5.3 only solution. If this change is needed for all other backports, then a better solution will be needed. > ncrypto.h itself has a list of includes. > > I guess, though, you could do the following for scatterlist.h in the RHEL5.3 backport: > ============================================================================== > #ifndef __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__ > #define __BACKPORT_LINUX_SCATTERLIST_H_TO_RHEL5_3__ > > /* crypto.h includes scatterlist.h */ > #include > > static inline void sg_assign_page(struct scatterlist *sg, struct page *page) > { > sg->page = page; > } > > #define for_each_sg(sglist, sg, nr, __i) \ > for (__i = 0, sg = (sglist); __i < (nr); __i++, sg++) > > static inline struct scatterlist *sg_next(struct scatterlist *sg) > { > if (!sg) { > BUG(); > return NULL; > } > return sg + 1; > } > > #endif > ============================================================================== It is more than just this. By including ncrypto.h, crypto.h and scatterlist.h in the RHEL backports are 99% smaller due to the removal of duplicated functionality. Obviously, this will need to be tested heavily. Thanks, Jon > > linux/ncrypto.h, though, is not part of, say, kernel 2.6.23. Need to check if the above is RedHat-only solution. > > - Jack From hal.rosenstock at gmail.com Tue May 5 11:30:45 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 5 May 2009 14:30:45 -0400 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <4A0050F5.2010208@gmail.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <4A0050F5.2010208@gmail.com> Message-ID: On Tue, May 5, 2009 at 10:45 AM, Eli Dorfman (Voltaire) wrote: > Hal Rosenstock wrote: >> On Tue, May 5, 2009 at 9:48 AM, Eli Dorfman (Voltaire) >> wrote: >>> Hal Rosenstock wrote: >>>> On Tue, May 5, 2009 at 9:00 AM, Doron Shoham wrote: >>>>> when setting max_op_vls = 0 >>>>> do not force it to 1. >>>>> 0 is valid value which means "No change" >>>>> >>>>> Signed-off-by: Doron Shoham >>>>> --- >>>>>  opensm/opensm/osm_port.c   |    6 ------ >>>>>  opensm/opensm/osm_subnet.c |    8 ++++++++ >>>>>  2 files changed, 8 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >>>>> index 2e6c642..db0c27e 100644 >>>>> --- a/opensm/opensm/osm_port.c >>>>> +++ b/opensm/opensm/osm_port.c >>>>> @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >>>>>        if (op_vls > p_subn->opt.max_op_vls) >>>>>                op_vls = p_subn->opt.max_op_vls; >>>>> >>>>> -       if (op_vls == 0) { >>>>> -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>>>> -                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>>>> -               op_vls = 1; >>>>> -       } >>>>> - >>>> Should that only be done when max_op_vls is 0 ? >>>> >>>> Something like: >>>>            if (op_vls > p_subn->opt.max_op_vls) >>>>                 op_vls = p_subn->opt.max_op_vls; >>>>            else if (op_vls == 0) { >>>>                OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >>>>                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >>>>                op_vls = 1; >>>>           } >>> why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? >> >>> is there a firmware bug that reports op_vls=0? >> >> There were (still are ?) implementations which returned op_vls 0 which >> is why the words "valid on Set()" were added to the IBA spec and why I >> don't feel safe removing the code as originally proposed but think my >> alternative is safe and accomplishes the stated goal. Is there a >> problem with my alternative proposal ? > > no, but there are other fields in portinfo that are not validated. Yes, there's some inconsistency here but it's based on field experience. > for example link_speed_enabled (which allows 0 value only on Set as well). Yes, but this field had the specific issue I noted and 0 being returned on get was never observed on any of the other fields where 0 is valid on set (added there as well). > also if a node returns op_vl=0 how do you know it supports op_vl=1? op_vls 1 is always safe as at least 1 data VL must be supported. It's just that possibly more op vls could have been supported if things had been compliant. -- Hal > Eli From sashak at voltaire.com Tue May 5 12:05:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 22:05:46 +0300 Subject: [ofa-general] Re: Issues with combined routing in smpquery In-Reply-To: <20090429160438.db62cde1.weiny2@llnl.gov> References: <20090428202736.0ff049e5.weiny2@llnl.gov> <20090428205525.4ffdd778.weiny2@llnl.gov> <20090429145355.704fb2f5.weiny2@llnl.gov> <20090429160438.db62cde1.weiny2@llnl.gov> Message-ID: <20090505190546.GA31846@sashak.voltaire.com> Hi Ira, On 16:04 Wed 29 Apr , Ira Weiny wrote: > > I know what changed but there appears to be a discrepancy between ib_mad_f > and the spec. > > Commit 2dbb8b95d9dc27423a6fdb85d88ef385ecee0005 > "libibmad: remove c99 definitions within the ib_mad_f structure" > removed the designated initializers from ib_mad_f. Appling the patch below > aligns the MAD_FIELDS with ib_mad_f. Thanks for looking into this. > However, if you look at the offsets specified in ib_mad_f they are wrong. > According to 14.2.1.2, DrSLID is at offset 32 bytes (256 bits). ib_mad_f > places the offset at 272. I have verified the bytes using a debugger and byte > 32 is the DrSLID. I hesitate to say there is a bug in mad_set_field however > there does appear to be something amiss. :-/ I think everything is ok there. 14.2.1.2 says: at offset 32 bytes (256 bits) DrDLID - bits 0-15, DrSLID - bits 16-31. Sasha From sashak at voltaire.com Tue May 5 12:08:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 22:08:10 +0300 Subject: [ofa-general] Re: [PATCH 1/3] Fix reversal of DRSLID and DRDLID in MAD_FIELDS enum In-Reply-To: <20090430142950.85ef6368.weiny2@llnl.gov> References: <20090430142950.85ef6368.weiny2@llnl.gov> Message-ID: <20090505190810.GB31846@sashak.voltaire.com> On 14:29 Thu 30 Apr , Ira Weiny wrote: > From: Ira Weiny > Date: Thu, 30 Apr 2009 11:19:26 -0700 > Subject: [PATCH] Fix reversal of DRSLID and DRDLID in MAD_FIELDS enum. > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From devel-ofed at morey-chaisemartin.com Tue May 5 12:33:38 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Tue, 05 May 2009 21:33:38 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size of the hop table In-Reply-To: <4A003E3F.2010100@Sun.COM> References: <4A003E3F.2010100@Sun.COM> Message-ID: <4A009492.2060408@morey-chaisemartin.com> Le 05/05/2009 15:25, Line.Holen at Sun.COM a écrit : > The hops table of ftree_sw_t is too small to hold the hop count > of max_lid. Changed sw_create() to allocate hops[max_lid+1] > not hops[max_lid]. > > Signed-off-by: Line Holen This patch seems right to me (at least agrees with other checks). However, I've been using the ftree algorithm without this fix in thousands of tests and never had any seg fault problem and valgrind showed nothing either... Would it be possible that the actual value is always < max_lid_ho ? Nicolas From sashak at voltaire.com Tue May 5 13:15:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:15:39 +0300 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090430142958.5811218f.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> Message-ID: <20090505201539.GC31846@sashak.voltaire.com> On 14:29 Thu 30 Apr , Ira Weiny wrote: > > static int > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) > { > - int rc = add_port_to_dpath(path, nextport); > - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > - f->fabric.maxhops_discovered = path->cnt; > + int rc = 0; > + > + if (portid->lid && !portid->drpath.drslid) { > + /* If we were LID routed > + * AND have not done so already > + * we need to set up the drslid > + */ > + ib_portid_t selfportid = { 0 }; > + if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) > + return -1; > + portid->drpath.drslid = selfportid.lid; > + portid->drpath.drdlid = 0xFFFF; How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What am I missing? Sasha > + } > + > + rc = add_port_to_dpath(&portid->drpath, nextport); > + > + if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) > + f->fabric.maxhops_discovered = portid->drpath.cnt; > return (rc); > } > > @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ > != IB_PORT_PHYS_STATE_LINKUP) > return -1; > > - if (extend_dpath(fabric, &path->drpath, portnum) < 0) > + if (extend_dpath(fabric, path, portnum) < 0) > return -1; > > if (query_node(fabric, &node_buf, &port_buf, path)) { > @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, > if (!port) > IBPANIC("out of memory"); > > - if (node->node.type != IB_NODE_SWITCH && > - get_remote_node(fabric, node, port, from, > + if(get_remote_node(fabric, node, port, from, > mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), > 0) < 0) > return ((ibnd_fabric_t *)fabric); > -- > 1.5.4.5 > From sashak at voltaire.com Tue May 5 13:16:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:16:59 +0300 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Remove some underbars from internal names In-Reply-To: <20090501214724.GA30974@comcast.net> References: <20090501214724.GA30974@comcast.net> Message-ID: <20090505201659.GD31846@sashak.voltaire.com> On 17:47 Fri 01 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue May 5 13:18:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:18:26 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags: Added libibnetdiscover to .spec file In-Reply-To: <49FECA41.7060200@ext.bull.net> References: <49FECA41.7060200@ext.bull.net> Message-ID: <20090505201826.GE31846@sashak.voltaire.com> On 12:58 Mon 04 May , Nicolas Morey-Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin Applied. Thanks. Sasha From sashak at voltaire.com Tue May 5 13:19:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:19:32 +0300 Subject: [ofa-general] Patch for libvendor incompatibility with QLogic SM In-Reply-To: References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624662@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624663@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624665@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5B7@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB43E870D5C2@MNEXMB1.qlogic.org> Message-ID: <20090505201932.GF31846@sashak.voltaire.com> On 09:42 Mon 04 May , Hal Rosenstock wrote: > > I would expect OFED 1.5 to be based off the current master Yes, it will be based on the current master. Sasha From sashak at voltaire.com Tue May 5 13:21:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:21:33 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover.c: Cosmetic formatting changes In-Reply-To: <20090504191732.GA29650@comcast.net> References: <20090504191732.GA29650@comcast.net> Message-ID: <20090505202133.GG31846@sashak.voltaire.com> On 15:17 Mon 04 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue May 5 13:24:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 5 May 2009 23:24:21 +0300 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr DB: Remove leading underscores from internal names In-Reply-To: <20090504200018.GA4590@comcast.net> References: <20090504200018.GA4590@comcast.net> Message-ID: <20090505202421.GH31846@sashak.voltaire.com> On 16:00 Mon 04 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From Line.Holen at Sun.COM Tue May 5 13:26:36 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Tue, 05 May 2009 22:26:36 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c Increase the size of the hop table In-Reply-To: <4A009492.2060408@morey-chaisemartin.com> References: <4A003E3F.2010100@Sun.COM> <4A009492.2060408@morey-chaisemartin.com> Message-ID: <4A00A0FC.6050606@Sun.COM> On 05/ 5/09 09:33 PM, Nicolas Morey-Chaisemartin wrote: > Le 05/05/2009 15:25, Line.Holen at Sun.COM a écrit : >> The hops table of ftree_sw_t is too small to hold the hop count >> of max_lid. Changed sw_create() to allocate hops[max_lid+1] >> not hops[max_lid]. >> >> Signed-off-by: Line Holen > > > This patch seems right to me (at least agrees with other checks). > However, I've been using the ftree algorithm without this fix in thousands of tests and never had any seg fault problem and valgrind showed nothing either... > Would it be possible that the actual value is always < max_lid_ho ? > > Nicolas > I haven't experienced any seg fault either. But I have seen lack of connectivity to the node having lid = max_lid. This was because hop[max_lid] contained a value of 0 rather than 0xff (for some of the switches) which made the routing stop too early. Line From jsquyres at cisco.com Tue May 5 13:57:09 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 5 May 2009 16:57:09 -0400 Subject: [ofa-general] Memory registration redux Message-ID: Roland and I chatted on the phone today; I think I now understand Roland's counter-proposal (I clearly didn't before). Let me try to summarize: 1. Add a new verb for "set this userspace flag to 1 if mr X ever becomes invalid" 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" (i.e., remove the effects of #1) 3. Add run-time query indicating whether #1 works 4. Add [optional] memory registration caching to libibverbs Prior to talking to Roland, I had envisioned *one* flag in userspace that indicated whether any memory registrations had become invalid. Roland's idea is that there is one flag *per registration* -- you can instantly tell whether a specific registration is valid. Given this, let's keep the discussion going here in email -- perhaps the teleconference next Monday may become moot. --------------------------------------------- More detail... Here's a sample scenario: - userspace registers memory buffer A - userspace adds this registration to its cache (note: the cache could be in libibverbs; more on this below) - userspace calls a [new] verb that says "tell me if mr X ever becomes invalid" and passes a pointer to a flag *in this registration's entry in the cache* - userspace leaves the memory buffer A registered/cached Some scenarios after the above has run: 1. Userspace uses buffer A again - userspace looks up and finds A's cached registration - userspace sees that this registration's flag is still 0, and therefore can proceed with communication 2. Application frees buffer A and it is returned to the OS (e.g, munmap) - IOMMU fires - change userspace flag corresponding to this registration to 1 - memory is unregistered - pages are returned 3. Userspace uses buffer A again (after #2) - userspace looks up and finds A's cached registration - userspace sees that this registration's flag is 1 - userspace therefore registers this memory again, and re-calls the verb saying "tell me if mr X ever becomes invalid" (etc.) - userspace proceeds with communication The kernel has to store a little extra state for each registration (the address of the userspace flag to tweak if the registration ever becomes invalid), but it's small and bounded by the number of active registrations. From MPI's perspective, this feature would be a great step forward -- if we can query verbs at run-time to see if this feature is active, we can stop using the memory allocation hooks (yay!). Obviously, MPI's will need to carry the old memory allocation hooks for backwards compatibility for a while, but if we can effectively deprecate them, that would be great. **Specifically: it's the memory allocation hooks code in MPI implementations that is "fragile", "brittle", etc. Avoiding the issue would be great; the code becomes much more robust because we're not subverting the memory allocator. A secondary feature would be to add memory registration caching to libibverbs. This wouldn't be *required* for MPIs since we all have registration caches already, but it might be nice to deprecate/ eventually remove that code in an MPI implementation, too. The use case is similar to what was proposed earlier: add a flag to ibv_reg_mr() indicating whether you want the registration cached or not. If the registration is to be cached, libibverbs would also invoke the "tell me if this mr every becomes invalid" functionality. The MPI/application then *always* calls ibv_reg_mr() to register memory -- if the cache in libibverbs finds a valid matching mr, it can just return without a syscall. As also described previously, calls to ibv_dereg_mr() do not necessarily need to actually unregister -- they can just mark a registration cache as "able to be evicted if necessary." The other new verbs discussed in my prior mail would also still be useful (ibv_is_reg(), ibv_reg_mr_limits(), ibv_reg_mr_clean()). **Note: the registration caches in MPI's today are not necessarily that complicated. They're essentially balanced trees (e.g., in OMPI, it's a red-black tree). This is not the "fragile", "brittle" code -- it's just data structures and accounting. ================================= I refrained from a specific new API proposal; let's argue over these ideas first and see if we can come to consensus. If so, specific API proposals can follow. -- Jeff Squyres Cisco Systems From weiny2 at llnl.gov Tue May 5 14:19:40 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 5 May 2009 14:19:40 -0700 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090505201539.GC31846@sashak.voltaire.com> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090505201539.GC31846@sashak.voltaire.com> Message-ID: <20090505141940.2f2d57e3.weiny2@llnl.gov> On Tue, 5 May 2009 23:15:39 +0300 Sasha Khapyorsky wrote: > On 14:29 Thu 30 Apr , Ira Weiny wrote: > > > > static int > > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) > > { > > - int rc = add_port_to_dpath(path, nextport); > > - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > > - f->fabric.maxhops_discovered = path->cnt; > > + int rc = 0; > > + > > + if (portid->lid && !portid->drpath.drslid) { > > + /* If we were LID routed > > + * AND have not done so already > > + * we need to set up the drslid > > + */ > > + ib_portid_t selfportid = { 0 }; > > + if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) > > + return -1; > > + portid->drpath.drslid = selfportid.lid; > > + portid->drpath.drdlid = 0xFFFF; > > How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What > am I missing? Using a combined route where we are starting at some remote node. We have to use a directed route which does not start at "our" requester node. From the spec. C14-6 "bullet 6" states: "... If the directed route does not start from the requester node, then DrSLID shall be set to the LID of the requester node, which must have been assigned." The requester node is "self" in this case. If the DRSLID was set to the portid->lid then the response would not come back to us because portid->lid is the LID of the remote node we are starting the DR Path at. Ira > > Sasha > > > + } > > + > > + rc = add_port_to_dpath(&portid->drpath, nextport); > > + > > + if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) > > + f->fabric.maxhops_discovered = portid->drpath.cnt; > > return (rc); > > } > > > > @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ > > != IB_PORT_PHYS_STATE_LINKUP) > > return -1; > > > > - if (extend_dpath(fabric, &path->drpath, portnum) < 0) > > + if (extend_dpath(fabric, path, portnum) < 0) > > return -1; > > > > if (query_node(fabric, &node_buf, &port_buf, path)) { > > @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, > > if (!port) > > IBPANIC("out of memory"); > > > > - if (node->node.type != IB_NODE_SWITCH && > > - get_remote_node(fabric, node, port, from, > > + if(get_remote_node(fabric, node, port, from, > > mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), > > 0) < 0) > > return ((ibnd_fabric_t *)fabric); > > -- > > 1.5.4.5 > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Tue May 5 14:21:06 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 5 May 2009 14:21:06 -0700 Subject: [ofa-general] Re: Issues with combined routing in smpquery In-Reply-To: <20090505190546.GA31846@sashak.voltaire.com> References: <20090428202736.0ff049e5.weiny2@llnl.gov> <20090428205525.4ffdd778.weiny2@llnl.gov> <20090429145355.704fb2f5.weiny2@llnl.gov> <20090429160438.db62cde1.weiny2@llnl.gov> <20090505190546.GA31846@sashak.voltaire.com> Message-ID: <20090505142106.99d96c01.weiny2@llnl.gov> On Tue, 5 May 2009 22:05:46 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > On 16:04 Wed 29 Apr , Ira Weiny wrote: > > > > I know what changed but there appears to be a discrepancy between ib_mad_f > > and the spec. > > > > Commit 2dbb8b95d9dc27423a6fdb85d88ef385ecee0005 > > "libibmad: remove c99 definitions within the ib_mad_f structure" > > removed the designated initializers from ib_mad_f. Appling the patch below > > aligns the MAD_FIELDS with ib_mad_f. > > Thanks for looking into this. > > > However, if you look at the offsets specified in ib_mad_f they are wrong. > > According to 14.2.1.2, DrSLID is at offset 32 bytes (256 bits). ib_mad_f > > places the offset at 272. I have verified the bytes using a debugger and byte > > 32 is the DrSLID. I hesitate to say there is a bug in mad_set_field however > > there does appear to be something amiss. :-/ > > I think everything is ok there. 14.2.1.2 says: at offset 32 bytes (256 > bits) DrDLID - bits 0-15, DrSLID - bits 16-31. Ah, ok, I see now. I mixed up my bits... ;-) Ira > > Sasha -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From jackm at dev.mellanox.co.il Tue May 5 22:30:55 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 6 May 2009 08:30:55 +0300 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: <20090505150635.GA30788@opengridcomputing.com> References: <200905051021.36725.jackm@dev.mellanox.co.il> <20090505150635.GA30788@opengridcomputing.com> Message-ID: <200905060830.56062.jackm@dev.mellanox.co.il> On Tuesday 05 May 2009 18:06, Jon Mason wrote: > No, we currently duplicate all the scatterlist functionality.  Including > ncrypto.h would greatly simplify the backport headers, but it is a > RHEL5.2/5.3 only solution.  If this change is needed for all other > backports, then a better solution will be needed. > Each backport has its OWN directory. The backports are not identical for all kernels. There is absolutely no problem with handling backports per kernel/per distribution. Therefore, the RHEL 5.2/5.3 solution can be used for those backports alone, without affecting any of the others. Other backports will have a different change. For RHEL5.2/5.3, my concern is that if someone will actually write an ncrypto kernel application, and include ncrypto.h along with the infiniband headers, there will be compilation problems because the scatterlist functionality fixes will appear twice. Specifically, OFED 1.4.1 has the following INDIVIDUAL/independent backports, and each one is handled differently: 2.6.16 2.6.16_sles10 2.6.16_sles10_sp1 2.6.16_sles10_sp2 2.6.17 2.6.18 2.6.18-EL5.1 2.6.18-EL5.2 2.6.18-EL5.3 2.6.18_FC6 (also for EL5.0) 2.6.18_suse10_2 2.6.19 2.6.20 2.6.21 2.6.22 2.6.22_suse10_3 2.6.23 2.6.24 2.6.25 2.6.26 2.6.9_U4 2.6.9_U5 2.6.9_U6 2.6.9_U7 - Jack From sashak at voltaire.com Wed May 6 03:07:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 13:07:44 +0300 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090430142958.5811218f.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> Message-ID: <20090506100744.GB10145@sk> On 14:29 Thu 30 Apr , Ira Weiny wrote: > From: Ira Weiny > Date: Wed, 29 Apr 2009 10:15:55 -0700 > Subject: [PATCH] Add combined routing support to libibnetdisc > > Also allow a scan to start at a switch. > > Signed-off-by: Ira Weiny > --- > infiniband-diags/libibnetdisc/src/ibnetdisc.c | 28 ++++++++++++++++++------ > 1 files changed, 21 insertions(+), 7 deletions(-) > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index 0ff5134..fc19633 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) > } > > static int > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) > { > - int rc = add_port_to_dpath(path, nextport); > - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > - f->fabric.maxhops_discovered = path->cnt; > + int rc = 0; > + > + if (portid->lid && !portid->drpath.drslid) { > + /* If we were LID routed > + * AND have not done so already > + * we need to set up the drslid > + */ > + ib_portid_t selfportid = { 0 }; > + if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) > + return -1; And wouldn't it be better instead of resolving selfport on each extend_path() call to keep it already resolved somewhere in fabric structure? Sasha > + portid->drpath.drslid = selfportid.lid; > + portid->drpath.drdlid = 0xFFFF; > + } > + > + rc = add_port_to_dpath(&portid->drpath, nextport); > + > + if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) > + f->fabric.maxhops_discovered = portid->drpath.cnt; > return (rc); > } > > @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ > != IB_PORT_PHYS_STATE_LINKUP) > return -1; > > - if (extend_dpath(fabric, &path->drpath, portnum) < 0) > + if (extend_dpath(fabric, path, portnum) < 0) > return -1; > > if (query_node(fabric, &node_buf, &port_buf, path)) { > @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, > if (!port) > IBPANIC("out of memory"); > > - if (node->node.type != IB_NODE_SWITCH && > - get_remote_node(fabric, node, port, from, > + if(get_remote_node(fabric, node, port, from, > mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), > 0) < 0) > return ((ibnd_fabric_t *)fabric); > -- > 1.5.4.5 > From sashak at voltaire.com Wed May 6 03:08:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 13:08:44 +0300 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090505141940.2f2d57e3.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090505201539.GC31846@sashak.voltaire.com> <20090505141940.2f2d57e3.weiny2@llnl.gov> Message-ID: <20090506100844.GC10145@sk> On 14:19 Tue 05 May , Ira Weiny wrote: > > > > How does it work? Shouldn't be portid->drpath.drslid = portid->lid? What > > am I missing? > > Using a combined route where we are starting at some remote node. We have to > use a directed route which does not start at "our" requester node. From the > spec. C14-6 "bullet 6" states: > > "... If the directed route does not start from the requester node, then > DrSLID shall be set to the LID of the requester node, which must have been > assigned." > > The requester node is "self" in this case. Makes sense. Sasha From vlad at lists.openfabrics.org Wed May 6 03:21:58 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 6 May 2009 03:21:58 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090506-0200 daily build status Message-ID: <20090506102158.E5ADBE6140F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From slavas at voltaire.com Wed May 6 03:24:55 2009 From: slavas at voltaire.com (Slava Strebkov) Date: Wed, 6 May 2009 13:24:55 +0300 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal Message-ID: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> In addition to the original proposal we suggest allocating special MLID for the following MGIDs: 1. FF12401bxxxx000000000000FFFFFFFF - All Nodes 2. FF12401bxxxx00000000000000000001 - All hosts 3. FF12401bffff0000000000000000004d - all Gateways 4. FF12401bxxxx00000000000000000002 - all routers 5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM For all other cases we suggest that same MLID will be assigned to different MGIDs if: 1. They share the same P Key 2. Same signature - for IPoIB only 3. Same LSB bits - bitmask configurable by user (default 10 bits) for example, the following are the same: MGID1: FF12401bABCD000000000000xxxxx755 MGID2: FF12401bABCD000000000000yyyyyB55 Implementation. Since there will be many mgroups shared same mlid, mlid-array entry will contain fleximap holding mgroups. Searching of mgroup will be performed by mlid (index in the array) and mgid - key in the fleximap. Slava Strebkov From sashak at voltaire.com Wed May 6 03:24:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 13:24:38 +0300 Subject: [ofa-general] Re: [PATCH] ib-mgmt: fixup ibsendtrap for windows In-Reply-To: References: Message-ID: <20090506102438.GE10145@sk> On 15:49 Mon 04 May , Sean Hefty wrote: > Fix some typecast issues. > > Signed-off-by: Sean Hefty Applied with change noted below. Thanks. > --- > > infiniband-diags/src/ibsendtrap.c | 12 ++++++------ > 1 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c > index 469bc39..7ad588e 100644 > --- a/infiniband-diags/src/ibsendtrap.c > +++ b/infiniband-diags/src/ibsendtrap.c > @@ -66,10 +66,10 @@ static int get_node_type(ib_portid_t *port) > static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) > { > n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; > - n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); > + n->g_or_v.generic.prod_type_lsb = cl_hton16((uint16_t) get_node_type(port)); Instead of this casting I converted get_node_type() to return uint16_t. Sasha From sashak at voltaire.com Wed May 6 04:07:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 14:07:19 +0300 Subject: [ofa-general] Re: [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <4A00386E.2050300@voltaire.com> References: <4A00386E.2050300@voltaire.com> Message-ID: <20090506110719.GF10145@sk> Hi Doron, On 16:00 Tue 05 May , Doron Shoham wrote: > when setting max_op_vls = 0 > do not force it to 1. > 0 is valid value which means "No change" > > Signed-off-by: Doron Shoham > --- > opensm/opensm/osm_port.c | 6 ------ > opensm/opensm/osm_subnet.c | 8 ++++++++ > 2 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..db0c27e 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -380,12 +380,6 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > if (op_vls > p_subn->opt.max_op_vls) > op_vls = p_subn->opt.max_op_vls; > > - if (op_vls == 0) { > - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > - "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > - op_vls = 1; > - } > - I think that originally it was done as workaround for some old and buggy device. Personally I don't remember such cases in practice, but maybe Mellanox guys could say more. Yevgeny? Basically if this is not needed anymore I'm fine to remove it (but somehow it was not a direct purpose of the patch). > OSM_LOG_EXIT(p_log); > return op_vls; > } > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..71fc7a0 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > "# switch port connected to a CA or router port\n" > "leaf_head_of_queue_lifetime 0x%02x\n\n" > "# Limit the maximal operational VLs\n" > + "# Virtual Lanes operational on this port\n" > + "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" > + "# 0: No change; valid only on Set()\n" > + "# 1: VL0\n" > + "# 2: VL0, VL1\n" > + "# 3: VL0 - VL3\n" > + "# 4: VL0 - VL7\n" > + "# 5: VL0 - VL14\n" > "max_op_vls %u\n\n" Using 'max_op_vls = 0' will enforce PortInfo update (see how osm_physp_calc_link_op_vls() is used in osm_link_mgr.c and osm_lid_mgr.c) with "No change" request, which is obviously not desired. So max_op_vls = 0 case should be handled properly or not permitted. Sasha > "# Force PortInfo:LinkSpeedEnabled on switch ports\n" > "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" > -- > 1.5.4 > From sashak at voltaire.com Wed May 6 04:21:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 14:21:35 +0300 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> Message-ID: <20090506112135.GG10145@sk> On 09:59 Tue 05 May , Hal Rosenstock wrote: > >> > >> Should that only be done when max_op_vls is 0 ? > >> > >> Something like: > >> ?? ?? ?? ?? ?? ??if (op_vls > p_subn->opt.max_op_vls) > >> ?? ?? ?? ?? ?? ?? ?? ?? op_vls = p_subn->opt.max_op_vls; > >> ?? ?? ?? ?? ?? ??else if (op_vls == 0) { > >> ?? ?? ?? ?? ?? ?? ?? ??OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > >> ?? ?? ?? ?? ?? ?? ?? ??op_vls = 1; > >> ?? ?? ?? ?? ?? } > > > > why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? > > > is there a firmware bug that reports op_vls=0? > > There were (still are ?) implementations which returned op_vls 0 which > is why the words "valid on Set()" were added to the IBA spec and why I > don't feel safe removing the code as originally proposed but think my > alternative is safe and accomplishes the stated goal. Is there a > problem with my alternative proposal ? Assuming that all this was done as workaround for buggy OperVLs report its relevance shouldn't be a function of max_op_vls configuration value. I see two independent issues here: (1) removing (or keeping) zero OperVLs report workaround and (2) support and proper handling max_op_vls = 0 configuration value. Sasha From hal.rosenstock at gmail.com Wed May 6 04:29:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 6 May 2009 07:29:37 -0400 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <20090506112135.GG10145@sk> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> Message-ID: On Wed, May 6, 2009 at 7:21 AM, Sasha Khapyorsky wrote: > On 09:59 Tue 05 May     , Hal Rosenstock wrote: >> >> >> >> Should that only be done when max_op_vls is 0 ? >> >> >> >> Something like: >> >> ?? ?? ?? ?? ?? ??if (op_vls > p_subn->opt.max_op_vls) >> >> ?? ?? ?? ?? ?? ?? ?? ?? op_vls = p_subn->opt.max_op_vls; >> >> ?? ?? ?? ?? ?? ??else if (op_vls == 0) { >> >> ?? ?? ?? ?? ?? ?? ?? ??OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >> >> ?? ?? ?? ?? ?? ?? ?? ??op_vls = 1; >> >> ?? ?? ?? ?? ?? } >> > >> > why do you suggest a special case for op_vls=0 (and not for other portinfo fields)? >> >> > is there a firmware bug that reports op_vls=0? >> >> There were (still are ?) implementations which returned op_vls 0 which >> is why the words "valid on Set()" were added to the IBA spec and why I >> don't feel safe removing the code as originally proposed but think my >> alternative is safe and accomplishes the stated goal. Is there a >> problem with my alternative proposal ? > > Assuming that all this was done as workaround for buggy OperVLs report > its relevance shouldn't be a function of max_op_vls configuration value. > > I see two independent issues here: (1) removing (or keeping) zero > OperVLs report workaround and (2) support and proper handling > max_op_vls = 0 configuration value. Agreed and IMO (as I stated in previous emails) the workaround should be kept as I don't think there is a way of knowing for sure that those non compliant implementations are not in the field anymore. If the push is to remove this, then maybe another option for this workaround should be added with the default being to have the workaround off. -- Hal > > Sasha > From hnrose at comcast.net Wed May 6 04:35:52 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 6 May 2009 07:35:52 -0400 Subject: [ofa-general] [PATCH] opensm/osm_perfmgr_db.c: Remove unneeded initialization in perfmgr_db_print_by_name Message-ID: <20090506113552.GA32102@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c index b0bfd36..3034894 100644 --- a/opensm/opensm/osm_perfmgr_db.c +++ b/opensm/opensm/osm_perfmgr_db.c @@ -693,8 +693,8 @@ static void db_dump(cl_map_item_t * const p_map_item, void *context) void perfmgr_db_print_by_name(perfmgr_db_t * db, char *nodename, FILE *fp) { - cl_map_item_t *item = NULL; - db_node_t *node = NULL; + cl_map_item_t *item; + db_node_t *node; cl_plock_acquire(&db->lock); From sashak at voltaire.com Wed May 6 04:46:43 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 14:46:43 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c Increase the size of the hop table In-Reply-To: <4A003E3F.2010100@Sun.COM> References: <4A003E3F.2010100@Sun.COM> Message-ID: <20090506114643.GI10145@sk> On 15:25 Tue 05 May , Line.Holen at Sun.COM wrote: > The hops table of ftree_sw_t is too small to hold the hop count > of max_lid. Changed sw_create() to allocate hops[max_lid+1] > not hops[max_lid]. > > Signed-off-by: Line Holen Applied. Thanks. Sasha From sashak at voltaire.com Wed May 6 04:51:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 14:51:06 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr_db.c: Remove unneeded initialization in perfmgr_db_print_by_name In-Reply-To: <20090506113552.GA32102@comcast.net> References: <20090506113552.GA32102@comcast.net> Message-ID: <20090506115106.GJ10145@sk> On 07:35 Wed 06 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed May 6 05:29:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 15:29:52 +0300 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090501165334.59bf72a9.weiny2@llnl.gov> References: <20090422185441.6f8601dc.weiny2@llnl.gov> <20090425175710.GI28604@sk> <20090427150409.9c10e479.weiny2@llnl.gov> <20090501173806.GF14714@sk.iol.unh.edu> <20090501165334.59bf72a9.weiny2@llnl.gov> Message-ID: <20090506122952.GA28975@sk> On 16:53 Fri 01 May , Ira Weiny wrote: > > I did not attempt to preserve any switch or HCA order printing. I don't know > of any utils which require this. Am I wrong? I don't think that some utils could need it (unless there are bugs), I just diff-ed old and new outputs. Sasha From tziporet at dev.mellanox.co.il Wed May 6 07:09:00 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 06 May 2009 17:09:00 +0300 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: <4A0199FC.4000006@mellanox.co.il> Jeff Squyres wrote: > Roland and I chatted on the phone today; I think I now understand > Roland's counter-proposal (I clearly didn't before). Let me try to > summarize: > > 1. Add a new verb for "set this userspace flag to 1 if mr X ever > becomes invalid" > 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" > (i.e., remove the effects of #1) > 3. Add run-time query indicating whether #1 works > 4. Add [optional] memory registration caching to libibverbs > > Prior to talking to Roland, I had envisioned *one* flag in userspace > that indicated whether any memory registrations had become invalid. > Roland's idea is that there is one flag *per registration* -- you can > instantly tell whether a specific registration is valid. > > Given this, let's keep the discussion going here in email -- perhaps > the teleconference next Monday may become moot. I think the new proposal is good (but I am not MPI expert) If we implement it soon we will be able to enable it in OFED 1.5 too I think the cache in libibverbs can be delayed since it can be added after the API will the kernel is avilable Tziporet From hnrose at comcast.net Wed May 6 07:13:26 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 6 May 2009 10:13:26 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Cosmetic changes Message-ID: <20090506141326.GA29542@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index e6a1cfe..855a2ff 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -96,7 +96,7 @@ typedef struct redir { ib_net32_t redir_qp; } redir_t; -/* Node to store information about which nodes we are monitoring */ +/* Node to store information about nodes being monitored */ typedef struct monitored_node { cl_map_item_t map_item; struct monitored_node *next; @@ -108,6 +108,7 @@ typedef struct monitored_node { } monitored_node_t; struct osm_opensm; + /****s* OpenSM: PerfMgr/osm_perfmgr_t * This object should be treated as opaque and should * be manipulated only through the provided functions. @@ -130,9 +131,9 @@ typedef struct osm_perfmgr { uint16_t sweep_time_s; perfmgr_db_t *db; atomic32_t outstanding_queries; /* this along with sig_query */ - cl_event_t sig_query; /* will throttle our querys */ + cl_event_t sig_query; /* will throttle our queries */ uint32_t max_outstanding_queries; - cl_qmap_t monitored_map; /* map the nodes we are tracking */ + cl_qmap_t monitored_map; /* map the nodes being tracked */ monitored_node_t *remove_list; } osm_perfmgr_t; /* diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 93644a0..ecfdbda 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -47,7 +47,6 @@ #endif /* HAVE_CONFIG_H */ #ifdef ENABLE_OSM_PERF_MGR - #include #include #include @@ -66,7 +65,7 @@ #include #include -#define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe +#define PERFMGR_INITIAL_TID_VALUE 0xcafe #if ENABLE_OSM_PERF_MGR_PROFILE struct { @@ -114,8 +113,6 @@ static inline void diff_time(struct timeval *before, struct timeval *after, } #endif -extern int wait_for_pending_transactions(osm_stats_t * stats); - /********************************************************************** * Internal helper functions. **********************************************************************/ @@ -200,8 +197,9 @@ static void perfmgr_mad_send_err_callback(void *bind_context, OSM_LOG_ENTER(pm->log); - /* go ahead and get the monitored node struct to have the printable - * name if needed in messages + /* + * get the monitored node struct to have the printable name + * for log messages */ if ((p_node = cl_qmap_get(&pm->monitored_map, node_guid)) == cl_qmap_end(&pm->monitored_map)) { @@ -290,7 +288,7 @@ Exit: /********************************************************************** * Unbind the PerfMgr from the vendor layer for MAD sends/receives **********************************************************************/ -static void osm_perfmgr_mad_unbind(osm_perfmgr_t * pm) +static void perfmgr_mad_unbind(osm_perfmgr_t * pm) { OSM_LOG_ENTER(pm->log); if (pm->bind_handle == OSM_BIND_INVALID_HANDLE) { @@ -307,7 +305,7 @@ Exit: **********************************************************************/ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) { - ib_net32_t qp = cl_ntoh32(1); + ib_net32_t qp = IB_QP1; if (mon_node && mon_node->num_ports && port < mon_node->num_ports && mon_node->redir_port[port].redir_lid && @@ -396,7 +394,7 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, status = osm_vendor_send(perfmgr->bind_handle, p_madw, TRUE); if (status == IB_SUCCESS) { - /* pause this thread if we have too many outstanding requests */ + /* pause thread if there are too many outstanding requests */ cl_atomic_inc(&(perfmgr->outstanding_queries)); if (perfmgr->outstanding_queries > perfmgr->max_outstanding_queries) { @@ -426,7 +424,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context) if (cl_qmap_get(&pm->monitored_map, node_guid) == cl_qmap_end(&pm->monitored_map)) { - /* if not already in our map add it */ + /* if not already in map add it */ num_ports = osm_node_get_num_physp(node); mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * num_ports); @@ -484,7 +482,7 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context) num_ports = osm_node_get_num_physp(node); node_guid = cl_ntoh64(node->node_info.node_guid); - /* make sure we have a database object ready to store this information */ + /* make sure there is a database object ready to store this info */ if (perfmgr_db_create_entry(pm->db, node_guid, mon_node->esp0, num_ports, node->print_desc) != PERFMGR_EVENT_DB_SUCCESS) { @@ -538,8 +536,9 @@ Exit: /********************************************************************** * Discovery stuff. - * Basically this code should not be here, but merged with main OpenSM + * This code should not be here, but merged with main OpenSM **********************************************************************/ +extern int wait_for_pending_transactions(osm_stats_t * stats); extern void osm_drop_mgr_process(IN osm_sm_t * sm); static int sweep_hop_1(osm_sm_t * sm) @@ -680,7 +679,7 @@ static int sweep_hop_0(osm_sm_t * sm) h_bind = osm_sm_mad_ctrl_get_bind_handle(&sm->mad_ctrl); if (h_bind == OSM_BIND_INVALID_HANDLE) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "No bound ports.\n"); + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, "No bound ports\n"); return -1; } @@ -773,7 +772,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) gettimeofday(&before, NULL); #endif pm->sweep_state = PERFMGR_SWEEP_ACTIVE; - /* With the global lock held collect the node guids */ + /* With the global lock held, collect the node guids */ /* FIXME we should be able to track SA notices * and not have to sweep the node_guid_tbl each pass */ @@ -785,9 +784,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) /* then for each node query their counters */ cl_qmap_apply_func(&pm->monitored_map, perfmgr_query_counters, pm); - /* Clean out any nodes found to be removed during the - * sweep - */ + /* clean out any nodes found to be removed during the sweep */ remove_marked_nodes(pm); #if ENABLE_OSM_PERF_MGR_PROFILE @@ -812,7 +809,7 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) /********************************************************************** * PerfMgr timer - loop continuously and signal SM to run PerfMgr - * processor. + * processor if enabled. **********************************************************************/ static void perfmgr_sweep(void *arg) { @@ -830,7 +827,7 @@ void osm_perfmgr_shutdown(osm_perfmgr_t * pm) OSM_LOG_ENTER(pm->log); cl_timer_stop(&pm->sweep_timer); cl_disp_unregister(pm->pc_disp_h); - osm_perfmgr_mad_unbind(pm); + perfmgr_mad_unbind(pm); OSM_LOG_EXIT(pm->log); } @@ -846,12 +843,12 @@ void osm_perfmgr_destroy(osm_perfmgr_t * pm) /********************************************************************** * Detect if someone else on the network could have cleared the counters - * without us knowing. This is easy to detect because the counters never wrap - * but are "sticky" + * without us knowing. This is easy to detect because the counters never + * wrap but are "sticky" * - * The one time this will not work is if the port is getting errors fast enough - * to have the reading overtake the previous reading. In this case counters - * will be missed. + * The one time this will not work is if the port is getting errors fast + * enough to have the reading overtake the previous reading. In this case, + * counters will be missed. **********************************************************************/ static void perfmgr_check_oob_clear(osm_perfmgr_t * pm, monitored_node_t * mon_node, uint8_t port, @@ -1051,9 +1048,9 @@ static void perfmgr_log_events(osm_perfmgr_t * pm, /********************************************************************** * The dispatcher uses a thread pool which will call this function when - * we have a thread available to process our mad received from the wire. + * there is a thread available to process the mad received on the wire. **********************************************************************/ -static void pc_rcv_process(void *context, void *data) +static void pc_recv_process(void *context, void *data) { osm_perfmgr_t *pm = context; osm_madw_t *p_madw = data; @@ -1070,8 +1067,9 @@ static void pc_rcv_process(void *context, void *data) OSM_LOG_ENTER(pm->log); - /* go ahead and get the monitored node struct to have the printable - * name if needed in messages + /* + * get the monitored node struct to have the printable name + * for log messages */ if ((p_node = cl_qmap_get(&pm->monitored_map, node_guid)) == cl_qmap_end(&pm->monitored_map)) { @@ -1207,7 +1205,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm, pm->log = &osm->log; pm->mad_pool = &osm->mad_pool; pm->vendor = osm->p_vendor; - pm->trans_id = OSM_PERFMGR_INITIAL_TID_VALUE; + pm->trans_id = PERFMGR_INITIAL_TID_VALUE; pm->lock = &osm->lock; pm->state = p_opt->perfmgr ? PERFMGR_STATE_ENABLED : PERFMGR_STATE_DISABLE; @@ -1227,7 +1225,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm, } pm->pc_disp_h = cl_disp_register(&osm->disp, OSM_MSG_MAD_PORT_COUNTERS, - pc_rcv_process, pm); + pc_recv_process, pm); if (pm->pc_disp_h == CL_DISP_INVALID_HANDLE) { perfmgr_db_destroy(pm->db); goto Exit; @@ -1256,7 +1254,7 @@ void osm_perfmgr_clear_counters(osm_perfmgr_t * pm) } /******************************************************************* - * Have the DB dump its information to the file specified + * Dump the DB information to the file specified *******************************************************************/ void osm_perfmgr_dump_counters(osm_perfmgr_t * pm, perfmgr_db_dump_t dump_type) { @@ -1276,7 +1274,7 @@ void osm_perfmgr_dump_counters(osm_perfmgr_t * pm, perfmgr_db_dump_t dump_type) } /******************************************************************* - * Have the DB print its information to the fp specified + * Print the DB information to the fp specified *******************************************************************/ void osm_perfmgr_print_counters(osm_perfmgr_t * pm, char *nodename, FILE * fp) { diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c index 3034894..e5dfc19 100644 --- a/opensm/opensm/osm_perfmgr_db.c +++ b/opensm/opensm/osm_perfmgr_db.c @@ -247,8 +247,8 @@ debug_dump_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num, * perfmgr_db_err_reading_t functions **********************************************************************/ perfmgr_db_err_t -perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid, - uint8_t port, perfmgr_db_err_reading_t * reading) +perfmgr_db_add_err_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port, + perfmgr_db_err_reading_t * reading) { db_port_t *p_port = NULL; db_node_t *node = NULL; @@ -389,8 +389,8 @@ debug_dump_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port_num, * perfmgr_db_data_cnt_reading_t functions **********************************************************************/ perfmgr_db_err_t -perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid, - uint8_t port, perfmgr_db_data_cnt_reading_t * reading) +perfmgr_db_add_dc_reading(perfmgr_db_t * db, uint64_t guid, uint8_t port, + perfmgr_db_data_cnt_reading_t * reading) { db_port_t *p_port = NULL; db_node_t *node = NULL; From sashak at voltaire.com Wed May 6 08:00:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 6 May 2009 18:00:31 +0300 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Cosmetic changes In-Reply-To: <20090506141326.GA29542@comcast.net> References: <20090506141326.GA29542@comcast.net> Message-ID: <20090506150031.GA29470@sk> On 10:13 Wed 06 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From weiny2 at llnl.gov Wed May 6 08:45:33 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 6 May 2009 08:45:33 -0700 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090506122952.GA28975@sk> References: <20090422185441.6f8601dc.weiny2@llnl.gov> <20090425175710.GI28604@sk> <20090427150409.9c10e479.weiny2@llnl.gov> <20090501173806.GF14714@sk.iol.unh.edu> <20090501165334.59bf72a9.weiny2@llnl.gov> <20090506122952.GA28975@sk> Message-ID: <20090506084533.2d182a1d.weiny2@llnl.gov> On Wed, 6 May 2009 15:29:52 +0300 Sasha Khapyorsky wrote: > On 16:53 Fri 01 May , Ira Weiny wrote: > > > > I did not attempt to preserve any switch or HCA order printing. I don't know > > of any utils which require this. Am I wrong? > > I don't think that some utils could need it (unless there are bugs), I > just diff-ed old and new outputs. Ok, we are in agreement then. I have done extensive testing by diffing the output and I agree it is a pain... ;-) Sorry about that. Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From jon at opengridcomputing.com Wed May 6 08:56:42 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Wed, 6 May 2009 10:56:42 -0500 Subject: [ofa-general] OFED, the backported header and sg_init_table() In-Reply-To: <200905060830.56062.jackm@dev.mellanox.co.il> References: <200905051021.36725.jackm@dev.mellanox.co.il> <20090505150635.GA30788@opengridcomputing.com> <200905060830.56062.jackm@dev.mellanox.co.il> Message-ID: <20090506155641.GA4935@opengridcomputing.com> On Wed, May 06, 2009 at 08:30:55AM +0300, Jack Morgenstein wrote: > On Tuesday 05 May 2009 18:06, Jon Mason wrote: > > No, we currently duplicate all the scatterlist functionality.  Including > > ncrypto.h would greatly simplify the backport headers, but it is a > > RHEL5.2/5.3 only solution.  If this change is needed for all other > > backports, then a better solution will be needed. > > > Each backport has its OWN directory. The backports are not identical > for all kernels. There is absolutely no problem with handling backports > per kernel/per distribution. Therefore, the RHEL 5.2/5.3 solution can be > used for those backports alone, without affecting any of the others. > Other backports will have a different change. Yes, the point I was trying to make is that the fix I have will only apply to RHEL5. If a more sweeping change is needed, then something else will need to be done. I believe the root issue with the reported was on RHEL5.3, so this will probably solve their problem unless they need it for all OFED supported releases. > For RHEL5.2/5.3, my concern is that if someone will actually write an ncrypto kernel > application, and include ncrypto.h along with the infiniband headers, there will be > compilation problems because the scatterlist functionality fixes will appear twice. Excellent point. The patch I just sent out should prevent this from happening as well. > Specifically, OFED 1.4.1 has the following INDIVIDUAL/independent backports, and > each one is handled differently: > 2.6.16 > 2.6.16_sles10 > 2.6.16_sles10_sp1 > 2.6.16_sles10_sp2 > 2.6.17 > 2.6.18 > 2.6.18-EL5.1 > 2.6.18-EL5.2 > 2.6.18-EL5.3 > 2.6.18_FC6 (also for EL5.0) > 2.6.18_suse10_2 > 2.6.19 > 2.6.20 > 2.6.21 > 2.6.22 > 2.6.22_suse10_3 > 2.6.23 > 2.6.24 > 2.6.25 > 2.6.26 > 2.6.9_U4 > 2.6.9_U5 > 2.6.9_U6 > 2.6.9_U7 > > - Jack From monis at Voltaire.COM Wed May 6 09:06:52 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 06 May 2009 19:06:52 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <20090501213652.GO32114@obsidianresearch.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> <20090427162349.GI4431@obsidianresearch.com> <49F9A729.3090904@voltaire.com> <20090501213652.GO32114@obsidianresearch.com> Message-ID: <4A01B59C.5080301@Voltaire.COM> Jason Gunthorpe wrote: > On Thu, Apr 30, 2009 at 04:27:05PM +0300, Or Gerlitz wrote: >> Jason Gunthorpe wrote: >>> including a PID is not best, you should include enough information to >>> figure out the pid(s) from proc/xx/fd, and vice versa. > >> maybe its not the best solution but it seems to me good enough > > Well, we have to live with these interfaces literally forever, > shortcuts ultimately just cause more problems down the road.. > > Reall the thinking should be 'I want to make lsof work usefully' not > 'I want some random and different hack to let me see something'. And > yes, that is harder. But the IB stack is now at the point where these > small hard things are the sort of work that is needed to get parity > with the other stuff in linux.. > > Jason > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > debugfs is a common way to export data from kernel to users (IPoIB uses it) and it has it's advantages. On the other hand, netlink has it's disadvantages so, I don't think that debugfs is the wrong way. It's just another way. Besides, remember that rdmacm is only aware of part of the opened QPs on the host which may lead to a confusion for one who reads the output of lsof ("I know that there is an open QP but I don't see it in the list") From weiny2 at llnl.gov Wed May 6 09:33:47 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 6 May 2009 09:33:47 -0700 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090506100744.GB10145@sk> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090506100744.GB10145@sk> Message-ID: <20090506093347.bb1b56be.weiny2@llnl.gov> On Wed, 6 May 2009 13:07:44 +0300 Sasha Khapyorsky wrote: > On 14:29 Thu 30 Apr , Ira Weiny wrote: > > From: Ira Weiny > > Date: Wed, 29 Apr 2009 10:15:55 -0700 > > Subject: [PATCH] Add combined routing support to libibnetdisc > > > > Also allow a scan to start at a switch. > > > > Signed-off-by: Ira Weiny > > --- > > infiniband-diags/libibnetdisc/src/ibnetdisc.c | 28 ++++++++++++++++++------ > > 1 files changed, 21 insertions(+), 7 deletions(-) > > > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > index 0ff5134..fc19633 100644 > > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) > > } > > > > static int > > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) > > { > > - int rc = add_port_to_dpath(path, nextport); > > - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > > - f->fabric.maxhops_discovered = path->cnt; > > + int rc = 0; > > + > > + if (portid->lid && !portid->drpath.drslid) { > > + /* If we were LID routed > > + * AND have not done so already > > + * we need to set up the drslid > > + */ > > + ib_portid_t selfportid = { 0 }; > > + if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) > > + return -1; > > And wouldn't it be better instead of resolving selfport on each > extend_path() call to keep it already resolved somewhere in fabric > structure? This will only happen 1 time for each fabric being scan'ed because the path is reused... Oh wait a minute, I just reviewed the code... For the current use case the path is reused since I am only scanning 1 node. However, in the general case this is not true. Sorry about that. A new patch is below. Ira From: Ira Weiny Date: Wed, 29 Apr 2009 10:15:55 -0700 Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed. Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 27 ++++++++++++++++++------ infiniband-diags/libibnetdisc/src/internal.h | 1 + 2 files changed, 21 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 0ff5134..1e93ff8 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -177,11 +177,25 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) } static int -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) { - int rc = add_port_to_dpath(path, nextport); - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) - f->fabric.maxhops_discovered = path->cnt; + int rc = 0; + + if (portid->lid) { + /* If we were LID routed we need to set up the drslid */ + if (!f->selfportid.lid) + if (ib_resolve_self_via(&f->selfportid, NULL, NULL, + f->fabric.ibmad_port) < 0) + return -1; + + portid->drpath.drslid = f->selfportid.lid; + portid->drpath.drdlid = 0xFFFF; + } + + rc = add_port_to_dpath(&portid->drpath, nextport); + + if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) + f->fabric.maxhops_discovered = portid->drpath.cnt; return (rc); } @@ -447,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ != IB_PORT_PHYS_STATE_LINKUP) return -1; - if (extend_dpath(fabric, &path->drpath, portnum) < 0) + if (extend_dpath(fabric, path, portnum) < 0) return -1; if (query_node(fabric, &node_buf, &port_buf, path)) { @@ -546,8 +560,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, if (!port) IBPANIC("out of memory"); - if (node->node.type != IB_NODE_SWITCH && - get_remote_node(fabric, node, port, from, + if(get_remote_node(fabric, node, port, from, mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), 0) < 0) return ((ibnd_fabric_t *)fabric); diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 4e6bb18..5785e33 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -88,6 +88,7 @@ struct ibnd_fabric { struct ibnd_node *switches; struct ibnd_node *ch_adapters; struct ibnd_node *routers; + ib_portid_t selfportid; }; #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) -- 1.5.4.5 From weiny2 at llnl.gov Wed May 6 09:51:14 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 6 May 2009 09:51:14 -0700 Subject: [ofa-general] [PATCH] Fix 2 formatting diff's from old ibqueryerrors. Message-ID: <20090506095114.3893f4aa.weiny2@llnl.gov> 2 changes I noted in the output from ibqueryerrors. "Link Info:" was not being printed when "-r" was used. The "header": Errors for 0x "" Should only be printed when errors are found. The following patch cleans those up. Ira From: Ira Weiny Date: Tue, 28 Apr 2009 14:39:11 -0700 Subject: [PATCH] Fix 2 formatting diff's from old ibqueryerrors. Signed-off-by: Ira Weiny --- infiniband-diags/src/ibqueryerrors.c | 29 ++++++++++++++++------------- 1 files changed, 16 insertions(+), 13 deletions(-) diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index 09861be..70c3d48 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -123,7 +123,6 @@ print_port_config(ibnd_node_t *node, int portnum) char speed_msg[256]; char ext_port_str[256]; int iwidth, ispeed, istate, iphystate; - int n = 0; ibnd_port_t *port = node->ports[portnum]; @@ -140,7 +139,7 @@ print_port_config(ibnd_node_t *node, int portnum) width_msg[0] = '\0'; speed_msg[0] = '\0'; - n = snprintf(link_str, 256, "(%3s %s %6s/%8s)", + snprintf(link_str, 256, "(%3s %s %6s/%8s)", mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed), mad_dump_val(IB_PORT_STATE_F, state, 64, &istate), @@ -177,9 +176,9 @@ print_port_config(ibnd_node_t *node, int portnum) ext_port_str[0] = '\0'; if (node->type == IB_NODE_SWITCH) - printf(" %6d", node->smalid); + printf(" Link info: %6d", node->smalid); else - printf(" %6d", port->base_lid); + printf(" Link info: %6d", port->base_lid); printf("%4d[%2s] ==%s==> %s", port->portnum, ext_port_str, link_str, remote_str); @@ -211,7 +210,7 @@ report_suppressed(void) } static void -print_results(ibnd_node_t *node, uint8_t *pc, int portnum) +print_results(ibnd_node_t *node, uint8_t *pc, int portnum, int *header_printed) { char buf[1024]; char *str = buf; @@ -237,7 +236,6 @@ print_results(ibnd_node_t *node, uint8_t *pc, int portnum) /* if we found errors. */ if (n != 0) { - char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); if (data_counters) for (i = IB_PC_XMT_BYTES_F; i <= IB_PC_RCV_PKTS_F; i++) { uint64_t val64 = 0; @@ -247,17 +245,21 @@ print_results(ibnd_node_t *node, uint8_t *pc, int portnum) mad_field_name(i), val64); } - printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename); - printf(" GUID 0x%" PRIx64 " port %d:%s\n", - node->guid, portnum, str); + if (!*header_printed) { + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); + printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename); + *header_printed = 1; + free(nodename); + } + + printf(" GUID 0x%" PRIx64 " port %d:%s\n", node->guid, portnum, str); if (port_config) print_port_config(node, portnum); - free(nodename); } } static void -print_port(ibnd_node_t *node, int portnum) +print_port(ibnd_node_t *node, int portnum, int *header_printed) { uint8_t pc[1024]; uint16_t cap_mask; @@ -291,7 +293,7 @@ print_port(ibnd_node_t *node, int portnum) uint32_t foo = 0; mad_encode_field(pc, IB_PC_XMT_WAIT_F, &foo); } - print_results(node, pc, portnum); + print_results(node, pc, portnum, header_printed); cleanup: free(nodename); @@ -300,6 +302,7 @@ cleanup: void print_node(ibnd_node_t *node, void *user_data) { + int header_printed = 0; int p = 0; int startport = 1; @@ -311,7 +314,7 @@ print_node(ibnd_node_t *node, void *user_data) for (p = startport; p <= node->numports; p++) { if (node->ports[p]) { - print_port(node, p); + print_port(node, p, &header_printed); } } } -- 1.5.4.5 From weiny2 at llnl.gov Wed May 6 09:53:03 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 6 May 2009 09:53:03 -0700 Subject: [ofa-general] [PATCH] Clean up printing of switch heading when printing "down links" only. Message-ID: <20090506095303.f11659f1.weiny2@llnl.gov> Another corner case: If there are no down links on a switch and "-d" is selected then the header for that switch should not be printed. Ira From: Ira Weiny Date: Thu, 30 Apr 2009 13:41:38 -0700 Subject: [PATCH] Clean up printing of switch heading when printing "down links" only. Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 2454bf2..cf38ecb 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -205,13 +205,8 @@ void print_switch(ibnd_node_t *node, void *user_data) { int i = 0; - - if (!line_mode) { - char *remap = remap_node_name(node_name_map, node->guid, - node->nodedesc); - printf("Switch 0x%016"PRIx64" %s:\n", node->guid, remap); - free(remap); - } + int head_print = 0; + char *remap = remap_node_name(node_name_map, node->guid, node->nodedesc); for (i = 1; i <= node->numports; i++) { ibnd_port_t *port = node->ports[i]; @@ -219,9 +214,14 @@ print_switch(ibnd_node_t *node, void *user_data) continue; if (!down_links_only || mad_get_field(port->info, 0, IB_PORT_STATE_F) == IB_LINK_DOWN) { + if (!head_print && !line_mode) { + printf("Switch 0x%016"PRIx64" %s:\n", node->guid, remap); + head_print = 1; + } print_port(node, port); } } + free(remap); } void -- 1.5.4.5 From jgunthorpe at obsidianresearch.com Wed May 6 10:38:50 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 6 May 2009 11:38:50 -0600 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <4A01B59C.5080301@Voltaire.COM> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> <20090427162349.GI4431@obsidianresearch.com> <49F9A729.3090904@voltaire.com> <20090501213652.GO32114@obsidianresearch.com> <4A01B59C.5080301@Voltaire.COM> Message-ID: <20090506173850.GJ2590@obsidianresearch.com> On Wed, May 06, 2009 at 07:06:52PM +0300, Moni Shoua wrote: > > Reall the thinking should be 'I want to make lsof work usefully' not > > 'I want some random and different hack to let me see something'. And > > yes, that is harder. But the IB stack is now at the point where these > > small hard things are the sort of work that is needed to get parity > > with the other stuff in linux.. > debugfs is a common way to export data from kernel to users (IPoIB > uses it) and it has it's advantages. On the other hand, netlink has > it's disadvantages so, I don't think that debugfs is the wrong > way. It's just another way. Gah, no! Debugfs is NOT ment to be used for users, it is for kernel debugging. It is specifically not a stable API and commonly used user space apps should not rely on it. This is why the distros don't mount it by default. Viewing the active QPs and RDMA CM connections is not kernel debugging, it is necessary data for end user app debugging. > Besides, remember that rdmacm is only aware of part of the opened > QPs on the host which may lead to a confusion for one who reads the > output of lsof ("I know that there is an open QP but I don't see it > in the list") Sure, but clearly the end user desire is to make lsof work properly with all the new objects the IB and verbs APIs introduce to the kernel. I don't think your patch advances that goal at all. Jason From jsquyres at cisco.com Wed May 6 10:42:37 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 6 May 2009 13:42:37 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: <4A0199FC.4000006@mellanox.co.il> References: <4A0199FC.4000006@mellanox.co.il> Message-ID: On May 6, 2009, at 10:09 AM, Tziporet Koren wrote: > I think the new proposal is good (but I am not MPI expert) > If we implement it soon we will be able to enable it in OFED 1.5 too > That sounds good, as long as we don't diverge from upstream (like what happened with XRC). > I think the cache in libibverbs can be delayed since it can be added > after the API will the kernel is avilable > Fair enough. -- Jeff Squyres Cisco Systems From rdreier at cisco.com Wed May 6 13:08:31 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 06 May 2009 13:08:31 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Jeff Squyres's message of "Tue, 5 May 2009 16:57:09 -0400") References: Message-ID: > Roland and I chatted on the phone today; I think I now understand > Roland's counter-proposal (I clearly didn't before). Let me try to > summarize: > > 1. Add a new verb for "set this userspace flag to 1 if mr X ever > becomes invalid" > 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" > (i.e., remove the effects of #1) > 3. Add run-time query indicating whether #1 works > 4. Add [optional] memory registration caching to libibverbs Looking closer at how to actually implement this, I see that the MMU notifiers (cf ) may be called with locks held, so the kernel can't do a put_user() or the equivalent from the notifier. Therefore I think the interface we would expose to userspace would be something more like mmap() on some special file to get some kernel memory mapped into userspace, and then ioctl() to register/unregister a "set this flag if address range X...Y is affected." To be honest I don't really love this idea -- the kernel still needs a fairly complicated data structure to efficiently track the address ranges being tracked, the size of the mmap() limits the number of ranges being tracked based on a static limit set at initialization time (or handling multiple maps gets still more complex), and there is some careful thinking required to make sure there are no memory ordering or cache aliasing issues. So then I thought some about how to implement the full MR cache in the kernel. And that fairly quickly gets into some complex stuff as well -- for example, since we can't take sleeping locks from MMU notifiers, but we can't hold non-sleeping locks across MR register operations, we need to drop our MR cache lock while registering things, which forces us to deal with rolling back registrations if we miss the cache initially but then find that another thread has already added a registration to the cache while we were trying to register the same memory. Keeping the actual MR caching in userspace does seem to make things simpler because the locking is much easier without having to worry about sleeping vs. non-sleeping locks. Also doing the cache in userspace with my flag idea above has the nice property that the fast path of hitting the cache on memory registration has no system call and in fact testing the flag may even be a CPU cache hit if memory registration is a hot enough path. Doing it in the kernel means even the best case has a system call -- which is very cheap with current CPUs but still a non-zero cost. So I'm really not sure what the right way to go is yet. Further opinions would be helpful. - R. From rdreier at cisco.com Wed May 6 13:10:47 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 06 May 2009 13:10:47 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Jeff Squyres's message of "Tue, 5 May 2009 16:57:09 -0400") References: Message-ID: By the way, what's the desired behavior of the cache if a process registers, say, address range 0x1000 ... 0x3fff, and then the same process registers address range 0x2000 ... 0x2fff (with all the same permissions, etc)? The initial registration creates an MR that is still valid for the smaller virtual address range, so the second registration is much cheaper if we used the cached registration; but if we use the cache for the second registration, and then deregister the first one, we're stuck with a too-big range pinned in the cache because of the second registration. - R. From jgunthorpe at obsidianresearch.com Wed May 6 14:46:28 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 6 May 2009 15:46:28 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: <20090506214628.GM2590@obsidianresearch.com> On Wed, May 06, 2009 at 01:10:47PM -0700, Roland Dreier wrote: > By the way, what's the desired behavior of the cache if a process > registers, say, address range 0x1000 ... 0x3fff, and then the same > process registers address range 0x2000 ... 0x2fff (with all the same > permissions, etc)? > > The initial registration creates an MR that is still valid for the > smaller virtual address range, so the second registration is much > cheaper if we used the cached registration; but if we use the cache for > the second registration, and then deregister the first one, we're stuck > with a too-big range pinned in the cache because of the second > registration. Yuk, doesn't this problem pretty much doom this method entirely? You can't tear down the entire registration of 0x1000 ... 0x3fff if the app does something to change 0x2000 .. 0x2fff because it may have active RDMAs going on in 0x1000 ... 0x1fff. The above could happen through strange use of brk. What about a slightly different twist.. Instead of trying to make everything synchronous in the mmu_notifier, just have a counter mapped to user space. Increment the counter whenever the mms change from the notifier. Pin the user page that contains the single counter upon starting the process so access is lockless. In user space, check the counter before every cache lookup and if it has changed call back into the kernel to resynchronize the MR tables in the HCA to the current VM. Avoids the locking and racing problems, keeps the fast path in the user space and avoids the above question about how to deal with arbitrary actions? Jason From rdreier at cisco.com Wed May 6 14:56:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 06 May 2009 14:56:25 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <20090506214628.GM2590@obsidianresearch.com> (Jason Gunthorpe's message of "Wed, 6 May 2009 15:46:28 -0600") References: <20090506214628.GM2590@obsidianresearch.com> Message-ID: > Yuk, doesn't this problem pretty much doom this method entirely? You > can't tear down the entire registration of 0x1000 ... 0x3fff if the app > does something to change 0x2000 .. 0x2fff because it may have active > RDMAs going on in 0x1000 ... 0x1fff. Yes, I guess if we try to reuse registrations like this then we run into trouble. I think your example points to a problem if an app registers 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff. But we can get around this just by not ever reusing registrations that way -- only treat something as a cache hit if it matches the start and length exactly. > What about a slightly different twist.. Instead of trying to make > everything synchronous in the mmu_notifier, just have a counter mapped > to user space. Increment the counter whenever the mms change from the > notifier. Pin the user page that contains the single counter upon > starting the process so access is lockless. > > In user space, check the counter before every cache lookup and if it > has changed call back into the kernel to resynchronize the MR tables in > the HCA to the current VM. > > Avoids the locking and racing problems, keeps the fast path in the > user space and avoids the above question about how to deal with > arbitrary actions? I like the simplicity of the fast path. But it seems the slow path would be hard -- how exactly did you envision resynchronizing the MR tables? (Considering that RDMAs might be in flight for MRs that weren't changed by the MM operations) - R. From landman at scalableinformatics.com Wed May 6 15:26:40 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 06 May 2009 18:26:40 -0400 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F1FCEF.3030305@mellanox.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> Message-ID: <4A020EA0.5030605@scalableinformatics.com> Vu Pham wrote: > Hi Celine, > > What HCA do you have on your system? Is it ConnectX? If yes, what is its > firmware version? I am seeing this also on a server with ConnectX and a client with mthca. My mount hangs: /sbin/mount.nfs 10.1.1.2:/data /data -o rdma,intr,port=2050 ^C Leaving this in the logs: May 6 18:14:03 dv3 kernel: [ 9997.015209] rpcrdma: connection to 10.1.1.2:2050 on mthca0, memreg 6 slots 32 ird 4 May 6 18:14:03 dv3 kernel: [ 9997.015582] rpcrdma: connection to 10.1.1.2:2050 closed (-103) rdma seems to work root at dv3:~# ib_rdma_bw -b -i 2 6222: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=1 | cma=0 | 6222: Local address: ... 6222: Remote address: ... 6222: Bandwidth peak (#0 to #245): 1765.83 MB/sec 6222: Bandwidth average: 1724.45 MB/sec 6222: Service Demand peak (#0 to #245): 884 cycles/KB 6222: Service Demand Avg : 906 cycles/KB root at dv3:~# showmount -e 10.1.1.2 Export list for 10.1.1.2: /data * On the server side, I see May 6 14:07:53 jr4 mountd[5673]: authenticated mount request from 10.1.1.1:940 for /data (/data) On server for rping [ root at jr4 ~]# rping -s cq completion failed status 4 wait for RDMA_READ_COMPLETE state 10 on the client side for rping root at dv3:~# rping -S 100 -d -v -c -a 10.1.1.2 verbose client created cm_id 0x606690 cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x606690 (parent) cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x606690 (parent) rdma_resolve_addr - rdma_resolve_route successful created pd 0x608be0 created channel 0x6068c0 created cq 0x608c30 created qp 0x608d50 rping_setup_buffers called on cb 0x605010 allocated & registered buffers... cq_thread started. cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x606690 (parent) ESTABLISHED rmda_connect successful RDMA addr 60a8d0 rkey 116003d len 100 send completion cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x606690 (parent) client DISCONNECT EVENT... wait for RDMA_WRITE_ADV state 6 cq completion failed status 5 rping_free_buffers called on cb 0x605010 destroy cm_id 0x606690 Any hints on the 103 error? I have 2.6.000 firmware on the ConnectX. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jgunthorpe at obsidianresearch.com Wed May 6 15:26:38 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 6 May 2009 16:26:38 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com> Message-ID: <20090506222638.GA16280@obsidianresearch.com> On Wed, May 06, 2009 at 02:56:25PM -0700, Roland Dreier wrote: > > Yuk, doesn't this problem pretty much doom this method entirely? You > > can't tear down the entire registration of 0x1000 ... 0x3fff if the app > > does something to change 0x2000 .. 0x2fff because it may have active > > RDMAs going on in 0x1000 ... 0x1fff. > > Yes, I guess if we try to reuse registrations like this then we run into > trouble. I think your example points to a problem if an app registers > 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff > and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff. > > But we can get around this just by not ever reusing registrations that > way -- only treat something as a cache hit if it matches the start and > length exactly. I can't comment on that, but it feels to me like a reasonable MPI use model would be to do small IOs randomly from the same allocation, and pre-hint to the library it wants that whole area cached in one shot. > > What about a slightly different twist.. Instead of trying to make > > everything synchronous in the mmu_notifier, just have a counter mapped > > to user space. Increment the counter whenever the mms change from the > > notifier. Pin the user page that contains the single counter upon > > starting the process so access is lockless. > > > > In user space, check the counter before every cache lookup and if it > > has changed call back into the kernel to resynchronize the MR tables in > > the HCA to the current VM. > > > > Avoids the locking and racing problems, keeps the fast path in the > > user space and avoids the above question about how to deal with > > arbitrary actions? > > I like the simplicity of the fast path. But it seems the slow path > would be hard -- how exactly did you envision resynchronizing the MR > tables? (Considering that RDMAs might be in flight for MRs that weren't > changed by the MM operations) Well, this conceptually doesn't seem hard. Go through all the pages in the MR, if any have changed then pin the new page and replace the pages physical address in the HCA's page table. Once done, synchronize with the hardware, then run through again and un-pin and release all the replaced pages. Every HCA must have the necessary primitives for this to support register and unregister... An RDMA that is in progress to any page that is replaced is a 'use after free' type programming error. (And this means certain wacky uses, like using MAP_FIXED on memory that is under active RDMA, would be unsupported without an additional call) Doing this on a page by page basis rather than on a registration by registration basis is granular enough to avoid the problem you noticed. The mmu notifiers can optionally make note of the affected pages during the callback to reduce the workload of the syscall. Only part I don't immediately see is how to trap creation of new VM (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Jason From rdreier at cisco.com Wed May 6 15:39:54 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 06 May 2009 15:39:54 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <20090506222638.GA16280@obsidianresearch.com> (Jason Gunthorpe's message of "Wed, 6 May 2009 16:26:38 -0600") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> Message-ID: > Well, this conceptually doesn't seem hard. Go through all the pages in > the MR, if any have changed then pin the new page and replace the > pages physical address in the HCA's page table. Once done, synchronize > with the hardware, then run through again and un-pin and release all > the replaced pages. > > Every HCA must have the necessary primitives for this to support > register and unregister... No... every HCA just needs to support register and unregister. It doesn't have to support changing the mapping without full unregister and reregister. Also this requires potentially walking the page tables of the entire process, checking to see if any mappings have changed. We really want to keep the information that the MMU notifiers give us, namely which virtual address range is changing. > The mmu notifiers can optionally make note of the affected pages > during the callback to reduce the workload of the syscall. This requires an unbounded amount of events to be queued up in the kernel, naively. (If we lose some events then we have to go back to the full page table scan, which I don't think is feasible) > Only part I don't immediately see is how to trap creation of new VM > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. Why do we care? The initial faulting in of mappings occurs when an MR is created. - R. From jgunthorpe at obsidianresearch.com Wed May 6 17:02:31 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 6 May 2009 18:02:31 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> Message-ID: <20090507000231.GB16280@obsidianresearch.com> On Wed, May 06, 2009 at 03:39:54PM -0700, Roland Dreier wrote: > > Well, this conceptually doesn't seem hard. Go through all the pages in > > the MR, if any have changed then pin the new page and replace the > > pages physical address in the HCA's page table. Once done, synchronize > > with the hardware, then run through again and un-pin and release all > > the replaced pages. > > > > Every HCA must have the necessary primitives for this to support > > register and unregister... > > No... every HCA just needs to support register and unregister. It > doesn't have to support changing the mapping without full unregister and > reregister. Well, I would imagine this entire process to be a HCA specific operation, so HW that supports a better method can use it, otherwise it has to register/unregister. Is this a concern today with existing HCAs? Using register/unregister exposes a race for the original case you brought up - but that race is completely unfixable without hardware support. At least it now becomes a hw specific race that can be printk'd and someday fixed in new HW rather than an unsolvable API problem.. > Also this requires potentially walking the page tables of the entire > process, checking to see if any mappings have changed. We really want > to keep the information that the MMU notifiers give us, namely which > virtual address range is changing. Walking the page tables of every registration in the process, not the entire process. > > The mmu notifiers can optionally make note of the affected pages > > during the callback to reduce the workload of the syscall. > This requires an unbounded amount of events to be queued up in the > kernel, naively. (If we lose some events then we have to go back to the > full page table scan, which I don't think is feasible) I was thinking more along the lines of having the mmu notifiers put affected registrations on a per-process (or PD?) dirty linked list, with the link pointers as part of the registration structure. Set a dirty flag in the registration too. An extra pointer per registration and a minor incremental cost to the existing work the mmu notifier would have to do. > > Only part I don't immediately see is how to trap creation of new VM > > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. > > Why do we care? The initial faulting in of mappings occurs when an MR > is created. Well, exactly, that's the problem. If you can't trap mmap you cannot do the initial faulting and mapping for a new object that is being mapped into an existing MR. Consider: void *a = mmap(0,PAGE_SIZE..); ibv_register(); // [..] mmunmap(a); ibv_synchronize(); // At this point we want the HCA mapping to point to oblivion mmap(a,PAGE_SIZE,MAP_FIXED); ibv_synchronize(); // And now we want it to point to the new allocation I use MAP_FIXED to illustrate the point, but Jeff has said the same address re-use happens randomly in real apps. This is the main deviation from your original idea, instead of having a granular notification to userspace to unregister a region, the kernel just goes and fixes it up so the existing registration still works. This method avoids the problem you noticed, but there is extra work to fixup a registration that may never be used again. I strongly suspect that in the majority of cases this extra work should be about on the same order as userspace calling unregister on the MR. Or, ignore the overlapping problem, and use your original technique, slightly modified: - Userspace registers a counter with the kernel. Kernel pins the page, sets up mmu notifiers and increments the counter when invalidates intersect with registrations - Kernel maintains a linked list of registrations that have been invalidated via mmu notifiers using the registration structure and a dirty bit - Userspace checks the counter at every cache hit, if different it calls into the kernel: MR_Cookie *mrs[100]; int rc = ibv_get_invalid_mrs(mrs,100); invalidate_cache(mrs,rc); // Repeat until drained get_invalid_mrs traverses the linked list and returns an identifying value to userspace, which looks it up in the cache, calls unregister and removes it from the cache. Jason From weiny2 at llnl.gov Wed May 6 18:01:40 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 6 May 2009 18:01:40 -0700 Subject: [ofa-general] [RFC][PATCH] ibnetdiscover: remove report of max hops discovered. In-Reply-To: References: <20090504151005.9a565bc5.weiny2@llnl.gov> <1241543312.18144.18.camel@auk31.llnl.gov> Message-ID: <20090506180140.6213971e.weiny2@llnl.gov> The number reported as "max hops" from ibnetdiscover can change depending on the algorithm used to discover the fabric. As Hal says in the message below using this number is therefore dangerous. If no one is currently using this number I propose the patch below which removes the "max hops discovered" from the output. Ira On Tue, 5 May 2009 14:25:32 -0400 Hal Rosenstock wrote: > Hi Al, > > On Tue, May 5, 2009 at 1:08 PM, Al Chu wrote: [snip] > > > > Ira says that the output of the hops is actually "max hops used to get > > from my port to another port during my search of the network".  So the > > number could change if (hypotehtical example) depth-first-search were > > used instead of BFS. > > Sure; it can depend on how the search is done but isn't it the max > from the initiated node (which could be different depending on the > algo used) ? Using that number seems dangerous for that very reason. I > always thought that number was "nice" to have but nothing more. It > predated my work on ibnetdiscover. > > -- Hal > From: Ira Weiny Date: Wed, 6 May 2009 17:56:23 -0700 Subject: [PATCH] ibnetdiscover: remove report of max hops discovered. Signed-off-by: Ira Weiny --- infiniband-diags/src/ibnetdiscover.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 1799618..89e4f0f 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -448,7 +448,6 @@ dump_topology(int group, ibnd_fabric_t *fabric) struct iter_user_data iter_user_data; fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); - fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", fabric->from_node->guid, mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F)); -- 1.5.4.5 From IMCEAEX-_O=QLOGIC_OU=SPG_CN=RECIPIENTS_CN=KSHARMA at qlogic.com Wed May 6 22:03:51 2009 From: IMCEAEX-_O=QLOGIC_OU=SPG_CN=RECIPIENTS_CN=KSHARMA at qlogic.com (Karun Sharma (Contractor - GS Labs)) Date: Thu, 7 May 2009 00:03:51 -0500 Subject: [ofa-general] SDP error In-Reply-To: References: Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB43E8736567@MNEXMB1.qlogic.org> 1. Make sure ib_sdp is loaded. 2. Do "export LD_PRELOAD=/libsdp.so". path is /usr/lib64 for 64-bit systems. Thanks Karun ________________________________ From: general-bounces at lists.openfabrics.org [general-bounces at lists.openfabrics.org] On Behalf Of anthony garnier [sokar6012 at hotmail.com] Sent: Tuesday, May 05, 2009 6:23 PM To: general at lists.openfabrics.org Subject: [ofa-general] SDP error Hello, i`m running a debian 5.0 OS with ofed 1.4, RDMA work very well, but when I`m trying to use the SDP protocol with ssh, Netperf or a simple Client-Server programming in C, I got socket error like that : NetPIPE: can't open stream socket! errno=97 (for Netpipe) Address family not supported by protocol ssh (for ssh) Address family not supported by protocol (for clent-server) Someone knows those errors? ________________________________ Discutez sur Messenger où que vous soyez ! Mettez Messenger sur votre mobile ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From dorons at voltaire.com Thu May 7 00:39:36 2009 From: dorons at voltaire.com (Doron Shoham) Date: Thu, 07 May 2009 10:39:36 +0300 Subject: [ofa-general] [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> Message-ID: <4A029038.2040603@voltaire.com> when setting max_op_vls = 0 do not force it to 1. 0 is valid value which means "No change" Signed-off-by: Doron Shoham --- opensm/opensm/osm_port.c | 4 ++-- opensm/opensm/osm_subnet.c | 8 ++++++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 2e6c642..3679f29 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -379,8 +379,8 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, /* support user limitation of max_op_vls */ if (op_vls > p_subn->opt.max_op_vls) op_vls = p_subn->opt.max_op_vls; - - if (op_vls == 0) { + else if (op_vls == 0) { + /* for non compliant implementations */ OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ec15f8a..71fc7a0 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) "# switch port connected to a CA or router port\n" "leaf_head_of_queue_lifetime 0x%02x\n\n" "# Limit the maximal operational VLs\n" + "# Virtual Lanes operational on this port\n" + "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" + "# 0: No change; valid only on Set()\n" + "# 1: VL0\n" + "# 2: VL0, VL1\n" + "# 3: VL0 - VL3\n" + "# 4: VL0 - VL7\n" + "# 5: VL0 - VL14\n" "max_op_vls %u\n\n" "# Force PortInfo:LinkSpeedEnabled on switch ports\n" "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" -- 1.5.4 From ogerlitz at voltaire.com Thu May 7 01:06:56 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 07 May 2009 11:06:56 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <20090501213652.GO32114@obsidianresearch.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> <20090427162349.GI4431@obsidianresearch.com> <49F9A729.3090904@voltaire.com> <20090501213652.GO32114@obsidianresearch.com> Message-ID: <4A0296A0.3090308@voltaire.com> Jason Gunthorpe wrote: > Well, we have to live with these interfaces literally forever, shortcuts ultimately just cause more problems down the road.. Reall the thinking should be 'I want to make lsof work usefully' not > 'I want some random and different hack to let me see something'. And yes, that is harder. But the IB stack is now at the point where these small hard things are the sort of work that is needed to get parity with the other stuff in linux... Jason, As Moni stated, this isn't a shortcut, its one solution to the problem of the user being unable to see their rdma-cm based sessions. For the time I believe that debugfs can do the job of raising the visibility of rdma connections from zero to something one can work with which isn't random and isn't hack. A more sophisticated, netlink based solution is possible, we'd love to see patches from other people doing that. Or. From vlad at lists.openfabrics.org Thu May 7 03:23:53 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 7 May 2009 03:23:53 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090507-0200 daily build status Message-ID: <20090507102353.258F0E614EB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Thu May 7 04:33:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 7 May 2009 14:33:45 +0300 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090506093347.bb1b56be.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov> Message-ID: <20090507113345.GA19236@sk> On 09:33 Wed 06 May , Ira Weiny wrote: > > This will only happen 1 time for each fabric being scan'ed because the path is > reused... > > Oh wait a minute, I just reviewed the code... For the current use case the > path is reused since I am only scanning 1 node. However, in the general case > this is not true. Sorry about that. A new patch is below. > > Ira > > > From: Ira Weiny > Date: Wed, 29 Apr 2009 10:15:55 -0700 > Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed. > > Signed-off-by: Ira Weiny Applied. Thanks. (Please send v2 patch as separate email - I will not need to edit/merge commit messages, potentially doing wrong interpretations :)). Sasha From jackm at dev.mellanox.co.il Thu May 7 05:01:16 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 7 May 2009 15:01:16 +0300 Subject: [ofa-general] [PATCH] mlx4: fix fast registration implementation Message-ID: <200905071501.17670.jackm@dev.mellanox.co.il> The low-level driver modified the page-list addresses for FRWR post send to big-endian, and set a "present" bit. This caused problems later when the ULP attempted to unmap the pages in the page-list (using the list addresses which were assumed to be still in CPU-endian order). The cause of the crash was found by Vu Pham of Mellanox. The fix is along the lines suggested by Steve Wise in comment #21 in Bugzilla 1571. This patch fixes Bugzilla 1571. Signed-off-by: Jack Morgenstein --- Roland, please take this for kernel 2.6.30. diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 9974e88..a8c0bc4 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -86,6 +86,7 @@ struct mlx4_ib_mr { struct mlx4_ib_fast_reg_page_list { struct ib_fast_reg_page_list ibfrpl; + u64 *mapped_page_list; dma_addr_t map; }; diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..fddf583 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -231,16 +231,22 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device if (!mfrpl) return ERR_PTR(-ENOMEM); - mfrpl->ibfrpl.page_list = dma_alloc_coherent(&dev->dev->pdev->dev, + mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL); + if (!mfrpl->ibfrpl.page_list) + goto err_free; + + mfrpl->mapped_page_list = dma_alloc_coherent(&dev->dev->pdev->dev, size, &mfrpl->map, GFP_KERNEL); if (!mfrpl->ibfrpl.page_list) - goto err_free; + goto err_free_mfrpl; WARN_ON(mfrpl->map & 0x3f); return &mfrpl->ibfrpl; +err_free_mfrpl: + kfree(mfrpl->ibfrpl.page_list); err_free: kfree(mfrpl); return ERR_PTR(-ENOMEM); @@ -252,8 +258,9 @@ void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list); int size = page_list->max_page_list_len * sizeof (u64); - dma_free_coherent(&dev->dev->pdev->dev, size, page_list->page_list, + dma_free_coherent(&dev->dev->pdev->dev, size, mfrpl->mapped_page_list, mfrpl->map); + kfree(mfrpl->ibfrpl.page_list); kfree(mfrpl); } diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index f385a24..20724ae 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1365,7 +1365,7 @@ static void set_fmr_seg(struct mlx4_wqe_fmr_seg *fseg, struct ib_send_wr *wr) int i; for (i = 0; i < wr->wr.fast_reg.page_list_len; ++i) - wr->wr.fast_reg.page_list->page_list[i] = + mfrpl->mapped_page_list[i] = cpu_to_be64(wr->wr.fast_reg.page_list->page_list[i] | MLX4_MTT_FLAG_PRESENT); From hnrose at comcast.net Thu May 7 06:00:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 7 May 2009 09:00:53 -0400 Subject: [ofa-general] [PATCH 2/2] opensm/osm_console.c: Add dump and clear redir perfmgr command support Message-ID: <20090507130053.GB1093@comcast.net> Signed-off-by: Hal Rosenstock --- Changes since v1: Changes based on changes to PerfMgr redir support in v3 patch diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index d351261..30ddd53 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -211,7 +211,7 @@ static void help_dump_conf(FILE *out, int detail) static void help_perfmgr(FILE * out, int detail) { fprintf(out, - "perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n"); + "perfmgr [enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n"); if (detail) { fprintf(out, "perfmgr -- print the performance manager state\n"); @@ -225,6 +225,10 @@ static void help_perfmgr(FILE * out, int detail) " [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n"); fprintf(out, " [print_counters ] -- print the counters for the specified node\n"); + fprintf(out, + " [dump_redir []] -- dump the redirection table\n"); + fprintf(out, + " [clear_redir []] -- clear the redirection table\n"); } } #endif /* ENABLE_OSM_PERF_MGR */ @@ -1135,6 +1139,152 @@ static void dump_conf_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } #ifdef ENABLE_OSM_PERF_MGR +static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm, + char *nodename) +{ + cl_map_item_t *item; + monitored_node_t *node; + + item = cl_qmap_head(&p_osm->perfmgr.monitored_map); + while (item != cl_qmap_end(&p_osm->perfmgr.monitored_map)) { + node = (monitored_node_t *)item; + if (strcmp(node->name, nodename) == 0) + return node; + item = cl_qmap_next(item); + } + + return NULL; +} + +static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm, + uint64_t guid) +{ + cl_map_item_t *node; + + node = cl_qmap_get(&p_osm->perfmgr.monitored_map, guid); + if (node != cl_qmap_end(&p_osm->perfmgr.monitored_map)) + return (monitored_node_t *)node; + + return NULL; +} + +static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out) +{ + int port, redir; + + /* only display monitored nodes with redirection info */ + redir = 0; + for (port = (p_mon_node->esp0) ? 0 : 1; + port < p_mon_node->num_ports; port++) { + if (p_mon_node->port[port].redirection) { + if (!redir) { + fprintf(out, " Node GUID ESP0 Name\n"); + fprintf(out, " --------- ---- ----\n"); + fprintf(out, " 0x%" PRIx64 " %d %s\n", + p_mon_node->guid, p_mon_node->esp0, + p_mon_node->name); + fprintf(out, "\n Port Valid LIDs PKey QP PKey Index\n"); + fprintf(out, " ---- ----- ---- ---- -- ----------\n"); + redir = 1; + } + fprintf(out, " %d %d %u->%u 0x%x 0x%x %d\n", + port, p_mon_node->port[port].valid, + cl_ntoh16(p_mon_node->port[port].orig_lid), + cl_ntoh16(p_mon_node->port[port].lid), + cl_ntoh16(p_mon_node->port[port].pkey), + cl_ntoh32(p_mon_node->port[port].qp), + p_mon_node->port[port].pkey_ix); + } + } + if (redir) + fprintf(out, "\n"); +} + +static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out) +{ + monitored_node_t *p_mon_node; + uint64_t guid; + + if (!p_osm->subn.opt.perfmgr_redir) + fprintf(out, "Perfmgr redirection not enabled\n"); + + fprintf(out, "\nRedirection Table\n"); + fprintf(out, "-----------------\n"); + cl_plock_acquire(p_osm->perfmgr.lock); + if (nodename) { + guid = strtoull(nodename, NULL, 0); + if (guid == 0 && errno) + p_mon_node = find_node_by_name(p_osm, nodename); + else + p_mon_node = find_node_by_guid(p_osm, guid); + if (p_mon_node) + dump_redir_entry(p_mon_node, out); + else { + if (guid == 0 && errno) + fprintf(out, "Node %s not found...\n", nodename); + else + fprintf(out, "Node 0x%" PRIx64 " not found...\n", guid); + } + } else { + p_mon_node = (monitored_node_t *) cl_qmap_head(&p_osm->perfmgr.monitored_map); + while (p_mon_node != (monitored_node_t *) cl_qmap_end(&p_osm->perfmgr.monitored_map)) { + dump_redir_entry(p_mon_node, out); + p_mon_node = (monitored_node_t *) cl_qmap_next((const cl_map_item_t *)p_mon_node); + } + } + cl_plock_release(p_osm->perfmgr.lock); +} + +static void clear_redir_entry(monitored_node_t *p_mon_node) +{ + int port; + ib_net16_t orig_lid; + + for (port = (p_mon_node->esp0) ? 0 : 1; + port < p_mon_node->num_ports; port++) { + if (p_mon_node->port[port].redirection) { + orig_lid = p_mon_node->port[port].orig_lid; + memset(&p_mon_node->port[port], 0, + sizeof(monitored_port_t)); + p_mon_node->port[port].valid = TRUE; + p_mon_node->port[port].orig_lid = orig_lid; + } + } +} + +static void clear_redir(osm_opensm_t * p_osm, char *nodename, FILE * out) +{ + monitored_node_t *p_mon_node; + uint64_t guid; + + if (!p_osm->subn.opt.perfmgr_redir) + fprintf(out, "Perfmgr redirection not enabled\n"); + + cl_plock_acquire(p_osm->perfmgr.lock); + if (nodename) { + guid = strtoull(nodename, NULL, 0); + if (guid == 0 && errno) + p_mon_node = find_node_by_name(p_osm, nodename); + else + p_mon_node = find_node_by_guid(p_osm, guid); + if (p_mon_node) + clear_redir_entry(p_mon_node); + else { + if (guid == 0 && errno) + fprintf(out, "Node %s not found...\n", nodename); + else + fprintf(out, "Node 0x%" PRIx64 " not found...\n", guid); + } + } else { + p_mon_node = (monitored_node_t *) cl_qmap_head(&p_osm->perfmgr.monitored_map); + while (p_mon_node != (monitored_node_t *) cl_qmap_end(&p_osm->perfmgr.monitored_map)) { + clear_redir_entry(p_mon_node); + p_mon_node = (monitored_node_t *) cl_qmap_next((const cl_map_item_t *)p_mon_node); + } + } + cl_plock_release(p_osm->perfmgr.lock); +} + static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) { char *p_cmd; @@ -1167,6 +1317,12 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) fprintf(out, "print_counters requires a node name or node GUID to be specified\n"); } + } else if (strcmp(p_cmd, "dump_redir") == 0) { + p_cmd = name_token(p_last); + dump_redir(p_osm, p_cmd, out); + } else if (strcmp(p_cmd, "clear_redir") == 0) { + p_cmd = name_token(p_last); + clear_redir(p_osm, p_cmd, out); } else if (strcmp(p_cmd, "sweep_time") == 0) { p_cmd = next_token(p_last); if (p_cmd) { From hnrose at comcast.net Thu May 7 05:59:18 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 7 May 2009 08:59:18 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Better redirection support Message-ID: <20090507125918.GA1093@comcast.net> Handle PKey and QPN redirection information GID redirection handling remains Signed-off-by: Hal Rosenstock --- Changes since v2: Use OpenSM DB rather than vendor layer for local port number and PKeys Change most log levels from ERROR to VERBOSE Redirection info validity now determined by single flag validate_redir_pkey returns pkey index or -1 rather than boolean Removed redir_ prefixes Changes since v1: Added include of osm_helper.h to osm_perfmgr.c diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index 855a2ff..70d68f0 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -90,11 +90,17 @@ typedef enum { PERFMGR_SWEEP_SUSPENDED } osm_perfmgr_sweep_state_t; -/* Redirection information */ -typedef struct redir { - ib_net16_t redir_lid; - ib_net32_t redir_qp; -} redir_t; +typedef struct monitored_port { + uint16_t pkey_ix; + ib_net16_t orig_lid; + boolean_t redirection; + boolean_t valid; + /* Redirection fields from ClassPortInfo */ + ib_gid_t gid; + ib_net16_t lid; + ib_net16_t pkey; + ib_net32_t qp; +} monitored_port_t; /* Node to store information about nodes being monitored */ typedef struct monitored_node { @@ -104,7 +110,7 @@ typedef struct monitored_node { boolean_t esp0; char *name; uint32_t num_ports; - redir_t redir_port[1]; /* redirection on a per port basis */ + monitored_port_t port[1]; } monitored_node_t; struct osm_opensm; @@ -135,6 +141,8 @@ typedef struct osm_perfmgr { uint32_t max_outstanding_queries; cl_qmap_t monitored_map; /* map the nodes being tracked */ monitored_node_t *remove_list; + ib_net64_t port_guid; + int16_t local_port; } osm_perfmgr_t; /* * FIELDS diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index ecfdbda..9c47a8f 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -64,6 +64,7 @@ #include #include #include +#include #define PERFMGR_INITIAL_TID_VALUE 0xcafe @@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, uint8_t port = context->perfmgr_context.port; cl_map_item_t *p_node; monitored_node_t *p_mon_node; + ib_net16_t orig_lid; OSM_LOG_ENTER(pm->log); @@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void *bind_context, p_mon_node->num_ports); goto Exit; } - /* Clear redirection info */ - p_mon_node->redir_port[port].redir_lid = 0; - p_mon_node->redir_port[port].redir_qp = 0; + /* Clear redirection info for this port except orig_lid */ + orig_lid = p_mon_node->port[port].orig_lid; + memset(&p_mon_node->port[port], 0, sizeof(monitored_port_t)); + p_mon_node->port[port].orig_lid = orig_lid; + p_mon_node->port[port].valid = TRUE; cl_plock_release(pm->lock); } @@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, const ib_net64_t port_guid) goto Exit; } - bind_info.port_guid = port_guid; + bind_info.port_guid = pm->port_guid = port_guid; bind_info.mad_class = IB_MCLASS_PERF; bind_info.class_version = 1; bind_info.is_responder = FALSE; @@ -277,7 +281,6 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, const ib_net64_t port_guid) OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C04: Vendor specific bind failed (%s)\n", ib_get_err_str(status)); - goto Exit; } Exit: @@ -308,24 +311,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) ib_net32_t qp = IB_QP1; if (mon_node && mon_node->num_ports && port < mon_node->num_ports && - mon_node->redir_port[port].redir_lid && - mon_node->redir_port[port].redir_qp) - qp = mon_node->redir_port[port].redir_qp; + mon_node->port[port].redirection && mon_node->port[port].qp) + qp = mon_node->port[port].qp; return qp; } -/********************************************************************** - * Given a node, a port, and an optional monitored node, - * return the appropriate lid to query that port - **********************************************************************/ -static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port, - monitored_node_t * mon_node) +static ib_net16_t get_base_lid(osm_node_t * p_node, uint8_t port) { - if (mon_node && mon_node->num_ports && port < mon_node->num_ports && - mon_node->redir_port[port].redir_lid) - return mon_node->redir_port[port].redir_lid; - switch (p_node->node_info.node_type) { case IB_NODE_TYPE_CA: case IB_NODE_TYPE_ROUTER: @@ -338,12 +331,26 @@ static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port, } /********************************************************************** + * Given a node, a port, and an optional monitored node, + * return the lid appropriate to query that port + **********************************************************************/ +static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port, + monitored_node_t * mon_node) +{ + if (mon_node && mon_node->num_ports && port < mon_node->num_ports && + mon_node->port[port].lid) + return mon_node->port[port].lid; + + return get_base_lid(p_node, port); +} + +/********************************************************************** * Form and send the Port Counters MAD for a single port. **********************************************************************/ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, ib_net16_t dest_lid, - ib_net32_t dest_qp, uint8_t port, - uint8_t mad_method, + ib_net32_t dest_qp, uint16_t pkey_ix, + uint8_t port, uint8_t mad_method, osm_madw_context_t * p_context) { ib_api_status_t status = IB_SUCCESS; @@ -382,8 +389,7 @@ static ib_api_status_t perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, p_madw->mad_addr.addr_type.gsi.remote_qp = dest_qp; p_madw->mad_addr.addr_type.gsi.remote_qkey = cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY); - /* FIXME what about other partitions */ - p_madw->mad_addr.addr_type.gsi.pkey_ix = 0; + p_madw->mad_addr.addr_type.gsi.pkey_ix = pkey_ix; p_madw->mad_addr.addr_type.gsi.service_level = 0; p_madw->mad_addr.addr_type.gsi.global_route = FALSE; p_madw->resp_expected = TRUE; @@ -419,6 +425,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context) osm_perfmgr_t *pm = (osm_perfmgr_t *) context; monitored_node_t *mon_node = NULL; uint32_t num_ports; + int port; OSM_LOG_ENTER(pm->log); @@ -427,7 +434,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context) /* if not already in map add it */ num_ports = osm_node_get_num_physp(node); mon_node = malloc(sizeof(*mon_node) + - sizeof(redir_t) * num_ports); + sizeof(monitored_port_t) * num_ports); if (!mon_node) { OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: " "malloc failed: not handling node %s" @@ -436,7 +443,7 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context) goto Exit; } memset(mon_node, 0, - sizeof(*mon_node) + sizeof(redir_t) * num_ports); + sizeof(*mon_node) + sizeof(monitored_port_t) * num_ports); mon_node->guid = node_guid; mon_node->name = strdup(node->print_desc); mon_node->num_ports = num_ports; @@ -444,6 +451,11 @@ static void collect_guids(cl_map_item_t * p_map_item, void *context) mon_node->esp0 = (node->sw && ib_switch_info_is_enhanced_port0(&node->sw-> switch_info)); + for (port = mon_node->esp0 ? 0 : 1; port < num_ports; port++) { + mon_node->port[port].orig_lid = get_base_lid(node, port); + mon_node->port[port].valid = TRUE; + } + cl_qmap_insert(&pm->monitored_map, node_guid, (cl_map_item_t *) mon_node); } @@ -500,6 +512,9 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context) if (!osm_node_get_physp_ptr(node, port)) continue; + if (!mon_node->port[port].valid) + continue; + lid = get_lid(node, port, mon_node); if (lid == 0) { OSM_LOG(pm->log, OSM_LOG_DEBUG, "WARN: node 0x%" PRIx64 @@ -520,8 +535,10 @@ static void perfmgr_query_counters(cl_map_item_t * p_map_item, void *context) OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Getting stats for node 0x%" PRIx64 " port %d (lid %u) (%s)\n", node_guid, port, cl_ntoh16(lid), node->print_desc); - status = perfmgr_send_pc_mad(pm, lid, remote_qp, port, - IB_MAD_METHOD_GET, &mad_context); + status = perfmgr_send_pc_mad(pm, lid, remote_qp, + mon_node->port[port].pkey_ix, + port, IB_MAD_METHOD_GET, + &mad_context); if (status != IB_SUCCESS) OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C09: " "Failed to issue port counter query for node 0x%" @@ -768,6 +785,24 @@ void osm_perfmgr_process(osm_perfmgr_t * pm) pm->subn->sm_state == IB_SMINFO_STATE_NOTACTIVE) perfmgr_discovery(pm->subn->p_osm); + /* if redirection enabled, determine local port */ + if (pm->subn->opt.perfmgr_redir && pm->local_port == -1) { + osm_node_t *p_node; + osm_port_t *p_port; + + CL_PLOCK_ACQUIRE(pm->sm->p_lock); + p_port = osm_get_port_by_guid(pm->subn, pm->port_guid); + if (p_port) { + p_node = p_port->p_node; + CL_ASSERT(p_node); + pm->local_port = + ib_node_info_get_local_port_num(&p_node->node_info); + } else + OSM_LOG(pm->log, OSM_LOG_ERROR, + "ERR 4C87: No PerfMgr port object\n"); + CL_PLOCK_RELEASE(pm->sm->p_lock); + } + #if ENABLE_OSM_PERF_MGR_PROFILE gettimeofday(&before, NULL); #endif @@ -935,8 +970,8 @@ static int counter_overflow_32(ib_net32_t val) * MAD to the port. **********************************************************************/ static void perfmgr_check_overflow(osm_perfmgr_t * pm, - monitored_node_t * mon_node, uint8_t port, - ib_port_counters_t * pc) + monitored_node_t * mon_node, int16_t pkey_ix, + uint8_t port, ib_port_counters_t * pc) { osm_madw_context_t mad_context; ib_api_status_t status; @@ -963,6 +998,9 @@ static void perfmgr_check_overflow(osm_perfmgr_t * pm, osm_node_t *p_node = NULL; ib_net16_t lid = 0; + if (!mon_node->port[port].valid) + goto Exit; + osm_log(pm->log, OSM_LOG_VERBOSE, "PerfMgr: Counter overflow: %s (0x%" PRIx64 ") port %d; clearing counters\n", @@ -987,8 +1025,9 @@ static void perfmgr_check_overflow(osm_perfmgr_t * pm, mad_context.perfmgr_context.port = port; mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET; /* clear port counters */ - status = perfmgr_send_pc_mad(pm, lid, remote_qp, port, - IB_MAD_METHOD_SET, &mad_context); + status = perfmgr_send_pc_mad(pm, lid, remote_qp, pkey_ix, + port, IB_MAD_METHOD_SET, + &mad_context); if (status != IB_SUCCESS) OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C11: " "Failed to send clear counters MAD for %s (0x%" @@ -1046,6 +1085,64 @@ static void perfmgr_log_events(osm_perfmgr_t * pm, time_diff, mon_node->name, mon_node->guid, port); } +static int16_t validate_redir_pkey(osm_perfmgr_t *pm, ib_net16_t pkey) +{ + int16_t pkey_ix = -1; + osm_port_t *p_port; + osm_pkey_tbl_t *p_pkey_tbl; + ib_net16_t *p_orig_pkey; + uint16_t block; + uint8_t index; + + OSM_LOG_ENTER(pm->log); + + CL_PLOCK_ACQUIRE(pm->sm->p_lock); + p_port = osm_get_port_by_guid(pm->subn, pm->port_guid); + if (!p_port) { + CL_PLOCK_RELEASE(pm->sm->p_lock); + OSM_LOG(pm->log, OSM_LOG_ERROR, + "ERR 4C1E: No PerfMgr port object\n"); + goto Exit; + } + if (p_port->p_physp && osm_physp_is_valid(p_port->p_physp)) { + p_pkey_tbl = &p_port->p_physp->pkeys; + if (!p_pkey_tbl) { + CL_PLOCK_RELEASE(pm->sm->p_lock); + OSM_LOG(pm->log, OSM_LOG_VERBOSE, + "No PKey table found for PerfMgr port\n"); + goto Exit; + } + p_orig_pkey = cl_map_get(&p_pkey_tbl->keys, + ib_pkey_get_base(pkey)); + if (!p_orig_pkey) { + CL_PLOCK_RELEASE(pm->sm->p_lock); + OSM_LOG(pm->log, OSM_LOG_VERBOSE, + "PKey 0x%x not found for PerfMgr port\n", + cl_ntoh16(pkey)); + goto Exit; + } + if (osm_pkey_tbl_get_block_and_idx(p_pkey_tbl, p_orig_pkey, + &block, &index) == IB_SUCCESS) { + CL_PLOCK_RELEASE(pm->sm->p_lock); + pkey_ix = block * IB_NUM_PKEY_ELEMENTS_IN_BLOCK + index; + } else { + CL_PLOCK_RELEASE(pm->sm->p_lock); + OSM_LOG(pm->log, OSM_LOG_ERROR, + "ERR 0x4C1F: Failed to obtain P_Key 0x%04x " + "block and index for PerfMgr port\n", + cl_ntoh16(pkey)); + } + } else { + CL_PLOCK_RELEASE(pm->sm->p_lock); + OSM_LOG(pm->log, OSM_LOG_ERROR, + "ERR 4C20: Local PerfMgt port physp invalid\n"); + } + +Exit: + OSM_LOG_EXIT(pm->log); + return pkey_ix; +} + /********************************************************************** * The dispatcher uses a thread pool which will call this function when * there is a thread available to process the mad received on the wire. @@ -1064,6 +1161,8 @@ static void pc_recv_process(void *context, void *data) perfmgr_db_data_cnt_reading_t data_reading; cl_map_item_t *p_node; monitored_node_t *p_mon_node; + int16_t pkey_ix = 0; + boolean_t valid = TRUE; OSM_LOG_ENTER(pm->log); @@ -1087,7 +1186,8 @@ static void pc_recv_process(void *context, void *data) p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO); /* Response could also be redirection (IBM eHCA PMA does this) */ - if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { + if (p_mad->status & IB_MAD_STATUS_REDIRECT && + p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { char gid_str[INET6_ADDRSTRLEN]; ib_class_port_info_t *cpi = (ib_class_port_info_t *) & @@ -1100,17 +1200,46 @@ static void pc_recv_process(void *context, void *data) inet_ntop(AF_INET6, cpi->redir_gid.raw, gid_str, sizeof gid_str), cl_ntoh32(cpi->redir_qp)); - /* LID or GID redirection ? */ - /* For GID redirection, need to get PathRecord from SA */ + if (!pm->subn->opt.perfmgr_redir) { + OSM_LOG(pm->log, OSM_LOG_VERBOSE, + "Redirection requested but disabled\n"); + valid = FALSE; + } + + /* valid redirection ? */ if (cpi->redir_lid == 0) { + if (!ib_gid_is_notzero(&cpi->redir_gid)) { + OSM_LOG(pm->log, OSM_LOG_VERBOSE, + "Invalid redirection " + "(both redirect LID and GID are zero)\n"); + valid = FALSE; + } + } + if (cpi->redir_qp == 0) { + OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectQP\n"); + valid = FALSE; + } + if (cpi->redir_pkey == 0) { + OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectP_Key\n"); + valid = FALSE; + } + if (cpi->redir_qkey != IB_QP1_WELL_KNOWN_Q_KEY) { + OSM_LOG(pm->log, OSM_LOG_VERBOSE, "Invalid RedirectQ_Key\n"); + valid = FALSE; + } + + pkey_ix = validate_redir_pkey(pm, cpi->redir_pkey); + if (pkey_ix == -1) { OSM_LOG(pm->log, OSM_LOG_VERBOSE, - "GID redirection not currently implemented!\n"); - goto Exit; + "Index for Pkey 0x%x not found\n", + cl_ntoh16(cpi->redir_pkey)); + valid = FALSE; } - if (!pm->subn->opt.perfmgr_redir) { - OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " - "redirection requested but disabled\n"); + if (cpi->redir_lid == 0) { + /* GID redirection: get PathRecord information */ + OSM_LOG(pm->log, OSM_LOG_VERBOSE, + "GID redirection not currently supported\n"); goto Exit; } @@ -1125,13 +1254,24 @@ static void pc_recv_process(void *context, void *data) p_mon_node->num_ports); goto Exit; } - p_mon_node->redir_port[port].redir_lid = cpi->redir_lid; - p_mon_node->redir_port[port].redir_qp = cpi->redir_qp; + p_mon_node->port[port].redirection = TRUE; + p_mon_node->port[port].valid = valid; + memcpy(&p_mon_node->port[port].gid, &cpi->redir_gid, + sizeof(ib_gid_t)); + p_mon_node->port[port].lid = cpi->redir_lid; + p_mon_node->port[port].qp = cpi->redir_qp; + p_mon_node->port[port].pkey = cpi->redir_pkey; + if (pkey_ix != -1) + p_mon_node->port[port].pkey_ix = pkey_ix; cl_plock_release(pm->lock); + if (!valid) + goto Exit; + /* Finally, reissue the query to the redirected location */ status = perfmgr_send_pc_mad(pm, cpi->redir_lid, cpi->redir_qp, - port, mad_context->perfmgr_context. + pkey_ix, port, + mad_context->perfmgr_context. mad_method, mad_context); if (status != IB_SUCCESS) OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C14: " @@ -1166,7 +1306,7 @@ static void pc_recv_process(void *context, void *data) perfmgr_db_clear_prev_dc(pm->db, node_guid, port); } - perfmgr_check_overflow(pm, p_mon_node, port, wire_read); + perfmgr_check_overflow(pm, p_mon_node, pkey_ix, port, wire_read); #if ENABLE_OSM_PERF_MGR_PROFILE do { @@ -1212,6 +1352,7 @@ ib_api_status_t osm_perfmgr_init(osm_perfmgr_t * pm, osm_opensm_t * osm, pm->sweep_time_s = p_opt->perfmgr_sweep_time_s; pm->max_outstanding_queries = p_opt->perfmgr_max_outstanding_queries; pm->osm = osm; + pm->local_port = -1; status = cl_timer_init(&pm->sweep_timer, perfmgr_sweep, pm); if (status != IB_SUCCESS) From hal.rosenstock at gmail.com Thu May 7 06:53:05 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 7 May 2009 09:53:05 -0400 Subject: [ofa-general] Re: [RFC][PATCH] ibnetdiscover: remove report of max hops discovered. In-Reply-To: <20090506180140.6213971e.weiny2@llnl.gov> References: <20090504151005.9a565bc5.weiny2@llnl.gov> <1241543312.18144.18.camel@auk31.llnl.gov> <20090506180140.6213971e.weiny2@llnl.gov> Message-ID: On Wed, May 6, 2009 at 9:01 PM, Ira Weiny wrote: > The number reported as "max hops" from ibnetdiscover can change depending on > the algorithm used to discover the fabric.  As Hal says in the message below > using this number is therefore dangerous. > > If no one is currently using this number I propose the patch below which > removes the "max hops discovered" from the output. If it's removed from the topology output, should there be an option which displays this number ? It does provide some idea of the levels in the hierarchy which can be useful when someone provides a topology file for their subnet. -- Hal > Ira > > On Tue, 5 May 2009 14:25:32 -0400 > Hal Rosenstock wrote: > >> Hi Al, >> >> On Tue, May 5, 2009 at 1:08 PM, Al Chu wrote: > > [snip] > >> > >> > Ira says that the output of the hops is actually "max hops used to get >> > from my port to another port during my search of the network".  So the >> > number could change if (hypotehtical example) depth-first-search were >> > used instead of BFS. >> >> Sure; it can depend on how the search is done but isn't it the max >> from the initiated node (which could be different depending on the >> algo used) ? Using that number seems dangerous for that very reason. I >> always thought that number was "nice" to have but nothing more. It >> predated my work on ibnetdiscover. >> >> -- Hal >> > > > From: Ira Weiny > Date: Wed, 6 May 2009 17:56:23 -0700 > Subject: [PATCH] ibnetdiscover: remove report of max hops discovered. > > > Signed-off-by: Ira Weiny > --- >  infiniband-diags/src/ibnetdiscover.c |    1 - >  1 files changed, 0 insertions(+), 1 deletions(-) > > diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c > index 1799618..89e4f0f 100644 > --- a/infiniband-diags/src/ibnetdiscover.c > +++ b/infiniband-diags/src/ibnetdiscover.c > @@ -448,7 +448,6 @@ dump_topology(int group, ibnd_fabric_t *fabric) >        struct iter_user_data iter_user_data; > >        fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > -       fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); >        fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", >                fabric->from_node->guid, >                mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F)); > -- > 1.5.4.5 > > From jsquyres at cisco.com Thu May 7 06:54:26 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 7 May 2009 09:54:26 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: > By the way, what's the desired behavior of the cache if a process > registers, say, address range 0x1000 ... 0x3fff, and then the same > process registers address range 0x2000 ... 0x2fff (with all the same > permissions, etc)? > > The initial registration creates an MR that is still valid for the > smaller virtual address range, so the second registration is much > cheaper if we used the cached registration; but if we use the cache > for > the second registration, and then deregister the first one, we're > stuck > with a too-big range pinned in the cache because of the second > registration. > I don't know what the other MPI's do in this scenario, but here's what OMPI will do: 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and therefore register - add each page to our cache with a refcount of 1 2. lookup 0x2000-0x2fff in the cache, find that all the pages are already registered - refcount++ on each page in the cache 3. when we go to dereg 0x1000-0x3fff - refcount-- on each page in the cache - since some pages in the range still have refcount>0, don't do anything further Specifically: the actual dereg of 0x1000-0x3fff is blocked on also releasing 0x2000-0x2fff. Note that OMPI will only register a max of X bytes at a time (where X defaults to 2MB). So even if a user calls MPI_SEND(...) with an enormous buffer, we'll register it X/page_size pages at a time, not the entire buffer at once. Hence, the "buffer A is blocked from dereg'ing by buffer B" scenario is *somewhat* mitigated -- it's less wasteful than if we can registered/cached the entire huge buffer at once. Finally, note that if 0x2000-0x2fff had not been registered, the 0x1000-0x3fff pages are not actually deregistered when all the pages' refcounts go to 0 -- they are just moved to the "able to be dereg'ed list". We don't actually dereg it until we later try to reg new memory and fail due to lack of resources. Then we take entries off the "able to be dereg'ed list" and dereg them, then try reg'ing the new memory again. MVAPICH: do you guys do similar things? (I don't know if HP/Scali/Intel will comment on their registration cache schemes) -- Jeff Squyres Cisco Systems From hal.rosenstock at gmail.com Thu May 7 06:56:38 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 7 May 2009 09:56:38 -0400 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090506093347.bb1b56be.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov> Message-ID: Ira, On Wed, May 6, 2009 at 12:33 PM, Ira Weiny wrote: > On Wed, 6 May 2009 13:07:44 +0300 > Sasha Khapyorsky wrote: > >> On 14:29 Thu 30 Apr     , Ira Weiny wrote: >> > From: Ira Weiny >> > Date: Wed, 29 Apr 2009 10:15:55 -0700 >> > Subject: [PATCH] Add combined routing support to libibnetdisc >> > >> >    Also allow a scan to start at a switch. >> > >> > Signed-off-by: Ira Weiny >> > --- >> >  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   28 ++++++++++++++++++------ >> >  1 files changed, 21 insertions(+), 7 deletions(-) >> > >> > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c >> > index 0ff5134..fc19633 100644 >> > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c >> > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c >> > @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) >> >  } >> > >> >  static int >> > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) >> > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) >> >  { >> > -   int rc = add_port_to_dpath(path, nextport); >> > -   if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) >> > -           f->fabric.maxhops_discovered = path->cnt; >> > +   int rc = 0; >> > + >> > +   if (portid->lid && !portid->drpath.drslid) { >> > +           /* If we were LID routed >> > +            * AND have not done so already >> > +            * we need to set up the drslid >> > +            */ >> > +           ib_portid_t selfportid = { 0 }; >> > +           if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) >> > +                   return -1; >> >> And wouldn't it be better instead of resolving selfport on each >> extend_path() call to keep it already resolved somewhere in fabric >> structure? > > This will only happen 1 time for each fabric being scan'ed because the path is > reused... > > Oh wait a minute, I just reviewed the code...  For the current use case the > path is reused since I am only scanning 1 node.  However, in the general case > this is not true.  Sorry about that.  A new patch is below. Does combined routing always fall back on failure to using directed routing ? Also, would you summarize the use cases for combined routing in ibnetdiscover ? -- Hal > Ira > > > From: Ira Weiny > Date: Wed, 29 Apr 2009 10:15:55 -0700 > Subject: [PATCH] Fix ibnd_discover when the specified ib_portid_t starts LID routed. > > Signed-off-by: Ira Weiny > --- >  infiniband-diags/libibnetdisc/src/ibnetdisc.c |   27 ++++++++++++++++++------ >  infiniband-diags/libibnetdisc/src/internal.h  |    1 + >  2 files changed, 21 insertions(+), 7 deletions(-) > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index 0ff5134..1e93ff8 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -177,11 +177,25 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) >  } > >  static int > -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) >  { > -       int rc = add_port_to_dpath(path, nextport); > -       if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > -               f->fabric.maxhops_discovered = path->cnt; > +       int rc = 0; > + > +       if (portid->lid) { > +               /* If we were LID routed we need to set up the drslid */ > +               if (!f->selfportid.lid) > +                       if (ib_resolve_self_via(&f->selfportid, NULL, NULL, > +                                       f->fabric.ibmad_port) < 0) > +                               return -1; > + > +               portid->drpath.drslid = f->selfportid.lid; > +               portid->drpath.drdlid = 0xFFFF; > +       } > + > +       rc = add_port_to_dpath(&portid->drpath, nextport); > + > +       if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) > +               f->fabric.maxhops_discovered = portid->drpath.cnt; >        return (rc); >  } > > @@ -447,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ >                        != IB_PORT_PHYS_STATE_LINKUP) >                return -1; > > -       if (extend_dpath(fabric, &path->drpath, portnum) < 0) > +       if (extend_dpath(fabric, path, portnum) < 0) >                return -1; > >        if (query_node(fabric, &node_buf, &port_buf, path)) { > @@ -546,8 +560,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, >        if (!port) >                IBPANIC("out of memory"); > > -       if (node->node.type != IB_NODE_SWITCH && > -           get_remote_node(fabric, node, port, from, > +       if(get_remote_node(fabric, node, port, from, >                                mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), >                                0) < 0) >                return ((ibnd_fabric_t *)fabric); > diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h > index 4e6bb18..5785e33 100644 > --- a/infiniband-diags/libibnetdisc/src/internal.h > +++ b/infiniband-diags/libibnetdisc/src/internal.h > @@ -88,6 +88,7 @@ struct ibnd_fabric { >        struct ibnd_node *switches; >        struct ibnd_node *ch_adapters; >        struct ibnd_node *routers; > +       ib_portid_t selfportid; >  }; >  #define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) > > -- > 1.5.4.5 > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From chocapiiic.tiery at gmail.com Thu May 7 06:58:54 2009 From: chocapiiic.tiery at gmail.com (Thierry) Date: Thu, 7 May 2009 15:58:54 +0200 Subject: [ofa-general] struct ib_sge, how to dump buffer from kernel Message-ID: <8d9c773c0905070658q4844714fh213f383bacfa99b3@mail.gmail.com> Hi, I am trying to make a kernel module in order to monitor communication threw infiniband device: my goal is to monitor as many things as I can from kernel space. I have implemented a simple module which can send data from kernel space to user space using netlink socket. I think the best palce to extract information is inside the ib_post_send function in driver/infiniband/hw/mlx4/qp.c were there is a ib_send_wr structure. But I have some trouble to read data from kernel : I have already made a trap in libibverbs (cmd_post_send function), and I have been able to read struct ibv_send_wr and also ibv_send_wr->ibv_sge->addr But in kernel space, I can't read data in ib_send_wr->ib_sge->addr and I don't understand why : I made a memcpy of addr, using the length in ib_sge->length, and then print it with printk %s. Does ibv_send_wr structure is a copy of ibv_send_wr but in kernel_space?* How does a memory adress looks like? Do you have any references I can read in order to understand memorry adressing ? Regards, Thierry --- char foo[wr->sg_list->length]; memcpy(foo, wr->sg_list->addr, wr->sg_list->length); printk(KERN_INFO "buffer: %s\n", foo); From sashak at voltaire.com Thu May 7 04:34:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 7 May 2009 14:34:04 +0300 Subject: [ofa-general] Re: [PATCH 3/3] Modify '-S' option of iblinkinfo and ibqueryerrors to do a limited scan of the fabric first and then fall back to a full scan which searches for the GUID. In-Reply-To: <20090430143002.89262384.weiny2@llnl.gov> References: <20090430143002.89262384.weiny2@llnl.gov> Message-ID: <20090507113404.GB19236@sk> On 14:30 Thu 30 Apr , Ira Weiny wrote: > From: Ira Weiny > Date: Tue, 28 Apr 2009 16:38:38 -0700 > Subject: [PATCH] Modify '-S' option of iblinkinfo and ibqueryerrors to do a limited scan of the > fabric first and then fall back to a full scan which searches for the GUID. > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From dorfman.eli at gmail.com Thu May 7 07:09:39 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 07 May 2009 17:09:39 +0300 Subject: [ofa-general] Bug in opensm LID assignement Message-ID: <4A02EBA3.70205@gmail.com> opensm assigns conflicting LIDs to node after lmc change (e.g. 0 to 1) when node guid is in the guid2lid cache. In the following example CA port 1 lid 24 lmc 1 and switch lid is 25 which overlaps with CA port's lid. This happens because switch port guid is in the guid2lid cache (0x0008f104003f2aa2 0x0019 0x0019) vendid=0x2c9 devid=0x634a sysimgguid=0x2c902002576a3 caguid=0x2c902002576a0 Ca 2 "H-0002c902002576a0" # "FIG3 HCA-1" [2](2c902002576a2) "S-0008f104003f29d2"[23] # lid 26 lmc 1 "ISR2012/ISR2004 Voltaire sLB-2024" lid 34 4xDDR [1](2c902002576a1) "S-0008f104003f2aa2"[23] # lid 24 lmc 1 "ISR2012/ISR2004 Voltaire sLB-2024" lid 25 4xDDR From sashak at voltaire.com Thu May 7 04:52:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 7 May 2009 14:52:12 +0300 Subject: [ofa-general] Re: [PATCH] osm_port.c: do not force max_op_vls = 0 to 1 In-Reply-To: <4A029038.2040603@voltaire.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> Message-ID: <20090507115212.GC19236@sk> Hi Doron, On 10:39 Thu 07 May , Doron Shoham wrote: > when setting max_op_vls = 0 > do not force it to 1. > 0 is valid value which means "No change" > > Signed-off-by: Doron Shoham > --- > opensm/opensm/osm_port.c | 4 ++-- > opensm/opensm/osm_subnet.c | 8 ++++++++ > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..3679f29 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -379,8 +379,8 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > /* support user limitation of max_op_vls */ > if (op_vls > p_subn->opt.max_op_vls) > op_vls = p_subn->opt.max_op_vls; > - > - if (op_vls == 0) { > + else if (op_vls == 0) { > + /* for non compliant implementations */ > OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > op_vls = 1; I would suggest to not mix zero OperVLs workaround and max_op_vls=0 processing. Just move 'op_vls == 0' check to be above max_op_vls comparison. Also (need to repeat my original comment) using max_op_vls = 0 will enforce PortInfo update attempt, which actually may not be needed if the only "changed" field is OperVLs changed to 0 ("No change") - see the code in osm_link_mgr.c. Sasha From hnrose at comcast.net Thu May 7 07:33:46 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 7 May 2009 10:33:46 -0400 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message Message-ID: <20090507143346.GA1713@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 2e6c642..17bac73 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, op_vls = p_subn->opt.max_op_vls; if (op_vls == 0) { - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " + OSM_LOG(p_log, OSM_LOG_DEBUG, "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; } From dorons at voltaire.com Thu May 7 08:51:19 2009 From: dorons at voltaire.com (Doron Shoham) Date: Thu, 07 May 2009 18:51:19 +0300 Subject: [ofa-general] [PATCH 0/2] osm_port.c: fix op_vls processing In-Reply-To: <20090507115212.GC19236@sk> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> Message-ID: <4A030377.6050202@voltaire.com> From dorons at voltaire.com Thu May 7 08:54:36 2009 From: dorons at voltaire.com (Doron Shoham) Date: Thu, 07 May 2009 18:54:36 +0300 Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A030377.6050202@voltaire.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> Message-ID: <4A03043C.4010709@voltaire.com> check if op_vls = 0 before max_op_vls comparison Signed-off-by: Doron Shoham --- opensm/opensm/osm_port.c | 9 +++++---- 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 2e6c642..4d1bbf2 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, } else op_vls = ib_port_info_get_op_vls(&p_physp->port_info); - /* support user limitation of max_op_vls */ - if (op_vls > p_subn->opt.max_op_vls) - op_vls = p_subn->opt.max_op_vls; - if (op_vls == 0) { + /* for non compliant implementations */ OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; } + /* support user limitation of max_op_vls */ + if (op_vls > p_subn->opt.max_op_vls) + op_vls = p_subn->opt.max_op_vls; + OSM_LOG_EXIT(p_log); return op_vls; -- 1.5.4 From dorons at voltaire.com Thu May 7 08:55:17 2009 From: dorons at voltaire.com (Doron Shoham) Date: Thu, 07 May 2009 18:55:17 +0300 Subject: [ofa-general] [PATCH 0/2] osm_port.c: do not enforce PortInfo update if max_op_vls = 0 In-Reply-To: <4A030377.6050202@voltaire.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> Message-ID: <4A030465.90009@voltaire.com> do not enforce PortInfo update if max_op_vls = 0 Signed-off-by: Doron Shoham --- opensm/opensm/osm_port.c | 2 +- opensm/opensm/osm_subnet.c | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 4d1bbf2..8bf1767 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -383,7 +383,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, op_vls = 1; } /* support user limitation of max_op_vls */ - if (op_vls > p_subn->opt.max_op_vls) + if (p_subn->opt.max_op_vls && op_vls > p_subn->opt.max_op_vls) op_vls = p_subn->opt.max_op_vls; diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ec15f8a..71fc7a0 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) "# switch port connected to a CA or router port\n" "leaf_head_of_queue_lifetime 0x%02x\n\n" "# Limit the maximal operational VLs\n" + "# Virtual Lanes operational on this port\n" + "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" + "# 0: No change; valid only on Set()\n" + "# 1: VL0\n" + "# 2: VL0, VL1\n" + "# 3: VL0 - VL3\n" + "# 4: VL0 - VL7\n" + "# 5: VL0 - VL14\n" "max_op_vls %u\n\n" "# Force PortInfo:LinkSpeedEnabled on switch ports\n" "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" -- 1.5.4 From dorons at voltaire.com Thu May 7 08:58:29 2009 From: dorons at voltaire.com (Doron Shoham) Date: Thu, 07 May 2009 18:58:29 +0300 Subject: [ofa-general] [PATCH] saquery: fix -c arguement Message-ID: <4A030525.7090209@voltaire.com> set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO Signed-off-by: Doron Shoham --- infiniband-diags/src/saquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 4dcd712..2ec32cf 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1470,7 +1470,7 @@ static int process_opt(void *context, int ch, char *optarg) node_print_desc = ALL_DESC; break; case 'c': - command = CLASS_PORT_INFO; + command = SAQUERY_CMD_CLASS_PORT_INFO break; case 'S': query_type = IB_SA_ATTR_SERVICERECORD; -- 1.5.4 From weiny2 at llnl.gov Thu May 7 08:58:07 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 7 May 2009 08:58:07 -0700 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: References: <20090430142958.5811218f.weiny2@llnl.gov> <20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov> Message-ID: <20090507085807.f1e743bb.weiny2@llnl.gov> On Thu, 7 May 2009 09:56:38 -0400 Hal Rosenstock wrote: > Ira, > > On Wed, May 6, 2009 at 12:33 PM, Ira Weiny wrote: > > On Wed, 6 May 2009 13:07:44 +0300 > > Sasha Khapyorsky wrote: > > [snip] > >> > >> And wouldn't it be better instead of resolving selfport on each > >> extend_path() call to keep it already resolved somewhere in fabric > >> structure? > > > > This will only happen 1 time for each fabric being scan'ed because the path is > > reused... > > > > Oh wait a minute, I just reviewed the code...  For the current use case the > > path is reused since I am only scanning 1 node.  However, in the general case > > this is not true.  Sorry about that.  A new patch is below. > > Does combined routing always fall back on failure to using directed routing ? No, not automatically in the library. > > Also, would you summarize the use cases for combined routing in ibnetdiscover ? > ibnetdiscover does not use this feature. It does a "full scan" which results in only DR routing. iblinkinfo and ibqueryerrors have the ability to request output for a single node. The library was written to be able to scan from a given portid and a number of hops around that node. However, at first this only supported a DR path in the portid. If the user specified something like GUID iblinkinfo would scan the entire fabric and search the data which came back for that node. Of course the problem with is that on a large fabric it could take 8 seconds to come back with a single node of data. If the SM/SA is up and running I decided it would be better to query for the LID of that node and start the scan from there. That is what this patch adds. iblinkinfo and ibqueryerrors will call ibnd_discover_fabric with the "from" == to the portid resolved from the SA and "hops" == 1. If resolving the GUID or the limited scan fails ibqueryerrors and iblinkinfo then call the library again for a full fabric scan ("from" == NULL) and then search for the node in the fabric data returned. So that is the use case for doing this in the library. But once again ibnetdiscover does not use this. The other use case I could think of is doing a more extensive scan of multiple hops around a single node. I have not implemented this yet but in my early testing it worked just fine starting with a DR path. I believe this will still work with combined routing. Make sense? Ira From changquing.tang at hp.com Thu May 7 09:07:05 2009 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 7 May 2009 16:07:05 +0000 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: <58C6777539C300489D145B0F8E29C3281679DC115F@GVW0673EXC.americas.hpqcorp.net> HP-MPI is pretty much doing the similar thing. --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Jeff Squyres > Sent: Thursday, May 07, 2009 8:54 AM > To: Roland Dreier > Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny > Verkhovsky; Håkon Bugge; Donald Kerr; OpenFabrics General; > Alexander Supalov > Subject: Re: [ofa-general] Memory registration redux > > On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: > > > By the way, what's the desired behavior of the cache if a process > > registers, say, address range 0x1000 ... 0x3fff, and then the same > > process registers address range 0x2000 ... 0x2fff (with all > the same > > permissions, etc)? > > > > The initial registration creates an MR that is still valid for the > > smaller virtual address range, so the second registration is much > > cheaper if we used the cached registration; but if we use the cache > > for the second registration, and then deregister the first > one, we're > > stuck with a too-big range pinned in the cache because of > the second > > registration. > > > > > I don't know what the other MPI's do in this scenario, but > here's what OMPI will do: > > 1. lookup 0x1000-0x3fff in the cache; not find any of it it, > and therefore register > - add each page to our cache with a refcount of 1 2. > lookup 0x2000-0x2fff in the cache, find that all the pages > are already registered > - refcount++ on each page in the cache 3. when we go to > dereg 0x1000-0x3fff > - refcount-- on each page in the cache > - since some pages in the range still have refcount>0, > don't do anything further > > Specifically: the actual dereg of 0x1000-0x3fff is blocked on > also releasing 0x2000-0x2fff. > > Note that OMPI will only register a max of X bytes at a time > (where X defaults to 2MB). So even if a user calls > MPI_SEND(...) with an enormous buffer, we'll register it > X/page_size pages at a time, not the entire buffer at once. > Hence, the "buffer A is blocked from dereg'ing by buffer B" > scenario is *somewhat* mitigated -- it's less wasteful than > if we can registered/cached the entire huge buffer at once. > > Finally, note that if 0x2000-0x2fff had not been registered, > the 0x1000-0x3fff pages are not actually deregistered when > all the pages' > refcounts go to 0 -- they are just moved to the "able to be > dereg'ed list". We don't actually dereg it until we later > try to reg new memory and fail due to lack of resources. > Then we take entries off the "able to be dereg'ed list" and > dereg them, then try reg'ing the new memory again. > > MVAPICH: do you guys do similar things? > > (I don't know if HP/Scali/Intel will comment on their > registration cache schemes) > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From koop at cse.ohio-state.edu Thu May 7 09:55:13 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu, 7 May 2009 12:55:13 -0400 (EDT) Subject: [ofa-general] Memory registration redux In-Reply-To: <58C6777539C300489D145B0F8E29C3281679DC115F@GVW0673EXC.americas.hpqcorp.net> Message-ID: MVAPICH is also doing pretty much the same thing as well. Matt On Thu, 7 May 2009, Tang, Changqing wrote: > > HP-MPI is pretty much doing the similar thing. --CQ > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > Jeff Squyres > > Sent: Thursday, May 07, 2009 8:54 AM > > To: Roland Dreier > > Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny > > Verkhovsky; H�kon Bugge; Donald Kerr; OpenFabrics General; > > Alexander Supalov > > Subject: Re: [ofa-general] Memory registration redux > > > > On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote: > > > > > By the way, what's the desired behavior of the cache if a process > > > registers, say, address range 0x1000 ... 0x3fff, and then the same > > > process registers address range 0x2000 ... 0x2fff (with all > > the same > > > permissions, etc)? > > > > > > The initial registration creates an MR that is still valid for the > > > smaller virtual address range, so the second registration is much > > > cheaper if we used the cached registration; but if we use the cache > > > for the second registration, and then deregister the first > > one, we're > > > stuck with a too-big range pinned in the cache because of > > the second > > > registration. > > > > > > > > > I don't know what the other MPI's do in this scenario, but > > here's what OMPI will do: > > > > 1. lookup 0x1000-0x3fff in the cache; not find any of it it, > > and therefore register > > - add each page to our cache with a refcount of 1 2. > > lookup 0x2000-0x2fff in the cache, find that all the pages > > are already registered > > - refcount++ on each page in the cache 3. when we go to > > dereg 0x1000-0x3fff > > - refcount-- on each page in the cache > > - since some pages in the range still have refcount>0, > > don't do anything further > > > > Specifically: the actual dereg of 0x1000-0x3fff is blocked on > > also releasing 0x2000-0x2fff. > > > > Note that OMPI will only register a max of X bytes at a time > > (where X defaults to 2MB). So even if a user calls > > MPI_SEND(...) with an enormous buffer, we'll register it > > X/page_size pages at a time, not the entire buffer at once. > > Hence, the "buffer A is blocked from dereg'ing by buffer B" > > scenario is *somewhat* mitigated -- it's less wasteful than > > if we can registered/cached the entire huge buffer at once. > > > > Finally, note that if 0x2000-0x2fff had not been registered, > > the 0x1000-0x3fff pages are not actually deregistered when > > all the pages' > > refcounts go to 0 -- they are just moved to the "able to be > > dereg'ed list". We don't actually dereg it until we later > > try to reg new memory and fail due to lack of resources. > > Then we take entries off the "able to be dereg'ed list" and > > dereg them, then try reg'ing the new memory again. > > > > MVAPICH: do you guys do similar things? > > > > (I don't know if HP/Scali/Intel will comment on their > > registration cache schemes) > > > > -- > > Jeff Squyres > > Cisco Systems > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu May 7 14:46:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 May 2009 14:46:55 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <20090507000231.GB16280@obsidianresearch.com> (Jason Gunthorpe's message of "Wed, 6 May 2009 18:02:31 -0600") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> Message-ID: > > No... every HCA just needs to support register and unregister. It > > doesn't have to support changing the mapping without full unregister and > > reregister. > > Well, I would imagine this entire process to be a HCA specific > operation, so HW that supports a better method can use it, otherwise > it has to register/unregister. Is this a concern today with existing > HCAs? > > Using register/unregister exposes a race for the original case you > brought up - but that race is completely unfixable without hardware > support. At least it now becomes a hw specific race that can be > printk'd and someday fixed in new HW rather than an unsolvable API > problem.. We definitely don't want to duplicate all this logic in every hardware device driver, so most of it needs to be generic. If we're adding new low-level driver methods to handle this, that definitely raises the cost of implementing all this. But I guess if we start with a generic register/unregister fallback that drivers can override for better performance, then I think we're in good shape. > > Also this requires potentially walking the page tables of the entire > > process, checking to see if any mappings have changed. We really want > > to keep the information that the MMU notifiers give us, namely which > > virtual address range is changing. > > Walking the page tables of every registration in the process, not the > entire process. Yes... but there are bugs in the bugzilla about mthca being limited to only 8 GB of registration by default or something like that, and having that break Intel MPI in some cases. So some MPI jobs want to have 10s of GBs of registered memory -- walking millions of page table entries for every resync operation seems like a big problem to me. Which means that the MMU notifier has to walk the list of memory registrations and mark any affected ones as dirty (possibly with a hint about which pages were invalidated) as you suggest below. Falling back to the "check every registration" ultra-slow-path I think should never ever happen. > I was thinking more along the lines of having the mmu notifiers put > affected registrations on a per-process (or PD?) dirty linked list, > with the link pointers as part of the registration structure. Set a > dirty flag in the registration too. An extra pointer per registration > and a minor incremental cost to the existing work the mmu notifier > would have to do. Yes, makes sense. > > > Only part I don't immediately see is how to trap creation of new VM > > > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap().. > > > > Why do we care? The initial faulting in of mappings occurs when an MR > > is created. > > Well, exactly, that's the problem. If you can't trap mmap you cannot > do the initial faulting and mapping for a new object that is being > mapped into an existing MR. > > Consider: > > void *a = mmap(0,PAGE_SIZE..); > ibv_register(); > // [..] > mmunmap(a); > ibv_synchronize(); > > // At this point we want the HCA mapping to point to oblivion > > mmap(a,PAGE_SIZE,MAP_FIXED); > ibv_synchronize(); > > // And now we want it to point to the new allocation > > I use MAP_FIXED to illustrate the point, but Jeff has said the same > address re-use happens randomly in real apps. This can be handled I think, although at some cost. Just have the kernel keep track of which MMU sequence number actually invalidated each MR, and return (via ibv_synchronize()) the MMU change sequence number that userspace is in sync with. So in the example above, the first synchronize after munmap() will fail to fix up the first registration, since it is pointing to an unmapped virtual address, and hence it will leave that MR on the dirty list, and return that sequence number as not being synced up yet. And then the second synchronize will see that MR still on the dirty list, and try again to find the pages. Passing the sequence number back to userspace makes it possible for userspace to know that it still has to call ibv_synchronize() again. There is the possibility that a 1GB MR will have its last page unmapped, and end up having 100s of thousands of pages walked again and again in every synchronize operation. > This method avoids the problem you noticed, but there is extra work to > fixup a registration that may never be used again. I strongly suspect > that in the majority of cases this extra work should be about on the > same order as userspace calling unregister on the MR. Yes, also it doesn't match the current MPI way of lazily unregistering things, and only garbage collecting the refcnt 0 cache entries when a registration fails. With this method, if userspace unregisters something, it really is gone, and if it doesn't unregister it, then it really uses up space until userspace explicitly unregisters it. Not sure how MPI implementers feel about that. > Or, ignore the overlapping problem, and use your original technique, > slightly modified: > - Userspace registers a counter with the kernel. Kernel pins the > page, sets up mmu notifiers and increments the counter when > invalidates intersect with registrations > - Kernel maintains a linked list of registrations that have been > invalidated via mmu notifiers using the registration structure > and a dirty bit > - Userspace checks the counter at every cache hit, if different it > calls into the kernel: > MR_Cookie *mrs[100]; > int rc = ibv_get_invalid_mrs(mrs,100); > invalidate_cache(mrs,rc); > // Repeat until drained > > get_invalid_mrs traverses the linked list and returns an > identifying value to userspace, which looks it up in the cache, > calls unregister and removes it from the cache. What's the advantage of this? I have to do the get_invalid_mrs() call a bunch of times, rather than just reading which ones are invalid from the cache directly? - R. From rdreier at cisco.com Thu May 7 14:58:50 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 May 2009 14:58:50 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Jeff Squyres's message of "Thu, 7 May 2009 09:54:26 -0400") References: Message-ID: > I don't know what the other MPI's do in this scenario, but here's what > OMPI will do: > > 1. lookup 0x1000-0x3fff in the cache; not find any of it it, and > therefore register > - add each page to our cache with a refcount of 1 > 2. lookup 0x2000-0x2fff in the cache, find that all the pages are > already registered > - refcount++ on each page in the cache > 3. when we go to dereg 0x1000-0x3fff > - refcount-- on each page in the cache > - since some pages in the range still have refcount>0, don't do > anything further > > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also > releasing 0x2000-0x2fff. If everyone is doing this, how do you handle the case that Jason pointed out, namely: * you register 0x1000 ... 0x3fff * you want to register 0x2000 ... 0x2fff and have a cache hit * you finish up with 0x1000 ... 0x3fff * app does something (which is valid since you finished up with the bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() that leads to munmap() or whatever), and your hooks tell you so. * app reallocates a mapping in 0x3000 ... 0x3fff * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked both invalid and in-use in the cache at this point !? - R. From rdreier at cisco.com Thu May 7 15:08:54 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 May 2009 15:08:54 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: fix fast registration implementation In-Reply-To: <200905071501.17670.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 7 May 2009 15:01:16 +0300") References: <200905071501.17670.jackm@dev.mellanox.co.il> Message-ID: OK, I guess we want to make work request read-only by the low-level driver. (I wasn't sure whether the fix should be in mlx4 or the NFS/RDMA code, but OK, this approach seems better overall) Applied. > + u64 *mapped_page_list; fixed this to __be64 to avoid sparse endianness checking problems, and say what the code means a little better. - R. From jgunthorpe at obsidianresearch.com Thu May 7 15:48:06 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 7 May 2009 16:48:06 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> Message-ID: <20090507224806.GF16280@obsidianresearch.com> On Thu, May 07, 2009 at 02:46:55PM -0700, Roland Dreier wrote: > > Using register/unregister exposes a race for the original case you > > brought up - but that race is completely unfixable without hardware > > support. At least it now becomes a hw specific race that can be > > printk'd and someday fixed in new HW rather than an unsolvable API > > problem.. > > We definitely don't want to duplicate all this logic in every hardware > device driver, so most of it needs to be generic. If we're adding new > low-level driver methods to handle this, that definitely raises the cost > of implementing all this. But I guess if we start with a generic > register/unregister fallback that drivers can override for better > performance, then I think we're in good shape. Right, I was only thinking of a new driver call that was along the lines of update_mr_pages() that just updates the HCA's mapping with new page table entires atomically. It really would be device specific. If there is no call available then unregister/register + printk log is a fair generic implementation. To be clear, what I'm thinking is that this would only be invoked if the VM is being *replaced*. Simply unmaping VM should do nothing. > Which means that the MMU notifier has to walk the list of memory > registrations and mark any affected ones as dirty (possibly with a hint > about which pages were invalidated) as you suggest below. Falling back > to the "check every registration" ultra-slow-path I think should never > ever happen. Yikes, yes, that makes sense. And hearing that at least openmpi caps the registration size makes me think per-page granularity is probably unnecessary to track. > > Well, exactly, that's the problem. If you can't trap mmap you cannot > > do the initial faulting and mapping for a new object that is being > > mapped into an existing MR. > > > > Consider: > > > > void *a = mmap(0,PAGE_SIZE..); > > ibv_register(); > > // [..] > > mmunmap(a); > > ibv_synchronize(); > > > > // At this point we want the HCA mapping to point to oblivion > > > > mmap(a,PAGE_SIZE,MAP_FIXED); > > ibv_synchronize(); > > > > // And now we want it to point to the new allocation > > > > I use MAP_FIXED to illustrate the point, but Jeff has said the same > > address re-use happens randomly in real apps. > > This can be handled I think, although at some cost. Just have the > kernel keep track of which MMU sequence number actually invalidated each > MR, and return (via ibv_synchronize()) the MMU change sequence number > that userspace is in sync with. So in the example above, the first > synchronize after munmap() will fail to fix up the first registration, > since it is pointing to an unmapped virtual address, and hence it will > leave that MR on the dirty list, and return that sequence number as not > being synced up yet. And then the second synchronize will see that MR > still on the dirty list, and try again to find the pages. I agree some kind of kernel/userspace exchange of the sequence number is necessary to make all the locking and race conditions work out. But the problem I'm seeing is how does the sequence number get incremented by the kernel after the mmap() call in the above sequence? Which mmu_notifier/etc call back do you hook for that? The *very best* hook would be one that is called when a mm has new virtual address space allocated and the verbs layer would then take the allocated address range and intersect it with the registration list. Any registrations that have pages in the allocated region are marked invalid. Imagine every call to ibv_synchronize was prefixed with a check that the sequence number is changed. > > This method avoids the problem you noticed, but there is extra work to > > fixup a registration that may never be used again. I strongly suspect > > that in the majority of cases this extra work should be about on the > > same order as userspace calling unregister on the MR. > > Yes, also it doesn't match the current MPI way of lazily unregistering > things, and only garbage collecting the refcnt 0 cache entries when a > registration fails. With this method, if userspace unregisters > something, it really is gone, and if it doesn't unregister it, then it > really uses up space until userspace explicitly unregisters it. Not > sure how MPI implementers feel about that. Well, mixing the lazy unregister in is not a significant change, just don't increment the sequence number on munmap and have the kernel do nothing until pages are mapped into an existing registration. With a flag both behaviors are possible. All of this work is mainly to close the hole where mapping new memory over already registered VM results in RDMA to the wrong pages. Fixing this hole removes the need to trap memory management syscalls and solves that data corruption problem. >From there various optimizations can be done, like lazy garbage collecting registrations that no longer point to mapped memory. > > Or, ignore the overlapping problem, and use your original technique, > > slightly modified: > > - Userspace registers a counter with the kernel. Kernel pins the > > page, sets up mmu notifiers and increments the counter when > > invalidates intersect with registrations > > - Kernel maintains a linked list of registrations that have been > > invalidated via mmu notifiers using the registration structure > > and a dirty bit > > - Userspace checks the counter at every cache hit, if different it > > calls into the kernel: > > MR_Cookie *mrs[100]; > > int rc = ibv_get_invalid_mrs(mrs,100); > > invalidate_cache(mrs,rc); > > // Repeat until drained > > > > get_invalid_mrs traverses the linked list and returns an > > identifying value to userspace, which looks it up in the cache, > > calls unregister and removes it from the cache. > > What's the advantage of this? I have to do the get_invalid_mrs() call a > bunch of times, rather than just reading which ones are invalid from the > cache directly? This is a trade off, the above is a more normal kernel API and lets the app get an list of changes it can scan. Having the kernel update flags means if the app wants a list of changes it has to scan all registrations. Knowing the registration is no good lets you remove it from the search list and save time on the hot path. I imagined a call that would return as much in one go as memory is available (ie 100 entries above) so I doubt more then one call per event would ever be needed. Jason From sfr at canb.auug.org.au Thu May 7 18:53:56 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Fri, 8 May 2009 11:53:56 +1000 Subject: [ofa-general] linux-next: infiniband tree build failure Message-ID: <20090508115356.b97b8981.sfr@canb.auug.org.au> Hi Roland, Today's linux-next build (x86_64 allmodconfig) failed like this: drivers/infiniband/hw/mlx4/mr.c: In function 'mlx4_ib_alloc_fast_reg_page_list': drivers/infiniband/hw/mlx4/mr.c:242: error: label 'err_free_mfrpl' used but not defined Caused by commit 88029ff3c862812b81745ae3d6557ede96e2d051 ("IB/mlx4: Don't overwrite fast registration page list when posting work request"). Clearly not build tested :-( I have used the version of the tree from next-20090507. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From rdreier at cisco.com Thu May 7 21:36:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 May 2009 21:36:55 -0700 Subject: [ofa-general] linux-next: infiniband tree build failure In-Reply-To: <20090508115356.b97b8981.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Fri, 8 May 2009 11:53:56 +1000") References: <20090508115356.b97b8981.sfr@canb.auug.org.au> Message-ID: > Today's linux-next build (x86_64 allmodconfig) failed like this: > > drivers/infiniband/hw/mlx4/mr.c: In function 'mlx4_ib_alloc_fast_reg_page_list': > drivers/infiniband/hw/mlx4/mr.c:242: error: label 'err_free_mfrpl' used but not defined > > Caused by commit 88029ff3c862812b81745ae3d6557ede96e2d051 ("IB/mlx4: > Don't overwrite fast registration page list when posting work request"). > Clearly not build tested :-( My fault for editing after applying the patch. Fixed now. Thanks, Roland From vlad at lists.openfabrics.org Fri May 8 03:25:01 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 8 May 2009 03:25:01 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090508-0200 daily build status Message-ID: <20090508102501.10072E61327@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Fri May 8 06:35:38 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 8 May 2009 09:35:38 -0400 Subject: [ofa-general] Re: [PATCH 2/3] Add combined routing support to libibnetdisc In-Reply-To: <20090507085807.f1e743bb.weiny2@llnl.gov> References: <20090430142958.5811218f.weiny2@llnl.gov> <20090506100744.GB10145@sk> <20090506093347.bb1b56be.weiny2@llnl.gov> <20090507085807.f1e743bb.weiny2@llnl.gov> Message-ID: On Thu, May 7, 2009 at 11:58 AM, Ira Weiny wrote: > On Thu, 7 May 2009 09:56:38 -0400 > Hal Rosenstock wrote: > >> Ira, >> >> On Wed, May 6, 2009 at 12:33 PM, Ira Weiny wrote: >> > On Wed, 6 May 2009 13:07:44 +0300 >> > Sasha Khapyorsky wrote: >> > > [snip] > >> >> >> >> And wouldn't it be better instead of resolving selfport on each >> >> extend_path() call to keep it already resolved somewhere in fabric >> >> structure? >> > >> > This will only happen 1 time for each fabric being scan'ed because the path is >> > reused... >> > >> > Oh wait a minute, I just reviewed the code...  For the current use case the >> > path is reused since I am only scanning 1 node.  However, in the general case >> > this is not true.  Sorry about that.  A new patch is below. >> >> Does combined routing always fall back on failure to using directed routing ? > > No, not automatically in the library. > >> >> Also, would you summarize the use cases for combined routing in ibnetdiscover ? >> > > ibnetdiscover does not use this feature.  It does a "full scan" which results > in only DR routing. > > iblinkinfo and ibqueryerrors have the ability to request output for a single > node.  The library was written to be able to scan from a given portid and a > number of hops around that node.  However, at first this only supported a DR > path in the portid.  If the user specified something like GUID iblinkinfo > would scan the entire fabric and search the data which came back for that > node.  Of course the problem with is that on a large fabric it could take 8 > seconds to come back with a single node of data.  If the SM/SA is up and > running I decided it would be better to query for the LID of that node and > start the scan from there.  That is what this patch adds.  iblinkinfo and > ibqueryerrors will call ibnd_discover_fabric with the "from" == to the portid > resolved from the SA and "hops" == 1.  If resolving the GUID or the limited > scan fails ibqueryerrors and iblinkinfo then call the library again for a full > fabric scan ("from" == NULL) and then search for the node in the fabric data > returned. > > So that is the use case for doing this in the library.  But once again > ibnetdiscover does not use this.  The other use case I could think of is doing > a more extensive scan of multiple hops around a single node.  I have not > implemented this yet but in my early testing it worked just fine starting with > a DR path.  I believe this will still work with combined routing. > > Make sense? Yes, this makes sense. Thanks for clarifying. -- Hal > Ira > > From viral.mehta at einfochips.com Fri May 8 06:45:06 2009 From: viral.mehta at einfochips.com (Viral Mehta) Date: Fri, 08 May 2009 19:15:06 +0530 Subject: [ofa-general] ib_rdma_bw - bandwidth calculation Message-ID: <4A043762.4090003@einfochips.com> Hi, While running below ib_rdma_bw on 32bit platform, I am getting unexpected low throughput. Server: ib_rdma_bw -p 5019 -s 1048576 -t 500 -n 5000 -b -c Client: ib_rdma_bw -p 5019 -s 1048576 -t 500 -n 5000 -b -c 100.168.54.49 (If iterations are changed to 500, I am getting expected throughput) Looking at the code I found, ib_rdma_bw.c in perftest package has following code >{ > double cycles_to_units; > unsigned long tsize; /* Transferred size, in megabytes */ > .... > .... > cycles_to_units = get_cpu_mhz(0) * 1000000; > > printf("%d: Bandwidth average: %g MB/sec\n", pid, > tsize * iters * cycles_to_units / > (tcompleted[iters - 1] - tposted[0]) >/ 0x100000); >} > Here, tsize is "unsigned long" and which is of 4Bytes on 32bit platforms and 8Bytes on 64bit platforms. I run test for 1M datasize and 5000 iterations as above, the calculation (tsize * iters) overflows "unsigned long" limit and thus gives unexpected result as low throughput. Correct fix should be applied in ib_rdma_bw application. Either change calculation from (tsize * iters * cycles_to_units) to ( cycles_to_units * tsize * iters ) Or to change tsize to double. Should I go ahead and submit a patch ? Viral Mehta, Embedded Software Engineer, www.einfochips.com P.S. - However, I do understand that we can overflow double boundary as well if we run test for higher datasize and higher iterations. Better way to calculate bandwidth would be after every fix number of iterations (say 100). From hal.rosenstock at gmail.com Fri May 8 06:57:58 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 8 May 2009 09:57:58 -0400 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> Message-ID: On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: > > In addition to the original proposal we suggest allocating special MLID > for the following MGIDs: >  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >  2. FF12401bxxxx00000000000000000001 - All hosts >  3. FF12401bffff0000000000000000004d  - all Gateways >  4. FF12401bxxxx00000000000000000002 - all routers >  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM It turns out that collapsing multicast groups across PKeys on a single MLID may not be such a good idea unless partition enforcement enforcement by switches is disabled. There should be different modes of collapsing based on this based on whether this is enabled or not. > For all other cases we suggest that same MLID will be assigned to > different MGIDs if: >  1. They share the same P Key >  2. Same signature - for IPoIB only >  3. Same LSB bits - bitmask configurable by user (default  10 bits) >        for example, the following are the same: >        MGID1:  FF12401bABCD000000000000xxxxx755 >        MGID2:  FF12401bABCD000000000000yyyyyB55 Jason's approach to this was in a thread entitled "IPv6 and IPoIB scalability issue": http://lists.openfabrics.org/pipermail/general/2006-November/029621.html in which he proposed an MGID range (MGID/prefix syntax) for collapsing IPv6 SNM groups. Additionally, there was the potential to distribute the matched groups across some number of MLIDs. See also thread "[RFC] OpenSM and IPv6 Scalability Proposal": http://lists.openfabrics.org/pipermail/general/2008-June/051226.html >  Implementation. >  Since there will be many mgroups shared same mlid, mlid-array entry > will contain >  fleximap holding mgroups. >  Searching of mgroup will be performed by mlid (index in the array) and > mgid - >  key in the fleximap. Sasha proposed using an array rather than fleximap for this: http://lists.openfabrics.org/pipermail/general/2008-June/051525.html -- Hal > > >  Slava Strebkov > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Sat May 9 03:21:53 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 9 May 2009 03:21:53 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090509-0200 daily build status Message-ID: <20090509102153.70076E615A7@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dorfman.eli at gmail.com Sat May 9 03:32:06 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Sat, 9 May 2009 13:32:06 +0300 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> Message-ID: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: > On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: >> >> In addition to the original proposal we suggest allocating special MLID >> for the following MGIDs: >>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >>  2. FF12401bxxxx00000000000000000001 - All hosts >>  3. FF12401bffff0000000000000000004d  - all Gateways >>  4. FF12401bxxxx00000000000000000002 - all routers >>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM > > It turns out that collapsing multicast groups across PKeys on a single > MLID may not be such a good idea unless partition enforcement > enforcement by switches is disabled. There should be different modes > of collapsing based on this based on whether this is enabled or not. The idea is to allocate a different MLID per each of the above special MGIDs. >> For all other cases we suggest that same MLID will be assigned to >> different MGIDs if: >>  1. They share the same P Key >>  2. Same signature - for IPoIB only >>  3. Same LSB bits - bitmask configurable by user (default  10 bits) >>        for example, the following are the same: >>        MGID1:  FF12401bABCD000000000000xxxxx755 >>        MGID2:  FF12401bABCD000000000000yyyyyB55 > > Jason's approach to this was in a thread entitled "IPv6 and IPoIB > scalability issue": > http://lists.openfabrics.org/pipermail/general/2006-November/029621.html > in which he proposed an MGID range (MGID/prefix syntax) for collapsing > IPv6 SNM groups. Additionally, there was the potential to distribute > the matched groups across some number of MLIDs. See also thread "[RFC] > OpenSM and IPv6 Scalability Proposal": > http://lists.openfabrics.org/pipermail/general/2008-June/051226.html > >>  Implementation. >>  Since there will be many mgroups shared same mlid, mlid-array entry >> will contain >>  fleximap holding mgroups. >>  Searching of mgroup will be performed by mlid (index in the array) and >> mgid - >>  key in the fleximap. > > Sasha proposed using an array rather than fleximap for this: > http://lists.openfabrics.org/pipermail/general/2008-June/051525.html > > -- Hal > >> >> >>  Slava Strebkov >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Sat May 9 03:41:27 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 9 May 2009 06:41:27 -0400 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> Message-ID: On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman wrote: > On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: >> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: >>> >>> In addition to the original proposal we suggest allocating special MLID >>> for the following MGIDs: >>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >>>  2. FF12401bxxxx00000000000000000001 - All hosts >>>  3. FF12401bffff0000000000000000004d  - all Gateways >>>  4. FF12401bxxxx00000000000000000002 - all routers >>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM >> >> It turns out that collapsing multicast groups across PKeys on a single >> MLID may not be such a good idea unless partition enforcement >> enforcement by switches is disabled. There should be different modes >> of collapsing based on this based on whether this is enabled or not. > > The idea is to allocate a different MLID per each of the above special MGIDs. So one MLID per PKey in the MGID ? What's the difference between xxxx's and ABCD in the syntax above ? IPv6 is being collapsed per PKey too, right ? >>> For all other cases we suggest that same MLID will be assigned to >>> different MGIDs if: >>>  1. They share the same P Key >>>  2. Same signature - for IPoIB only >>>  3. Same LSB bits - bitmask configurable by user (default  10 bits) >>>        for example, the following are the same: >>>        MGID1:  FF12401bABCD000000000000xxxxx755 >>>        MGID2:  FF12401bABCD000000000000yyyyyB55 >> >> Jason's approach to this was in a thread entitled "IPv6 and IPoIB >> scalability issue": >> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html >> in which he proposed an MGID range (MGID/prefix syntax) for collapsing >> IPv6 SNM groups. Additionally, there was the potential to distribute >> the matched groups across some number of MLIDs. See also thread "[RFC] >> OpenSM and IPv6 Scalability Proposal": >> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html >> >>>  Implementation. >>>  Since there will be many mgroups shared same mlid, mlid-array entry >>> will contain >>>  fleximap holding mgroups. >>>  Searching of mgroup will be performed by mlid (index in the array) and >>> mgid - >>>  key in the fleximap. >> >> Sasha proposed using an array rather than fleximap for this: >> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html >> >> -- Hal >> >>> >>> >>>  Slava Strebkov >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > From dorfman.eli at gmail.com Sat May 9 04:29:23 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sat, 09 May 2009 14:29:23 +0300 Subject: [ofa-general] [PATCH] opensm/osm_lid_mgr.c bug in opensm LID assignment Message-ID: <4A056913.7010700@gmail.com> lid persistent range wrong check used lids were not properly chekced which caused duplicate lid assignment in some cases. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_lid_mgr.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index 14601e1..e1d5106 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -595,7 +595,7 @@ static boolean_t lid_mgr_is_range_not_persistent(IN osm_lid_mgr_t * p_mgr, return FALSE; for (i = lid; i < lid + num_lids; i++) - if (p_mgr->used_lids[lid]) + if (p_mgr->used_lids[i]) return FALSE; return TRUE; -- 1.5.3.6 From dorfman.eli at gmail.com Sat May 9 04:31:30 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Sat, 9 May 2009 14:31:30 +0300 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> Message-ID: <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com> On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock wrote: > On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman wrote: >> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: >>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: >>>> >>>> In addition to the original proposal we suggest allocating special MLID >>>> for the following MGIDs: >>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >>>>  2. FF12401bxxxx00000000000000000001 - All hosts >>>>  3. FF12401bffff0000000000000000004d  - all Gateways >>>>  4. FF12401bxxxx00000000000000000002 - all routers >>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM >>> >>> It turns out that collapsing multicast groups across PKeys on a single >>> MLID may not be such a good idea unless partition enforcement >>> enforcement by switches is disabled. There should be different modes >>> of collapsing based on this based on whether this is enabled or not. >> >> The idea is to allocate a different MLID per each of the above special MGIDs. > > So one MLID per PKey in the MGID ? yes > What's the difference between xxxx's and ABCD in the syntax above ? none. should be the same. > IPv6 is being collapsed per PKey too, right ? yes >>>> For all other cases we suggest that same MLID will be assigned to >>>> different MGIDs if: >>>>  1. They share the same P Key >>>>  2. Same signature - for IPoIB only >>>>  3. Same LSB bits - bitmask configurable by user (default  10 bits) >>>>        for example, the following are the same: >>>>        MGID1:  FF12401bABCD000000000000xxxxx755 >>>>        MGID2:  FF12401bABCD000000000000yyyyyB55 >>> >>> Jason's approach to this was in a thread entitled "IPv6 and IPoIB >>> scalability issue": >>> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html >>> in which he proposed an MGID range (MGID/prefix syntax) for collapsing >>> IPv6 SNM groups. Additionally, there was the potential to distribute >>> the matched groups across some number of MLIDs. See also thread "[RFC] >>> OpenSM and IPv6 Scalability Proposal": >>> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html >>> >>>>  Implementation. >>>>  Since there will be many mgroups shared same mlid, mlid-array entry >>>> will contain >>>>  fleximap holding mgroups. >>>>  Searching of mgroup will be performed by mlid (index in the array) and >>>> mgid - >>>>  key in the fleximap. >>> >>> Sasha proposed using an array rather than fleximap for this: >>> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html >>> >>> -- Hal >>> >>>> >>>> >>>>  Slava Strebkov >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >> > From hal.rosenstock at gmail.com Sat May 9 05:26:05 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 9 May 2009 08:26:05 -0400 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com> References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com> Message-ID: On Sat, May 9, 2009 at 7:31 AM, Eli Dorfman wrote: > On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock wrote: >> On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman wrote: >>> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: >>>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: >>>>> >>>>> In addition to the original proposal we suggest allocating special MLID >>>>> for the following MGIDs: >>>>>  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >>>>>  2. FF12401bxxxx00000000000000000001 - All hosts >>>>>  3. FF12401bffff0000000000000000004d  - all Gateways >>>>>  4. FF12401bxxxx00000000000000000002 - all routers >>>>>  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM >>>> >>>> It turns out that collapsing multicast groups across PKeys on a single >>>> MLID may not be such a good idea unless partition enforcement >>>> enforcement by switches is disabled. There should be different modes >>>> of collapsing based on this based on whether this is enabled or not. >>> >>> The idea is to allocate a different MLID per each of the above special MGIDs. >> >> So one MLID per PKey in the MGID ? > yes > >> What's the difference between xxxx's and ABCD in the syntax above ? > none. should be the same. Doesn't the xxxxxx for IPv6 mean mask these nibbles though ? > >> IPv6 is being collapsed per PKey too, right ? > yes > >>>>> For all other cases we suggest that same MLID will be assigned to >>>>> different MGIDs if: >>>>>  1. They share the same P Key >>>>>  2. Same signature - for IPoIB only >>>>>  3. Same LSB bits - bitmask configurable by user (default  10 bits) >>>>>        for example, the following are the same: >>>>>        MGID1:  FF12401bABCD000000000000xxxxx755 >>>>>        MGID2:  FF12401bABCD000000000000yyyyyB55 >>>> >>>> Jason's approach to this was in a thread entitled "IPv6 and IPoIB >>>> scalability issue": >>>> http://lists.openfabrics.org/pipermail/general/2006-November/029621.html >>>> in which he proposed an MGID range (MGID/prefix syntax) for collapsing >>>> IPv6 SNM groups. Additionally, there was the potential to distribute >>>> the matched groups across some number of MLIDs. See also thread "[RFC] >>>> OpenSM and IPv6 Scalability Proposal": >>>> http://lists.openfabrics.org/pipermail/general/2008-June/051226.html >>>> >>>>>  Implementation. >>>>>  Since there will be many mgroups shared same mlid, mlid-array entry >>>>> will contain >>>>>  fleximap holding mgroups. >>>>>  Searching of mgroup will be performed by mlid (index in the array) and >>>>> mgid - >>>>>  key in the fleximap. >>>> >>>> Sasha proposed using an array rather than fleximap for this: >>>> http://lists.openfabrics.org/pipermail/general/2008-June/051525.html >>>> >>>> -- Hal >>>> >>>>> >>>>> >>>>>  Slava Strebkov >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>> >> > From dorfman.eli at gmail.com Sat May 9 22:51:33 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 10 May 2009 08:51:33 +0300 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> <694d48600905090431ocd05510y3218575a8a93d75@mail.gmail.com> Message-ID: <4A066B65.8030704@gmail.com> Hal Rosenstock wrote: > On Sat, May 9, 2009 at 7:31 AM, Eli Dorfman wrote: >> On Sat, May 9, 2009 at 1:41 PM, Hal Rosenstock wrote: >>> On Sat, May 9, 2009 at 6:32 AM, Eli Dorfman wrote: >>>> On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: >>>>> On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: >>>>>> In addition to the original proposal we suggest allocating special MLID >>>>>> for the following MGIDs: >>>>>> 1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >>>>>> 2. FF12401bxxxx00000000000000000001 - All hosts >>>>>> 3. FF12401bffff0000000000000000004d - all Gateways >>>>>> 4. FF12401bxxxx00000000000000000002 - all routers >>>>>> 5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM >>>>> It turns out that collapsing multicast groups across PKeys on a single >>>>> MLID may not be such a good idea unless partition enforcement >>>>> enforcement by switches is disabled. There should be different modes >>>>> of collapsing based on this based on whether this is enabled or not. >>>> The idea is to allocate a different MLID per each of the above special MGIDs. >>> So one MLID per PKey in the MGID ? >> yes >> >>> What's the difference between xxxx's and ABCD in the syntax above ? >> none. should be the same. > > Doesn't the xxxxxx for IPv6 mean mask these nibbles though ? For IPv6 the ABCD is the pkey and xxxxxx is the mask To make it the same as IPv4 groups we can use the following notation (mmmmmm=mask and xxxx=pkey) FF12601bxxxx000000000001ffmmmmmm From slavas at voltaire.com Sat May 9 22:54:55 2009 From: slavas at voltaire.com (Slava Strebkov) Date: Sun, 10 May 2009 08:54:55 +0300 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA01F6F88C@exil.voltaire.com> -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Friday, May 08, 2009 4:58 PM To: Slava Strebkov Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: > > In addition to the original proposal we suggest allocating special MLID > for the following MGIDs: >  1. FF12401bxxxx000000000000FFFFFFFF - All Nodes >  2. FF12401bxxxx00000000000000000001 - All hosts >  3. FF12401bffff0000000000000000004d  - all Gateways >  4. FF12401bxxxx00000000000000000002 - all routers >  5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM It turns out that collapsing multicast groups across PKeys on a single MLID may not be such a good idea unless partition enforcement enforcement by switches is disabled. There should be different modes of collapsing based on this based on whether this is enabled or not. > For all other cases we suggest that same MLID will be assigned to > different MGIDs if: >  1. They share the same P Key >  2. Same signature - for IPoIB only >  3. Same LSB bits - bitmask configurable by user (default  10 bits) >        for example, the following are the same: >        MGID1:  FF12401bABCD000000000000xxxxx755 >        MGID2:  FF12401bABCD000000000000yyyyyB55 Jason's approach to this was in a thread entitled "IPv6 and IPoIB scalability issue": http://lists.openfabrics.org/pipermail/general/2006-November/029621.html in which he proposed an MGID range (MGID/prefix syntax) for collapsing IPv6 SNM groups. Additionally, there was the potential to distribute the matched groups across some number of MLIDs. See also thread "[RFC] OpenSM and IPv6 Scalability Proposal": http://lists.openfabrics.org/pipermail/general/2008-June/051226.html >  Implementation. >  Since there will be many mgroups shared same mlid, mlid-array entry > will contain >  fleximap holding mgroups. >  Searching of mgroup will be performed by mlid (index in the array) and > mgid - >  key in the fleximap. Sasha proposed using an array rather than fleximap for this: http://lists.openfabrics.org/pipermail/general/2008-June/051525.html We propose MLID -indexed array, but instead of list of pointers to multicast groups, there will be fleximap sorted by MGID. This is faster than simple list. Slava -- Hal > > >  Slava Strebkov > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dorfman.eli at gmail.com Sat May 9 23:42:44 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 10 May 2009 09:42:44 +0300 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: <20090507143346.GA1713@comcast.net> References: <20090507143346.GA1713@comcast.net> Message-ID: <4A067764.3040306@gmail.com> Hal Rosenstock wrote: > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..17bac73 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > op_vls = p_subn->opt.max_op_vls; > > if (op_vls == 0) { > - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > + OSM_LOG(p_log, OSM_LOG_DEBUG, > "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); In this case I think that level should be changed to ERROR since this is not the normal behavior. > op_vls = 1; > } From dorfman.eli at gmail.com Sat May 9 23:49:41 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 10 May 2009 09:49:41 +0300 Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A03043C.4010709@voltaire.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com> Message-ID: <4A067905.5060401@gmail.com> Doron Shoham wrote: > check if op_vls = 0 before max_op_vls comparison > > Signed-off-by: Doron Shoham > --- > opensm/opensm/osm_port.c | 9 +++++---- > 1 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..4d1bbf2 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > } else > op_vls = ib_port_info_get_op_vls(&p_physp->port_info); > > - /* support user limitation of max_op_vls */ > - if (op_vls > p_subn->opt.max_op_vls) > - op_vls = p_subn->opt.max_op_vls; > - > if (op_vls == 0) { > + /* for non compliant implementations */ > OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " I think that level should be OSM_LOG_ERROR. > "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > op_vls = 1; > } > + /* support user limitation of max_op_vls */ > + if (op_vls > p_subn->opt.max_op_vls) > + op_vls = p_subn->opt.max_op_vls; > + > > OSM_LOG_EXIT(p_log); > return op_vls; From dorons at voltaire.com Sun May 10 01:17:11 2009 From: dorons at voltaire.com (Doron Shoham) Date: Sun, 10 May 2009 11:17:11 +0300 Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A067905.5060401@gmail.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com> <4A067905.5060401@gmail.com> Message-ID: <4A068D87.6040801@voltaire.com> check if op_vls = 0 before max_op_vls comparison Signed-off-by: Doron Shoham --- opensm/opensm/osm_port.c | 11 ++++++----- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 2e6c642..41b67ad 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, } else op_vls = ib_port_info_get_op_vls(&p_physp->port_info); - /* support user limitation of max_op_vls */ - if (op_vls > p_subn->opt.max_op_vls) - op_vls = p_subn->opt.max_op_vls; - if (op_vls == 0) { - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " + /* for non compliant implementations */ + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: " "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; } + /* support user limitation of max_op_vls */ + if (op_vls > p_subn->opt.max_op_vls) + op_vls = p_subn->opt.max_op_vls; + OSM_LOG_EXIT(p_log); return op_vls; -- 1.5.4 From vlad at lists.openfabrics.org Sun May 10 03:24:50 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 10 May 2009 03:24:50 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090510-0200 daily build status Message-ID: <20090510102450.B8AB8E61434@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Sun May 10 03:46:10 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 10 May 2009 06:46:10 -0400 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: <4A067764.3040306@gmail.com> References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com> Message-ID: On Sun, May 10, 2009 at 2:42 AM, Eli Dorfman (Voltaire) wrote: > Hal Rosenstock wrote: >> Signed-off-by: Hal Rosenstock >> --- >> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >> index 2e6c642..17bac73 100644 >> --- a/opensm/opensm/osm_port.c >> +++ b/opensm/opensm/osm_port.c >> @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >>               op_vls = p_subn->opt.max_op_vls; >> >>       if (op_vls == 0) { >> -             OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " >> +             OSM_LOG(p_log, OSM_LOG_DEBUG, >>                       "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); > > In this case I think that level should be changed to ERROR since this is not the normal behavior. Sasha has been adamant that any device supplied data errors use something other than ERROR log level. -- Hal > >>               op_vls = 1; >>       } > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Sun May 10 03:47:29 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 10 May 2009 06:47:29 -0400 Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A068D87.6040801@voltaire.com> References: <4A00386E.2050300@voltaire.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com> <4A067905.5060401@gmail.com> <4A068D87.6040801@voltaire.com> Message-ID: On Sun, May 10, 2009 at 4:17 AM, Doron Shoham wrote: > check if op_vls = 0 before max_op_vls comparison > > Signed-off-by: Doron Shoham > --- >  opensm/opensm/osm_port.c |   11 ++++++----- >  1 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..41b67ad 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, >        } else >                op_vls = ib_port_info_get_op_vls(&p_physp->port_info); > > -       /* support user limitation of max_op_vls */ > -       if (op_vls > p_subn->opt.max_op_vls) > -               op_vls = p_subn->opt.max_op_vls; > - >        if (op_vls == 0) { > -               OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > +               /* for non compliant implementations */ > +               OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: " Sasha has been adamant that any device supplied data errors use something other than ERROR log level. -- Hal >                        "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); >                op_vls = 1; >        } > +       /* support user limitation of max_op_vls */ > +       if (op_vls > p_subn->opt.max_op_vls) > +               op_vls = p_subn->opt.max_op_vls; > + > >        OSM_LOG_EXIT(p_log); >        return op_vls; > -- > 1.5.4 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dorons at voltaire.com Sun May 10 04:30:13 2009 From: dorons at voltaire.com (Doron Shoham) Date: Sun, 10 May 2009 14:30:13 +0300 Subject: [ofa-general] [PATCH] saquery: fix -c arguement In-Reply-To: <4A030525.7090209@voltaire.com> References: <4A030525.7090209@voltaire.com> Message-ID: <4A06BAC5.40405@voltaire.com> set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO Signed-off-by: Doron Shoham --- infiniband-diags/src/saquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 4dcd712..5920eda 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1470,7 +1470,7 @@ static int process_opt(void *context, int ch, char *optarg) node_print_desc = ALL_DESC; break; case 'c': - command = CLASS_PORT_INFO; + command = SAQUERY_CMD_CLASS_PORT_INFO; break; case 'S': query_type = IB_SA_ATTR_SERVICERECORD; -- 1.5.4 Sorry, forgot ';' Thanks, Doron From amirv at mellanox.co.il Sun May 10 23:52:42 2009 From: amirv at mellanox.co.il (Amir Vadai) Date: Mon, 11 May 2009 09:52:42 +0300 Subject: [ofa-general] SDP error In-Reply-To: References: Message-ID: <4A07CB3A.2030507@mellanox.co.il> Hi, Which kernel is it? What is the command line that you execute? for example for ssh (both in the client and in the server) Please make sure that the module ib_sdp.so is loaded when you run the programs. If it wasn't started automatically - please let me know. - Amir On 05/05/2009 03:53 PM, anthony garnier wrote: > Hello, > > i`m running a debian 5.0 OS with ofed 1.4, RDMA work very well, but when > I`m trying to use the SDP protocol with ssh, Netperf or a simple > Client-Server programming in C, I got socket error like that : > > NetPIPE: can't open stream socket! errno=97 (for Netpipe) > > Address family not supported by protocol ssh (for ssh) > > Address family not supported by protocol (for clent-server) > > Someone knows those errors? > > ------------------------------------------------------------------------ > Discutez sur Messenger où que vous soyez ! Mettez Messenger sur votre > mobile ! > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Amir Vadai Software Eng. Mellanox Technologies mailto: amirv at mellanox.co.il Tel +972-3-6259539 From vlad at lists.openfabrics.org Mon May 11 03:22:07 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 11 May 2009 03:22:07 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090511-0200 daily build status Message-ID: <20090511102207.BC028E61364@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jsquyres at cisco.com Mon May 11 05:11:50 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 11 May 2009 08:11:50 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com> References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com><382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com><8B573344-0870-4352-8BC2-AED66E1E4234@cisco.com> <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com> Message-ID: <871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com> On May 4, 2009, at 8:25 PM, Jeff Squyres (jsquyres) wrote: > It was suggested today that a teleconference to discuss these issues > might be much more useful (an hour-long teleconference can save a > week's worth of emails!). This will be a technical call to discuss > memory registration issues; it will not be an EWG call. I've setup a > WebEx call for next Monday at the "normal" time: noon US Eastern, 9am > US Pacific, 7pm Israel. The invite will be coming to the ewg and > general lists shortly. > Productive discussion about this issue is still occurring on the list -- I don't think we need this teleconf today. -- Jeff Squyres Cisco Systems From perkinjo at cse.ohio-state.edu Mon May 11 05:13:18 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon, 11 May 2009 08:13:18 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: <20090511121318.GD3045@cse.ohio-state.edu> On Tue, May 05, 2009 at 04:57:09PM -0400, Jeff Squyres wrote: > Roland and I chatted on the phone today; I think I now understand > Roland's counter-proposal (I clearly didn't before). Let me try to > summarize: > > 1. Add a new verb for "set this userspace flag to 1 if mr X ever becomes > invalid" > 2. Add a new verb for "no longer tell me if mr X ever becomes invalid" > (i.e., remove the effects of #1) > 3. Add run-time query indicating whether #1 works > 4. Add [optional] memory registration caching to libibverbs > > Prior to talking to Roland, I had envisioned *one* flag in userspace > that indicated whether any memory registrations had become invalid. > Roland's idea is that there is one flag *per registration* -- you can > instantly tell whether a specific registration is valid. > > Given this, let's keep the discussion going here in email -- perhaps the > teleconference next Monday may become moot. It looks like there has been more discussion on how to implement this idea. Are we still planning on having this teleconference today? -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From perkinjo at cse.ohio-state.edu Mon May 11 05:14:05 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon, 11 May 2009 08:14:05 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com> References: <11AAF71E-0D36-471E-A9C6-5FC924AF9E7D@cisco.com> <871F48B9-4B7A-4EE5-9D1E-0D9A9A69035E@cisco.com> Message-ID: <20090511121405.GE3045@cse.ohio-state.edu> On Mon, May 11, 2009 at 08:11:50AM -0400, Jeff Squyres wrote: > On May 4, 2009, at 8:25 PM, Jeff Squyres (jsquyres) wrote: > >> It was suggested today that a teleconference to discuss these issues >> might be much more useful (an hour-long teleconference can save a >> week's worth of emails!). This will be a technical call to discuss >> memory registration issues; it will not be an EWG call. I've setup a >> WebEx call for next Monday at the "normal" time: noon US Eastern, 9am >> US Pacific, 7pm Israel. The invite will be coming to the ewg and >> general lists shortly. >> > > > Productive discussion about this issue is still occurring on the list -- > I don't think we need this teleconf today. OK thanks. You can ignore my just sent email. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From sebastien.dugue at bull.net Mon May 11 05:38:21 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 11 May 2009 14:38:21 +0200 Subject: [ofa-general] [PATCH] mstflint - Fix redirection to /dev/null in hca_self_test.ofed Message-ID: <20090511143821.4b0746ef@frecb007965> Redirect 'rpm -qa' stderr to /dev/null instead of null. Signed-Off-By: Sebastien Dugue --- hca_self_test.ofed | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hca_self_test.ofed b/hca_self_test.ofed index c7c5492..4f29080 100755 --- a/hca_self_test.ofed +++ b/hca_self_test.ofed @@ -168,7 +168,7 @@ else fi if [ $RPM_KER_VER -ne 0 ]; then - RPM_CUR_BOOTED_KER=`rpm -qa 2> null| grep kernel-ib | grep $(echo $BOOTED_KER | sed s/-/_/) | wc -l` + RPM_CUR_BOOTED_KER=`rpm -qa 2> /dev/null| grep kernel-ib | grep $(echo $BOOTED_KER | sed s/-/_/) | wc -l` if [ $RPM_CUR_BOOTED_KER -eq 0 ]; then echo -e "Host Driver RPM Check .................. ${red}FAIL" tput sgr0 -- 1.6.3.rc3.12.gb7937 From sokar6012 at hotmail.com Mon May 11 07:06:29 2009 From: sokar6012 at hotmail.com (anthony garnier) Date: Mon, 11 May 2009 14:06:29 +0000 Subject: [ofa-general] Install ofed 1.4 on XEN Message-ID: Hi, I installed ofed 1.4 (from http://alioth.debian.org/projects/pkg-ofed/ ) on Xen , all the package are well installed and my HCa are recognized but when I`m trying to build the kernel module with module-assistant prepare module-assistant build ofa-kernel I got an error in the log file wich is : Failed executing /usr/bin/quiltmake[1]: *** [kdist_config] Error 1 make[1]: Leaving directory `/usr/src/modules/ofa-kernel` make: *** [kdist_build] error 2 Is where someone who knows this error? Regards Anthony _________________________________________________________________ Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant !  http://www.windowslive.fr/messenger/1.asp -------------- next part -------------- An HTML attachment was scrubbed... URL: From gmpc at sanger.ac.uk Mon May 11 07:16:05 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon, 11 May 2009 15:16:05 +0100 Subject: [ofa-general] Install ofed 1.4 on XEN In-Reply-To: References: Message-ID: <4A083325.3090008@sanger.ac.uk> anthony garnier wrote: > Hi, > I installed ofed 1.4 (from http://alioth.debian.org/projects/pkg-ofed/ > ) on Xen , all the package are well installed and my HCa are recognized > but when I`m trying to build the kernel module with > > module-assistant prepare > module-assistant build ofa-kernel > > I got an error in the log file wich is : > Failed executing /usr/bin/quiltmake[1]: *** [kdist_config] Error 1 > make[1]: Leaving directory `/usr/src/modules/ofa-kernel` > make: *** [kdist_build] error 2 > > Is where someone who knows this error? Do you have the quilt package installed? It is required for the build. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From tziporet at mellanox.co.il Mon May 11 08:32:05 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 11 May 2009 18:32:05 +0300 Subject: [ofa-general] OFED 1.4.1-rc5 is available In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com> Hi, OFED-1.4.1-rc5 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc5.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.1 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.1-rc4 ========================== - mlx4_en: Updated driver to version 1.4.1 that was released by Mellanox - Added an error in case of mlx4 library mismatch with kernel (due to XRC support) - 3 bug fixed (see attachment) - Updated bonding package: ib-bonding-0.9.0-40 - Attached kernel git tree changes for details - Updated documentation Tasks that should be completed for GA (May 14): ==================================== 1. High priority bug fixes - see list bellow 2. Complete documentation update Open bugs: ======== bug_id bug_severity op_sys assigned_to 1596 cri Othe Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is loaded 1616 cri RHEL jon at opengridcomputing.com iommu_alloc error when running connectathon on ppc64 nfs ... 1571 cri RHEL vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc4_rc5.log Type: application/octet-stream Size: 8057 bytes Desc: ofed-1.4.1-rc4_rc5.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc5-fixed-bugs.csv Type: application/octet-stream Size: 469 bytes Desc: ofed-1.4.1-rc5-fixed-bugs.csv URL: From swise at opengridcomputing.com Mon May 11 08:35:09 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 May 2009 10:35:09 -0500 Subject: [ofa-general] [PATCH ofed-1.4.1 relnotes] Update cxgb3 release notes for 1.4.1 Message-ID: <20090511153509.17504.51102.stgit@build.ogc.int> Signed-off-by: Steve Wise --- cxgb3_release_notes.txt | 38 ++++++++------------------------------ 1 files changed, 8 insertions(+), 30 deletions(-) diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt index 5f2edaa..4df6779 100644 --- a/cxgb3_release_notes.txt +++ b/cxgb3_release_notes.txt @@ -1,42 +1,20 @@ Open Fabrics Enterprise Distribution (OFED) CHELSIO T3 RNIC RELEASE NOTES - December 2008 + May 2009 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the -Chelsio S310/320 and R310/320 series adapters. Make sure you choose the -'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms. +Chelsio S series adapters. Make sure you choose the 'cxgb3' and +'libcxgb3' options when generating your ofed-1.4.1 rpms. ============================================ -New for ofed-1.4 +New for ofed-1.4.1 ============================================ -- 7.0 Firmware support. See below for more information on updating -your RNIC to the latest firmware. - -- Memory Managment Extensions including: - - Fast register memory regions - - Invalidate local memory region work request - - Zero stag support via the local DMA lkey field - - Read with invalidate local stag work request - -- RDS bcopy mode enabled for iWARP devices - -============================================ -Recent Enhancements -============================================ +- NFSRDMA support. -- Various MPI libraries are enabled via a new iw_cxgb3 module option -called peer2peer. When loading iw_cxgb3, set peer2peer=1 to enable Intel -MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released -with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7). -This option must be set on all systems in your cluster. See more info -below on running these MPIs. NOTE: None of these MPIs are included in -the ofed-1.4 release. Contact the specific vendors for obtaining the -MPI code. Open MPI can be pulled from www.open-mpi.org. - -- Large memory registration. User applications can now register > 30MB -memory regions. +- 7.4 Firmware support. See below for more information on updating +your RNIC to the latest firmware. ============================================ Enabling Various MPIs @@ -64,7 +42,7 @@ chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" Intel MPI: ============= -The following env vars enable Intel MPI version 3.1.038. Place these +The following env vars enable Intel MPI version >= 3.1.038. Place these in your user env after installing and setting up Intel MPI: export RSH=ssh From swise at opengridcomputing.com Mon May 11 08:36:19 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 May 2009 10:36:19 -0500 Subject: [ofa-general] [PATCH ofed-1.4.1 relnotes] Update cxgb3 release notes for 1.4.1 Message-ID: <20090511153619.17559.19237.stgit@build.ogc.int> Signed-off-by: Steve Wise --- cxgb3_release_notes.txt | 36 +++++++----------------------------- 1 files changed, 7 insertions(+), 29 deletions(-) diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt index 5f2edaa..d1fdafc 100644 --- a/cxgb3_release_notes.txt +++ b/cxgb3_release_notes.txt @@ -1,42 +1,20 @@ Open Fabrics Enterprise Distribution (OFED) CHELSIO T3 RNIC RELEASE NOTES - December 2008 + May 2009 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the -Chelsio S310/320 and R310/320 series adapters. Make sure you choose the -'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms. +Chelsio S series adapters. Make sure you choose the 'cxgb3' and +'libcxgb3' options when generating your ofed-1.4.1 rpms. ============================================ -New for ofed-1.4 +New for ofed-1.4.1 ============================================ -- 7.0 Firmware support. See below for more information on updating -your RNIC to the latest firmware. - -- Memory Managment Extensions including: - - Fast register memory regions - - Invalidate local memory region work request - - Zero stag support via the local DMA lkey field - - Read with invalidate local stag work request +- NFSRDMA support. -- RDS bcopy mode enabled for iWARP devices - -============================================ -Recent Enhancements -============================================ - -- Various MPI libraries are enabled via a new iw_cxgb3 module option -called peer2peer. When loading iw_cxgb3, set peer2peer=1 to enable Intel -MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released -with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7). -This option must be set on all systems in your cluster. See more info -below on running these MPIs. NOTE: None of these MPIs are included in -the ofed-1.4 release. Contact the specific vendors for obtaining the -MPI code. Open MPI can be pulled from www.open-mpi.org. - -- Large memory registration. User applications can now register > 30MB -memory regions. +- 7.4 Firmware support. See below for more information on updating +your RNIC to the latest firmware. ============================================ Enabling Various MPIs From swise at opengridcomputing.com Mon May 11 08:37:21 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 May 2009 10:37:21 -0500 Subject: [ofa-general] [PATCH ofed-1.4.1 cxgb3 relnotes] Update cxgb3 release notes for 1.4.1 Message-ID: <20090511153721.17587.46386.stgit@build.ogc.int> Signed-off-by: Steve Wise --- cxgb3_release_notes.txt | 36 +++++++----------------------------- 1 files changed, 7 insertions(+), 29 deletions(-) diff --git a/cxgb3_release_notes.txt b/cxgb3_release_notes.txt index 5f2edaa..d1fdafc 100644 --- a/cxgb3_release_notes.txt +++ b/cxgb3_release_notes.txt @@ -1,42 +1,20 @@ Open Fabrics Enterprise Distribution (OFED) CHELSIO T3 RNIC RELEASE NOTES - December 2008 + May 2009 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the -Chelsio S310/320 and R310/320 series adapters. Make sure you choose the -'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms. +Chelsio S series adapters. Make sure you choose the 'cxgb3' and +'libcxgb3' options when generating your ofed-1.4.1 rpms. ============================================ -New for ofed-1.4 +New for ofed-1.4.1 ============================================ -- 7.0 Firmware support. See below for more information on updating -your RNIC to the latest firmware. - -- Memory Managment Extensions including: - - Fast register memory regions - - Invalidate local memory region work request - - Zero stag support via the local DMA lkey field - - Read with invalidate local stag work request +- NFSRDMA support. -- RDS bcopy mode enabled for iWARP devices - -============================================ -Recent Enhancements -============================================ - -- Various MPI libraries are enabled via a new iw_cxgb3 module option -called peer2peer. When loading iw_cxgb3, set peer2peer=1 to enable Intel -MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released -with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7). -This option must be set on all systems in your cluster. See more info -below on running these MPIs. NOTE: None of these MPIs are included in -the ofed-1.4 release. Contact the specific vendors for obtaining the -MPI code. Open MPI can be pulled from www.open-mpi.org. - -- Large memory registration. User applications can now register > 30MB -memory regions. +- 7.4 Firmware support. See below for more information on updating +your RNIC to the latest firmware. ============================================ Enabling Various MPIs From bmr at opengridcomputing.com Mon May 11 11:03:16 2009 From: bmr at opengridcomputing.com (Brian M. Rzycki) Date: Mon, 11 May 2009 13:03:16 -0500 Subject: [ofa-general] OFED 1.4.1-rc5 symbol disagreements on SLES 11 SP0 Message-ID: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com> Greetings, I have the following SLES 11 SP0 machine: # cat /proc/cpuinfo | grep '^model name' model name : AMD Opteron(tm) Processor 244 model name : AMD Opteron(tm) Processor 244 # cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 0 # uname -a Linux demo1 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux # free -m total used free shared buffers cached Mem: 1877 333 1544 0 7 148 -/+ buffers/cache: 177 1700 Swap: 2055 0 2055 I downloaded and installed OFED-1.4.1-rc5.tgz on the machine. I configured one of the Mellanox infiniband interfaces and then reboot the system. When installing I chose 2,3: 2) Install OFED Software 3) All packages (all of Basic, HPC) Upon reboot I see the following messages in the dmesg log: ib_iser: disagrees about version of symbol ib_fmr_pool_unmap ib_iser: Unknown symbol ib_fmr_pool_unmap ib_iser: disagrees about version of symbol ib_create_cq ib_iser: Unknown symbol ib_create_cq ib_iser: disagrees about version of symbol rdma_resolve_addr ib_iser: Unknown symbol rdma_resolve_addr ib_iser: disagrees about version of symbol ib_create_fmr_pool ib_iser: Unknown symbol ib_create_fmr_pool ib_iser: disagrees about version of symbol ib_dereg_mr ib_iser: Unknown symbol ib_dereg_mr ib_iser: disagrees about version of symbol rdma_disconnect ib_iser: Unknown symbol rdma_disconnect ib_iser: disagrees about version of symbol rdma_resolve_route ib_iser: Unknown symbol rdma_resolve_route ib_iser: disagrees about version of symbol rdma_create_qp ib_iser: Unknown symbol rdma_create_qp ib_iser: disagrees about version of symbol ib_destroy_cq ib_iser: Unknown symbol ib_destroy_cq ib_iser: disagrees about version of symbol rdma_create_id ib_iser: Unknown symbol rdma_create_id ib_iser: disagrees about version of symbol rdma_destroy_qp ib_iser: Unknown symbol rdma_destroy_qp ib_iser: disagrees about version of symbol ib_get_dma_mr ib_iser: Unknown symbol ib_get_dma_mr ib_iser: disagrees about version of symbol ib_alloc_pd ib_iser: Unknown symbol ib_alloc_pd ib_iser: disagrees about version of symbol rdma_connect ib_iser: Unknown symbol rdma_connect ib_iser: disagrees about version of symbol rdma_destroy_id ib_iser: Unknown symbol rdma_destroy_id ib_iser: disagrees about version of symbol ib_dealloc_pd ib_iser: Unknown symbol ib_dealloc_pd ib_iser: disagrees about version of symbol ib_fmr_pool_map_phys ib_iser: Unknown symbol ib_fmr_pool_map_phys I don't see ib_iser.ko is in the kernel updates directory: # find /lib/modules/$(uname -r)/updates -name '*.ko' | grep iser # It looks like the OFED installer isn't building ib_iser.ko even when I choose 2,3. Thanks, -Brian Rzycki From roel.kluin at gmail.com Mon May 11 13:25:07 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Mon, 11 May 2009 22:25:07 +0200 Subject: [ofa-general] [PATCH] ehca: remove driver_data direct access of struct device Message-ID: <4A0889A3.8020803@gmail.com> To avoid direct access to the driver_data pointer in struct device, the functions dev_get_drvdata() and dev_set_drvdata() should be used. Signed-off-by: Roel Kluin --- diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 368311c..5acfb4c 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct of_device *dev, shca->ofdev = dev; shca->ipz_hca_handle.handle = *handle; - dev->dev.driver_data = shca; + dev_set_drvdata(&dev->dev, shca); ret = ehca_sense_attributes(shca); if (ret < 0) { @@ -878,7 +878,7 @@ probe1: static int __devexit ehca_remove(struct of_device *dev) { - struct ehca_shca *shca = dev->dev.driver_data; + struct ehca_shca *shca = dev_get_drvdata(&dev->dev); unsigned long flags; int ret; From jgunthorpe at obsidianresearch.com Mon May 11 14:14:46 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 11 May 2009 15:14:46 -0600 Subject: [ofa-general] Re: [RFC] OpenSM and IPv6 Scalability Proposal In-Reply-To: <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> References: <39C75744D164D948A170E9792AF8E7CA01F6F886@exil.voltaire.com> <694d48600905090332u29630a75lb2b1bbfb537c287e@mail.gmail.com> Message-ID: <20090511211446.GC16395@obsidianresearch.com> On Sat, May 09, 2009 at 01:32:06PM +0300, Eli Dorfman wrote: > On Fri, May 8, 2009 at 4:57 PM, Hal Rosenstock wrote: > > On Wed, May 6, 2009 at 6:24 AM, Slava Strebkov wrote: > >> > >> In addition to the original proposal we suggest allocating special MLID > >> for the following MGIDs: > >> ??1. FF12401bxxxx000000000000FFFFFFFF - All Nodes > >> ??2. FF12401bxxxx00000000000000000001 - All hosts > >> ??3. FF12401bffff0000000000000000004d ??- all Gateways > >> ??4. FF12401bxxxx00000000000000000002 - all routers > >> ??5. FF12601bABCD000000000001ffxxxxxx - IPv6 SNM > > > > It turns out that collapsing multicast groups across PKeys on a single > > MLID may not be such a good idea unless partition enforcement > > enforcement by switches is disabled. There should be different modes > > of collapsing based on this based on whether this is enabled or not. > > The idea is to allocate a different MLID per each of the above special MGIDs. In practice I think you'd be better to combine the All Nodes, All hosts, All Gatesways, All Routers and IPv4 broadcast group onto a single MLID and then distribute the SNM groups over some number of additional MLIDs in an intelligent manner. The specialty groups are not really used very much, while the purpose of the SNM group is for ND scalability. If your network is large enough to care about this then it is probably also large enough to benefit from multiple SNM groups.. Otherwise, you may as well lump them all together into the broadcast MLID. Jason From caitlin.bestler at gmail.com Mon May 11 14:23:58 2009 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 11 May 2009 14:23:58 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <20090507224806.GF16280@obsidianresearch.com> References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> Message-ID: <469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com> On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe wrote: > > Right, I was only thinking of a new driver call that was along the > lines of update_mr_pages() that just updates the HCA's mapping with > new page table entires atomically. It really would be device > specific. If there is no call available then unregister/register + > printk log is a fair generic implementation. > > To be clear, what I'm thinking is that this would only be invoked if Both the IBTA and RDMAC verbs were defined so that the meaning of L-Key/R-Key/STag + Address could not instantly change from "X" to "Y", only from "X" to NULL and then NULL to "Y". There are a lot of good reasons for this, especially for R-Keys or remotely accessible STags. It ensures that all operations that started when the translation was "X" are completed before any that will use the "Y" translation can commence. That is not something we want to accidentally undermine. There really isn't a reason why this rule needed to apply to entire Memory Regions. So I don't see a problem with allowing an update_mr_pages() verb that changes a portion of an MR map, perhaps by optimal machine specific hooks when available, without requiring the entire MR be specified. But it must preserve the guarantee that all operations initiated with translation "X" are completed before any operations for translation "Y" can be initiated. Preserving this guarantee should not be a problem for the free() then reallocate scenarios that have been discussed. From jgunthorpe at obsidianresearch.com Mon May 11 14:40:54 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 11 May 2009 15:40:54 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: <469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com> References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> <469958e00905111423p5fd15c58s2dfa57cbc4f64c26@mail.gmail.com> Message-ID: <20090511214054.GE16395@obsidianresearch.com> On Mon, May 11, 2009 at 02:23:58PM -0700, Caitlin Bestler wrote: > On Thu, May 7, 2009 at 3:48 PM, Jason Gunthorpe > wrote: > > > > Right, I was only thinking of a new driver call that was along the > > lines of update_mr_pages() that just updates the HCA's mapping with > > new page table entires atomically. It really would be device > > specific. If there is no call available then unregister/register + > > printk log is a fair generic implementation. > > > > To be clear, what I'm thinking is that this would only be invoked if > > Both the IBTA and RDMAC verbs were defined so that the meaning of > L-Key/R-Key/STag + Address could not instantly change from "X" to > "Y", only from "X" to NULL and then NULL to "Y". Well, this is sort of a grey area, in one sense the meaning isn't changing, just the underlying phyiscal memory is being moved around by the OS. The notion that the verbs refer to some sort of invisible underlying VM object is nice for an implementation but pretty useless for MPI.. > There are a lot of good reasons for this, especially for R-Keys or > remotely accessible STags. It ensures that all operations that > started when the translation was "X" are completed before any that > will use the "Y" translation can commence. That is not something we > want to accidentally undermine. I'm not sure I see how this helps, synchronizing all this is the responsibility of the application, if it wants to change the mapping then it should be able to, and if it does so with poor timing then it will have races and loose data . As it stands today there are already races where apps can loose data transfered after an unmap() or transfer the wrong data after a mmap() so the current model is already broken from that perspective. Of course an update verb has to operate with similar ordering guarantees to regsiter/unregister relative to the local work request queue - that is to say if the verb is done out-of-line with the WR queue then it must wait for the queue to flush before issuing the update to the HCA - just like unregister - and then wait for the verb to complete before returning to the app - just like register. And we all wish for userspace FRMRs... Jason From greg at kroah.com Mon May 11 14:05:05 2009 From: greg at kroah.com (Greg KH) Date: Mon, 11 May 2009 14:05:05 -0700 Subject: [ofa-general] Re: [PATCH] infiniband: ehca: remove driver_data direct access of struct device In-Reply-To: References: <20090504200022.GA22746@kroah.com> Message-ID: <20090511210505.GA31999@kroah.com> On Tue, May 05, 2009 at 07:13:16AM +0200, Hoang-Nam Nguyen wrote: > Hi, > This patch looks fine to me. Thanks! Thanks for reviewing it, I've added your "Acked-by" to the patch in my tree. greg k-h From zhouyonghao at ict.ac.cn Mon May 11 19:17:37 2009 From: zhouyonghao at ict.ac.cn (zhouyonghao at ict.ac.cn) Date: Tue, 12 May 2009 10:17:37 +0800 (CST) Subject: [ofa-general] How to establish IB communcation more =?gb2312?b?ZWZmZWN0aXZl?= =?gb2312?b?bHmjvw==?= Message-ID: Hi all, I'm using libibverbs to build a cluster memory pool, and using TCP/IP handshake to exchange memory information and establish the connection before the IB communication. While I found this process costed a lot of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to handle the establishment. But I dont't find sample code or API document, is there anything I missed? BTW, how to establish communication in current OFED? Any comparision or suggestion is appreciated, that will help me a lot. From swise at opengridcomputing.com Mon May 11 20:06:26 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 May 2009 22:06:26 -0500 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <1242092150.16618.15.camel@heimdal.trondhjem.org> References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> Message-ID: <4A08E7B2.1010907@opengridcomputing.com> Trond Myklebust wrote: > On Mon, 2009-05-11 at 21:14 -0400, Tom Talpey wrote: > >> At 08:44 PM 5/11/2009, Trond Myklebust wrote: >> >>> On Mon, 2009-05-11 at 19:13 -0500, Steve Wise wrote: >>> >>>> Trond Myklebust wrote: >>>> >>>>> On Mon, 2009-05-11 at 17:25 -0500, Steve Wise wrote: >>>>> >>>>> >>>>>> Hey Trond, >>>>>> >>>>>> Will this bug fix make 2.6.30? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Steve. >>>>>> >>>>>> >>>>> Not in the form it is in now. As I've said earlier, I'm not happy about >>>>> the sunrpc layer having to circumvent ordinary type checking on >>>>> non-sunrpc structures. >>>>> >>>>> Cheers >>>>> Trond >>>>> >>>> How is it circumventing? It's currently incorrectly casting a pointer >>>> into a u64. That seems just broken to me. Also, its really the sunrpc >>>> rdma transport layer. It deals specifically with rdma. It _should_ >>>> know about rdma interfaces and types. >>>> >>> The fact is that I'm simply not interested enough in rdma to tolerate >>> hacks. If it isn't done cleanly, in a manner that I can maintain, then >>> the whole transport layer comes out... >>> >> I know exactly what you want - it's not what the code does now and >> it's not an accessor function to set the hardware's u64 field. What's >> needed is a new function to manage the entire RDMA triplet, and the >> memory registration behind it, in the OFA code side. Put the hardware >> goop below the line, IOW. I'll dust up Steve on this. >> > > This does indeed sound like what I'd looking for. > > There is a huge difference between having code that depends on well > defined rdma interfaces, and code that depends on rdma hacks. A piece of > code that requires casts from a non-local opaque type into another > protocol-dependent non-local type will definitely fall in the latter > category. I really don't care what the current code does, but a fix for > that code is something that does it _correctly_; it is not yet another > hack, whether or not it fixes a bug in the short term. > > Trond > > Trond, I get your point, and we can certainly work on improving this with the rdma developer community. But removing the one-line-broken cast will resolve a current crash situation for 2.6.30. Can't we get this fix in 2.6.30 and work on the API improvements for 2.6.31? I've CCed Roland and the ofa general list to get everyone involved in this thread so we can get this API design change going. I agree we can clean this up moving forward, but lets fix the broken 2.6.30 code. Will this work? Steve. From dotanba at gmail.com Mon May 11 22:58:37 2009 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 12 May 2009 08:58:37 +0300 Subject: =?GB2312?Q?Re=3A_=5Bofa=2Dgeneral=5D_How_to_establish_IB_communcation_m?= =?GB2312?Q?ore_effectively=A3=BF?= In-Reply-To: References: Message-ID: <2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com> You can't find such samples in the verbs library; It can be found in the rdma cma library, you should search for rping or ucmatose. Dotan 2009/5/12 : > Hi all, >    I'm using libibverbs to build a cluster memory pool, and using TCP/IP > handshake to exchange memory information and establish the connection > before the IB communication. While I found this process costed a lot > of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to > handle the establishment. But I dont't find sample code or API > document, is there anything I missed? >    BTW, how to establish communication in current OFED? Any comparision > or suggestion is appreciated, that will help me a lot. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dorfman.eli at gmail.com Mon May 11 23:51:14 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 12 May 2009 09:51:14 +0300 Subject: [ofa-general] running ib diagnostics blocks Message-ID: <4A091C62.8050906@gmail.com> Hi, What could be the reason that open("/dev/infiniband/umad0", O_RDWR|O_NONBLOCK) blocks and does not return. I did not find any errors in dmesg. Thanks, Eli From or.gerlitz at gmail.com Tue May 12 00:38:05 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 12 May 2009 10:38:05 +0300 Subject: [ofa-general] running ib diagnostics blocks In-Reply-To: <4A091C62.8050906@gmail.com> References: <4A091C62.8050906@gmail.com> Message-ID: <15ddcffd0905120038l34f71fa1m7fba3e4218021b11@mail.gmail.com> Eli Dorfman wrote: > What could be the reason that open("/dev/infiniband/umad0", > O_RDWR|O_NONBLOCK) > blocks and does not return. I did not find any errors in dmesg. Eli, You can examine the kernel stack of all processes, including yours... using sysrq ($ echo 1 > /proc/sysrq-trigger and then $ echo t > /proc/sysrq-trigge) and looking in the dmesg. e.g see if some other process which deals with umad is in the D state... Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Tue May 12 00:45:02 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 12 May 2009 10:45:02 +0300 Subject: [ofa-general] OFED 1.4.1-rc5 recall In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com> Message-ID: <4A0928FE.9010007@dev.mellanox.co.il> Hi, OFED-1.4.1-rc5 was removed from OFA downloads. OFED-1.4.1-rc6 will be released as soon as dependence issue between nfs and ib_core will be resolved and tested. Regards, Vladimir From monis at Voltaire.COM Tue May 12 02:31:40 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 12 May 2009 12:31:40 +0300 Subject: =?windows-1252?Q?Re=3A_=5Bofa-general=5D_How_to_establis?= =?windows-1252?Q?h_IB_communcation_more_effectively=3F?= In-Reply-To: <2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com> References: <2f3bf9a60905112258m3a7365d8w1fe869cac9bfbd9a@mail.gmail.com> Message-ID: <4A0941FC.6020606@Voltaire.COM> Dotan Barak wrote: > You can't find such samples in the verbs library; It can be found in > the rdma cma library, you should search for rping or ucmatose. > > Dotan > > 2009/5/12 : >> Hi all, >> I'm using libibverbs to build a cluster memory pool, and using TCP/IP >> handshake to exchange memory information and establish the connection >> before the IB communication. While I found this process costed a lot >> of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to >> handle the establishment. But I dont't find sample code or API >> document, is there anything I missed? >> BTW, how to establish communication in current OFED? Any comparision >> or suggestion is appreciated, that will help me a lot. >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > RPM of librdmacm also includes detailed man pages (man rdma_cm) From vlad at lists.openfabrics.org Tue May 12 03:25:07 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 12 May 2009 03:25:07 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090512-0200 daily build status Message-ID: <20090512102507.212B7E614E7@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hnguyen at linux.vnet.ibm.com Tue May 12 03:00:41 2009 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Tue, 12 May 2009 12:00:41 +0200 Subject: [ofa-general] [PATCH] perftest: send_lat/bw: Attach to multicast group when QP is in INIT Message-ID: <200905121200.41849.hnguyen@linux.vnet.ibm.com> Subject: [PATCH] perftest: send_lat/bw: Attach to multicast group when QP is in INIT If multicast is enabled, the current code of send_lat/bw attaches the QP to a multicast group while it's still in RESET state. Since the IB spec does not strictly specify the QP state for this operation and ehca's current firmware does not allow attaching in RESET, this patch moves the attach_mcast() function call after QP has been modified to INIT. See also discussion thread http://lists.openfabrics.org/pipermail/general/2009-May/059450.html Signed-off-by: Hoang-Nam Nguyen --- send_bw.c | 29 +++++++++++++++-------------- send_lat.c | 30 +++++++++++++++--------------- 2 files changed, 30 insertions(+), 29 deletions(-) diff --git a/send_bw.c b/send_bw.c index afabfa4..9a10ff3 100755 --- a/send_bw.c +++ b/send_bw.c @@ -421,20 +421,6 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, return NULL; } - if ((user_parm->connection_type==UD) && (user_parm->use_mcg) && (!user_parm->servername || user_parm->duplex)) { - union ibv_gid gid; - uint8_t mcg_gid[16] = MCG_GID; - - /* use the local QP number as part of the mcg */ - mcg_gid[11] = (user_parm->servername) ? 0 : 1; - *(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num; - memcpy(gid.raw, mcg_gid, 16); - - if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) { - fprintf(stderr, "Couldn't attach QP to mcg\n"); - return NULL; - } - } } { @@ -457,6 +443,21 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, fprintf(stderr, "Failed to modify UD QP to INIT\n"); return NULL; } + + if ((user_parm->use_mcg) && (!user_parm->servername || user_parm->duplex)) { + union ibv_gid gid; + uint8_t mcg_gid[16] = MCG_GID; + + /* use the local QP number as part of the mcg */ + mcg_gid[11] = (user_parm->servername) ? 0 : 1; + *(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num; + memcpy(gid.raw, mcg_gid, 16); + + if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) { + fprintf(stderr, "Couldn't attach QP to mcg\n"); + return NULL; + } + } } else if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | diff --git a/send_lat.c b/send_lat.c index 1f21652..e1a1156 100755 --- a/send_lat.c +++ b/send_lat.c @@ -425,21 +425,6 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, fprintf(stderr, "Couldn't create QP\n"); return NULL; } - - if ((user_parm->connection_type==UD) && (user_parm->use_mcg)) { - union ibv_gid gid; - uint8_t mcg_gid[16] = MCG_GID; - - /* use the local QP number as part of the mcg */ - mcg_gid[11] = (user_parm->servername) ? 0 : 1; - *(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num; - memcpy(gid.raw, mcg_gid, 16); - - if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) { - fprintf(stderr, "Couldn't attach QP to mcg\n"); - return NULL; - } - } } { @@ -463,6 +448,21 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, fprintf(stderr, "Failed to modify UD QP to INIT\n"); return NULL; } + + if (user_parm->use_mcg) { + union ibv_gid gid; + uint8_t mcg_gid[16] = MCG_GID; + + /* use the local QP number as part of the mcg */ + mcg_gid[11] = (user_parm->servername) ? 0 : 1; + *(uint32_t *)(&mcg_gid[12]) = ctx->qp->qp_num; + memcpy(gid.raw, mcg_gid, 16); + + if (ibv_attach_mcast(ctx->qp, &gid, MCG_LID)) { + fprintf(stderr, "Couldn't attach QP to mcg\n"); + return NULL; + } + } } else if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | -- 1.5.5 From hnrose at comcast.net Tue May 12 04:21:03 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 12 May 2009 07:21:03 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Reduce host name length Message-ID: <20090512112103.GA7715@comcast.net> to what's needed (based on NodeDescription length) Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h index 41a5810..33d1920 100644 --- a/opensm/include/opensm/osm_event_plugin.h +++ b/opensm/include/opensm/osm_event_plugin.h @@ -60,7 +60,7 @@ BEGIN_C_DECLS * *********/ -#define OSM_EPI_NODE_NAME_LEN (128) +#define OSM_EPI_NODE_NAME_LEN (65) struct osm_opensm; /** ========================================================================= diff --git a/opensm/include/opensm/osm_perfmgr_db.h b/opensm/include/opensm/osm_perfmgr_db.h index d0eff73..42a47bd 100644 --- a/opensm/include/opensm/osm_perfmgr_db.h +++ b/opensm/include/opensm/osm_perfmgr_db.h @@ -131,7 +131,7 @@ typedef struct db_port { /** ========================================================================= * group port counters for ports into the nodes */ -#define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE << 1) +#define NODE_NAME_SIZE (IB_NODE_DESCRIPTION_SIZE + 1) typedef struct db_node { cl_map_item_t map_item; /* must be first */ uint64_t node_guid; From tziporet at dev.mellanox.co.il Tue May 12 04:33:18 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 12 May 2009 14:33:18 +0300 Subject: [ofa-general] OFED 1.4.1-rc5 symbol disagreements on SLES 11 SP0 In-Reply-To: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com> References: <8D6B365D-9766-4908-873A-D4DC9BE3A9C9@opengridcomputing.com> Message-ID: <4A095E7E.6030607@mellanox.co.il> Brian M. Rzycki wrote: > Greetings, > > I have the following SLES 11 SP0 machine: > > > It looks like the OFED installer isn't building ib_iser.ko even when I > choose 2,3. > This is the same bug reported on rc5. We removed rc5 and will publish RC6 soon Tziporet From tziporet at dev.mellanox.co.il Tue May 12 04:34:10 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 12 May 2009 14:34:10 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.4.1 cxgb3 relnotes] Update cxgb3 release notes for 1.4.1 In-Reply-To: <20090511153721.17587.46386.stgit@build.ogc.int> References: <20090511153721.17587.46386.stgit@build.ogc.int> Message-ID: <4A095EB2.6070505@mellanox.co.il> Steve Wise wrote: > Signed-off-by: Steve Wise > --- > Applied Tziporet From HNGUYEN at de.ibm.com Tue May 12 05:40:13 2009 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Tue, 12 May 2009 14:40:13 +0200 Subject: [ofa-general] Re: [PATCH] ehca: remove driver_data direct access of struct device In-Reply-To: <4A0889A3.8020803@gmail.com> References: <4A0889A3.8020803@gmail.com> Message-ID: Hi, Thanks for this patch. But I've to NACK because 1) Greg KH has already done a similar patch in his tree. See http://lists.openfabrics.org/pipermail/general/2009-May/059442.html 2) Your patch is incomplete Regards Nam Roel Kluin wrote on 11.05.2009 22:25:07: > From: > > Roel Kluin > > To: > > Hoang-Nam Nguyen/Germany/IBM at IBMDE > > Cc: > > general at lists.openfabrics.org, lkml : > > Date: > > 11.05.2009 22:25 > > Subject: > > [PATCH] ehca: remove driver_data direct access of struct device > > To avoid direct access to the driver_data pointer in struct device, the > functions dev_get_drvdata() and dev_set_drvdata() should be used. > > Signed-off-by: Roel Kluin > --- > diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/ > infiniband/hw/ehca/ehca_main.c > index 368311c..5acfb4c 100644 > --- a/drivers/infiniband/hw/ehca/ehca_main.c > +++ b/drivers/infiniband/hw/ehca/ehca_main.c > @@ -749,7 +749,7 @@ static int __devinit ehca_probe(struct of_device *dev, > > shca->ofdev = dev; > shca->ipz_hca_handle.handle = *handle; > - dev->dev.driver_data = shca; > + dev_set_drvdata(&dev->dev, shca); > > ret = ehca_sense_attributes(shca); > if (ret < 0) { > @@ -878,7 +878,7 @@ probe1: > > static int __devexit ehca_remove(struct of_device *dev) > { > - struct ehca_shca *shca = dev->dev.driver_data; > + struct ehca_shca *shca = dev_get_drvdata(&dev->dev); > unsigned long flags; > int ret; > From swise at opengridcomputing.com Tue May 12 09:11:36 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 12 May 2009 11:11:36 -0500 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <4A08E7B2.1010907@opengridcomputing.com> References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> Message-ID: <4A099FB8.7090603@opengridcomputing.com> Steve Wise wrote: >Trond Myklebust wrote (earlier in this thread): > > All I should need to know is that I can advertise either dma handles or > kernel VAs, and know that I can choose between two functions, say, > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to > initialise the ib_send_wr structure correctly. To align more with the rest of the fast_reg API in ib_verbs.h, I propose: static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr *send_wr, dma_addr_t dma); static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *send_wr, void *kva); Thoughts? From swise at opengridcomputing.com Tue May 12 09:23:31 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 12 May 2009 11:23:31 -0500 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <4A099FB8.7090603@opengridcomputing.com> References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> <4A099FB8.7090603@opengridcomputing.com> Message-ID: <4A09A283.3090605@opengridcomputing.com> Steve Wise wrote: > Steve Wise wrote: > > >Trond Myklebust wrote (earlier in this thread): > > > > All I should need to know is that I can advertise either dma handles or > > kernel VAs, and know that I can choose between two functions, say, > > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to > > initialise the ib_send_wr structure correctly. > > > To align more with the rest of the fast_reg API in ib_verbs.h, I propose: > > static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr > *send_wr, dma_addr_t dma); > static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr > *send_wr, void *kva); > > Thoughts? > > uncompiled patch: diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c179318..fb56930 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1940,6 +1940,30 @@ static inline void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) } /** + * ib_init_fast_reg_iova_start_dma - initializes the iova_start field + * based on a dma address supplied by the user. + * @wr - struct ib_send_wr pointer to be initialized + * @addr - dma_addr_t value to be used as the iova_start + */ +static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr *wr, + dma_addr_t addr) +{ + wr->wr.fast_reg.iova_start = addr; +} + +/** + * ib_init_fast_reg_iova_start_kva - initializes the iova_start field + * based on a kernel virtual address supplied by the user. + * @wr - struct ib_send_wr pointer to be initialized + * @addr - void * address to be used as the iova_start + */ +static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr, + void *addr) +{ + wr->wr.fast_reg.iova_start = (unsigned long)addr; +} + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From sashak at voltaire.com Tue May 12 10:50:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 20:50:57 +0300 Subject: [ofa-general] Re: [PATCH] Fix 2 formatting diff's from old ibqueryerrors. In-Reply-To: <20090506095114.3893f4aa.weiny2@llnl.gov> References: <20090506095114.3893f4aa.weiny2@llnl.gov> Message-ID: <20090512175057.GA27108@sashak.voltaire.com> On 09:51 Wed 06 May , Ira Weiny wrote: > 2 changes I noted in the output from ibqueryerrors. > > "Link Info:" was not being printed when "-r" was used. > > The "header": Errors for 0x "" > > Should only be printed when errors are found. > > The following patch cleans those up. > > Ira > > > From: Ira Weiny > Date: Tue, 28 Apr 2009 14:39:11 -0700 > Subject: [PATCH] Fix 2 formatting diff's from old ibqueryerrors. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 10:52:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 20:52:02 +0300 Subject: [ofa-general] Re: [PATCH] Clean up printing of switch heading when printing "down links" only. In-Reply-To: <20090506095303.f11659f1.weiny2@llnl.gov> References: <20090506095303.f11659f1.weiny2@llnl.gov> Message-ID: <20090512175202.GB27108@sashak.voltaire.com> On 09:53 Wed 06 May , Ira Weiny wrote: > Another corner case: If there are no down links on a switch and "-d" is selected then the header for that switch should not be printed. > > Ira > > > From: Ira Weiny > Date: Thu, 30 Apr 2009 13:41:38 -0700 > Subject: [PATCH] Clean up printing of switch heading when printing "down links" only. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 10:55:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 20:55:41 +0300 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com> Message-ID: <20090512175541.GC27108@sashak.voltaire.com> On 06:46 Sun 10 May , Hal Rosenstock wrote: > > Sasha has been adamant that any device supplied data errors use > something other than ERROR log level. But I think that VERBOSE is more appropriate than for such cases than just DEBUG. Another way is to add another "level" for subnet warnings. Sasha From sashak at voltaire.com Tue May 12 10:56:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 20:56:03 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: <20090507143346.GA1713@comcast.net> References: <20090507143346.GA1713@comcast.net> Message-ID: <20090512175603.GD27108@sashak.voltaire.com> On 10:33 Thu 07 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 11:12:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:12:17 +0300 Subject: [ofa-general] [PATCH] saquery: fix -c arguement In-Reply-To: <4A06BAC5.40405@voltaire.com> References: <4A030525.7090209@voltaire.com> <4A06BAC5.40405@voltaire.com> Message-ID: <20090512181217.GE27108@sashak.voltaire.com> On 14:30 Sun 10 May , Doron Shoham wrote: > set SAQUERY_CMD_CLASS_PORT_INFO instead of CLASS_PORT_INFO > > Signed-off-by: Doron Shoham Applied. Thanks. Sasha From hal.rosenstock at gmail.com Tue May 12 11:06:55 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 12 May 2009 14:06:55 -0400 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: <20090512175541.GC27108@sashak.voltaire.com> References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com> <20090512175541.GC27108@sashak.voltaire.com> Message-ID: On Tue, May 12, 2009 at 1:55 PM, Sasha Khapyorsky wrote: > On 06:46 Sun 10 May     , Hal Rosenstock wrote: >> >> Sasha has been adamant that any device supplied data errors use >> something other than ERROR log level. > > But I think that VERBOSE is more appropriate than for such cases than > just DEBUG. Yes, VERBOSE level is more consistent than DEBUG level with what is done elsewhere in OpenSM. -- Hal > Another way is to add another "level" for subnet warnings. > > Sasha > From sashak at voltaire.com Tue May 12 11:15:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:15:04 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_lid_mgr.c bug in opensm LID assignment In-Reply-To: <4A056913.7010700@gmail.com> References: <4A056913.7010700@gmail.com> Message-ID: <20090512181504.GF27108@sashak.voltaire.com> On 14:29 Sat 09 May , Eli Dorfman (Voltaire) wrote: > lid persistent range wrong check > used lids were not properly chekced which > caused duplicate lid assignment in some cases. > > Signed-off-by: Eli Dorfman Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 11:22:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:22:57 +0300 Subject: [ofa-general] Re: [PATCH] opensm/PerfMgr: Reduce host name length In-Reply-To: <20090512112103.GA7715@comcast.net> References: <20090512112103.GA7715@comcast.net> Message-ID: <20090512182257.GG27108@sashak.voltaire.com> On 07:21 Tue 12 May , Hal Rosenstock wrote: > > to what's needed (based on NodeDescription length) > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 11:39:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:39:25 +0300 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Remove error number from debug level log message In-Reply-To: References: <20090507143346.GA1713@comcast.net> <4A067764.3040306@gmail.com> <20090512175541.GC27108@sashak.voltaire.com> Message-ID: <20090512183925.GI27108@sashak.voltaire.com> On 14:06 Tue 12 May , Hal Rosenstock wrote: > > Yes, VERBOSE level is more consistent than DEBUG level with what is > done elsewhere in OpenSM. Ok, I'm changing to VERBOSE. Sasha From hnrose at comcast.net Tue May 12 11:32:33 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 12 May 2009 14:32:33 -0400 Subject: [ofa-general] [PATCH] opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message to VERBOSE Message-ID: <20090512183233.GA1113@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index 17bac73..cb8b153 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -381,7 +381,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, op_vls = p_subn->opt.max_op_vls; if (op_vls == 0) { - OSM_LOG(p_log, OSM_LOG_DEBUG, + OSM_LOG(p_log, OSM_LOG_VERBOSE, "Invalid OP_VLS = 0. Forcing correction to 1 (VL0)\n"); op_vls = 1; } From sashak at voltaire.com Tue May 12 11:45:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:45:42 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_port.c: Change log level of Invalid OP_VLS 0 message to VERBOSE In-Reply-To: <20090512183233.GA1113@comcast.net> References: <20090512183233.GA1113@comcast.net> Message-ID: <20090512184542.GJ27108@sashak.voltaire.com> On 14:32 Tue 12 May , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Oops, I committed this already in the local branch :) Will use your version instead. Applied. Thanks. Sasha From sashak at voltaire.com Tue May 12 11:51:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:51:41 +0300 Subject: [ofa-general] Re: [PATCH 0/2] osm_port.c: do not enforce PortInfo update if max_op_vls = 0 In-Reply-To: <4A030465.90009@voltaire.com> References: <4A00386E.2050300@voltaire.com> <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A030465.90009@voltaire.com> Message-ID: <20090512185141.GK27108@sashak.voltaire.com> On 18:55 Thu 07 May , Doron Shoham wrote: > do not enforce PortInfo update if max_op_vls = 0 > > Signed-off-by: Doron Shoham > --- > opensm/opensm/osm_port.c | 2 +- > opensm/opensm/osm_subnet.c | 8 ++++++++ > 2 files changed, 9 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 4d1bbf2..8bf1767 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -383,7 +383,7 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > op_vls = 1; > } > /* support user limitation of max_op_vls */ > - if (op_vls > p_subn->opt.max_op_vls) > + if (p_subn->opt.max_op_vls && op_vls > p_subn->opt.max_op_vls) Then you likely want to drop '0' value from the comment in config file template (diff below), no? Sasha > op_vls = p_subn->opt.max_op_vls; > > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index ec15f8a..71fc7a0 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1288,6 +1288,14 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > "# switch port connected to a CA or router port\n" > "leaf_head_of_queue_lifetime 0x%02x\n\n" > "# Limit the maximal operational VLs\n" > + "# Virtual Lanes operational on this port\n" > + "# Values are (IB Spec 1.2.1, 14.2.5.6 Table 146 \"PortInfo\")\n" > + "# 0: No change; valid only on Set()\n" > + "# 1: VL0\n" > + "# 2: VL0, VL1\n" > + "# 3: VL0 - VL3\n" > + "# 4: VL0 - VL7\n" > + "# 5: VL0 - VL14\n" > "max_op_vls %u\n\n" > "# Force PortInfo:LinkSpeedEnabled on switch ports\n" > "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" > -- > 1.5.4 > From sashak at voltaire.com Tue May 12 11:55:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 21:55:20 +0300 Subject: [ofa-general] [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A067905.5060401@gmail.com> References: <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com> <4A067905.5060401@gmail.com> Message-ID: <20090512185520.GL27108@sashak.voltaire.com> On 09:49 Sun 10 May , Eli Dorfman (Voltaire) wrote: > Doron Shoham wrote: > > check if op_vls = 0 before max_op_vls comparison > > > > Signed-off-by: Doron Shoham > > --- > > opensm/opensm/osm_port.c | 9 +++++---- > > 1 files changed, 5 insertions(+), 4 deletions(-) > > > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > > index 2e6c642..4d1bbf2 100644 > > --- a/opensm/opensm/osm_port.c > > +++ b/opensm/opensm/osm_port.c > > @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > > } else > > op_vls = ib_port_info_get_op_vls(&p_physp->port_info); > > > > - /* support user limitation of max_op_vls */ > > - if (op_vls > p_subn->opt.max_op_vls) > > - op_vls = p_subn->opt.max_op_vls; > > - > > if (op_vls == 0) { > > + /* for non compliant implementations */ > > OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > > I think that level should be OSM_LOG_ERROR. OSM_LOG_VERBOSE is better - it is not OpenSM error. And also - don't mix two ideas in a single patch :) . Sasha From sashak at voltaire.com Tue May 12 12:00:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 22:00:36 +0300 Subject: [ofa-general] Re: [PATCH 1/2] osm_port.c: check if op_vls = 0 before max_op_vls comparison In-Reply-To: <4A068D87.6040801@voltaire.com> References: <4A0043B0.3030400@gmail.com> <20090506112135.GG10145@sk> <4A029038.2040603@voltaire.com> <20090507115212.GC19236@sk> <4A030377.6050202@voltaire.com> <4A03043C.4010709@voltaire.com> <4A067905.5060401@gmail.com> <4A068D87.6040801@voltaire.com> Message-ID: <20090512190036.GM27108@sashak.voltaire.com> On 11:17 Sun 10 May , Doron Shoham wrote: > check if op_vls = 0 before max_op_vls comparison > > Signed-off-by: Doron Shoham Applied. Thanks. See comments below. > --- > opensm/opensm/osm_port.c | 11 ++++++----- > 1 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 2e6c642..41b67ad 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -376,15 +376,16 @@ uint8_t osm_physp_calc_link_op_vls(IN osm_log_t * p_log, > } else > op_vls = ib_port_info_get_op_vls(&p_physp->port_info); > > - /* support user limitation of max_op_vls */ > - if (op_vls > p_subn->opt.max_op_vls) > - op_vls = p_subn->opt.max_op_vls; > - > if (op_vls == 0) { > - OSM_LOG(p_log, OSM_LOG_DEBUG, "ERR 4102: " > + /* for non compliant implementations */ ^^^^ Please care to not introduce trailing spaces. > + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4102: " Log level is OSM_LOG_VERBOSE now - merged with Hal's patches. Sasha From sashak at voltaire.com Tue May 12 12:17:26 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 May 2009 22:17:26 +0300 Subject: [ofa-general] Re: [RFC][PATCH] ibnetdiscover: remove report of max hops discovered. In-Reply-To: <20090506180140.6213971e.weiny2@llnl.gov> References: <20090504151005.9a565bc5.weiny2@llnl.gov> <1241543312.18144.18.camel@auk31.llnl.gov> <20090506180140.6213971e.weiny2@llnl.gov> Message-ID: <20090512191726.GN27108@sashak.voltaire.com> On 18:01 Wed 06 May , Ira Weiny wrote: > The number reported as "max hops" from ibnetdiscover can change depending on > the algorithm used to discover the fabric. As Hal says in the message below > using this number is therefore dangerous. > > If no one is currently using this number I propose the patch below which > removes the "max hops discovered" from the output. I don't know about usages, it is rather additional ibnetdiscover info (similar to date/time printout). But it was nice to have - it provides some idea about what ibnetdiscover did. If you want to remove it anyway, at least print it in verbose mode ('-v'). Sasha From arlin.r.davis at intel.com Tue May 12 12:21:23 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 12 May 2009 12:21:23 -0700 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: References: Message-ID: >Hi all, > I'm using libibverbs to build a cluster memory pool, and >using TCP/IP >handshake to exchange memory information and establish the connection >before the IB communication. While I found this process costed a lot >of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to >handle the establishment. But I dont't find sample code or API >document, is there anything I missed? > BTW, how to establish communication in current OFED? Any >comparision >or suggestion is appreciated, that will help me a lot. > What scale are you targeting? Your single connection number seems high. For a connection (socket connect, exchanging QP info, private data, qp modify) using uDAPL socket cm versus rdma_cm I get: socket_cm on 1Ge == ~900us socket_cm on IPoIB (mlx4 ddr) == ~400us rdma_cm on IB (mlx4 ddr) == ~2200us As you can see, the path record queries via rdma_cm add a substantial penalty. With larger scale clusters this really starts to hurt. You can look at uDAPL (dapl/openib_cma and dapl/openib_scm) source for examples of a socket cm implementation vs rdma_cm. With the socket cm version we ran up to 14,400 cores with no problems using Intel MPI. However, with rdma_cm we had problems reaching 1000 cores due to IPoIB ARP storms and SA path record query issues. If someone would step up and provide a scalable SA caching solution in OFED then rdma_cm could possibly work for us again. Any takers? :^) -arlin From or.gerlitz at gmail.com Tue May 12 13:11:46 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 12 May 2009 23:11:46 +0300 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: References: Message-ID: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> Davis, Arlin R wrote: > For a connection (socket connect, exchanging QP info, private data, qp modify) > using uDAPL socket cm versus rdma_cm I get: > socket_cm on 1Ge == ~900us > socket_cm on IPoIB (mlx4 ddr) == ~400us > rdma_cm on IB (mlx4 ddr) == ~2200us > As you can see, the path record queries via rdma_cm add a substantial penalty. Hi Arlin, Just to make sure we're on the same page: both IPoIB and the RDMA-CM use SA path queries (ipoib for the unicast arp reply, and rdma-cm for rdma_resolve_route), going into details, things look like: with the rdma-cm: rdma_resolve_addr A --> * ARP request (broadcast) B --> A ARP reply (unicast, before that B does SA path query) rdma_resolve_route A does SA path query rdma_connect A --> B CM REQ B --> A CM REP A --> B CM RTU with the socket cm / ipoib: socket connect A --> * ARP request (broadcast) B --> A ARP reply (unicast, before that B does SA path query) A --> B TCP SYN (unicast, A does SA path query!) B --> A TCP SYN + ACK A --> B TCP ACK Looking on the differences between the flows, we can see that --both-- flows have --two-- path queries, so the 400us vs 2200us difference can't be related to that.So, is it possible that you have counted rdma_create_qp in the rdma-cm accounting and didn't count ibv_create_qp in the scm accounting? Or. From sean.hefty at intel.com Tue May 12 13:53:37 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 12 May 2009 13:53:37 -0700 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> References: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> Message-ID: <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com> >Just to make sure we're on the same page: both IPoIB and the RDMA-CM >use SA path queries But ipoib caches its path records... - Sean From arlin.r.davis at intel.com Tue May 12 14:23:37 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 12 May 2009 14:23:37 -0700 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> References: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> Message-ID: >Davis, Arlin R wrote: >> For a connection (socket connect, exchanging QP info, >private data, qp modify) >> using uDAPL socket cm versus rdma_cm I get: >> socket_cm on 1Ge == ~900us >> socket_cm on IPoIB (mlx4 ddr) == ~400us >> rdma_cm on IB (mlx4 ddr) == ~2200us >> As you can see, the path record queries via rdma_cm add a >substantial penalty. > >Hi Arlin, > >Just to make sure we're on the same page: both IPoIB and the RDMA-CM >use SA path queries (ipoib for the unicast arp reply, and rdma-cm for >rdma_resolve_route), going into details, things look like: I am running IPoIB connected so I assume there is no path query and I see no difference in IPoIB unconnected mode so I also assume it caches path records during ARP processing. Can someone confirm? ARP cache is also hit in all these cases so you can take ARP request/reply out. However, with rdma_cm we actually have to pick up the RDMA_CM_EVENT_ADDR_RESOLVED (arp) event before moving on to the rdma_resolve_route (path record), and then wait for RDMA_CM_EVENT_ROUTE_RESOLVED event before moving on to the rdma_connect call, and then finally wait for RDMA_CM_EVENT_ESTABLISHED. You start to get the picture of where my time goes? Not only do we have path record query delays we have a 3 step event processing (waiting/waking on each) just to get connected. My measurements are on top of uDAPL so everything is equal. I simply added some timers to dtest around connect and wait for connection event: start_timer dat_ep_connect() dat_evd_wait() stop_timer For example (client side): eth0 socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-eth0 -t IPoIB socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-ib0 -t rdma_cm: dtest -P ofa-v2-ib0 -h cst-55-ib0 -t -arlin From or.gerlitz at gmail.com Tue May 12 14:32:38 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 13 May 2009 00:32:38 +0300 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com> References: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com> Message-ID: <15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com> >>Just to make sure we're on the same page: both IPoIB and the RDMA-CM >>use SA path queries > But ipoib caches its path records... Yes, of-course. But, to start with, lets analyze the case of each node running --one-- rank and then take it from there to the case where each node runs C ranks. Or. From or.gerlitz at gmail.com Tue May 12 14:50:02 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 13 May 2009 00:50:02 +0300 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: References: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> Message-ID: <15ddcffd0905121450r44d43c45vc9bbdc88e5ba4557@mail.gmail.com> Davis, Arlin R wrote: >>Just to make sure we're on the same page: both IPoIB and the RDMA-CM >>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for >>rdma_resolve_route), going into details, things look like: > I am running IPoIB connected so I assume there is no path query > and I see no difference in IPoIB unconnected mode so I also assume > it caches path records during ARP processing. Can someone confirm? Arlin, Both the datagram and connected mode issue path query (its the way IB works). The datagram mode uses the IB UD (Unreliable Datagram) transport and once the path is resolve it creates IB AH (Address Handle) which is used in conjunction with the UD QP. The connected mode uses the IB RC (Reliable Connection) transport, so path info is used to establish it connection through the IB CM. > ARP cache is also hit in all these cases so you can take ARP request/reply out. I am not with you: by "ARP cache" I assume you refer to the networking stack neighbour table, correct? so this cache has the entries since the IPoIB network was also used to spawn the job? > However, with rdma_cm we actually have to pick up the ADDR_RESOLVED (arp) > event before moving on to the rdma_resolve_route (path record), and then wait for > ROUTE_RESOLVED event before moving on to the rdma_connect call, and then > finally wait for ESTABLISHED. You start to get the picture of where my time goes? > Not only do we have path record query delays we have a 3 step event > processing (waiting/waking on each) just to get connected. Yes, this sounds like a potentially big difference from the TCP case, lets see how many kernel --> user events we have in both methods -- rdma-cm active side ----------------------- addr-resolved route-resolved established rdma-cm passive side -------------------------- connection-request established scm active side ------------------ connected scm passive side -------------------- connection request connected in the rdma-cm framework there are three kernel -->user transitions/events for the active and two for the passive, where in the scm framework there are two for the passive but only one for the active. Also counting user --> kernel transitions, in the rdma-cm active side there are three vs only one in the scm. This sounds like where things would probably makes a difference. I believe it could be fairly easy to have the kernel rdma ucm module do two successive calls (route resolve and connect) once the local address is resolved, since at that point the user space consumer can create their QP, etc. > Not only do we have path record query delays So we agree that its path query --delays-- and for one rank per node its the same # of path queries? (Sean) Or. From sean.hefty at intel.com Tue May 12 14:55:12 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 12 May 2009 14:55:12 -0700 Subject: [ofa-general] How to establish IB communcation more effectively? In-Reply-To: <15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com> References: <15ddcffd0905121311j2c450aech7b0f933a6ce5ef91@mail.gmail.com> <89088CD4A69046AF95E5D54D970C5E34@amr.corp.intel.com> <15ddcffd0905121432y2209860fo90f9cfbaa04bc41d@mail.gmail.com> Message-ID: <2510151DA1D14A058528698BAE8C08C5@amr.corp.intel.com> >Yes, of-course. But, to start with, lets analyze the case of each node >running --one-- rank and then take it from there to the case where >each node runs C ranks. The caching is independent of running MPI though. To get a fair comparison, you'd probably have to reboot the entire cluster before running the test and ensure that no other communication between the nodes occurs over ipoib. For myself, I'm not sure that the tests are the same. The DAPL providers create and modify the QPs differently. I'd need to walk through the code to see whether QP creation time is included and verify that the QP modify calls are the same. As for responding to the initial question, using sockets with hard-coded values seems to be the most common way to establish IB connections at scale, though I would guess that using the ib_cm with hard-coded values would work about the same. - Sean From ralph.campbell at qlogic.com Tue May 12 15:10:21 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 12 May 2009 15:10:21 -0700 Subject: [ofa-general] qperf: destroy QPs before destroying any other objects Message-ID: <1242166221.3504.14.camel@chromite.mv.qlogic.com> The QP contains references to the protection domain (PD), memory regions (MR), address handles, completion queues (CQ), address handles (AH), etc. The QP should be destroyed before any other objects are destroyed so that the referenced object is not busy. Signed-off-by: Ralph Campbell diff --git a/src/rdma.c b/src/rdma.c index 845c35f..492d240 100644 --- a/src/rdma.c +++ b/src/rdma.c @@ -1577,6 +1577,10 @@ show_node_info(DEVICE *dev) static void rd_close(DEVICE *dev) { + if (Req.use_cm) + cm_close(dev); + else + ib_close(dev); if (dev->ah) ibv_destroy_ah(dev->ah); if (dev->cq) @@ -1585,10 +1589,6 @@ rd_close(DEVICE *dev) ibv_dealloc_pd(dev->pd); if (dev->channel) ibv_destroy_comp_channel(dev->channel); - if (Req.use_cm) - cm_close(dev); - else - ib_close(dev); rd_mrfree(dev); memset(dev, 0, sizeof(*dev)); From vlad at lists.openfabrics.org Wed May 13 03:22:28 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 13 May 2009 03:22:28 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090513-0200 daily build status Message-ID: <20090513102228.A11F0E6159A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From johann at georgex.org Tue May 12 18:55:03 2009 From: johann at georgex.org (johann at georgex.org) Date: Tue, 12 May 2009 18:55:03 -0700 Subject: [ofa-general] Re: qperf: destroy QPs before destroying any other objects In-Reply-To: <1242166221.3504.14.camel@chromite.mv.qlogic.com> References: <1242166221.3504.14.camel@chromite.mv.qlogic.com> Message-ID: <20090513015503.GA30869@georgex.org> Ralph, I've applied the patch and have committed it to the OFED git repository. Let me know if there is anything else I need to do. Johann On Tue, May 12, 2009 at 03:10:21PM -0700, Ralph Campbell wrote: > The QP contains references to the protection domain (PD), memory > regions (MR), address handles, completion queues (CQ), address > handles (AH), etc. > The QP should be destroyed before any other objects are destroyed > so that the referenced object is not busy. > > Signed-off-by: Ralph Campbell > > diff --git a/src/rdma.c b/src/rdma.c > index 845c35f..492d240 100644 > --- a/src/rdma.c > +++ b/src/rdma.c > @@ -1577,6 +1577,10 @@ show_node_info(DEVICE *dev) > static void > rd_close(DEVICE *dev) > { > + if (Req.use_cm) > + cm_close(dev); > + else > + ib_close(dev); > if (dev->ah) > ibv_destroy_ah(dev->ah); > if (dev->cq) > @@ -1585,10 +1589,6 @@ rd_close(DEVICE *dev) > ibv_dealloc_pd(dev->pd); > if (dev->channel) > ibv_destroy_comp_channel(dev->channel); > - if (Req.use_cm) > - cm_close(dev); > - else > - ib_close(dev); > rd_mrfree(dev); > > memset(dev, 0, sizeof(*dev)); > From weiny2 at llnl.gov Wed May 13 09:30:20 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 13 May 2009 09:30:20 -0700 Subject: [ofa-general] [PATCH V2] ibnetdiscover: only report max hops discovered when requested Message-ID: <20090513093020.f85f2a0a.weiny2@llnl.gov> Added "-m" flag to report this information if the user wants it. I also changed the text in the message which says "reported max hops discovered". I don't know if we want to change that text to something else but I wanted to indicate this number is not constant and may change. This is true not just if you change the algorithm of discovery but also if you run from different nodes. Thoughts, Ira From: Ira Weiny Date: Wed, 6 May 2009 17:56:23 -0700 Subject: [PATCH] ibnetdiscover: only report max hops discovered when requested Signed-off-by: Ira Weiny --- infiniband-diags/src/ibnetdiscover.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 1799618..98ff1e4 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -65,6 +65,8 @@ static FILE *f; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; +static int max_hops = 0; + /** * Define our own conversion functions to maintain compatibility with the old * ibnetdiscover which did not use the ibmad conversion functions. @@ -448,7 +450,8 @@ dump_topology(int group, ibnd_fabric_t *fabric) struct iter_user_data iter_user_data; fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); - fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); + if (max_hops) + fprintf(f, "# Reported max hops discovered: %d\n", fabric->maxhops_discovered); fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", fabric->from_node->guid, mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F)); @@ -628,6 +631,9 @@ static int process_opt(void *context, int ch, char *optarg) case 'p': ports_report = 1; break; + case 'm': + max_hops = 1; + break; default: return -1; } @@ -651,6 +657,7 @@ int main(int argc, char **argv) { "Router_list", 'R', 0, NULL, "list of connected routers" }, { "node-name-map", 1, 1, "", "node name map file" }, { "ports", 'p', 0, NULL, "obtain a ports report" }, + { "max_hops", 'm', 0, NULL, "report max hops discovered by the library" }, { 0 } }; char usage_args[] = "[topology-file]"; -- 1.5.4.5 From jsquyres at cisco.com Wed May 13 10:49:27 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 13:49:27 -0400 Subject: [ofa-general] RPM version numbers are the same Message-ID: <3DAD41A5-1435-4D2D-8D07-7B5FD8038E1E@cisco.com> Why are the RPM version numbers the same between rc5 and the current 1.4.1 nightlies? -- Jeff Squyres Cisco Systems From roel.kluin at gmail.com Wed May 13 11:33:43 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Wed, 13 May 2009 20:33:43 +0200 Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020() and init_serdes() Message-ID: <4A0B1287.4060603@gmail.com> With a postfix increment i is incremented beyond 10/5k so the error message will be displayed too soon. Signed-off-by: Roel Kluin --- This could occur almost never. diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index b832a7b..4a84d02 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -667,7 +667,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ i = 0; while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000) mdelay(1); - if (i >= 10000) { + if (i > 10000) { nes_debug(NES_DBG_INIT, "Did not see full soft reset done.\n"); return 0; } @@ -675,7 +675,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ i = 0; while ((nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS) != 0x80) && i++ < 10000) mdelay(1); - if (i >= 10000) { + if (i > 10000) { printk(KERN_ERR PFX "Internal CPU not ready, status = %02X\n", nes_read_indexed(nesdev, NES_IDX_INT_CPU_STATUS)); return 0; @@ -701,7 +701,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ i = 0; while (((nes_read32(nesdev->regs+NES_SOFTWARE_RESET) & 0x00000040) == 0) && i++ < 10000) mdelay(1); - if (i >= 10000) { + if (i > 10000) { nes_debug(NES_DBG_INIT, "Did not see port soft reset done.\n"); return 0; } @@ -711,7 +711,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0) & 0x0000000f)) != 0x0000000f) && i++ < 5000) mdelay(1); - if (i >= 5000) { + if (i > 5000) { nes_debug(NES_DBG_INIT, "Serdes 0 not ready, status=%x\n", u32temp); return 0; } @@ -722,7 +722,7 @@ static unsigned int nes_reset_adapter_ne020(struct nes_device *nesdev, u8 *OneG_ while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1) & 0x0000000f)) != 0x0000000f) && i++ < 5000) mdelay(1); - if (i >= 5000) { + if (i > 5000) { nes_debug(NES_DBG_INIT, "Serdes 1 not ready, status=%x\n", u32temp); return 0; } @@ -792,7 +792,7 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS0) & 0x0000000f)) != 0x0000000f) && i++ < 5000) mdelay(1); - if (i >= 5000) { + if (i > 5000) { nes_debug(NES_DBG_PHY, "Init: serdes 0 not ready, status=%x\n", u32temp); return 1; } @@ -815,7 +815,7 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, while (((u32temp = (nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_STATUS1) & 0x0000000f)) != 0x0000000f) && (i++ < 5000)) mdelay(1); - if (i >= 5000) { + if (i > 5000) { printk("%s: Init: serdes 1 not ready, status=%x\n", __func__, u32temp); /* return 1; */ } From jsquyres at cisco.com Wed May 13 11:34:04 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 14:34:04 -0400 Subject: [ofa-general] /dev/infiniband/rdma_cm not created Message-ID: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> I'm running on rhel4u6 with the 1.4.1 nightly from last night and sometimes /dev/infiniband/rdma_cm is not created. I can see its entry in /etc/udev/rules.d/90-ib.rules: KERNEL="umad*", NAME="infiniband/%k" KERNEL="issm*", NAME="infiniband/%k" KERNEL="ucm*", NAME="infiniband/%k", MODE="0666" KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" KERNEL="ucma", NAME="infiniband/%k", MODE="0666" KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" But only some of these are created: [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ total 0 drwxr-xr-x 2 root root 120 May 13 02:39 ./ drwxr-xr-x 10 root root 5740 May 13 09:39 ../ crw------- 1 root root 231, 64 May 13 02:39 issm0 crw------- 1 root root 231, 0 May 13 02:39 umad0 crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 [11:29] svbu-mpi005:/etc/udev/rules.d % I have both an IB HCA and an iWARP RNIC in this server: hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:bd60 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0120002 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 34 port_lmc: 0x00 hca_id: nes0 node_guid: 0012:5502:b58c:0000 sys_image_guid: 0012:5502:b58c:0000 vendor_id: 0x1255 vendor_part_id: 256 hw_ver: 0x5 board_id: NES020 Board ID phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 1 port_lmc: 0x00 I don't see any obvious errors occurring in syslog or dmesg. What could cause this failure? -- Jeff Squyres Cisco Systems From robert.j.woodruff at intel.com Wed May 13 11:39:15 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 13 May 2009 11:39:15 -0700 Subject: [ofa-general] RE: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> Is the driver loaded ? ie., do an /sbin/lsmod to see. Also are there any messages that would indicate a problem when you do a dmesg. -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Wednesday, May 13, 2009 11:34 AM To: OpenFabrics General; OpenFabrics EWG Subject: [ewg] /dev/infiniband/rdma_cm not created I'm running on rhel4u6 with the 1.4.1 nightly from last night and sometimes /dev/infiniband/rdma_cm is not created. I can see its entry in /etc/udev/rules.d/90-ib.rules: KERNEL="umad*", NAME="infiniband/%k" KERNEL="issm*", NAME="infiniband/%k" KERNEL="ucm*", NAME="infiniband/%k", MODE="0666" KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" KERNEL="ucma", NAME="infiniband/%k", MODE="0666" KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" But only some of these are created: [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ total 0 drwxr-xr-x 2 root root 120 May 13 02:39 ./ drwxr-xr-x 10 root root 5740 May 13 09:39 ../ crw------- 1 root root 231, 64 May 13 02:39 issm0 crw------- 1 root root 231, 0 May 13 02:39 umad0 crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 [11:29] svbu-mpi005:/etc/udev/rules.d % I have both an IB HCA and an iWARP RNIC in this server: hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:bd60 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0120002 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 34 port_lmc: 0x00 hca_id: nes0 node_guid: 0012:5502:b58c:0000 sys_image_guid: 0012:5502:b58c:0000 vendor_id: 0x1255 vendor_part_id: 256 hw_ver: 0x5 board_id: NES020 Board ID phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 1 port_lmc: 0x00 I don't see any obvious errors occurring in syslog or dmesg. What could cause this failure? -- Jeff Squyres Cisco Systems _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From jsquyres at cisco.com Wed May 13 11:54:55 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 14:54:55 -0400 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> Message-ID: <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> On May 13, 2009, at 2:39 PM, Woodruff, Robert J wrote: > Is the driver loaded ? ie., do an /sbin/lsmod to see. > Ah ha -- no, it is not: [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma [11:51] svbu-mpi005:/etc/udev/rules.d % What would cause it to not be loaded? I *assumed* (but didn't check) that it is loaded as part of OFED's /etc/init.d/openibd. Is that correct? > Also are there any messages that would indicate a > problem when you do a dmesg. > As I indicated in my first mail :-), no. > > > > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org > ] On Behalf Of Jeff Squyres > Sent: Wednesday, May 13, 2009 11:34 AM > To: OpenFabrics General; OpenFabrics EWG > Subject: [ewg] /dev/infiniband/rdma_cm not created > > I'm running on rhel4u6 with the 1.4.1 nightly from last night and > sometimes /dev/infiniband/rdma_cm is not created. I can see its entry > in /etc/udev/rules.d/90-ib.rules: > > KERNEL="umad*", NAME="infiniband/%k" > KERNEL="issm*", NAME="infiniband/%k" > KERNEL="ucm*", NAME="infiniband/%k", MODE="0666" > KERNEL="uverbs*", NAME="infiniband/%k", MODE="0666" > KERNEL="ucma", NAME="infiniband/%k", MODE="0666" > KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" > > But only some of these are created: > > [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ > total 0 > drwxr-xr-x 2 root root 120 May 13 02:39 ./ > drwxr-xr-x 10 root root 5740 May 13 09:39 ../ > crw------- 1 root root 231, 64 May 13 02:39 issm0 > crw------- 1 root root 231, 0 May 13 02:39 umad0 > crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 > crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 > [11:29] svbu-mpi005:/etc/udev/rules.d % > > I have both an IB HCA and an iWARP RNIC in this server: > > hca_id: mthca0 > fw_ver: 1.2.917 > node_guid: 0005:ad00:0008:bd60 > sys_image_guid: 0005:ad00:0100:d050 > vendor_id: 0x05ad > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_03B0120002 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 34 > port_lmc: 0x00 > > hca_id: nes0 > node_guid: 0012:5502:b58c:0000 > sys_image_guid: 0012:5502:b58c:0000 > vendor_id: 0x1255 > vendor_part_id: 256 > hw_ver: 0x5 > board_id: NES020 Board ID > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 0 > port_lid: 1 > port_lmc: 0x00 > > I don't see any obvious errors occurring in syslog or dmesg. > > What could cause this failure? > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Wed May 13 11:57:35 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 14:57:35 -0400 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> Message-ID: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> On May 13, 2009, at 2:54 PM, Jeff Squyres wrote: > [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma > [11:51] svbu-mpi005:/etc/udev/rules.d % > > What would cause it to not be loaded? I *assumed* (but didn't > check) that it is loaded as part of OFED's /etc/init.d/openibd. Is > that correct? FWIW, I see the following in /etc/infiniband/openibd.conf: # Start HCA driver upon boot ONBOOT=yes #... # Load RDMA_CM module RDMA_CM_LOAD=yes -- Jeff Squyres Cisco Systems From arlin.r.davis at intel.com Wed May 13 12:03:13 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 13 May 2009 12:03:13 -0700 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> Message-ID: >FWIW, I see the following in /etc/infiniband/openibd.conf: > > ># Load RDMA_CM module >RDMA_CM_LOAD=yes > is RDMA_UCM_LOAD=yes ? What do you see with "modinfo rdma_cm rdma_ucm" ? From robert.j.woodruff at intel.com Wed May 13 12:12:33 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 13 May 2009 12:12:33 -0700 Subject: [ofa-general] RE: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com> Check to see if some other driver failed to load. I think I have seen before that if another driver fails to load, the start script bails out and does not load the other drivers. Perhaps try doing a /etc/init.d/openibd restart manually to see if something is failing to load. -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Wednesday, May 13, 2009 11:58 AM To: Jeff Squyres Cc: Woodruff, Robert J; OpenFabrics General; OpenFabrics EWG; Hefty, Sean Subject: Re: [ewg] /dev/infiniband/rdma_cm not created On May 13, 2009, at 2:54 PM, Jeff Squyres wrote: > [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma > [11:51] svbu-mpi005:/etc/udev/rules.d % > > What would cause it to not be loaded? I *assumed* (but didn't > check) that it is loaded as part of OFED's /etc/init.d/openibd. Is > that correct? FWIW, I see the following in /etc/infiniband/openibd.conf: # Start HCA driver upon boot ONBOOT=yes #... # Load RDMA_CM module RDMA_CM_LOAD=yes -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Wed May 13 12:12:56 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 15:12:56 -0400 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com><94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> Message-ID: <583F9D06-1DDD-4666-B174-BFB16E10B5B3@cisco.com> On May 13, 2009, at 3:03 PM, Davis, Arlin R wrote: > >FWIW, I see the following in /etc/infiniband/openibd.conf: > > > > > ># Load RDMA_CM module > >RDMA_CM_LOAD=yes > > is RDMA_UCM_LOAD=yes ? > Yes, sorry I didn't see that one first time around: # Load RDMA_UCM module RDMA_UCM_LOAD=yes > What do you see with "modinfo rdma_cm rdma_ucm" ? [root at svbu-mpi055 ~]# modinfo rdma_cm rdma_ucm filename: /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ infiniband/core/rdma_cm.ko parm: cma_response_timeout:CMA_CM_RESPONSE_TIMEOUT default=20 parm: unify_tcp_port_space:Unify the host TCP and RDMA port space allocation (default=0) parm: tavor_quirk:Tavor performance quirk: limit MTU to 1K if > 0 license: Dual BSD/GPL description: Generic RDMA CM Agent author: Sean Hefty depends: ib_addr,ib_cm,iw_cm,ib_core,ib_sa vermagic: 2.6.9-67.ELsmp SMP gcc-3.4 filename: /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ infiniband/core/rdma_ucm.ko license: Dual BSD/GPL description: RDMA Userspace Connection Manager Access author: Sean Hefty depends: rdma_cm,ib_uverbs,ib_core,rdma_cm vermagic: 2.6.9-67.ELsmp SMP gcc-3.4 [root at svbu-mpi055 ~]# -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Wed May 13 12:18:50 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 15:18:50 -0400 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com> <94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com> <5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com> Message-ID: <7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com> On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote: > Check to see if some other driver failed to load. > I think I have seen before that if another driver > fails to load, the start script bails out and > does not load the other drivers. > > Perhaps try doing a /etc/init.d/openibd restart > manually to see if something is failing to load. > Weird -- doing it manually shows no problem: [root at svbu-mpi055 ~]# /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ] Setting up InfiniBand network interfaces: Bringing up interface ib0: [ OK ] Bringing up interface ib1: [ OK ] Setting up service network . . . [ done ] [root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm crw-rw-rw- 1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm [root at svbu-mpi055 ~]# Something must be going wrong during the bootup. I'm unfortunately several thousand miles from the server and don't have a serial console. I guess I'll insert some initlog's in /etc/init.d/openibd... -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Wed May 13 12:59:18 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 13 May 2009 15:59:18 -0400 Subject: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created In-Reply-To: <7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com> References: <25C847DB-BE7C-427D-B0E2-AC7C184DFEC4@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A792@orsmsx507.amr.corp.intel.com><94B101A0-36F4-4B80-B5F4-0ECB65EF685F@cisco.com><5FE760CE-11B9-4668-971B-DD0A89A3ED95@cisco.com><382A478CAD40FA4FB46605CF81FE39F42DC7A825@orsmsx507.amr.corp.intel.com> <7BFC31C1-1B86-47E8-A65D-19A3F4237AAB@cisco.com> Message-ID: <90B10556-C092-4BE0-A58F-8F1184AFBFCC@cisco.com> Ok, I figured it out. I have some creative /etc/sysconfig/network- script/ifcfg-ib* scripts that may choose to do nothing if no device is present (or some other esoteric, specific-to-jeffs-cluster criteria is met) -- they call "exit 0" in this case. This apparently causes the top-level /etc/init.d/openibd to exit (!). I've fixed this (they now never call "exit"); now everything works as expected. Upon reflection, I can see that this was totally my fault -- ifcfg-* scripts are always sourced and should therefore never call "exit". But given that /etc/init.d/openib is sooo complex and has sooo many moving parts, it would be nice if there were a way to track down problems a little more easily; perhaps a "verbose" setting in /etc/ infiniband/openibd.conf, or somesuch. Indeed, since OFED is targeted at the datacenter, monitors attached to the servers in question and/or serial consoles may not be readily available. Hence, having the ability to drop some verbose output into syslog during boot, for example, might be quite useful to sysadmins/network admins when troubleshooting. Just my $0.02. Thanks for the tips where to look, Woody! On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote: > On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote: > > > Check to see if some other driver failed to load. > > I think I have seen before that if another driver > > fails to load, the start script bails out and > > does not load the other drivers. > > > > Perhaps try doing a /etc/init.d/openibd restart > > manually to see if something is failing to load. > > > > Weird -- doing it manually shows no problem: > > [root at svbu-mpi055 ~]# /etc/init.d/openibd restart > Unloading HCA driver: [ OK ] > Loading HCA driver and Access Layer: [ OK ] > Setting up InfiniBand network interfaces: > Bringing up interface ib0: [ OK ] > Bringing up interface ib1: [ OK ] > Setting up service network . . . [ done ] > [root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm > crw-rw-rw- 1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm > [root at svbu-mpi055 ~]# > > Something must be going wrong during the bootup. I'm unfortunately > several thousand miles from the server and don't have a serial > console. I guess I'll insert some initlog's in /etc/init.d/openibd... > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From rdreier at cisco.com Wed May 13 14:35:16 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 May 2009 14:35:16 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <4A09A283.3090605@opengridcomputing.com> (Steve Wise's message of "Tue, 12 May 2009 11:23:31 -0500") References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> <4A099FB8.7090603@opengridcomputing.com> <4A09A283.3090605@opengridcomputing.com> Message-ID: > Trond Myklebust wrote (earlier in this thread): > > > > All I should need to know is that I can advertise either dma handles or > > kernel VAs, and know that I can choose between two functions, say, > > ib_send_wr_fastreg_dma_init() and ib_send_wr_fastreg_kva_init() to > > initialise the ib_send_wr structure correctly. I skimmed the earlier thread, and I have to say that I don't quite see what the problem with assigning things to a u64 directly is. You can use any address you want, and I don't quite understand why using the correct cast to avoid sign extension or truncation problems is such a big maintenance burden? The code below really just looks like obfuscation to me -- are we going to want to add something like /** * ib_init_fast_reg_iova_start_u64 - initializes the iova_start field * based on a 64-bit address supplied by the user. * @wr - struct ib_send_wr pointer to be initialized * @addr - void * address to be used as the iova_start */ static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr, u64 addr) { wr->wr.fast_reg.iova_start = addr; } next, to make sure we don't get confused about assigning a u64 to a u64? It all looks a bit overcomplicated to me. - R. > /** > + * ib_init_fast_reg_iova_start_dma - initializes the iova_start field > + * based on a dma address supplied by the user. > + * @wr - struct ib_send_wr pointer to be initialized > + * @addr - dma_addr_t value to be used as the iova_start > + */ > +static inline void ib_init_fast_reg_iova_start_dma(struct ib_send_wr *wr, > + dma_addr_t addr) > +{ > + wr->wr.fast_reg.iova_start = addr; > +} > + > +/** > + * ib_init_fast_reg_iova_start_kva - initializes the iova_start field > + * based on a kernel virtual address supplied by the user. > + * @wr - struct ib_send_wr pointer to be initialized > + * @addr - void * address to be used as the iova_start > + */ > +static inline void ib_init_fast_reg_iova_start_kva(struct ib_send_wr *wr, > + void *addr) > +{ > + wr->wr.fast_reg.iova_start = (unsigned long)addr; > +} > + > +/** > * ib_alloc_mw - Allocates a memory window. > * @pd: The protection domain associated with the memory window. > */ From rdreier at cisco.com Wed May 13 15:18:13 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 May 2009 15:18:13 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a couple of fixes to low-level drivers that fix crashes seen when running NFS/RDMA: Jack Morgenstein (1): IB/mlx4: Don't overwrite fast registration page list when posting work request Roland Dreier (1): Merge branches 'cxgb3' and 'mlx4' into for-linus Steve Wise (1): RDMA/cxgb3: Don't complete flushed send work requests twice drivers/infiniband/hw/cxgb3/cxio_hal.c | 1 + drivers/infiniband/hw/mlx4/mlx4_ib.h | 1 + drivers/infiniband/hw/mlx4/mr.c | 10 ++++++++-- drivers/infiniband/hw/mlx4/qp.c | 2 +- 4 files changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 8d71086..62f9cf2 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -410,6 +410,7 @@ int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) ptr = wq->sq_rptr + count; sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); while (ptr != wq->sq_wptr) { + sqp->signaled = 0; insert_sq_cqe(wq, cq, sqp); ptr++; sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 9974e88..8a7dd67 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -86,6 +86,7 @@ struct mlx4_ib_mr { struct mlx4_ib_fast_reg_page_list { struct ib_fast_reg_page_list ibfrpl; + __be64 *mapped_page_list; dma_addr_t map; }; diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..8f3666b 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -231,7 +231,11 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device if (!mfrpl) return ERR_PTR(-ENOMEM); - mfrpl->ibfrpl.page_list = dma_alloc_coherent(&dev->dev->pdev->dev, + mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL); + if (!mfrpl->ibfrpl.page_list) + goto err_free; + + mfrpl->mapped_page_list = dma_alloc_coherent(&dev->dev->pdev->dev, size, &mfrpl->map, GFP_KERNEL); if (!mfrpl->ibfrpl.page_list) @@ -242,6 +246,7 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device return &mfrpl->ibfrpl; err_free: + kfree(mfrpl->ibfrpl.page_list); kfree(mfrpl); return ERR_PTR(-ENOMEM); } @@ -252,8 +257,9 @@ void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) struct mlx4_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list); int size = page_list->max_page_list_len * sizeof (u64); - dma_free_coherent(&dev->dev->pdev->dev, size, page_list->page_list, + dma_free_coherent(&dev->dev->pdev->dev, size, mfrpl->mapped_page_list, mfrpl->map); + kfree(mfrpl->ibfrpl.page_list); kfree(mfrpl); } diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index f385a24..20724ae 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1365,7 +1365,7 @@ static void set_fmr_seg(struct mlx4_wqe_fmr_seg *fseg, struct ib_send_wr *wr) int i; for (i = 0; i < wr->wr.fast_reg.page_list_len; ++i) - wr->wr.fast_reg.page_list->page_list[i] = + mfrpl->mapped_page_list[i] = cpu_to_be64(wr->wr.fast_reg.page_list->page_list[i] | MLX4_MTT_FLAG_PRESENT); From rdreier at cisco.com Wed May 13 15:19:17 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 May 2009 15:19:17 -0700 Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020() and init_serdes() In-Reply-To: <4A0B1287.4060603@gmail.com> (Roel Kluin's message of "Wed, 13 May 2009 20:33:43 +0200") References: <4A0B1287.4060603@gmail.com> Message-ID: Looks good to me. NES guys? - R. From abenjamin at sgi.com Wed May 13 16:21:47 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Wed, 13 May 2009 16:21:47 -0700 Subject: [ofa-general] [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts Message-ID: <4A0B560B.3090606@sgi.com> When the mthca driver calls request_irq() to allocate interrupt resources, it uses the fixed device name string "ib_mthca". When multiple IB cards are present in the system, every instance of the resource is named "ib_mthca" in /proc/interrupts. This can make it very confusing trying to work out exactly where IB interrupts are going and why. Summary of changes: o Added a new IB core API , ib_init_device() that allocates an ib_device struct and initializes its device name. o Added a new field in mthca_dev struct to hold its device (IRQ) name. o Replaced the call to ib_alloc_device by ib_init_device at mthca device init time. o Modified device name parameter to request_irq() to use the device name allocated by ib_init_device() Signed-off-by: Arputham Benjamin --- a/ofa_kernel-1.4/drivers/infiniband/core/device.c 2008-08-14 16:58:42.962168204 -0700 +++ b/ofa_kernel-1.4/drivers/infiniband/core/device.c 2008-08-14 17:00:31.276257856 -0700 @@ -181,6 +181,40 @@ struct ib_device *ib_alloc_device(size_t EXPORT_SYMBOL(ib_alloc_device); /** + * ib_init_device - allocate and initialize an IB device struct + * @size:size of structure to allocate + * @name:HCA device name + * + * Low-level drivers should use ib_init_device() to allocate &struct + * ib_device and initialize its device name. @size is the size of + * the structure to be allocated, including any private data used by + * the low-level driver. + * ib_dealloc_device() must be used to free structures allocated with + * ib_init_device(). + */ +struct ib_device *ib_init_device(size_t size, const char *name) +{ + int ret = 0; + struct ib_device *device; + + device = (struct ib_device *) ib_alloc_device(size); + if (device) { + strlcpy(device->name, name, IB_DEVICE_NAME_MAX); + if (strchr(device->name, '%')) { + mutex_lock(&device_mutex); + ret = alloc_name(device->name); + mutex_unlock(&device_mutex); + } + } + if (ret) { + ib_dealloc_device(device); + return NULL; + } + return device; +} +EXPORT_SYMBOL(ib_init_device); + +/** * ib_dealloc_device - free an IB device struct * @device:structure to free * --- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_dev.h 2008-08-14 16:58:42.994168822 -0700 +++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_dev.h 2008-08-14 17:00:31.288258088 -0700 @@ -360,6 +360,7 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + char irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX]; }; #ifdef CONFIG_INFINIBAND_MTHCA_DEBUG --- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_eq.c 2008-08-14 16:58:42.994168822 -0700 +++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_eq.c 2008-08-14 17:00:31.304258396 -0700 @@ -860,17 +860,20 @@ int mthca_init_eq_table(struct mthca_dev if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { static const char *eq_name[] = { - [MTHCA_EQ_COMP] = DRV_NAME " (comp)", - [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", - [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + [MTHCA_EQ_COMP] = " (comp)", + [MTHCA_EQ_ASYNC] = " (async)", + [MTHCA_EQ_CMD] = " (cmd)" }; for (i = 0; i < MTHCA_NUM_EQ; ++i) { + strcpy(&dev->irq_name[i][IB_DEVICE_NAME_MAX], dev->ib_dev.name); + strcat(&dev->irq_name[i][IB_DEVICE_NAME_MAX], eq_name[i]); err = request_irq(dev->eq_table.eq[i].msi_x_vector, mthca_is_memfree(dev) ? mthca_arbel_msi_x_interrupt : mthca_tavor_msi_x_interrupt, - 0, eq_name[i], dev->eq_table.eq + i); + 0, &dev->irq_name[i][IB_DEVICE_NAME_MAX], + dev->eq_table.eq + i); if (err) goto err_out_cmd; dev->eq_table.eq[i].have_irq = 1; --- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_main.c 2008-08-14 16:58:42.994168822 -0700 +++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_main.c 2008-08-14 17:03:53.348154342 -0700 @@ -47,6 +47,8 @@ #include "mthca_memfree.h" #include "mthca_wqe.h" +struct ib_device *ib_init_device(size_t size, const char *name); + MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); MODULE_LICENSE("Dual BSD/GPL"); @@ -1091,7 +1093,7 @@ static int __mthca_init_one(struct pci_d } } - mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + mdev = (struct mthca_dev *) ib_init_device(sizeof *mdev, "mthca%d"); if (!mdev) { dev_err(&pdev->dev, "Device struct alloc failed, " "aborting.\n"); --- a/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_provider.c 2008-08-14 16:58:42.998168899 -0700 +++ b/ofa_kernel-1.4/drivers/infiniband/hw/mthca/mthca_provider.c 2008-08-14 17:00:31.336259013 -0700 @@ -1358,7 +1358,6 @@ int mthca_register_device(struct mthca_d if (ret) return ret; - strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.owner = THIS_MODULE; dev->ib_dev.uverbs_abi_ver = MTHCA_UVERBS_ABI_VERSION; From ogerlitz at voltaire.com Thu May 14 00:22:14 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 14 May 2009 10:22:14 +0300 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> <4A099FB8.7090603@opengridcomputing.com> <4A09A283.3090605@opengridcomputing.com> Message-ID: <4A0BC6A6.1070002@voltaire.com> > Trond Myklebust wrote >> All I should need to know is that I can advertise either dma handles or kernel VAs Maybe its obvious to some people here, but may I ask why there's a need to post either dma address or kernel virtual address? is it application need? hardware (e.g IB vs iWARP vs vendor implementation) specific? or something else? Or. From vlad at lists.openfabrics.org Thu May 14 03:24:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 14 May 2009 03:24:15 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090514-0200 daily build status Message-ID: <20090514102415.6A014E6118E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From chien.tin.tung at intel.com Thu May 14 06:28:06 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Thu, 14 May 2009 06:28:06 -0700 Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020() and init_serdes() In-Reply-To: <4A0B1287.4060603@gmail.com> References: <4A0B1287.4060603@gmail.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830350BDFFBE@azsmsx501.amr.corp.intel.com> >With a postfix increment i is incremented beyond 10/5k so the >error message will be displayed too soon. > >Signed-off-by: Roel Kluin >--- >This could occur almost never. Thanks for the patch. Roland please apply. Acked-by: Chien Tung Chien From swise at opengridcomputing.com Thu May 14 06:41:06 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 May 2009 08:41:06 -0500 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <4A0BC6A6.1070002@voltaire.com> References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> <4A099FB8.7090603@opengridcomputing.com> <4A09A283.3090605@opengridcomputing.com> <4A0BC6A6.1070002@voltaire.com> Message-ID: <4A0C1F72.8050503@opengridcomputing.com> Or Gerlitz wrote: >> Trond Myklebust wrote >>> All I should need to know is that I can advertise either dma handles >>> or kernel VAs > > Maybe its obvious to some people here, but may I ask why there's a > need to post either dma address or kernel virtual address? is it > application need? hardware (e.g IB vs iWARP vs vendor implementation) > specific? or something else? > > Or. > > The NFSRDMA transport uses Fast Register Memory Regions. In this particular section of code, the NFSRDMA client is building a fastreg work request to bind a page list to a fastreg mr. You can read about this in the IBTA spec on memory management extensions, or in the RDMA Verbs draft. Steve. From ogerlitz at Voltaire.com Thu May 14 06:45:26 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Thu, 14 May 2009 16:45:26 +0300 Subject: [ofa-general] Re: [PATCH 2.6.30] xprtrdma: The frmr iova_start values are truncated by the nfs rdma client. In-Reply-To: <4A0C1F72.8050503@opengridcomputing.com> References: <20090424190510.3134.90405.stgit@build.ogc.int> <49F31A16.2080806@opengridcomputing.com> <49F4AE86.4090908@opengridcomputing.com> <49f515a5.1d1e640a.1c82.6677@mx.google.com> <49F5ED55.1010607@opengridcomputing.com> <1240855510.8818.9.camel@heimdal.trondhjem.org> <1240856613.8818.16.camel@heimdal.trondhjem.org> <49F60845.4010007@opengridcomputing.com> <1240865214.8818.73.camel@heimdal.trondhjem.org> <4A08A5C6.7040003@opengridcomputing.com> <1242082203.1743.11.camel@heimdal.trondhjem.org> <4A08BF1C.2050204@opengridcomputing.com> <1242089066.1743.19.camel@heimdal.trondhjem.org> <4a08cd7b.48c3f10a.6bb1.fffff6d3@mx.google.com> <1242092150.16618.15.camel@heimdal.trondhjem.org> <4A08E7B2.1010907@opengridcomputing.com> <4A099FB8.7090603@opengridcomputing.com> <4A09A283.3090605@opengridcomputing.com> <4A0BC6A6.1070002@voltaire.com> <4A0C1F72.8050503@opengridcomputing.com> Message-ID: <4A0C2076.8010702@Voltaire.com> Steve Wise wrote: > The NFSRDMA transport uses Fast Register Memory Regions. In this > particular section of code, the NFSRDMA client is building a fastreg > work request to bind a page list to a fastreg mr. You can read about > this in the IBTA spec on memory management extensions, or in the RDMA Verbs draft. Hi Steve, I was aware for the context being fastreg work request. I was thinking that the spec mandates either dma addr or kva on the iova but from your reply I assume to be wrong, thanks. Or. From harsha at zresearch.com Thu May 14 10:40:04 2009 From: harsha at zresearch.com (Harshavardhana) Date: Thu, 14 May 2009 23:10:04 +0530 Subject: [ofa-general] GlusterFS 2.0 Release Message-ID: <8a80e9760905141040y5456f1cbqfc79061379fd55ad@mail.gmail.com> Greetings everyone, On Behalf of GlusterFS Team I'm happy to announce the release of GlusterFS version 2.0. Announcement =========== About GlusterFS: GlusterFS is a clustered file system that runs on commodity off-the-shelf hardware, delivering multiple times the scalability and performance of conventional storage. The architecture is modular, stackable and kernel-independent, which makes it easy to customize, install, manage and support different operating systems. Multiple storage systems can be clustered together, supporting petabytes of capacity in a single global namespace. Building a configuration of a few hundred terabytes can be accomplished in less than thirty minutes. GlusterFS Release v2.0: GlusterFS v2.0 has gone through a major revamp in design and development since v1.3. Thanks to thousands of initial users who provided us great feedback and bug reports. There are a number of production deployments now. GlusterFS uses existing disk file systems (such as Ext3, XFS, ZFS..) to store your data as regular files and folders. You can restore the data, even after you uninstall GlusterFS. So, give it a try and let us know. Please forward this message to relevant users. What is in 2.0 release: http://www.gluster.org/docs/index.php/GlusterFS_Features Who is using GlusterFS: http://www.gluster.org/docs/index.php/Who%27s_using_GlusterFS License: GNU GPLv3 Download: http://www.gluster.org/download.php Happy Hacking =========== -- GlusterFS Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Thu May 14 16:04:17 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 14 May 2009 16:04:17 -0700 Subject: [ofa-general] [PATCH] iblinkinfo, ibqueryerrors: prevent core when switch is not found Message-ID: <20090514160417.e7505e06.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 14 May 2009 15:52:42 -0700 Subject: [PATCH] iblinkinfo, ibqueryerrors: prevent core when switch is not found If the switch is not found print nice error message instead of seg faulting Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 11 +++++++++-- infiniband-diags/src/ibqueryerrors.c | 10 ++++++++-- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index cf38ecb..367056c 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -395,11 +395,18 @@ main(int argc, char **argv) goto close_port; } - if (guid) { + if (guid_str) { ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); - print_switch(sw, NULL); + if (sw) + print_switch(sw, NULL); + else + fprintf(stderr, "Failed to find switch: %s\n", guid_str); } else if (dr_path) { ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path); + if (sw) + print_switch(sw, NULL); + else + fprintf(stderr, "Failed to find switch: %s\n", dr_path); print_switch(sw, NULL); } else { ibnd_iter_nodes_type(fabric, print_switch, IB_NODE_SWITCH, NULL); diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index 525af70..999329e 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -445,10 +445,16 @@ main(int argc, char **argv) if (switch_guid) { ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); - print_node(node, NULL); + if (node) + print_node(node, NULL); + else + fprintf(stderr, "Failed to find node: %s\n", switch_guid_str); } else if (dr_path) { ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); - print_node(node, NULL); + if (node) + print_node(node, NULL); + else + fprintf(stderr, "Failed to find node: %s\n", dr_path); } else ibnd_iter_nodes(fabric, print_node, NULL); -- 1.5.4.5 From weiny2 at llnl.gov Thu May 14 16:42:10 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 14 May 2009 16:42:10 -0700 Subject: [ofa-general] [PATCH] iblinkinfo: remove unused file pointer. Message-ID: <20090514164210.8a42f37d.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 14 May 2009 16:39:52 -0700 Subject: [PATCH] iblinkinfo: remove unused file pointer. Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 6 ------ 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 367056c..d422a2a 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -52,7 +52,6 @@ #include char *argv0 = "iblinkinfotest"; -static FILE *f; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; @@ -294,8 +293,6 @@ main(int argc, char **argv) { 0 } }; - f = stdout; - argv0 = argv[0]; while (1) { @@ -357,9 +354,6 @@ main(int argc, char **argv) argc -= optind; argv += optind; - if (argc && !(f = fopen(argv[0], "w"))) - fprintf(stderr, "can't open file %s for writing", argv[0]); - ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 3); if (!ibmad_port) { fprintf(stderr, "Failed to open %s port %d", ca, ca_port); -- 1.5.4.5 From acceptany at gmail.com Thu May 14 18:41:38 2009 From: acceptany at gmail.com (Jordan) Date: Fri, 15 May 2009 09:41:38 +0800 Subject: [ofa-general] Some problem about the root nodes selection in up/down algorithm Message-ID: <91fe68d50905141841x659cf13dt3076440c7ceeb995@mail.gmail.com> In the function "updn_find_root_nodes_by_min_hop(OUT updn_t * p_updn)", there are two sentences"thd1 = cas_num * 0.9; thd2 = cas_num * 0.05;" I can't understand what the number "0.9, 0.05" means. Why use the number "0.9, 0.05"? What's the principle of this root node selection algorithm ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri May 15 03:46:13 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 15 May 2009 03:46:13 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090515-0200 daily build status Message-ID: <20090515104613.700B8E61112@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From rdreier at cisco.com Fri May 15 10:17:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 15 May 2009 10:17:25 -0700 Subject: [ofa-general] [PATCH] nes: off by one in reset_adapter_ne020() and init_serdes() In-Reply-To: <4A0B1287.4060603@gmail.com> (Roel Kluin's message of "Wed, 13 May 2009 20:33:43 +0200") References: <4A0B1287.4060603@gmail.com> Message-ID: Thanks, I've applied this. From rdreier at cisco.com Fri May 15 14:44:07 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 15 May 2009 14:44:07 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <4A0B560B.3090606@sgi.com> (Arputham Benjamin's message of "Wed, 13 May 2009 16:21:47 -0700") References: <4A0B560B.3090606@sgi.com> Message-ID: > When the mthca driver calls request_irq() to allocate interrupt resources, it uses > the fixed device name string "ib_mthca". When multiple IB cards are present in the system, > every instance of the resource is named "ib_mthca" in /proc/interrupts. > This can make it very confusing trying to work out exactly where IB interrupts are going and why. Fundamentally makes sense. Some comments about the specifics: > o Added a new IB core API , ib_init_device() that allocates an ib_device struct > and initializes its device name. seems reasonable. However I don't think we need both ib_init_device() and ib_alloc_device(), and also the "ib_init_device" name doesn't imply that it is allocating memory. > o Modified device name parameter to request_irq() to use the device name > allocated by ib_init_device() You only did this for mthca and only in the MSI-X case. I would suggest that mthca at least needs to be consistent between MSI-X and non-MSI-X, and it would be desirable to convert other drivers as well. Also the mthca changes really should be separated out from the changes to the core API. So I would suggest reworking this into a series of patches: 1. Add a function ib_alloc_device_set_name() that does what your ib_init_device() function does. (By the way, there is a problem with your implementation, since alloc_name() just checks the list of registered devices for a collision -- so devices that are allocated but not registered could be assigned the same name, if the kernel ever moves to parallelizing PCI probing or something like that -- so you should probably fix alloc_name() to check a list of all allocated devices or something like that) 2. For each RDMA driver (ie each of drivers/infiniband/hw/xxx), convert to using ib_init_device_alloc_name() -- one patch per driver. 3. Remove the old ib_alloc_device() and rename ib_alloc_device_set_name() back to ib_alloc_device(). 4. Change mthca to use the device name when naming IRQs, both in MSI-X and INTx mode. 5. [optional] Have other drivers name their IRQs similarly. One specific thing that puzzles me. You add a field: @@ -360,6 +360,7 @@ struct mthca_dev { struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; spinlock_t sm_lock; u8 rate[MTHCA_MAX_PORTS]; + char irq_name[MTHCA_NUM_EQ][IB_DEVICE_NAME_MAX]; }; which looks sane, but then the way you use it is: > + strcpy(&dev->irq_name[i][IB_DEVICE_NAME_MAX], dev->ib_dev.name); > + strcat(&dev->irq_name[i][IB_DEVICE_NAME_MAX], eq_name[i]); why is the address you want at the position IB_DEVICE_NAME_MAX instead of at index 0? Also (this is theoretical only since IB_DEVICE_NAME_MAX is much bigger than the size of "mthcaX") without range checking, since you only allocate IB_DEVICE_NAME_MAX what prevents the eq_name part from overflowing? In general I don't like since strcpy()/strcat() instead of strlcpy()/strlcat(). (And why write this as strcpy followed by strcat instead of a single snprintf()?) - R. From vlad at lists.openfabrics.org Sat May 16 03:23:24 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 16 May 2009 03:23:24 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090516-0200 daily build status Message-ID: <20090516102324.C1753E61508@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From vlad at lists.openfabrics.org Sun May 17 03:22:57 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 17 May 2009 03:22:57 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090517-0200 daily build status Message-ID: <20090517102258.1B77FE613CA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dorfman.eli at gmail.com Sun May 17 07:06:46 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 17 May 2009 17:06:46 +0300 Subject: [ofa-general] [PATCH ] opensm: MFT tables are not set after non full member re-join Message-ID: <4A1019F6.5060900@gmail.com> MFT tables are not set after non full member re-join In case of non full member re-join MFT tables are not set. No need to set or check non full member reference to mlid (port->mcm_list). This list should be used only for full members for cleanup when port goes down. A simple scenarion to reproduce this: 1. Full member creates group 2. Non-member join - MFT sent 3. Full member leave a. group is deleted but non member port has still reference to the MLID 4. Full member re-creates the group 5. Non member re-joins - MFT *NOT* sent to switches Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_sm.h | 3 ++- opensm/opensm/osm_sa_mcmember_record.c | 6 +++--- opensm/opensm/osm_sm.c | 22 +++++++++++++++++++++- 3 files changed, 26 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h index cc8321d..1a8a577 100644 --- a/opensm/include/opensm/osm_sm.h +++ b/opensm/include/opensm/osm_sm.h @@ -539,7 +539,8 @@ osm_resp_send(IN osm_sm_t * sm, ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * const p_sm, IN const ib_net16_t mlid, - IN const ib_net64_t port_guid); + IN const ib_net64_t port_guid, + IN uint8_t scope_state); /* * PARAMETERS * p_sm diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index 5543221..fe29dd6 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1039,7 +1039,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) if (!p_mgrp) { char gid_str[INET6_ADDRSTRLEN]; CL_PLOCK_RELEASE(sa->p_lock); - OSM_LOG(sa->p_log, OSM_LOG_DEBUG, + OSM_LOG(sa->p_log, OSM_LOG_INFO, "Failed since multicast group %s not present\n", inet_ntop(AF_INET6, p_recvd_mcmember_rec->mgid.raw, gid_str, sizeof gid_str)); @@ -1309,8 +1309,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) /* do the actual routing (actually schedule the update) */ status = osm_sm_mcgrp_join(sa->sm, mlid, - p_recvd_mcmember_rec->port_gid.unicast. - interface_id); + p_recvd_mcmember_rec->port_gid.unicast.interface_id, + p_recvd_mcmember_rec->scope_state); if (status != IB_SUCCESS) { OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: " diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c index daa60ff..b334d39 100644 --- a/opensm/opensm/osm_sm.c +++ b/opensm/opensm/osm_sm.c @@ -468,7 +468,7 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm, /********************************************************************** **********************************************************************/ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, - IN const ib_net64_t port_guid) + IN const ib_net64_t port_guid, IN uint8_t scope_state) { osm_mgrp_t *p_mgrp; osm_port_t *p_port; @@ -515,6 +515,25 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, goto Exit; } + /* if there was no change from the last time + * we processed the group we can skip doing anything + */ + if (p_mgrp->last_change_id == p_mgrp->last_tree_id) { + OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE, + "Skip processing mgrp with lid:0x%X last change id:%u\n", + cl_ntoh16(mlid), p_mgrp->last_change_id); + goto Exit; + } else { + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, + "processing mgrp with lid:0x%X port: 0x%016" PRIx64 " last change id:%u tree id:%u\n", + cl_ntoh16(mlid), cl_ntoh64(port_guid), + p_mgrp->last_change_id, p_mgrp->last_tree_id); + } + + /* add mgrp only to FULL member port. used for cleanup when port goes down */ + if (!(scope_state & IB_JOIN_STATE_FULL)) + goto MgrpProcess; + /* * Check if the object (according to mlid) already exists on this port. * If it does - then no need to update it again, and no need to @@ -543,6 +562,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, goto Exit; } +MgrpProcess: status = sm_mgrp_process(p_sm, p_mgrp); CL_PLOCK_RELEASE(p_sm->p_lock); -- 1.5.3.6 From sebastien.dugue at bull.net Mon May 18 00:55:16 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 18 May 2009 09:55:16 +0200 Subject: [ofa-general] [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: <20090518095156.7f9c39e6@frecb007965> References: <20090518095156.7f9c39e6@frecb007965> Message-ID: <20090518095516.6a803492@frecb007965> QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers size to the page size and then allocates page aligned memory using posix_memalign(). However, this allocation is quite wasteful on architectures using 64K pages (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc parameter and chunks are allocated using mmap. thus we end up allocating: (requested size rounded to the page size) + (page size) + (malloc overhead) rounded internally to the page size. So for example, if we request a buffer of page_size bytes, we end up consuming 3 pages. In short, for each QP buffer we allocate, there is an overhead of 2 pages. This is quite visible on large clusters especially where the number of QP can reach several thousands. This patch creates a new function mlx4_alloc_page() for use by mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when the page size is 64K. Signed-off-by: Sebastien Dugue --- src/buf.c | 40 ++++++++++++++++++++++++++++++++++++++-- src/mlx4.h | 7 +++++++ src/qp.c | 5 +++-- 3 files changed, 48 insertions(+), 4 deletions(-) diff --git a/src/buf.c b/src/buf.c index 0e5f9b6..c8b6823 100644 --- a/src/buf.c +++ b/src/buf.c @@ -35,6 +35,8 @@ #endif /* HAVE_CONFIG_H */ #include +#include +#include #include "mlx4.h" @@ -69,14 +71,48 @@ int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size) if (ret) free(buf->buf); - if (!ret) + if (!ret) { buf->length = size; + buf->type = MLX4_MALIGN; + } return ret; } +#define PAGE_64K (1UL << 16) + +int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size) +{ + int ret; + + /* Use the standard posix_memalign() call for pages < 64K */ + if (page_size < PAGE_64K) + return mlx4_alloc_buf(buf, size, page_size); + + /* Otherwise we can save a lot by using mmap directly */ + buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + if (buf->buf == MAP_FAILED) + return errno; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + munmap(buf->buf, align(size, page_size)); + else { + buf->length = size; + buf->type = MLX4_MMAP; + } + + return ret; + } + void mlx4_free_buf(struct mlx4_buf *buf) { ibv_dofork_range(buf->buf, buf->length); - free(buf->buf); + + if ( buf->type == MLX4_MMAP ) + munmap(buf->buf, buf->length); + else + free(buf->buf); } diff --git a/src/mlx4.h b/src/mlx4.h index 827a201..83547f5 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -161,9 +161,15 @@ struct mlx4_context { pthread_mutex_t db_list_mutex; }; +enum mlx4_buf_type { + MLX4_MMAP, + MLX4_MALIGN +}; + struct mlx4_buf { void *buf; size_t length; + enum mlx4_buf_type type; }; struct mlx4_pd { @@ -288,6 +294,7 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah) } int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size); +int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size); void mlx4_free_buf(struct mlx4_buf *buf); uint32_t *mlx4_alloc_db(struct mlx4_context *context, enum mlx4_db_type type); diff --git a/src/qp.c b/src/qp.c index d194ae3..557e255 100644 --- a/src/qp.c +++ b/src/qp.c @@ -604,8 +604,9 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, qp->sq.offset = 0; } - if (mlx4_alloc_buf(&qp->buf, - align(qp->buf_size, to_mdev(pd->context->device)->page_size), + if (mlx4_alloc_page(&qp->buf, + align(qp->buf_size, + to_mdev(pd->context->device)->page_size), to_mdev(pd->context->device)->page_size)) { free(qp->sq.wrid); free(qp->rq.wrid); -- 1.6.3.rc3.12.gb7937 From sebastien.dugue at bull.net Mon May 18 00:51:56 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 18 May 2009 09:51:56 +0200 Subject: [ofa-general] [PATCH 0/3] - libmthca libmlx4 - Optimize memory allocation of QP buffers with 64K pages Message-ID: <20090518095156.7f9c39e6@frecb007965> Hi, libmthca and libmlx4 allocate QP buffers using posix_memalign(), which results in big memory wastage on architectures with 64K pages. Replacing posix_memalign() with mmap() on those platforms allows to fix this (more description in the patches themselves). Now, for some numbers, a micro benchmark I wrote shows the heap usage and the number of mmaped pages used with posix_memalign() and mmap() respectively for 1000, 2000, up to 8000 QP. MTHCA posix_memalign mmap QP heap mmaped(pages) heap mmaped(pages) 1000 838736 2988 576512 1000 2000 1751216 5973 1161264 2000 3000 2598144 8961 1746016 3000 4000 3510656 11946 2330704 4000 5000 4357616 14934 2915440 5000 6000 5270080 17919 3500176 6000 7000 6117056 20907 4084912 7000 8000 6963968 23895 4669632 8000 MLX4 posix_memalign mmap QP heap mmaped(pages) heap mmaped(pages) 1000 1469424 2982 1010544 1003 2000 2994048 5958 2010752 2003 3000 4518672 8934 3010960 3003 4000 5969520 11913 4002960 4003 5000 7494176 14889 5003168 5003 6000 8953248 17868 6003376 6003 7000 10477856 20844 7003584 7003 8000 12002496 23820 8003792 8003 This patchset consists in 3 patches: 1. Optimize memory allocation of QP buffers for libmthca 2. Optimize memory allocation of QP buffers for libmlx4 3. Fix the fixes patches for libmlx4 after having applied the previous patch. Sebastien Dugue From sebastien.dugue at bull.net Mon May 18 00:55:25 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 18 May 2009 09:55:25 +0200 Subject: [ofa-general] [PATCH 1/3] libmthca - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: <20090518095156.7f9c39e6@frecb007965> References: <20090518095156.7f9c39e6@frecb007965> Message-ID: <20090518095525.064a0cb5@frecb007965> QP buffers are allocated with mthca_alloc_buf(), which rounds the buffers size to the page size and then allocates page aligned memory using posix_memalign(). However, this allocation is quite wasteful on architectures using 64K pages (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc parameter and chunks are allocated using mmap. thus we end up allocating: (requested size rounded to the page size) + (page size) + (malloc overhead) rounded internally to the page size. So for example, if we request a buffer of page_size bytes, we end up consuming 3 pages. In short, for each QP buffer we allocate, there is an overhead of 2 pages. This is quite visible on large clusters especially where the number of QP can reach several thousands. This patch creates a new function mthca_alloc_page() for use by mthca_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when the page size is 64K. Signed-off-by: Sebastien Dugue --- src/buf.c | 40 ++++++++++++++++++++++++++++++++++++++-- src/mthca.h | 7 +++++++ src/qp.c | 7 ++++--- 3 files changed, 49 insertions(+), 5 deletions(-) diff --git a/src/buf.c b/src/buf.c index 6c1be4f..ae37e9c 100644 --- a/src/buf.c +++ b/src/buf.c @@ -35,6 +35,8 @@ #endif /* HAVE_CONFIG_H */ #include +#include +#include #include "mthca.h" @@ -69,8 +71,38 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) if (ret) free(buf->buf); - if (!ret) + if (!ret) { buf->length = size; + buf->type = MTHCA_MALIGN; + } + + return ret; +} + +#define PAGE_64K (1UL << 16) + +int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size) +{ + int ret; + + /* Use the standard posix_memalign() call for pages < 64K */ + if (page_size < PAGE_64K) + return mthca_alloc_buf(buf, size, page_size); + + /* Otherwise we can save a lot by using mmap directly */ + buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + if (buf->buf == MAP_FAILED) + return errno; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + munmap(buf->buf, align(size, page_size)); + else { + buf->length = size; + buf->type = MTHCA_MMAP; + } return ret; } @@ -78,5 +110,9 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) void mthca_free_buf(struct mthca_buf *buf) { ibv_dofork_range(buf->buf, buf->length); - free(buf->buf); + + if ( buf->type == MTHCA_MMAP ) + munmap(buf->buf, buf->length); + else + free(buf->buf); } diff --git a/src/mthca.h b/src/mthca.h index 66751f3..7db15a7 100644 --- a/src/mthca.h +++ b/src/mthca.h @@ -138,9 +138,15 @@ struct mthca_context { int qp_table_mask; }; +enum mthca_buf_type { + MTHCA_MMAP, + MTHCA_MALIGN +}; + struct mthca_buf { void *buf; size_t length; + enum mthca_buf_type type; }; struct mthca_pd { @@ -291,6 +297,7 @@ static inline int mthca_is_memfree(struct ibv_context *ibctx) } int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size); +int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size); void mthca_free_buf(struct mthca_buf *buf); int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type, diff --git a/src/qp.c b/src/qp.c index 84dd206..15f4805 100644 --- a/src/qp.c +++ b/src/qp.c @@ -848,9 +848,10 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift); - if (mthca_alloc_buf(&qp->buf, - align(qp->buf_size, to_mdev(pd->context->device)->page_size), - to_mdev(pd->context->device)->page_size)) { + if (mthca_alloc_page(&qp->buf, + align(qp->buf_size, + to_mdev(pd->context->device)->page_size), + to_mdev(pd->context->device)->page_size)) { free(qp->wrid); return -1; } -- 1.6.3.rc3.12.gb7937 From sebastien.dugue at bull.net Mon May 18 01:06:18 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Mon, 18 May 2009 10:06:18 +0200 Subject: [ofa-general] [PATCH 3/3] libmlx4 - Fix fixes after QP buffers alloc optimization patch to allow build. In-Reply-To: <20090518095156.7f9c39e6@frecb007965> References: <20090518095156.7f9c39e6@frecb007965> Message-ID: <20090518100618.3615f4ed@frecb007965> The patches in 'fixes/' need to be refreshed after the previous patch in order to build properly. Signed-off-by: Sebastien Dugue --- fixes/lim_qp_resources.patch | 20 ++++------- fixes/resize_cq_owner_bit.patch | 4 +-- fixes/userspace_dev_lims.patch | 12 ++---- fixes/xrc_consolidated_v2.patch | 68 ++++++++++++++------------------------ fixes/xrc_fix_close_domain.patch | 8 ++--- fixes/xrc_rcv_qp_v2.patch | 12 ++----- 6 files changed, 44 insertions(+), 80 deletions(-) diff --git a/fixes/lim_qp_resources.patch b/fixes/lim_qp_resources.patch index 1f89256..54cc63e 100644 --- a/fixes/lim_qp_resources.patch +++ b/fixes/lim_qp_resources.patch @@ -7,11 +7,9 @@ qp creation also lie within the reported device limits. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/qp.c -=================================================================== ---- libmlx4.orig/src/qp.c 2008-06-04 08:24:45.000000000 +0300 -+++ libmlx4/src/qp.c 2008-06-04 08:24:49.000000000 +0300 -@@ -619,6 +619,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q +--- a/src/qp.c ++++ b/src/qp.c +@@ -622,6 +622,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q enum ibv_qp_type type) { int wqe_size; @@ -19,7 +17,7 @@ Index: libmlx4/src/qp.c wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg); switch (type) { -@@ -636,8 +637,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q +@@ -639,8 +640,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q } qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); @@ -31,10 +29,8 @@ Index: libmlx4/src/qp.c cap->max_send_wr = qp->sq.max_post; /* -Index: libmlx4/src/verbs.c -=================================================================== ---- libmlx4.orig/src/verbs.c 2008-06-04 08:24:45.000000000 +0300 -+++ libmlx4/src/verbs.c 2008-06-04 08:24:49.000000000 +0300 +--- a/src/verbs.c ++++ b/src/verbs.c @@ -390,12 +390,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv struct ibv_create_qp_resp resp; struct mlx4_qp *qp; @@ -54,9 +50,9 @@ Index: libmlx4/src/verbs.c attr->cap.max_inline_data > 1024) return NULL; -@@ -461,8 +463,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv - if (ret) +@@ -464,8 +466,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv goto err_destroy; + pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex); - qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr; + qp->rq.wqe_cnt = attr->cap.max_recv_wr; diff --git a/fixes/resize_cq_owner_bit.patch b/fixes/resize_cq_owner_bit.patch index 6557027..0a5b564 100644 --- a/fixes/resize_cq_owner_bit.patch +++ b/fixes/resize_cq_owner_bit.patch @@ -3,11 +3,9 @@ for the target buffer (and not left as it was in the source buffer). Signed-off-by: Jack Morgenstein -diff --git a/src/cq.c b/src/cq.c -index 68e16e9..8226b6b 100644 --- a/src/cq.c +++ b/src/cq.c -@@ -455,6 +455,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx4_cq *cq, void *buf, int old_cqe) +@@ -478,6 +478,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx cqe = get_cqe(cq, (i & old_cqe)); while ((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) != MLX4_CQE_OPCODE_RESIZE) { diff --git a/fixes/userspace_dev_lims.patch b/fixes/userspace_dev_lims.patch index 07cf638..80d4d14 100644 --- a/fixes/userspace_dev_lims.patch +++ b/fixes/userspace_dev_lims.patch @@ -9,10 +9,8 @@ preferable to breaking the ABI. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/mlx4.c -=================================================================== ---- libmlx4.orig/src/mlx4.c 2008-06-03 15:45:18.000000000 +0300 -+++ libmlx4/src/mlx4.c 2008-06-04 08:24:10.000000000 +0300 +--- a/src/mlx4.c ++++ b/src/mlx4.c @@ -104,6 +104,7 @@ static struct ibv_context *mlx4_alloc_co struct ibv_get_context cmd; struct mlx4_alloc_ucontext_resp resp; @@ -42,10 +40,8 @@ Index: libmlx4/src/mlx4.c err_free: free(context); return NULL; -Index: libmlx4/src/mlx4.h -=================================================================== ---- libmlx4.orig/src/mlx4.h 2008-06-03 15:45:18.000000000 +0300 -+++ libmlx4/src/mlx4.h 2008-06-04 08:24:10.000000000 +0300 +--- a/src/mlx4.h ++++ b/src/mlx4.h @@ -83,6 +83,20 @@ #define PFX "mlx4: " diff --git a/fixes/xrc_consolidated_v2.patch b/fixes/xrc_consolidated_v2.patch index 6fbd0a9..78a4f6c 100644 --- a/fixes/xrc_consolidated_v2.patch +++ b/fixes/xrc_consolidated_v2.patch @@ -18,8 +18,6 @@ V2: 2. Changed xrc_ops to more ops 3. Check for xrc verbs in ibv_more_ops via AC_CHECK_MEMBER -diff --git a/configure.in b/configure.in -index 25f27f7..46a3a64 100644 --- a/configure.in +++ b/configure.in @@ -42,6 +42,12 @@ AC_CHECK_HEADER(valgrind/memcheck.h, @@ -35,11 +33,9 @@ index 25f27f7..46a3a64 100644 dnl Checks for library functions AC_CHECK_FUNC(ibv_read_sysfs_file, [], -diff --git a/src/cq.c b/src/cq.c -index 68e16e9..c598b87 100644 --- a/src/cq.c +++ b/src/cq.c -@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq { struct mlx4_wq *wq; struct mlx4_cqe *cqe; @@ -50,7 +46,7 @@ index 68e16e9..c598b87 100644 uint32_t g_mlpath_rqpn; uint16_t wqe_index; int is_error; -@@ -221,20 +223,29 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -221,20 +222,29 @@ static int mlx4_poll_one(struct mlx4_cq is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; @@ -84,7 +80,7 @@ index 68e16e9..c598b87 100644 if (is_send) { wq = &(*cur_qp)->sq; -@@ -242,6 +254,10 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -242,6 +252,10 @@ static int mlx4_poll_one(struct mlx4_cq wq->tail += (uint16_t) (wqe_index - (uint16_t) wq->tail); wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; @@ -95,7 +91,7 @@ index 68e16e9..c598b87 100644 } else if ((*cur_qp)->ibv_qp.srq) { srq = to_msrq((*cur_qp)->ibv_qp.srq); wqe_index = htons(cqe->wqe_index); -@@ -387,6 +403,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) +@@ -387,6 +401,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t prod_index; uint8_t owner_bit; int nfreed = 0; @@ -106,7 +102,7 @@ index 68e16e9..c598b87 100644 /* * First we need to find the current producer index, so we -@@ -405,7 +425,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) +@@ -405,7 +423,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, */ while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); @@ -120,8 +116,6 @@ index 68e16e9..c598b87 100644 if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index)); ++nfreed; -diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h -index 20a40c9..1b1253c 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -68,6 +68,14 @@ struct mlx4_resize_cq { @@ -152,8 +146,6 @@ index 20a40c9..1b1253c 100644 +#endif + #endif /* MLX4_ABI_H */ -diff --git a/src/mlx4.c b/src/mlx4.c -index 671e849..27ca75d 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -68,6 +68,16 @@ struct { @@ -173,7 +165,7 @@ index 671e849..27ca75d 100644 static struct ibv_context_ops mlx4_ctx_ops = { .query_device = mlx4_query_device, .query_port = mlx4_query_port, -@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ +@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_co for (i = 0; i < MLX4_QP_TABLE_SIZE; ++i) context->qp_table[i].refcnt = 0; @@ -189,7 +181,7 @@ index 671e849..27ca75d 100644 for (i = 0; i < MLX4_NUM_DB_TYPE; ++i) context->db_list[i] = NULL; -@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ +@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_co pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE); context->ibv_ctx.ops = mlx4_ctx_ops; @@ -199,8 +191,6 @@ index 671e849..27ca75d 100644 if (mlx4_query_device(&context->ibv_ctx, &dev_attrs)) goto query_free; -diff --git a/src/mlx4.h b/src/mlx4.h -index 8643d8f..3eadb98 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -79,6 +79,11 @@ @@ -248,7 +238,7 @@ index 8643d8f..3eadb98 100644 struct mlx4_db_page *db_list[MLX4_NUM_DB_TYPE]; pthread_mutex_t db_list_mutex; }; -@@ -260,6 +284,11 @@ struct mlx4_ah { +@@ -266,6 +290,11 @@ struct mlx4_ah { struct mlx4_av av; }; @@ -260,7 +250,7 @@ index 8643d8f..3eadb98 100644 static inline unsigned long align(unsigned long val, unsigned long align) { return (val + align - 1) & ~(align - 1); -@@ -304,6 +333,13 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah) +@@ -310,6 +339,13 @@ static inline struct mlx4_ah *to_mah(str return to_mxxx(ah, ah); } @@ -272,9 +262,9 @@ index 8643d8f..3eadb98 100644 +#endif + int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size); + int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size); void mlx4_free_buf(struct mlx4_buf *buf); - -@@ -350,6 +386,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *srq, int ind); +@@ -357,6 +393,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq * int mlx4_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); @@ -285,7 +275,7 @@ index 8643d8f..3eadb98 100644 struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int mlx4_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, -@@ -380,5 +420,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, struct ibv_ah_attr *attr, +@@ -387,5 +427,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, st void mlx4_free_av(struct mlx4_ah *ah); int mlx4_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); @@ -302,11 +292,9 @@ index 8643d8f..3eadb98 100644 + #endif /* MLX4_H */ -diff --git a/src/qp.c b/src/qp.c -index 01e8580..2f02430 100644 --- a/src/qp.c +++ b/src/qp.c -@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, +@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; @@ -315,7 +303,7 @@ index 01e8580..2f02430 100644 (wr->send_flags & IBV_SEND_SIGNALED ? htonl(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr->send_flags & IBV_SEND_SOLICITED ? -@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, +@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, size = sizeof *ctrl / 16; switch (ibqp->qp_type) { @@ -325,7 +313,7 @@ index 01e8580..2f02430 100644 case IBV_QPT_RC: case IBV_QPT_UC: switch (wr->opcode) { -@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp_cap *cap, enum ibv_qp_type type, +@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp size += sizeof (struct mlx4_wqe_raddr_seg); break; @@ -333,7 +321,7 @@ index 01e8580..2f02430 100644 case IBV_QPT_RC: size += sizeof (struct mlx4_wqe_raddr_seg); /* -@@ -631,6 +635,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, +@@ -632,6 +636,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q case IBV_QPT_UC: case IBV_QPT_RC: @@ -341,11 +329,9 @@ index 01e8580..2f02430 100644 wqe_size -= sizeof (struct mlx4_wqe_raddr_seg); break; -diff --git a/src/srq.c b/src/srq.c -index ba2ceb9..1350792 100644 --- a/src/srq.c +++ b/src/srq.c -@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, +@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd return 0; } @@ -399,8 +385,6 @@ index ba2ceb9..1350792 100644 + pthread_mutex_unlock(&ctx->xrc_srq_table_mutex); +} + -diff --git a/src/verbs.c b/src/verbs.c -index 400050c..b7c9c8e 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -368,18 +368,36 @@ int mlx4_query_srq(struct ibv_srq *srq, @@ -447,7 +431,7 @@ index 400050c..b7c9c8e 100644 return 0; } -@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv qp->sq.wqe_cnt = align_queue_size(attr->cap.max_send_wr + qp->sq_spare_wqes); qp->rq.wqe_cnt = align_queue_size(attr->cap.max_recv_wr); @@ -456,7 +440,7 @@ index 400050c..b7c9c8e 100644 attr->cap.max_recv_wr = qp->rq.wqe_cnt = 0; else { if (attr->cap.max_recv_sge < 1) -@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; @@ -465,7 +449,7 @@ index 400050c..b7c9c8e 100644 qp->db = mlx4_alloc_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ); if (!qp->db) goto err_free; -@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv } cmd.buf_addr = (uintptr_t) qp->buf.buf; @@ -474,7 +458,7 @@ index 400050c..b7c9c8e 100644 cmd.db_addr = 0; else cmd.db_addr = (uintptr_t) qp->db; -@@ -485,7 +503,7 @@ err_destroy: +@@ -489,7 +507,7 @@ err_destroy: err_rq_db: pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex); @@ -483,7 +467,7 @@ index 400050c..b7c9c8e 100644 mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db); err_free: -@@ -544,7 +562,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, +@@ -548,7 +566,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, st mlx4_cq_clean(to_mcq(qp->send_cq), qp->qp_num, NULL); mlx4_init_qp_indices(to_mqp(qp)); @@ -492,16 +476,16 @@ index 400050c..b7c9c8e 100644 *to_mqp(qp)->db = 0; } -@@ -603,7 +621,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) - +@@ -611,7 +629,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) mlx4_unlock_cqs(ibqp); + pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex); - if (!ibqp->srq) + if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC) mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db); free(qp->sq.wrid); if (qp->rq.wqe_cnt) -@@ -661,3 +679,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +@@ -669,3 +687,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, { return ibv_cmd_detach_mcast(qp, gid, lid); } @@ -605,8 +589,6 @@ index 400050c..b7c9c8e 100644 + return 0; +} +#endif -diff --git a/src/wqe.h b/src/wqe.h -index 6f7f309..fa2f8ac 100644 --- a/src/wqe.h +++ b/src/wqe.h @@ -65,7 +65,7 @@ struct mlx4_wqe_ctrl_seg { diff --git a/fixes/xrc_fix_close_domain.patch b/fixes/xrc_fix_close_domain.patch index dfad7ac..3af2640 100644 --- a/fixes/xrc_fix_close_domain.patch +++ b/fixes/xrc_fix_close_domain.patch @@ -6,11 +6,9 @@ Need to pass this upward to caller. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/verbs.c -=================================================================== ---- libmlx4.orig/src/verbs.c 2008-09-01 10:51:11.000000000 +0300 -+++ libmlx4/src/verbs.c 2008-09-01 10:52:40.000000000 +0300 -@@ -774,9 +774,11 @@ +--- a/src/verbs.c ++++ b/src/verbs.c +@@ -782,9 +782,11 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom int mlx4_close_xrc_domain(struct ibv_xrc_domain *d) { diff --git a/fixes/xrc_rcv_qp_v2.patch b/fixes/xrc_rcv_qp_v2.patch index 311c500..00ffd53 100644 --- a/fixes/xrc_rcv_qp_v2.patch +++ b/fixes/xrc_rcv_qp_v2.patch @@ -5,11 +5,9 @@ Signed-off-by: Jack Morgenstein V2: 1. xrc_ops changed to more_ops -diff --git a/src/mlx4.c b/src/mlx4.c -index 27ca75d..e5ded78 100644 --- a/src/mlx4.c +++ b/src/mlx4.c -@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops = { +@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops .create_xrc_srq = mlx4_create_xrc_srq, .open_xrc_domain = mlx4_open_xrc_domain, .close_xrc_domain = mlx4_close_xrc_domain, @@ -21,11 +19,9 @@ index 27ca75d..e5ded78 100644 #endif }; #endif -diff --git a/src/mlx4.h b/src/mlx4.h -index 3eadb98..6307a2d 100644 --- a/src/mlx4.h +++ b/src/mlx4.h -@@ -429,6 +429,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context, +@@ -436,6 +436,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom int fd, int oflag); int mlx4_close_xrc_domain(struct ibv_xrc_domain *d); @@ -47,11 +43,9 @@ index 3eadb98..6307a2d 100644 #endif -diff --git a/src/verbs.c b/src/verbs.c -index b7c9c8e..8261eae 100644 --- a/src/verbs.c +++ b/src/verbs.c -@@ -778,4 +778,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc_domain *d) +@@ -786,4 +786,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc free(d); return 0; } -- 1.6.3.rc3.12.gb7937 From eli at mellanox.co.il Mon May 18 01:55:24 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 18 May 2009 11:55:24 +0300 Subject: [ofa-general] [PATCH 1/2] mlx4_core: Use module parameter for number of MTTs per segment Message-ID: <20090518085524.GA16094@mtls03> The current MTTs allocator uses kmalloc to allocate a buffer for it's buddy system implementation and thus is limited by the amount of MTT segments that it can control. As a result, the size of memory that can be registered is limited too. This patch uses a module parameter to control the number of MTT entries that each segment represents, thus allowing to register more memory with the same number of segments. Signed-off-by: Eli Cohen --- drivers/net/mlx4/main.c | 14 ++++++++++++-- drivers/net/mlx4/mr.c | 6 +++--- drivers/net/mlx4/profile.c | 2 +- include/linux/mlx4/device.h | 1 + 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 30bea96..018348c 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -100,6 +100,10 @@ module_param_named(use_prio, use_prio, bool, 0444); MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports " "(0/1, default 0)"); +static int log_mtts_per_seg = ilog2(MLX4_MTT_ENTRY_PER_SEG); +module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444); +MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment (1-5)"); + int mlx4_check_port_params(struct mlx4_dev *dev, enum mlx4_port_type *port_type) { @@ -203,12 +207,13 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev->caps.max_cqes = dev_cap->max_cq_sz - 1; dev->caps.reserved_cqs = dev_cap->reserved_cqs; dev->caps.reserved_eqs = dev_cap->reserved_eqs; + dev->caps.mtts_per_seg = 1 << log_mtts_per_seg; dev->caps.reserved_mtts = DIV_ROUND_UP(dev_cap->reserved_mtts, - MLX4_MTT_ENTRY_PER_SEG); + dev->caps.mtts_per_seg); dev->caps.reserved_mrws = dev_cap->reserved_mrws; dev->caps.reserved_uars = dev_cap->reserved_uars; dev->caps.reserved_pds = dev_cap->reserved_pds; - dev->caps.mtt_entry_sz = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz; + dev->caps.mtt_entry_sz = dev->caps.mtts_per_seg * dev_cap->mtt_entry_sz; dev->caps.max_msg_sz = dev_cap->max_msg_sz; dev->caps.page_size_cap = ~(u32) (dev_cap->min_page_sz - 1); dev->caps.flags = dev_cap->flags; @@ -1304,6 +1309,11 @@ static int __init mlx4_verify_params(void) return -1; } + if ((log_mtts_per_seg < 1) || (log_mtts_per_seg > 5)) { + printk(KERN_WARNING "mlx4_core: bad log_mtts_per_seg: %d\n", log_mtts_per_seg); + return -1; + } + return 0; } diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 0caf74c..3b8973d 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -209,7 +209,7 @@ int mlx4_mtt_init(struct mlx4_dev *dev, int npages, int page_shift, } else mtt->page_shift = page_shift; - for (mtt->order = 0, i = MLX4_MTT_ENTRY_PER_SEG; i < npages; i <<= 1) + for (mtt->order = 0, i = dev->caps.mtts_per_seg; i < npages; i <<= 1) ++mtt->order; mtt->first_seg = mlx4_alloc_mtt_range(dev, mtt->order); @@ -350,7 +350,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) mpt_entry->pd_flags |= cpu_to_be32(MLX4_MPT_PD_FLAG_FAST_REG | MLX4_MPT_PD_FLAG_RAE); mpt_entry->mtt_sz = cpu_to_be32((1 << mr->mtt.order) * - MLX4_MTT_ENTRY_PER_SEG); + dev->caps.mtts_per_seg); } else { mpt_entry->flags |= cpu_to_be32(MLX4_MPT_FLAG_SW_OWNS); } @@ -391,7 +391,7 @@ static int mlx4_write_mtt_chunk(struct mlx4_dev *dev, struct mlx4_mtt *mtt, (start_index + npages - 1) / (PAGE_SIZE / sizeof (u64))) return -EINVAL; - if (start_index & (MLX4_MTT_ENTRY_PER_SEG - 1)) + if (start_index & (dev->caps.mtts_per_seg - 1)) return -EINVAL; mtts = mlx4_table_find(&priv->mr_table.mtt_table, mtt->first_seg + diff --git a/drivers/net/mlx4/profile.c b/drivers/net/mlx4/profile.c index cebdf32..bd22df9 100644 --- a/drivers/net/mlx4/profile.c +++ b/drivers/net/mlx4/profile.c @@ -98,7 +98,7 @@ u64 mlx4_make_profile(struct mlx4_dev *dev, profile[MLX4_RES_EQ].size = dev_cap->eqc_entry_sz; profile[MLX4_RES_DMPT].size = dev_cap->dmpt_entry_sz; profile[MLX4_RES_CMPT].size = dev_cap->cmpt_entry_sz; - profile[MLX4_RES_MTT].size = MLX4_MTT_ENTRY_PER_SEG * dev_cap->mtt_entry_sz; + profile[MLX4_RES_MTT].size = dev->caps.mtts_per_seg * dev_cap->mtt_entry_sz; profile[MLX4_RES_MCG].size = MLX4_MGM_ENTRY_SIZE; profile[MLX4_RES_QP].num = request->num_qp; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 3aff8a6..ce7cc6c 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -210,6 +210,7 @@ struct mlx4_caps { int num_comp_vectors; int num_mpts; int num_mtt_segs; + int mtts_per_seg; int fmr_reserved_mtts; int reserved_mtts; int reserved_mrws; -- 1.6.3 From eli at mellanox.co.il Mon May 18 01:55:51 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 18 May 2009 11:55:51 +0300 Subject: [ofa-general] [PATCH 2/2] ib_mthca: Use module parameter for number of MTTs per segment Message-ID: <20090518085551.GA16106@mtls03> The current MTTs allocator uses kmalloc to allocate a buffer for it's buddy system implementation and thus is limited by the amount of MTT segments that it can control. As a result, the size of memory that can be registered is limited too. This patch uses a module parameter to control the number of MTT entries that each segment represents, thus allowing to register more memory with the same number of segments. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mthca/mthca_cmd.c | 2 +- drivers/infiniband/hw/mthca/mthca_dev.h | 1 + drivers/infiniband/hw/mthca/mthca_main.c | 17 ++++++++++++++--- drivers/infiniband/hw/mthca/mthca_mr.c | 16 ++++++++-------- drivers/infiniband/hw/mthca/mthca_profile.c | 4 ++-- 5 files changed, 26 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 6d55f9d..8c2ed99 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1059,7 +1059,7 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); if (mthca_is_memfree(dev)) dev_lim->reserved_mtts = ALIGN((1 << (field >> 4)) * sizeof(u64), - MTHCA_MTT_SEG_SIZE) / MTHCA_MTT_SEG_SIZE; + dev->limits.mtt_seg_size) / dev->limits.mtt_seg_size; else dev_lim->reserved_mtts = 1 << (field >> 4); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 2525901..9ef611f 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -159,6 +159,7 @@ struct mthca_limits { int reserved_eqs; int num_mpts; int num_mtt_segs; + int mtt_seg_size; int fmr_reserved_mtts; int reserved_mtts; int reserved_mrws; diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 1d83cf7..13da9f1 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -125,6 +125,10 @@ module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444); MODULE_PARM_DESC(fmr_reserved_mtts, "number of memory translation table segments reserved for FMR"); +static int log_mtts_per_seg = ilog2(MTHCA_MTT_SEG_SIZE / 8); +module_param_named(log_mtts_per_seg, log_mtts_per_seg, int, 0444); +MODULE_PARM_DESC(log_mtts_per_seg, "Log2 number of MTT entries per segment (1-5)"); + static char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; @@ -162,6 +166,7 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) int err; u8 status; + mdev->limits.mtt_seg_size = (1 << log_mtts_per_seg) * 8; err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status); if (err) { mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); @@ -460,11 +465,11 @@ static int mthca_init_icm(struct mthca_dev *mdev, } /* CPU writes to non-reserved MTTs, while HCA might DMA to reserved mtts */ - mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * MTHCA_MTT_SEG_SIZE, - dma_get_cache_alignment()) / MTHCA_MTT_SEG_SIZE; + mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * mdev->limits.mtt_seg_size, + dma_get_cache_alignment()) / mdev->limits.mtt_seg_size; mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, - MTHCA_MTT_SEG_SIZE, + mdev->limits.mtt_seg_size, mdev->limits.num_mtt_segs, mdev->limits.reserved_mtts, 1, 0); @@ -1315,6 +1320,12 @@ static void __init mthca_validate_profile(void) printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n", hca_profile.fmr_reserved_mtts); } + + if ((log_mtts_per_seg < 1) || (log_mtts_per_seg > 5)) { + printk(KERN_WARNING PFX "bad log_mtts_per_seg (%d). Using default - %d\n", + log_mtts_per_seg, ilog2(MTHCA_MTT_SEG_SIZE / 8)); + log_mtts_per_seg = ilog2(MTHCA_MTT_SEG_SIZE / 8); + } } static int __init mthca_init(void) diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index 882e6b7..d606edf 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -220,7 +220,7 @@ static struct mthca_mtt *__mthca_alloc_mtt(struct mthca_dev *dev, int size, mtt->buddy = buddy; mtt->order = 0; - for (i = MTHCA_MTT_SEG_SIZE / 8; i < size; i <<= 1) + for (i = dev->limits.mtt_seg_size / 8; i < size; i <<= 1) ++mtt->order; mtt->first_seg = mthca_alloc_mtt_range(dev, mtt->order, buddy); @@ -267,7 +267,7 @@ static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, while (list_len > 0) { mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + - mtt->first_seg * MTHCA_MTT_SEG_SIZE + + mtt->first_seg * dev->limits.mtt_seg_size + start_index * 8); mtt_entry[1] = 0; for (i = 0; i < list_len && i < MTHCA_MAILBOX_SIZE / 8 - 2; ++i) @@ -326,7 +326,7 @@ static void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, u64 __iomem *mtts; int i; - mtts = dev->mr_table.tavor_fmr.mtt_base + mtt->first_seg * MTHCA_MTT_SEG_SIZE + + mtts = dev->mr_table.tavor_fmr.mtt_base + mtt->first_seg * dev->limits.mtt_seg_size + start_index * sizeof (u64); for (i = 0; i < list_len; ++i) mthca_write64_raw(cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT), @@ -345,10 +345,10 @@ static void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, /* For Arbel, all MTTs must fit in the same page. */ BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64) - 1) / PAGE_SIZE); /* Require full segments */ - BUG_ON(s % MTHCA_MTT_SEG_SIZE); + BUG_ON(s % dev->limits.mtt_seg_size); mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg + - s / MTHCA_MTT_SEG_SIZE, &dma_handle); + s / dev->limits.mtt_seg_size, &dma_handle); BUG_ON(!mtts); @@ -479,7 +479,7 @@ int mthca_mr_alloc(struct mthca_dev *dev, u32 pd, int buffer_size_shift, if (mr->mtt) mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + - mr->mtt->first_seg * MTHCA_MTT_SEG_SIZE); + mr->mtt->first_seg * dev->limits.mtt_seg_size); if (0) { mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); @@ -626,7 +626,7 @@ int mthca_fmr_alloc(struct mthca_dev *dev, u32 pd, goto err_out_table; } - mtt_seg = mr->mtt->first_seg * MTHCA_MTT_SEG_SIZE; + mtt_seg = mr->mtt->first_seg * dev->limits.mtt_seg_size; if (mthca_is_memfree(dev)) { mr->mem.arbel.mtts = mthca_table_find(dev->mr_table.mtt_table, @@ -908,7 +908,7 @@ int mthca_init_mr_table(struct mthca_dev *dev) dev->mr_table.mtt_base); dev->mr_table.tavor_fmr.mtt_base = - ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE); + ioremap(addr, mtts * dev->limits.mtt_seg_size); if (!dev->mr_table.tavor_fmr.mtt_base) { mthca_warn(dev, "MTT ioremap for FMR failed.\n"); err = -ENOMEM; diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c index d168c25..8edb28a 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.c +++ b/drivers/infiniband/hw/mthca/mthca_profile.c @@ -94,7 +94,7 @@ s64 mthca_make_profile(struct mthca_dev *dev, profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; profile[MTHCA_RES_MPT].size = dev_lim->mpt_entry_sz; - profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_MTT].size = dev->limits.mtt_seg_size; profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; profile[MTHCA_RES_UARC].size = request->uarc_size; @@ -232,7 +232,7 @@ s64 mthca_make_profile(struct mthca_dev *dev, dev->limits.num_mtt_segs = profile[i].num; dev->mr_table.mtt_base = profile[i].start; init_hca->mtt_base = profile[i].start; - init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + init_hca->mtt_seg_sz = ffs(dev->limits.mtt_seg_size) - 7; break; case MTHCA_RES_UAR: dev->limits.num_uars = profile[i].num; -- 1.6.3 From PHF at zurich.ibm.com Mon May 18 02:52:27 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Mon, 18 May 2009 11:52:27 +0200 Subject: [ofa-general] RPATH issue with libibverbs (OFED 1.4) Message-ID: Hi, I am no longer able to build the libibverbs due to an RPATH issue. Can you give me some advice as to how to solve it? When running the 'install.pl' script, I get the following output: ... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'dist %{nil}' --target x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' /root/OFED/1.4/OFED-1.4/SRPMS/libibverbs-1.1.2-1.ofed1.4.src.rpm Failed to build libibverbs RPM See /tmp/OFED.2614.logs/libibverbs.rpmbuild.log The last few lines from that log are: ERROR 0001: file '/usr/bin/ibv_asyncwatch' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_srq_pingpong' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_devices' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_devinfo' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_rc_pingpong' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_ud_pingpong' contains a standard rpath '/usr/lib64' in [/usr/lib64] ERROR 0001: file '/usr/bin/ibv_uc_pingpong' contains a standard rpath '/usr/lib64' in [/usr/lib64] error: Bad exit status from /var/tmp/rpm-tmp.52084 (%install) I am runing the following Fedora kernel: 2.6.27.21-78.2.41.fc9.x86_64 On another machine with the exact same setup, the installation works fine. Many thanks and kind regards, Philip -- Philip Frey IBM Zurich Research Laboratory Saumerstrasse 4 | Phone: +41 44 724 8613 CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Mon May 18 03:21:13 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 18 May 2009 03:21:13 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090518-0200 daily build status Message-ID: <20090518102113.2EE37E61348@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From Bert.Wiegers at t-systems-sfr.com Mon May 18 04:10:42 2009 From: Bert.Wiegers at t-systems-sfr.com (Wiegers, Bert) Date: Mon, 18 May 2009 13:10:42 +0200 Subject: [ofa-general] MTU in IPoIB In-Reply-To: <20090414185748.5ea98ae7@beno.local.bs> References: <200904112233.51105.bs_lists@aakef.fastmail.fm> <20090414091223.c7911402.weiny2@llnl.gov> <20090414185748.5ea98ae7@beno.local.bs> Message-ID: <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com> Hi, In our default-setup we are using IPoIB. This is set up with a MTU of 65520 ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.75.107.32 Bcast:10.75.255.255 Mask:255.255.0.0 inet6 addr: fe80::214:4fa4:d3ba:25/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:9318 errors:0 dropped:0 overruns:0 frame:0 TX packets:4197 errors:0 dropped:10 overruns:0 carrier:0 collisions:0 txqueuelen:4096 RX bytes:25032362 (23.8 Mb) TX bytes:636320 (621.4 Kb) Our dmesg on the other hand shows these hints: ib_core: module not supported by Novell, setting U taint flag. ib_mad: module not supported by Novell, setting U taint flag. ib_mthca: module not supported by Novell, setting U taint flag. mlx4_ib: module not supported by Novell, setting U taint flag. mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) ib_ipath: module not supported by Novell, setting U taint flag. cxgb3: module not supported by Novell, setting U taint flag. iw_cxgb3: module not supported by Novell, setting U taint flag. ib_umad: module not supported by Novell, setting U taint flag. NET: Registered protocol family 10 lo: Disabled Privacy Extensions IPv6 over IPv4 tunneling driver ib_uverbs: module not supported by Novell, setting U taint flag. ib_sa: module not supported by Novell, setting U taint flag. ib_cm: module not supported by Novell, setting U taint flag. ib_ipoib: module not supported by Novell, setting U taint flag. ADDRCONF(NETDEV_UP): ib0: link is not ready ib0: enabling connected mode will cause multicast packet drops ib0: mtu > 2044 will cause multicast packet drops. ib0: mtu > 2044 will cause multicast packet drops. ib1: enabling connected mode will cause multicast packet drops ib1: mtu > 2044 will cause multicast packet drops. ib1: mtu > 2044 will cause multicast packet drops. ib_addr: module not supported by Novell, setting U taint flag. iw_cm: module not supported by Novell, setting U taint flag. rdma_cm: module not supported by Novell, setting U taint flag. ib_sdp: module not supported by Novell, setting U taint flag. NET: Registered protocol family 27 qlgc_vnic: module not supported by Novell, setting U taint flag. QLGC_VNIC: Initializing QLogic Corp. Virtual NIC (VNIC) driver version 1.3.0.0.4 rdma_ucm: module not supported by Novell, setting U taint flag. scsi_transport_iscsi: module not supported by Novell, setting U taint flag. Loading iSCSI transport class v2.0-869. libiscsi: module not supported by Novell, setting U taint flag. iscsi_tcp: module not supported by Novell, setting U taint flag. iscsi: registered transport (tcp) ib_iser: module not supported by Novell, setting U taint flag. iscsi: registered transport (iser) md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready eth0: no IPv6 routers present ib0: no IPv6 routers present So should I limit the MTU to 2044? Thanks. Bert From Robert at saq.co.uk Mon May 18 05:18:40 2009 From: Robert at saq.co.uk (Robert Dunkley) Date: Mon, 18 May 2009 13:18:40 +0100 Subject: [ofa-general] MTU in IPoIB References: <200904112233.51105.bs_lists@aakef.fastmail.fm><20090414091223.c7911402.weiny2@llnl.gov><20090414185748.5ea98ae7@beno.local.bs> <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com> Message-ID: Hi, This warning is normal. If you don't need Multicast then it is of no concern at all. If you have an app that uses multicast then you will have to limit the MTU (In this case you might be better off using reliable transmission - not connected mode). Rob -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Wiegers, Bert Sent: 18 May 2009 12:11 To: general at lists.openfabrics.org Subject: [ofa-general] MTU in IPoIB Hi, In our default-setup we are using IPoIB. This is set up with a MTU of 65520 ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.75.107.32 Bcast:10.75.255.255 Mask:255.255.0.0 inet6 addr: fe80::214:4fa4:d3ba:25/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:9318 errors:0 dropped:0 overruns:0 frame:0 TX packets:4197 errors:0 dropped:10 overruns:0 carrier:0 collisions:0 txqueuelen:4096 RX bytes:25032362 (23.8 Mb) TX bytes:636320 (621.4 Kb) Our dmesg on the other hand shows these hints: ib_core: module not supported by Novell, setting U taint flag. ib_mad: module not supported by Novell, setting U taint flag. ib_mthca: module not supported by Novell, setting U taint flag. mlx4_ib: module not supported by Novell, setting U taint flag. mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) ib_ipath: module not supported by Novell, setting U taint flag. cxgb3: module not supported by Novell, setting U taint flag. iw_cxgb3: module not supported by Novell, setting U taint flag. ib_umad: module not supported by Novell, setting U taint flag. NET: Registered protocol family 10 lo: Disabled Privacy Extensions IPv6 over IPv4 tunneling driver ib_uverbs: module not supported by Novell, setting U taint flag. ib_sa: module not supported by Novell, setting U taint flag. ib_cm: module not supported by Novell, setting U taint flag. ib_ipoib: module not supported by Novell, setting U taint flag. ADDRCONF(NETDEV_UP): ib0: link is not ready ib0: enabling connected mode will cause multicast packet drops ib0: mtu > 2044 will cause multicast packet drops. ib0: mtu > 2044 will cause multicast packet drops. ib1: enabling connected mode will cause multicast packet drops ib1: mtu > 2044 will cause multicast packet drops. ib1: mtu > 2044 will cause multicast packet drops. ib_addr: module not supported by Novell, setting U taint flag. iw_cm: module not supported by Novell, setting U taint flag. rdma_cm: module not supported by Novell, setting U taint flag. ib_sdp: module not supported by Novell, setting U taint flag. NET: Registered protocol family 27 qlgc_vnic: module not supported by Novell, setting U taint flag. QLGC_VNIC: Initializing QLogic Corp. Virtual NIC (VNIC) driver version 1.3.0.0.4 rdma_ucm: module not supported by Novell, setting U taint flag. scsi_transport_iscsi: module not supported by Novell, setting U taint flag. Loading iSCSI transport class v2.0-869. libiscsi: module not supported by Novell, setting U taint flag. iscsi_tcp: module not supported by Novell, setting U taint flag. iscsi: registered transport (tcp) ib_iser: module not supported by Novell, setting U taint flag. iscsi: registered transport (iser) md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready eth0: no IPv6 routers present ib0: no IPv6 routers present So should I limit the MTU to 2044? Thanks. Bert _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member Find us in http://www.thebestof.co.uk/petersfield From ogerlitz at Voltaire.com Mon May 18 05:20:19 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Mon, 18 May 2009 15:20:19 +0300 Subject: [ofa-general] MTU in IPoIB In-Reply-To: <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com> References: <200904112233.51105.bs_lists@aakef.fastmail.fm> <20090414091223.c7911402.weiny2@llnl.gov> <20090414185748.5ea98ae7@beno.local.bs> <5C59564D023C844F9C20CCD32687E5F4210E2EE6@SFREXMBX01.acds.t-systems-sfr.com> Message-ID: <4A115283.6020803@Voltaire.com> Wiegers, Bert wrote: > In our default-setup we are using IPoIB. This is set up with a MTU of 65520 > ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 [...] > ib0: enabling connected mode will cause multicast packet drops > ib0: mtu > 2044 will cause multicast packet drops. > So should I limit the MTU to 2044? Please take a look on Documentation/infiniband/ipoib.txt, specifically commit b49ca "IPoIB: Document newish features" should help you understand things better. Or. From tziporet at mellanox.co.il Mon May 18 06:06:27 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 18 May 2009 16:06:27 +0300 Subject: [ofa-general] EWG/OFED meeting agenda for today (May 18) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com> This is the agenda for today's EWG/OFED meeting: 1. OFED 1.4.1 bugs status and decision on RC6 date 1628 blo andy.grover at oracle.com RDS in 1.4.1 cannot connect to RDS in 1.3.1 - I think Andy sent a fix for this 1596 cri Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is loaded - Jeff B. is working on this - need update 1616 cri swise at opengridcomputing.com iommu_alloc error when running connectathon on ppc64 nfs ... - I think Steve sent a patch for this 1571 cri vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, - Need update from Vu We had a problematic RC5 which we deleted. We now wait for bug 1596 resolution 2. New memory registration API - update from Jeff S. 3. OFED 1.5 status update - all 4. Open discussion Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Mon May 18 07:20:36 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 18 May 2009 09:20:36 -0500 Subject: [ofa-general] Re: EWG/OFED meeting agenda for today (May 18) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD02B7BFDF@mtlexch01.mtl.com> Message-ID: <4A116EB4.4060102@opengridcomputing.com> Tziporet Koren wrote: > > This is the agenda for today's EWG/OFED meeting: > > 1. OFED 1.4.1 bugs status and decision on RC6 date > > 1628 blo andy.grover at oracle.com RDS in 1.4.1 cannot connect to RDS in > 1.3.1 - I think Andy sent a fix for this > > 1596 cri Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is > loaded - Jeff B. is working on this - need update > > 1616 cri swise at opengridcomputing.com iommu_alloc error when running > connectathon on ppc64 nfs … - I think Steve sent a patch for this > I did. I just now closed 1616. From worleys at gmail.com Mon May 18 09:21:00 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 10:21:00 -0600 Subject: [ofa-general] SRP aggregate bandwidth decreasing as threads increase Message-ID: I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10 I'm seeing aggregate performance drop significantly. This is not a drive issue: locally, the drives get best performance >~32 threads, and maintain their aggregate way beyond that. Is there any tunable parameter or source code change in the initiator or target code that would effect performance with a high thread count? Thanks, Chris From jsquyres at cisco.com Mon May 18 09:24:48 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 18 May 2009 12:24:48 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: Message-ID: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com> On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote: > > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also > > releasing 0x2000-0x2fff. > > If everyone is doing this, how do you handle the case that Jason > pointed > out, namely: > > * you register 0x1000 ... 0x3fff > * you want to register 0x2000 ... 0x2fff and have a cache hit > * you finish up with 0x1000 ... 0x3fff > * app does something (which is valid since you finished up with the > bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() > that leads to munmap() or whatever), and your hooks tell you so. > * app reallocates a mapping in 0x3000 ... 0x3fff > * you want to re-register 0x1000 ... 0x3fff -- but it has to be > marked > both invalid and in-use in the cache at this point !? > Sorry; this mail slipped by me and I just saw it now. If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can change even though it is still registered, then we're screwed -- we have no way of knowing that this is now invalid (Open MPI, at least -- can't speak for others). Is there a way to detect condition this in userspace? -- Jeff Squyres Cisco Systems From worleys at gmail.com Mon May 18 09:57:10 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 10:57:10 -0600 Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 10:52 AM, Sufficool, Stanley wrote: > IIRC, The SRP Target code has many context switches that throttle > performance at higher thread counts. Can anything be done to reduce the context switches? Is there, for example, one thread on the target per user thread that may best be pinned? Thanks, Chris > >> -----Original Message----- >> From: Chris Worley [mailto:worleys at gmail.com] >> Sent: Monday, May 18, 2009 9:21 AM >> To: OpenIB; scst-devel >> Subject: [Scst-devel] SRP aggregate bandwidth decreasing as >> threads increase >> >> >> I'm seeing peak performance at ~4 threads (1.6GB/s), w/ >> threads >10 I'm seeing aggregate performance drop >> significantly.  This is not a drive issue: locally, the >> drives get best performance >~32 threads, and maintain their >> aggregate way beyond that. >> >> Is there any tunable parameter or source code change in the >> initiator or target code that would effect performance with a >> high thread count? >> >> Thanks, >> >> Chris From ssufficool at rov.sbcounty.gov Mon May 18 09:52:12 2009 From: ssufficool at rov.sbcounty.gov (Sufficool, Stanley) Date: Mon, 18 May 2009 09:52:12 -0700 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: Message-ID: IIRC, The SRP Target code has many context switches that throttle performance at higher thread counts. > -----Original Message----- > From: Chris Worley [mailto:worleys at gmail.com] > Sent: Monday, May 18, 2009 9:21 AM > To: OpenIB; scst-devel > Subject: [Scst-devel] SRP aggregate bandwidth decreasing as > threads increase > > > I'm seeing peak performance at ~4 threads (1.6GB/s), w/ > threads >10 I'm seeing aggregate performance drop > significantly. This is not a drive issue: locally, the > drives get best performance >~32 threads, and maintain their > aggregate way beyond that. > > Is there any tunable parameter or source code change in the > initiator or target code that would effect performance with a > high thread count? > > Thanks, > > Chris > > -------------------------------------------------------------- > ---------------- > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables > unlimited royalty-free distribution of the report engine > for externally facing server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Scst-devel mailing list > Scst-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > From bart.vanassche at gmail.com Mon May 18 10:22:43 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 18 May 2009 19:22:43 +0200 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley wrote: > IIRC, The SRP Target code has many context switches that throttle > performance at higher thread counts. Depends on which version of ib_srpt you are using. The ib_srpt kernel module has a parameter called "thread" which allows to control whether disk I/O is handled in another thread than the one that communicates over InfiniBand (thread=1) or in the same thread (thread=0). For older versions of the ib_srpt kernel module the default was thread=1, which caused indeed a lot of context switches. On December 3, 2008 (SCST Subversion revision 594) the default has been changed from thread=1 to thread=0 because the latter results in better performance. Bart. From caitlin.bestler at gmail.com Mon May 18 11:02:23 2009 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 18 May 2009 11:02:23 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com> References: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com> Message-ID: <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com> On Mon, May 18, 2009 at 9:24 AM, Jeff Squyres wrote: > On May 7, 2009, at 5:58 PM, Roland Dreier (rdreier) wrote: > >>  > Specifically: the actual dereg of 0x1000-0x3fff is blocked on also >>  > releasing 0x2000-0x2fff. >> >> If everyone is doing this, how do you handle the case that Jason pointed >> out, namely: >> >>  * you register 0x1000 ... 0x3fff >>  * you want to register 0x2000 ... 0x2fff and have a cache hit >>  * you finish up with 0x1000 ... 0x3fff >>  * app does something (which is valid since you finished up with the >>   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg free() >>   that leads to munmap() or whatever), and your hooks tell you so. >>  * app reallocates a mapping in 0x3000 ... 0x3fff >>  * you want to re-register 0x1000 ... 0x3fff -- but it has to be marked >>   both invalid and in-use in the cache at this point !? >> > > > Sorry; this mail slipped by me and I just saw it now. > > If this can actually happen -- that the mapping of 0x1000 ... 0x3fff can > change even though it is still registered, then we're screwed -- we have no > way of knowing that this is now invalid (Open MPI, at least -- can't speak > for others). > > Is there a way to detect condition this in userspace? > How does 0x1000 to 0x3fff get registered as a single Memory Region? If it is legitimate to free() 0x3000..0x3fff then how can there ever be a legitimate reference to 0x1000..0x3fff? If there is no such single reference, I don't see how a Memory Region is every created covering that range. If the user creates the Memory Region, then they are responsible for not free()ing a portion of it. Would the MPI library ever create a single large memory region based on two distinct Sends? From generationgnu at yahoo.com Mon May 18 11:04:10 2009 From: generationgnu at yahoo.com (Sam Haxor) Date: Mon, 18 May 2009 11:04:10 -0700 (PDT) Subject: [Scst-devel] [ofa-general] RE: SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: <977100.12599.qm@web111918.mail.gq1.yahoo.com> If we pin the thread to 'a' CPU(say CPU-X) then can we pass on a hint to the BE driver to process the response completion on the same 'CPU-X' ? This way the CPU-cache will be utilized efficiently. I don't know how the IB driver etc works. So can't contribute in terms of code right now. But this is how any sample implementation could/should look like - BE driver creates the following - QUEUES[NR_CPUS]; QUEUES { CMD QUEUE[SOME_QUEUE_DEPTH]; RSP QUEUE[SOME_QUEUE_DEPTH]; }; 1) scsi-mid down-calls BE driver, and also passes a hint aka 'thread-CPU-X'. 2) BE transmits cmd on QUEUES[thread-CPU-X]->CMD QUEUE[slot_index]; 3) The adapter(HBA/HCA) will interrupt the BE driver on the 'thread-CPU-X'. 3.1) Now it is the BE drivers responsibility to affinitize the response-draining with the corresponding CPU @ driver load time. Ciao ----- Original Message ---- > From: Bart Van Assche > To: "Sufficool, Stanley" > Cc: Chris Worley ; scst-devel ; OpenIB > Sent: Monday, May 18, 2009 1:22:43 PM > Subject: Re: [Scst-devel] [ofa-general] RE: SRP aggregate bandwidth decreasing as threads increase > > On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley > wrote: > > IIRC, The SRP Target code has many context switches that throttle > > performance at higher thread counts. > > Depends on which version of ib_srpt you are using. The ib_srpt kernel > module has a parameter called "thread" which allows to control whether > disk I/O is handled in another thread than the one that communicates > over InfiniBand (thread=1) or in the same thread (thread=0). For older > versions of the ib_srpt kernel module the default was thread=1, which > caused indeed a lot of context switches. On December 3, 2008 (SCST > Subversion revision 594) the default has been changed from thread=1 to > thread=0 because the latter results in better performance. > > Bart. > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables > unlimited royalty-free distribution of the report engine > for externally facing server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Scst-devel mailing list > Scst-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel From vst at vlnb.net Mon May 18 11:23:22 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 18 May 2009 22:23:22 +0400 Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: <4A11A79A.8070204@vlnb.net> Chris Worley, on 05/18/2009 08:21 PM wrote: > I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10 > I'm seeing aggregate performance drop significantly. This is not a > drive issue: locally, the drives get best performance >~32 threads, > and maintain their aggregate way beyond that. > > Is there any tunable parameter or source code change in the initiator > or target code that would effect performance with a high thread count? Check README of ib_srpt from the SCST SVN trunk. > Thanks, > > Chris > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables > unlimited royalty-free distribution of the report engine > for externally facing server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Scst-devel mailing list > Scst-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > From jsquyres at cisco.com Mon May 18 11:24:33 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 18 May 2009 14:24:33 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com> References: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com> <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com> Message-ID: <167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com> On May 18, 2009, at 2:02 PM, Caitlin Bestler wrote: > >> > Specifically: the actual dereg of 0x1000-0x3fff is blocked on > also > >> > releasing 0x2000-0x2fff. > >> > >> If everyone is doing this, how do you handle the case that Jason > pointed > >> out, namely: > >> > >> * you register 0x1000 ... 0x3fff > >> * you want to register 0x2000 ... 0x2fff and have a cache hit > >> * you finish up with 0x1000 ... 0x3fff > >> * app does something (which is valid since you finished up with > the > >> bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg > free() > >> that leads to munmap() or whatever), and your hooks tell you so. > >> * app reallocates a mapping in 0x3000 ... 0x3fff > >> * you want to re-register 0x1000 ... 0x3fff -- but it has to be > marked > >> both invalid and in-use in the cache at this point !? > I think I mis-parsed the above scenario in my previous response. When our memory hooks tell us that memory is about to be removed from the process, we unregister all pages in the relevant region and remove those entries from the cache. So the next time you look in the cache for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold. > How does 0x1000 to 0x3fff get registered as a single Memory Region? > If it is legitimate to free() 0x3000..0x3fff then how can there ever > be a > legitimate reference to 0x1000..0x3fff? If there is no such single > reference, > I don't see how a Memory Region is every created covering that range. > > If the user creates the Memory Region, then they are responsible for > not > free()ing a portion of it. > Agreed. If an application does that, it deserves what it gets. > Would the MPI library ever create a single large memory region based > on > two distinct Sends? > Per my prior mail, Open MPI registers chucks at a time. Each chunk is potentially a multiple of pages. So yes, you could end up having a single registration that spans the buffers used in multiple, distinct MPI sends. We reference count by page to ensure that deregistrations do not occur prematurely. For example, if page X contains the end of one large buffer and the beginning of another, both of which are being used in ongoing non- blocking MPI communications. Then page X's entry on our cache will have a refcount == 2. OMPI won't allow the registration containing that page to become eligible for deregistering until the cache entry's refcount goes down to 0. See my prior mail for a more complex example of our cache's behavior. -- Jeff Squyres Cisco Systems From worleys at gmail.com Mon May 18 11:40:58 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 12:40:58 -0600 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 11:22 AM, Bart Van Assche wrote: > On Mon, May 18, 2009 at 6:52 PM, Sufficool, Stanley > wrote: >> IIRC, The SRP Target code has many context switches that throttle >> performance at higher thread counts. > > Depends on which version of ib_srpt you are using. The ib_srpt kernel > module has a parameter called "thread" which allows to control whether > disk I/O is handled in another thread than the one that communicates > over InfiniBand (thread=1) or in the same thread (thread=0). For older > versions of the ib_srpt kernel module the default was thread=1, which > caused indeed a lot of context switches. On December 3, 2008 (SCST > Subversion revision 594) the default has been changed from thread=1 to > thread=0 because the latter results in better performance. I won't have access to the targets until tomorrow (at which point I may not have internet access), so I'm trying to gather a few possible solutions today. I'm using a very recent version of the SCST target code, it would only be ~1 month old. So, I'm guessing I have the "thread=0" code. Maybe, for a high thread count, this needs to be "=1"? Is there a way to control the number of threads once "thread=1" is set? Does it spawn one thread per initiator thread? Any other ideas of things to try? Thanks, Chris > > Bart. > From worleys at gmail.com Mon May 18 11:56:27 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 12:56:27 -0600 Subject: [ofa-general] Re: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: <4A11A79A.8070204@vlnb.net> References: <4A11A79A.8070204@vlnb.net> Message-ID: On Mon, May 18, 2009 at 12:23 PM, Vladislav Bolkhovitin wrote: > Chris Worley, on 05/18/2009 08:21 PM wrote: >> >> I'm seeing peak performance at ~4 threads (1.6GB/s), w/ threads >10 >> I'm seeing aggregate performance drop significantly.  This is not a >> drive issue: locally, the drives get best performance >~32 threads, >> and maintain their aggregate way beyond that. >> >> Is there any tunable parameter or source code change in the initiator >> or target code that would effect performance with a high thread count? > > Check README of ib_srpt from the SCST SVN trunk. There are 42 README's in scst. Do you mean the one in scst/trunk/srpt which talks of three performance issues: 1) Minimizing QUEUEFULL conditions. 2) Setting IRQ affinity on the drives. 3) Setting "thread=1". ? Chris From arlin.r.davis at intel.com Mon May 18 12:07:30 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 18 May 2009 12:07:30 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages Message-ID: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> getpid provides process ids which are not unique. Use unique thread id's in debug messages to help isolate issues across many device opens with multiple CM threads. Signed-off-by: Arlin Davis --- dapl/common/dapl_debug.c | 2 +- dapl/udapl/linux/dapl_osd.h | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index ba33cfc..20ee405 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -49,7 +49,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...) if (type & g_dapl_dbg_type) { if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) { va_start(args, fmt); - fprintf(stdout, "%s:%d: ", _ptr_host_, + fprintf(stdout, "%s:%lx: ", _ptr_host_, dapl_os_getpid()); dapl_os_vprintf(fmt, args); va_end(args); diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 1c098c5..0378a70 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) -#define dapl_os_getpid getpid - +#define dapl_os_getpid (long int)pthread_self #endif /* _DAPL_OSD_H_ */ -- 1.5.2.5 From arlin.r.davis at intel.com Mon May 18 12:07:38 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 18 May 2009 12:07:38 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) dtest: add connection timers on client side Message-ID: Add timers for active connections and print results. Allow polling or wait on conn event. Signed-off-by: Arlin Davis --- test/dtest/dtest.c | 34 ++++++++++++++++++++++++---------- 1 files changed, 24 insertions(+), 10 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 6ff7798..f1f0f2b 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -183,6 +183,7 @@ struct dt_time { double rdma_rd_total; double rtt; double close; + double conn; }; struct dt_time time; @@ -197,6 +198,7 @@ static int verbose = 0; static int polling = 0; static int poll_count = 0; static int rdma_wr_poll_count = 0; +static int conn_poll_count = 0; static int rdma_rd_poll_count[MAX_RDMA_RD] = { 0 }; static int delay = 0; static int buf_len = RDMA_BUFFER_SIZE; @@ -617,6 +619,9 @@ complete: } printf("%d: EP create: %10.2lf usec\n", getpid(), time.epc); printf("%d: EP free: %10.2lf usec\n", getpid(), time.epf); + if (!server) + printf("%d: connect: %10.2lf usec, poll_cnt=%d\n", + getpid(), time.conn, conn_poll_count); printf("%d: TOTAL: %10.2lf usec\n", getpid(), time.total); #if defined(_WIN32) || defined(_WIN64) @@ -843,6 +848,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) /* setup receive rdma buffer to initial string to be overwritten */ strcpy((char *)rbuf, "blah, blah, blah\n"); + /* clear event structure */ + memset(&event, 0, sizeof(DAT_EVENT)); + if (server) { /* SERVER */ /* create the service point for server listen */ @@ -962,6 +970,7 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) pdata[i] = i + 1; LOGPRINTF("%d Connecting to server\n", getpid()); + start = get_time(); ret = dat_ep_connect(h_ep, &remote_addr, conn_id, @@ -979,14 +988,18 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) printf("%d Waiting for connect response\n", getpid()); - ret = dat_evd_wait(h_conn_evd, DAT_TIMEOUT_INFINITE, 1, &event, &nmore); - if (ret != DAT_SUCCESS) { - fprintf(stderr, "%d Error dat_evd_wait: %s\n", - getpid(), DT_RetToString(ret)); - return (ret); - } else - LOGPRINTF("%d dat_evd_wait for h_conn_evd completed\n", - getpid()); + if (polling) + while (DAT_GET_TYPE(dat_evd_dequeue(h_conn_evd, &event)) == + DAT_QUEUE_EMPTY) + conn_poll_count++; + else + ret = dat_evd_wait(h_conn_evd, DAT_TIMEOUT_INFINITE, + 1, &event, &nmore); + + if (!server) { + stop = get_time(); + time.conn += ((stop - start) * 1.0e6); + } #ifdef TEST_REJECT_WITH_PRIVATE_DATA if (event.event_number != DAT_CONNECTION_EVENT_PEER_REJECTED) { @@ -1012,8 +1025,9 @@ DAT_RETURN connect_ep(char *hostname, DAT_CONN_QUAL conn_id) #endif if (event.event_number != DAT_CONNECTION_EVENT_ESTABLISHED) { - fprintf(stderr, "%d Error unexpected conn event : %s\n", - getpid(), DT_EventToSTr(event.event_number)); + fprintf(stderr, "%d Error unexpected conn event : 0x%x %s\n", + getpid(), event.event_number, + DT_EventToSTr(event.event_number)); return (DAT_ABORT); } -- 1.5.2.5 From arlin.r.davis at intel.com Mon May 18 12:08:24 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 18 May 2009 12:08:24 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) scm: multi-hca CM processing broken. Need cr thread wakeup mechanism per HCA. Message-ID: Currently there is only one pipe across all device opens. This results in some posted CR work getting delayed or not processed at all. Provide pipe for each device open and cr thread created and manage on a per device level. Signed-off-by: Arlin Davis --- dapl/openib_scm/dapl_ib_cm.c | 23 ++++++------ dapl/openib_scm/dapl_ib_util.c | 74 +++++++++++++++++++++++++--------------- dapl/openib_scm/dapl_ib_util.h | 1 + 3 files changed, 58 insertions(+), 40 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index a2b02eb..9cad5be 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -54,8 +54,6 @@ #include "dapl_ib_util.h" #include "dapl_osd.h" -extern DAPL_SOCKET g_scm[2]; - #if defined(_WIN32) || defined(_WIN64) enum DAPL_FD_EVENTS { DAPL_FD_READ = 0x1, @@ -282,7 +280,7 @@ static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->lock); /* wakeup work thread */ - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cm_destroy: thread wakeup error = %s\n", strerror(errno)); @@ -299,7 +297,7 @@ static void dapli_cm_queue(struct ib_cm_handle *cm_ptr) dapl_os_unlock(&cm_ptr->hca->ib_trans.lock); /* wakeup CM work thread */ - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cm_queue: thread wakeup error = %s\n", strerror(errno)); @@ -1210,7 +1208,8 @@ dapls_ib_remove_conn_listener(IN DAPL_IA * ia_ptr, IN DAPL_SP * sp_ptr) /* cr_thread will free */ cm_ptr->state = SCM_DESTROY; sp_ptr->cm_srvc_handle = NULL; - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(cm_ptr->hca->ib_trans.scm[1], + "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cm_destroy: thread wakeup error = %s\n", strerror(errno)); @@ -1312,7 +1311,7 @@ dapls_ib_reject_connection(IN dp_ib_cm_handle_t cm_ptr, /* cr_thread will destroy CR */ cm_ptr->state = SCM_REJECTED; - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(cm_ptr->hca->ib_trans.scm[1], "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cm_destroy: thread wakeup error = %s\n", strerror(errno)); @@ -1552,7 +1551,7 @@ void cr_thread(void *arg) while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { dapl_fd_zero(set); - dapl_fd_set(g_scm[0], set, DAPL_FD_READ); + dapl_fd_set(hca_ptr->ib_trans.scm[0], set, DAPL_FD_READ); if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) next_cr = dapl_llist_peek_head(&hca_ptr->ib_trans.list); @@ -1652,9 +1651,8 @@ void cr_thread(void *arg) &cr->dst.ia_address)-> sin_addr)); - /* POLLUP, NVAL, or poll error, issue event if connected */ - if (cr->state == SCM_CONNECTED) - dapli_socket_disconnect(cr); + /* POLLUP, NVAL, or poll error. - DISC */ + dapli_socket_disconnect(cr); } dapl_os_lock(&hca_ptr->ib_trans.lock); @@ -1664,8 +1662,9 @@ void cr_thread(void *arg) dapl_select(set); /* if pipe used to wakeup, consume */ - while (dapl_poll(g_scm[0], DAPL_FD_READ) == DAPL_FD_READ) { - if (recv(g_scm[0], rbuf, 2, 0) == -1) + while (dapl_poll(hca_ptr->ib_trans.scm[0], + DAPL_FD_READ) == DAPL_FD_READ) { + if (recv(hca_ptr->ib_trans.scm[0], rbuf, 2, 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cr_thread: read pipe error = %s\n", strerror(errno)); diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index c95b0c2..30c71fa 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -58,7 +58,6 @@ static const char rcsid[] = "$Id: $"; #include int g_dapl_loopback_connection = 0; -DAPL_SOCKET g_scm[2]; enum ibv_mtu dapl_ib_mtu(int mtu) { @@ -138,22 +137,7 @@ static DAT_RETURN getlocalipaddr(DAT_SOCK_ADDR * addr, int addr_len) return ret; } -/* - * dapls_ib_init, dapls_ib_release - * - * Initialize Verb related items for device open - * - * Input: - * none - * - * Output: - * none - * - * Returns: - * 0 success, -1 error - * - */ -int32_t dapls_ib_init(void) +static int32_t create_cr_pipe(IN DAPL_HCA * hca_ptr) { DAPL_SOCKET listen_socket; struct sockaddr_in addr; @@ -179,32 +163,58 @@ int32_t dapls_ib_init(void) if (ret) goto err1; - g_scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); - if (g_scm[1] == DAPL_INVALID_SOCKET) + hca_ptr->ib_trans.scm[1] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (hca_ptr->ib_trans.scm[1] == DAPL_INVALID_SOCKET) goto err1; - ret = connect(g_scm[1], (struct sockaddr *)&addr, sizeof(addr)); + ret = connect(hca_ptr->ib_trans.scm[1], + (struct sockaddr *)&addr, sizeof(addr)); if (ret) goto err2; - g_scm[0] = accept(listen_socket, NULL, NULL); - if (g_scm[0] == DAPL_INVALID_SOCKET) + hca_ptr->ib_trans.scm[0] = accept(listen_socket, NULL, NULL); + if (hca_ptr->ib_trans.scm[0] == DAPL_INVALID_SOCKET) goto err2; closesocket(listen_socket); return 0; err2: - closesocket(g_scm[1]); + closesocket(hca_ptr->ib_trans.scm[1]); err1: closesocket(listen_socket); return 1; } +static void destroy_cr_pipe(IN DAPL_HCA * hca_ptr) +{ + closesocket(hca_ptr->ib_trans.scm[0]); + closesocket(hca_ptr->ib_trans.scm[1]); +} + + +/* + * dapls_ib_init, dapls_ib_release + * + * Initialize Verb related items for device open + * + * Input: + * none + * + * Output: + * none + * + * Returns: + * 0 success, -1 error + * + */ +int32_t dapls_ib_init(void) +{ + return 0; +} + int32_t dapls_ib_release(void) { - closesocket(g_scm[0]); - closesocket(g_scm[1]); return 0; } @@ -382,6 +392,14 @@ DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA * hca_ptr) /* initialize CM list for listens on this HCA */ dapl_llist_init_head(&hca_ptr->ib_trans.list); + /* initialize pipe, user level wakeup on select */ + if (create_cr_pipe(hca_ptr)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: failed to init cr pipe - %s\n", + strerror(errno)); + goto bail; + } + /* create thread to process inbound connect request */ hca_ptr->ib_trans.cr_state = IB_THREAD_INIT; dat_status = dapl_os_thread_create(cr_thread, @@ -455,21 +473,21 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA * hca_ptr) /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_UTIL, " thread_destroy: thread wakeup err = %s\n", strerror(errno)); while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); - if (send(g_scm[1], "w", sizeof "w", 0) == -1) + if (send(hca_ptr->ib_trans.scm[1], "w", sizeof "w", 0) == -1) dapl_log(DAPL_DBG_TYPE_UTIL, " thread_destroy: thread wakeup err = %s\n", strerror(errno)); dapl_os_sleep_usec(2000); } dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); - + destroy_cr_pipe(hca_ptr); /* no longer need pipe */ return (DAT_SUCCESS); } diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 5493312..e924572 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -304,6 +304,7 @@ typedef struct _ib_hca_transport uint8_t tclass; uint8_t mtu; DAT_NAMED_ATTR named_attr; + DAPL_SOCKET scm[2]; } ib_hca_transport_t; /* provider specfic fields for shared memory support */ -- 1.5.2.5 From arlin.r.davis at intel.com Mon May 18 12:08:41 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 18 May 2009 12:08:41 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) windows: add build files for openib_scm, remove /Wp64 build option. Message-ID: Add build files for windows socket cm and change build option on windows providers. The new Win7 WDK issues a depreciated compiler option warning for /Wp64 (Enable 64-bit porting warnings) Signed-off-by: Arlin Davis --- dapl/openib_cma/SOURCES | 2 +- dapl/openib_scm/SOURCES | 53 ++++++++++++++++++++++++++++++++++++++++++++++ dapl/openib_scm/udapl.rc | 48 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 102 insertions(+), 1 deletions(-) create mode 100644 dapl/openib_scm/SOURCES create mode 100644 dapl/openib_scm/udapl.rc diff --git a/dapl/openib_cma/SOURCES b/dapl/openib_cma/SOURCES index 29e836e..e59ef35 100644 --- a/dapl/openib_cma/SOURCES +++ b/dapl/openib_cma/SOURCES @@ -53,4 +53,4 @@ TARGETLIBS= \ $(TARGETPATH)\*\librdmacmd.lib !endif -MSC_WARNING_LEVEL = /W1 /wd4113 /Wp64 +MSC_WARNING_LEVEL = /W1 /wd4113 diff --git a/dapl/openib_scm/SOURCES b/dapl/openib_scm/SOURCES new file mode 100644 index 0000000..f9204d9 --- /dev/null +++ b/dapl/openib_scm/SOURCES @@ -0,0 +1,53 @@ +!if $(FREEBUILD) +TARGETNAME=dapl2-ofa-scm +!else +TARGETNAME=dapl2-ofa-scmd +!endif + +TARGETPATH = ..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) +TARGETTYPE = DYNLINK +DLLENTRY = _DllMainCRTStartup + +!if $(_NT_TOOLS_VERSION) == 0x700 +DLLDEF=$O\udapl_ofa_scm_exports.def +!else +DLLDEF=$(OBJ_PATH)\$O\udapl_ofa_scm_exports.def +!endif + +USE_MSVCRT = 1 + +SOURCES = \ + udapl.rc \ + ..\dapl_common_src.c \ + ..\dapl_udapl_src.c \ + dapl_ib_cq.c \ + dapl_ib_extensions.c \ + dapl_ib_mem.c \ + dapl_ib_qp.c \ + dapl_ib_util.c \ + dapl_ib_cm.c + +INCLUDES = ..\include;..\common;windows;..\..\dat\include;\ + ..\..\dat\udat\windows;..\udapl\windows;\ + ..\..\..\..\inc;..\..\..\..\inc\user;..\..\..\libibverbs\include + +DAPL_OPTS = -DEXPORT_DAPL_SYMBOLS -DDAT_EXTENSIONS -DSOCK_CM -DOPENIB -DCQ_WAIT_OBJECT + +USER_C_FLAGS = $(USER_C_FLAGS) $(DAPL_OPTS) + +!if !$(FREEBUILD) +USER_C_FLAGS = $(USER_C_FLAGS) -DDAPL_DBG +!endif + +TARGETLIBS= \ + $(SDK_LIB_PATH)\kernel32.lib \ + $(SDK_LIB_PATH)\ws2_32.lib \ +!if $(FREEBUILD) + $(TARGETPATH)\*\dat2.lib \ + $(TARGETPATH)\*\libibverbs.lib +!else + $(TARGETPATH)\*\dat2d.lib \ + $(TARGETPATH)\*\libibverbsd.lib +!endif + +MSC_WARNING_LEVEL = /W1 /wd4113 diff --git a/dapl/openib_scm/udapl.rc b/dapl/openib_scm/udapl.rc new file mode 100644 index 0000000..8550256 --- /dev/null +++ b/dapl/openib_scm/udapl.rc @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2007, 2009 Intel Corporation. All rights reserved. + * + * This software is available to you under the OpenIB.org BSD license + * below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#include + +#define VER_FILETYPE VFT_DLL +#define VER_FILESUBTYPE VFT2_UNKNOWN + +#if DBG +#define VER_FILEDESCRIPTION_STR "Direct Access Provider Library v2.0 (OFA socket-cm) (Debug)" +#define VER_INTERNALNAME_STR "dapl2-ofa-scmd.dll" +#define VER_ORIGINALFILENAME_STR "dapl2-ofa-scmd.dll" +#else +#define VER_FILEDESCRIPTION_STR "Direct Access Provider Library v2.0 (OFA socket-cm)" +#define VER_INTERNALNAME_STR "dapl2-ofa-scm.dll" +#define VER_ORIGINALFILENAME_STR "dapl2-ofa-scm.dll" +#endif + +#include -- 1.5.2.5 From sean.hefty at intel.com Mon May 18 12:15:57 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 18 May 2009 12:15:57 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages In-Reply-To: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> Message-ID: please copy the ofw mail list on dapl changes >diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h >index 1c098c5..0378a70 100644 >--- a/dapl/udapl/linux/dapl_osd.h >+++ b/dapl/udapl/linux/dapl_osd.h >@@ -572,8 +572,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) > #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) > #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) > >-#define dapl_os_getpid getpid >- >+#define dapl_os_getpid (long int)pthread_self Maybe add a new call, dapl_os_get_thread_id or something similar, to avoid confusion with the name and what the call returns. From bart.vanassche at gmail.com Mon May 18 12:18:22 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 18 May 2009 21:18:22 +0200 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 8:40 PM, Chris Worley wrote: > Any other ideas of things to try? Depends on the workload that is running on the initiators. Are the initiators performing linear I/O or block I/O ? Which I/O scheduler is being used by the initiator systems, and how has it been configured ? Which I/O scheduler has been configured on the target, and with which parameters ? As you probably know, you can find these parameters under /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or scst_vdisk ? And what is the kernel version of the target system ? By the way, an important I/O performance regression has been fixed in kernel 2.6.29 (see also http://lwn.net/Articles/325307/). Bart. From worleys at gmail.com Mon May 18 12:41:52 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 13:41:52 -0600 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche wrote: > On Mon, May 18, 2009 at 8:40 PM, Chris Worley wrote: >> Any other ideas of things to try? > > Depends on the workload that is running on the initiators. Are the > initiators performing linear I/O or block I/O ? I'm not sure what "linear I/O" is. It is block I/O at 56KB chunks; using direct I/O. > Which I/O scheduler is > being used by the initiator systems, and how has it been configured ? The noop scheduler is being used on the targets and initiators. All the standard schedulers performed worse. > Which I/O scheduler has been configured on the target, and with which > parameters ? As you probably know, you can find these parameters under > /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or > scst_vdisk ? scst_vdisk. > And what is the kernel version of the target system ? Ubuntu 8.10 with a 2.6.27 kernel, if my memory serves me correctly. > By > the way, an important I/O performance regression has been fixed in > kernel 2.6.29 (see also http://lwn.net/Articles/325307/). Thanks, I'll try that. Chris > > Bart. > From arlin.r.davis at intel.com Mon May 18 12:47:56 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 18 May 2009 12:47:56 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages In-Reply-To: References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> Message-ID: >please copy the ofw mail list on dapl changes ok >Maybe add a new call, dapl_os_get_thread_id or something >similar, to avoid >confusion with the name and what the call returns. What about this... diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index 20ee405..6c6eeb5 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...) if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) { va_start(args, fmt); fprintf(stdout, "%s:%lx: ", _ptr_host_, - dapl_os_getpid()); + dapl_os_gettid()); dapl_os_vprintf(fmt, args); va_end(args); } diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 0378a70..e0e30bf 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) -#define dapl_os_getpid (long int)pthread_self +#define dapl_os_getpid (int)getpid +#define dapl_os_gettid (long int)pthread_self #endif /* _DAPL_OSD_H_ */ From bart.vanassche at gmail.com Mon May 18 12:55:39 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 18 May 2009 21:55:39 +0200 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 9:41 PM, Chris Worley wrote: > On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche > wrote: >> Which I/O scheduler has been configured on the target, and with which >> parameters ? As you probably know, you can find these parameters under >> /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or >> scst_vdisk ? > > scst_vdisk. The default number of kernel threads for scst_vdisk is five (kernel module parameter num_threads). It might be interesting to experiment with this parameter. Bart. From sean.hefty at intel.com Mon May 18 13:14:19 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 18 May 2009 13:14:19 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages In-Reply-To: References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> Message-ID: <5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com> >diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c >index 20ee405..6c6eeb5 100644 >--- a/dapl/common/dapl_debug.c >+++ b/dapl/common/dapl_debug.c >@@ -50,7 +50,7 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char >*fmt, ...) > if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) { > va_start(args, fmt); > fprintf(stdout, "%s:%lx: ", _ptr_host_, >- dapl_os_getpid()); >+ dapl_os_gettid()); > dapl_os_vprintf(fmt, args); > va_end(args); > } >diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h >index 0378a70..e0e30bf 100644 >--- a/dapl/udapl/linux/dapl_osd.h >+++ b/dapl/udapl/linux/dapl_osd.h >@@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) > #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) > #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) > >-#define dapl_os_getpid (long int)pthread_self >+#define dapl_os_getpid (int)getpid >+#define dapl_os_gettid (long int)pthread_self That's fine - what about Windows? :) From worleys at gmail.com Mon May 18 13:46:55 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 18 May 2009 14:46:55 -0600 Subject: [ofa-general] RE: [Scst-devel] SRP aggregate bandwidth decreasing as threads increase In-Reply-To: References: Message-ID: On Mon, May 18, 2009 at 1:55 PM, Bart Van Assche wrote: > On Mon, May 18, 2009 at 9:41 PM, Chris Worley wrote: >> On Mon, May 18, 2009 at 1:18 PM, Bart Van Assche >> wrote: >>> Which I/O scheduler has been configured on the target, and with which >>> parameters ? As you probably know, you can find these parameters under >>> /sys/class/block/sda/queue/{*,*/*}. Are you using scst_disk or >>> scst_vdisk ? >> >> scst_vdisk. > > The default number of kernel threads for scst_vdisk is five (kernel > module parameter num_threads). It might be interesting to experiment > with this parameter. Thanks! I'll try that too. Chris From rdreier at cisco.com Mon May 18 14:15:11 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 18 May 2009 14:15:11 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com> (Jeff Squyres's message of "Mon, 18 May 2009 14:24:33 -0400") References: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com> <469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com> <167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com> Message-ID: > When our memory hooks tell us that memory is about to be removed from > the process, we unregister all pages in the relevant region and remove > those entries from the cache. So the next time you look in the cache > for 0x3000-0x3fff, it won't be there -- it'll be treated as > cache-cold. So you want the registration cache to be reference counted per-page? Seems like potentially a lot of overhead -- if someone registers a million pages, then to check for a cache hit, you have to potentially check millions of reference counts. > > How does 0x1000 to 0x3fff get registered as a single Memory Region? > > If it is legitimate to free() 0x3000..0x3fff then how can there ever > > be a > > legitimate reference to 0x1000..0x3fff? If there is no such single > > reference, > > I don't see how a Memory Region is every created covering that range. > > > > If the user creates the Memory Region, then they are responsible for > > not > > free()ing a portion of it. > > > > Agreed. If an application does that, it deserves what it gets. Hang on. The whole point of MR caching is exactly that you don't unregister a memory region, even after you're done using the memory it covers, in the hope that you'll want to reuse that registration. And the whole point of this thread is that an application can then free() some of the memory that is still registered in the cache. > Per my prior mail, Open MPI registers chucks at a time. Each chunk is > potentially a multiple of pages. So yes, you could end up having a > single registration that spans the buffers used in multiple, distinct > MPI sends. We reference count by page to ensure that deregistrations > do not occur prematurely. Hmm, I'm worried that the exact semantics of the memory cache seem to be tied into how the MPI implementation is registering memory. Open MPI happens to work in small chunks (I guess) and so your cache is tailored for that use case. I know the original proposal was an attempt to come up with something that all the MPIs can agree on, but it didn't cover the full semantics, at least not for cases like the overlapping sub-registrations that we're discussing here. Is there still one set of semantics everyone can agree on? - R. From arlin.r.davis at intel.com Mon May 18 14:33:08 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 18 May 2009 14:33:08 -0700 Subject: [ofa-general] [PATCH] uDAPL (v2.0) linux_osd: use pthread_self instead of getpid for debug messages In-Reply-To: <5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com> References: <1CC1B86C7C264FBD8AF0ED57A2F9BF4D@amr.corp.intel.com> <5B78B104FBFB4D7789FE9B9F7F476F51@amr.corp.intel.com> Message-ID: > >That's fine - what about Windows? :) > > Yes, I do windows. Please verify following patch: diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index 20ee405..6723217 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -49,8 +49,8 @@ void dapl_internal_dbg_log(DAPL_DBG_TYPE type, const char *fmt, ...) if (type & g_dapl_dbg_type) { if (DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest) { va_start(args, fmt); - fprintf(stdout, "%s:%lx: ", _ptr_host_, - dapl_os_getpid()); + fprintf(stdout, "%s:%x: ", _ptr_host_, + dapl_os_gettid()); dapl_os_vprintf(fmt, args); va_end(args); } diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 0378a70..cb61cae 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -572,7 +572,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) -#define dapl_os_getpid (long int)pthread_self +#define dapl_os_getpid (DAT_UINT32)getpid +#define dapl_os_gettid (DAT_UINT32)pthread_self #endif /* _DAPL_OSD_H_ */ diff --git a/dapl/udapl/windows/dapl_osd.h b/dapl/udapl/windows/dapl_osd.h index cdfeb24..416a24b 100644 --- a/dapl/udapl/windows/dapl_osd.h +++ b/dapl/udapl/windows/dapl_osd.h @@ -501,11 +501,8 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) return strtol(nptr, endptr, base); } -STATIC __inline int -dapl_os_getpid(void) -{ - return (int)GetCurrentProcessId(); -} +#define dapl_os_getpid (DAT_UINT32)GetCurrentProcessId() +#define dapl_os_gettid (DAT_UINT32)GetCurrentThreadId() /* * Debug Helper Functions From abenjamin at sgi.com Mon May 18 16:44:25 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Mon, 18 May 2009 16:44:25 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: References: <4A0B560B.3090606@sgi.com> Message-ID: <4A11F2D9.6080107@sgi.com> Roland Dreier wrote: > why is the address you want at the position IB_DEVICE_NAME_MAX instead > of at index 0? It should be 0. Thanks for pointing that out. > In general I don't like since strcpy()/strcat() instead of > strlcpy()/strlcat(). > > > - R. > I'll modify the code to use snprintf(). Thank you for your help. Regards, Benjamin From Zhen.Liang at Sun.COM Mon May 18 21:37:32 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Tue, 19 May 2009 12:37:32 +0800 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes In-Reply-To: <1233654242.1364.39.camel@pyren.uio.no> References: <1233654242.1364.39.camel@pyren.uio.no> Message-ID: <4A12378C.8030101@sun.com> Hi Ole, Have you got solution for this? I think we got exactly same problem on 4600 with ofed-1.4.1-rc4: lspci output: 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0) and error messages from dmesg: mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) mlx4_core: Initializing 0000:03:00.0 mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1, reducing to 1. mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1 mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5) mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting. mlx4_core: probe of 0000:03:00.0 failed with error -5 Thanks Liang Ole Widar Saastad wrote: > I have problems using the OFED 1.4 software on the Sun x4600 nodes. > Need help to get this to work. We plan to run GPFS over IB on these > nodes in addition to MPI. > > Sun 4600 nodes with 8 quad core cpus, > Quad-Core AMD Opteron(tm) Processor 8380 > > OS is Rocks release 4. > centos-release-4-4.2/x86_64/ > > Linux compute-0-0.local 2.6.9-67.0.15.ELlargesmp #1 SMP Thu May 8 > 11:03:57 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > > > Needless to say our 300+ nodes (SUN x2200 with quad core) runs fine with > OFED 1.4 (and 1.3), they have the almost the same kernel : > Linux compute-4-0.local 2.6.9-67.0.15.ELsmp #1 SMP Thu May 8 10:50:20 > EDT 2008 x86_64 x86_64 x86_64 GNU/Linux > Same except ELsmp and not ELlargesmp. > > More information: > > dmesg prints out the following error message : > > Losing some ticks... checking if CPU frequency changed. > modulecmd[17499]: segfault at 0000007fc0b01688 rip 000000000060aa38 rsp 0000007fbfffcfd8 error 6 > mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) > mlx4_core: Initializing 0000:02:00.0 > ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 19 (level, low) -> IRQ 193 > PCI: Setting latency timer of device 0000:02:00.0 to 64 > mlx4_core 0000:02:00.0: Requested number of MACs is too much for port 1, reducing to 1. > MSI INIT SUCCESS > mlx4_core 0000:02:00.0: command 0x13 failed: fw status = 0x1 > mlx4_core 0000:02:00.0: SW2HW_EQ failed (-5) > mlx4_core 0000:02:00.0: Failed to initialize event queue table, aborting. > mlx4_core: probe of 0000:02:00.0 failed with error -5 > > The following software is installed: > > Select Option [1-5]:3 > kernel-ib > libibverbs > libibverbs-devel > libibverbs-utils > libmthca > libmlx4 > libcxgb3 > libnes > libipathverbs > libibcommon > libibcommon-devel > libibumad > libibumad-devel > ofed-docs > ofed-scripts > ibvexdmtools > qlgc_vnic_daemon > > > Just to be sure the card is present : > lspci returns : > 02:00.0 InfiniBand: Mellanox Technologies: Unknown device 634a (rev a0) > > > From dorfman.eli at gmail.com Tue May 19 01:56:36 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 19 May 2009 11:56:36 +0300 Subject: [ofa-general] Re: [PATCH ] opensm: MFT tables are not set after non full member re-join In-Reply-To: <4A1019F6.5060900@gmail.com> References: <4A1019F6.5060900@gmail.com> Message-ID: <4A127444.8080707@gmail.com> Eli Dorfman (Voltaire) wrote: > MFT tables are not set after non full member re-join > > In case of non full member re-join MFT tables are not set. > No need to set or check non full member reference to mlid (port->mcm_list). > This list should be used only for full members for cleanup when port goes down. > > A simple scenarion to reproduce this: > 1. Full member creates group > 2. Non-member join - MFT sent > 3. Full member leave > a. group is deleted but non member port has still reference to the MLID > 4. Full member re-creates the group > 5. Non member re-joins - MFT *NOT* sent to switches > > Signed-off-by: Eli Dorfman > --- > opensm/include/opensm/osm_sm.h | 3 ++- > opensm/opensm/osm_sa_mcmember_record.c | 6 +++--- > opensm/opensm/osm_sm.c | 22 +++++++++++++++++++++- > 3 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h > index cc8321d..1a8a577 100644 > --- a/opensm/include/opensm/osm_sm.h > +++ b/opensm/include/opensm/osm_sm.h > @@ -539,7 +539,8 @@ osm_resp_send(IN osm_sm_t * sm, > ib_api_status_t > osm_sm_mcgrp_join(IN osm_sm_t * const p_sm, > IN const ib_net16_t mlid, > - IN const ib_net64_t port_guid); > + IN const ib_net64_t port_guid, > + IN uint8_t scope_state); > /* > * PARAMETERS > * p_sm > diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c > index 5543221..fe29dd6 100644 > --- a/opensm/opensm/osm_sa_mcmember_record.c > +++ b/opensm/opensm/osm_sa_mcmember_record.c > @@ -1039,7 +1039,7 @@ static void mcmr_rcv_leave_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) > if (!p_mgrp) { > char gid_str[INET6_ADDRSTRLEN]; > CL_PLOCK_RELEASE(sa->p_lock); > - OSM_LOG(sa->p_log, OSM_LOG_DEBUG, > + OSM_LOG(sa->p_log, OSM_LOG_INFO, > "Failed since multicast group %s not present\n", > inet_ntop(AF_INET6, p_recvd_mcmember_rec->mgid.raw, > gid_str, sizeof gid_str)); > @@ -1309,8 +1309,8 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) > > /* do the actual routing (actually schedule the update) */ > status = osm_sm_mcgrp_join(sa->sm, mlid, > - p_recvd_mcmember_rec->port_gid.unicast. > - interface_id); > + p_recvd_mcmember_rec->port_gid.unicast.interface_id, > + p_recvd_mcmember_rec->scope_state); > > if (status != IB_SUCCESS) { > OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 1B14: " > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c > index daa60ff..b334d39 100644 > --- a/opensm/opensm/osm_sm.c > +++ b/opensm/opensm/osm_sm.c > @@ -468,7 +468,7 @@ static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm, > /********************************************************************** > **********************************************************************/ > ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, > - IN const ib_net64_t port_guid) > + IN const ib_net64_t port_guid, IN uint8_t scope_state) > { > osm_mgrp_t *p_mgrp; > osm_port_t *p_port; > @@ -515,6 +515,25 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, > goto Exit; > } > > + /* if there was no change from the last time > + * we processed the group we can skip doing anything > + */ > + if (p_mgrp->last_change_id == p_mgrp->last_tree_id) { > + OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE, > + "Skip processing mgrp with lid:0x%X last change id:%u\n", > + cl_ntoh16(mlid), p_mgrp->last_change_id); > + goto Exit; > + } else { > + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG, > + "processing mgrp with lid:0x%X port: 0x%016" PRIx64 " last change id:%u tree id:%u\n", > + cl_ntoh16(mlid), cl_ntoh64(port_guid), > + p_mgrp->last_change_id, p_mgrp->last_tree_id); > + } > + > + /* add mgrp only to FULL member port. used for cleanup when port goes down */ > + if (!(scope_state & IB_JOIN_STATE_FULL)) > + goto MgrpProcess; > + > /* > * Check if the object (according to mlid) already exists on this port. > * If it does - then no need to update it again, and no need to > @@ -543,6 +562,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, > goto Exit; > } > > +MgrpProcess: > status = sm_mgrp_process(p_sm, p_mgrp); > CL_PLOCK_RELEASE(p_sm->p_lock); > The following fixes a bug in the above PATCH that will lock the opensm when multicast group was not changed. diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c index b334d39..28cd76f 100644 --- a/opensm/opensm/osm_sm.c +++ b/opensm/opensm/osm_sm.c @@ -519,6 +519,7 @@ ib_api_status_t osm_sm_mcgrp_join(IN osm_sm_t * p_sm, IN const ib_net16_t mlid, * we processed the group we can skip doing anything */ if (p_mgrp->last_change_id == p_mgrp->last_tree_id) { + CL_PLOCK_RELEASE(p_sm->p_lock); OSM_LOG(p_sm->p_log, OSM_LOG_VERBOSE, "Skip processing mgrp with lid:0x%X last change id:%u\n", cl_ntoh16(mlid), p_mgrp->last_change_id); From vlad at lists.openfabrics.org Tue May 19 03:25:55 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 19 May 2009 03:25:55 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090519-0200 daily build status Message-ID: <20090519102555.601EBE615C4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sebastien.dugue at bull.net Tue May 19 05:09:21 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Tue, 19 May 2009 14:09:21 +0200 Subject: [ofa-general] [PATCH] perftest - Fix proc_get_cpu_mhz() on IA64 Message-ID: <20090519140921.3ea442b6@frecb007965> On IA64, proc_get_cpu_mhz() must use the ITC frequency rather than the CPU frequency. Signed-off-by: Sebastien Dugue --- get_clock.c | 18 +++++++++++++----- 1 files changed, 13 insertions(+), 5 deletions(-) diff --git a/get_clock.c b/get_clock.c index 0acb074..cc86452 100755 --- a/get_clock.c +++ b/get_clock.c @@ -144,12 +144,20 @@ static double proc_get_cpu_mhz(int no_cpu_freq_fail) while(fgets(buf, sizeof(buf), f)) { double m; int rc; + +#if defined (__ia64__) + /* Use the ITC frequency on IA64 */ + rc = sscanf(buf, "itc MHz : %lf", &m); +#elif defined (__PPC__) || defined (__PPC64__) + /* PPC has a different format as well */ + rc = sscanf(buf, "clock : %lf", &m); +#else rc = sscanf(buf, "cpu MHz : %lf", &m); - if (rc != 1) { /* PPC has a different format */ - rc = sscanf(buf, "clock : %lf", &m); - if (rc != 1) - continue; - } +#endif + + if (rc != 1) + continue; + if (mhz == 0.0) { mhz = m; continue; -- 1.6.3.rc3.12.gb7937 From tziporet at dev.mellanox.co.il Tue May 19 07:28:16 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 19 May 2009 17:28:16 +0300 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes In-Reply-To: <4A12378C.8030101@sun.com> References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com> Message-ID: <4A12C200.4000708@mellanox.co.il> Liang Zhen wrote: > Hi Ole, > Have you got solution for this? I think we got exactly same problem on > 4600 with ofed-1.4.1-rc4: > lspci output: > 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe > 2.0 2.5GT/s] (rev a0) > > and error messages from dmesg: > > mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) > mlx4_core: Initializing 0000:03:00.0 > mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1, > reducing to 1. > mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1 > mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5) > mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting. > mlx4_core: probe of 0000:03:00.0 failed with error -5 > > Can you send me the FW version and board type Since the driver is not loading you can use mstflint to get this data Please use: The devices can be accessed by their PCI ID as displayed by lspci (bus:dev.fn). Example: # List all Mellanox devices > /sbin/lspci -d 15b3: 02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368 (rev a0) # Use mstflint tool to query the firmware on this device > mstflint -d 02:00.0 q Tziporet From jsquyres at cisco.com Tue May 19 07:57:30 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 19 May 2009 10:57:30 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <730ECD2D-62E3-4EC7-92E7-A50FAD12D432@cisco.com><469958e00905181102s758ca2f3uddb09e8d604bd030@mail.gmail.com><167AB191-8F5E-4B82-823A-B2A04E2BF76D@cisco.com> Message-ID: <2639C2E6-BFD7-463C-AF1D-382F360F6236@cisco.com> On May 18, 2009, at 5:15 PM, Roland Dreier (rdreier) wrote: > So you want the registration cache to be reference counted per-page? > Seems like potentially a lot of overhead -- if someone registers a > million pages, then to check for a cache hit, you have to potentially > check millions of reference counts. > Our caches are hash tables of balanced red-black trees. So in practice, we won't be trolling through anywhere near a million reference counts to find a hit. > Hang on. The whole point of MR caching is exactly that you don't > unregister a memory region, even after you're done using the memory it > covers, in the hope that you'll want to reuse that registration. And > the whole point of this thread is that an application can then free() > some of the memory that is still registered in the cache. > Sorry -- the implication that I took from Caitlyn's text was that the memory was *used* after it was freed. That is clearly erroneous. What OMPI does (and apparently other MPI's do) is simply invalidate any registration for free'd memory. Additionally, we won't unregister memory while there is at least one use of it outstanding (that MPI knows about, such as a pending non-blocking communication). We lazily unregister just for exactly the case you're talking about (might want to use it for verbs communication again later). > > Per my prior mail, Open MPI registers chucks at a time. Each > chunk is > > potentially a multiple of pages. So yes, you could end up having a > > single registration that spans the buffers used in multiple, > distinct > > MPI sends. We reference count by page to ensure that > deregistrations > > do not occur prematurely. > > Hmm, I'm worried that the exact semantics of the memory cache seem > to be > tied into how the MPI implementation is registering memory. Open MPI > happens to work in small chunks (I guess) and so your cache is > tailored > for that use case. I know the original proposal was an attempt to > come > up with something that all the MPIs can agree on, but it didn't cover > the full semantics, at least not for cases like the overlapping > sub-registrations that we're discussing here. Is there still one > set of > semantics everyone can agree on? > So just to be clear -- let's separate the two issues that are evolving from this thread: 1. fix the hole where memory returned to the OS cannot be guaranteed to be caught by userspace (and therefore may still stay registered and/ or invalidate userspace registration cache entries) 2. have libibverbs include some form of memory registration caching (potentially using the solution to #1 to know when to invalidate reg. cache entries) Personally, I would prioritize them in the issues in this order. Did a solution for #1 get agreed upon? I admit that I got lost in the kernel discussion of issues between you, Jason, etc. Agreeing on registration caching semantics may take a little more discussion (although, as someone pointed out earlier, if libibverbs' reg caching is optional, then the verbs-based app can choose to use it or their own scheme). -- Jeff Squyres Cisco Systems From akepner at sgi.com Tue May 19 14:55:05 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 19 May 2009 14:55:05 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() Message-ID: <20090519215505.GN6837@sgi.com> We've seen a few instances of a crash in ipoib_neigh_cleanup() due to the use of a stale pointer: 848 neigh = *to_ipoib_neigh(n); <- read neigh (no locking) ..... 858 spin_lock_irqsave(&priv->lock, flags); 859 860 if (neigh->ah) <--- at this point neigh may be stale 861 ah = neigh->ah; 862 list_del(&neigh->list); 863 ipoib_neigh_free(n->dev, neigh); 864 865 spin_unlock_irqrestore(&priv->lock, flags); (Mentioned this here: http://lists.openfabrics.org/pipermail/ewg/2008-April/006459.html) We've been using a patch which re-reads neigh after the spinlock is taken. It's been effective in practice, but there's still a window where it's possible to use the stale pointer. I've been looking into a proper fix for this, and I'd like to solicit any ideas. First thought was to use RCU, e.g., instead of to_ipoib_neigh(), use: static inline struct ipoib_neigh* ipoib_neigh_retrieve(struct neighbour *n) { struct ipoib_neigh **np; np = (void*) n + ALIGN(offsetof(struct neighbour, ha) + INFINIBAND_ALEN, sizeof(void *)); return rcu_dereference(*np); } static inline void ipoib_neigh_assign(struct neighbour *n, struct ipoib_neigh *in) { struct ipoib_neigh **np; np = (void*) n + ALIGN(offsetof(struct neighbour, ha) + INFINIBAND_ALEN, sizeof(void *)); rcu_assign_pointer(*np, in); } where ipoib_neigh_retrieve() is done under rcu_read_lock() and ipoib_neigh_assign() under some spinlock (ipoib_dev_priv's lock might be repurposed for that use). But that approach gets more complicated than seems warranted (partly because there's a need to promote readers to writers in a few places...). Second thought was to use new locks to serialize access to the ipoib_neigh pointer stashed away in struct neighbour. Something like: struct ipoib_neigh_lock { spinlock_t sl; }__attribute__((__aligned__(SMP_CACHE_BYTES))); #define IPOIB_LOCK_SHIFT 6 #define IPOIB_LOCK_SIZE (1 << IPOIB_LOCK_SHIFT) #define IPOIB_LOCK_MASK (IPOIB_LOCK_SIZE -1) static struct ipoib_neigh_lock ipoib_neigh_locks[IPOIB_LOCK_SIZE] __cacheline_aligned; static inline void lock_ipoib_neigh(unsigned int hval) { spin_lock(&ipoib_neigh_locks[hval & IPOIB_LOCK_MASK].sl); } static inline void unlock_ipoib_neigh(unsigned int hval) { spin_unlock(&ipoib_neigh_locks[hval & IPOIB_LOCK_MASK].sl); } unsigned int ipoib_neigh_hval(struct neighbour *n); .... static void ipoib_neigh_cleanup(struct neighbour *n) { ..... unsigned int hval = ipoib_neigh_hval(n); lock_ipoib_neigh(hval); neigh = *to_ipoib_neigh(n); if (neigh) priv = netdev_priv(neigh->dev); else return; .... spin_lock_irqsave(&priv->lock, flags); if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); ipoib_neigh_free(n->dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); unlock_ipoib_neigh(hval); .... This seems much simpler, but maybe there are better approaches? -- Arthur From rdreier at cisco.com Tue May 19 15:01:13 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 19 May 2009 15:01:13 -0700 Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: <20090518095516.6a803492@frecb007965> (sebastien dugue's message of "Mon, 18 May 2009 09:55:16 +0200") References: <20090518095156.7f9c39e6@frecb007965> <20090518095516.6a803492@frecb007965> Message-ID: > QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers > size to the page size and then allocates page aligned memory using > posix_memalign(). > > However, this allocation is quite wasteful on architectures using 64K pages > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc > parameter and chunks are allocated using mmap. thus we end up allocating: > > (requested size rounded to the page size) + (page size) + (malloc overhead) > > rounded internally to the page size. > > So for example, if we request a buffer of page_size bytes, we end up > consuming 3 pages. In short, for each QP buffer we allocate, there is an > overhead of 2 pages. This is quite visible on large clusters especially where > the number of QP can reach several thousands. > > This patch creates a new function mlx4_alloc_page() for use by > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when > the page size is 64K. makes sense I guess. It would be nice if glibc() were smart enough to know that mmap(MAP_ANONYMOUS) is going to give something page-aligned anyway, but it seems that malloc overhead (required to make the memory from posix_memalign() work with free()) is going to cost at least one extra page, which as you point out is pretty bad with 64KB pages. (Of course 64KB pages are a disaster for any workload that deals with small objects of any kind, but that's another story) However I wonder why we want to make this optimization only for 64KB pages. It seems the code would be simpler if we just had our own page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it unconditionally everywhere. Or is it not actually better even on sane-sized (ie 4KB) page systems? It seems we still have the malloc overhead which is going to cost us another page? - R. From abenjamin at sgi.com Tue May 19 19:41:52 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Tue, 19 May 2009 19:41:52 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: References: <4A0B560B.3090606@sgi.com> Message-ID: <4A136DF0.7000402@sgi.com> Roland Dreier wrote: > So I would suggest reworking this into a series of patches: > > 1. Add a function ib_alloc_device_set_name() that does what your > ib_init_device() function does. (By the way, there is a problem with > your implementation, since alloc_name() just checks the list of > registered devices for a collision -- so devices that are allocated > but not registered could be assigned the same name, if the kernel > ever moves to parallelizing PCI probing or something like that -- so > you should probably fix alloc_name() to check a list of all allocated > devices or something like that) > The current implementation of IB core module doesn't maintain a list of allocated IB devices. Are you suggesting that we create a separate list of allocated but not registered devices in addition to the existing list of registered devices. Please clarify. Alternatively, we can use the existing registered devices list named 'device_list' in the IB core module to keep track of both allocated and registered devices. Currently, the ib_device can be in one of three states(IB_DEV_UNINITIALIZED, IB_DEV_REGISTERED, IB_DEV_UNREGISTERED). We can enhance this to include 'INITIALIZED' state and add the ib_device to 'device_list' with this new state at ib_alloc_device_set_name() time. In this case, there will be no changes to alloc_name() as it is already checking for device name collision in a single list irrespective of the state of the device. > 2. For each RDMA driver (ie each of drivers/infiniband/hw/xxx), convert > to using ib_init_device_alloc_name() -- one patch per driver. > I wanted to point out that the proposed patch will not fix the /proc/interrupts reporting issue for ConnectX IB devices because request_irq() is done by mlx4_core and not by mlx4_ib. Also, mlx4_core doesn't plug into IB core module. > 3. Remove the old ib_alloc_device() and rename > ib_alloc_device_set_name() back to ib_alloc_device(). > > - R. > I assume that there will be a transition period to allow deprecation of ib_alloc_device_set_name() before we can apply this patch. Is my assumption correct? Regards, Benjamin From rdreier at cisco.com Tue May 19 21:04:39 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 19 May 2009 21:04:39 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <4A136DF0.7000402@sgi.com> (Arputham Benjamin's message of "Tue, 19 May 2009 19:41:52 -0700") References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> Message-ID: [please hit "enter" every 70-80 characters or so, it makes email easier to read and quote] > The current implementation of IB core module doesn't maintain a list > of allocated IB devices. Are you suggesting that we create a separate > list of allocated but not registered devices in addition to the > existing list of registered devices. Please clarify. > Alternatively, we can use the existing registered devices list named > 'device_list' in the IB core module to keep track of both allocated > and registered devices. Currently, the ib_device can be in one of > three states(IB_DEV_UNINITIALIZED, IB_DEV_REGISTERED, > IB_DEV_UNREGISTERED). We can enhance this to include 'INITIALIZED' > state and add the ib_device to 'device_list' with this new state at > ib_alloc_device_set_name() time. In this case, there will be no > changes to alloc_name() as it is already checking for device name > collision in a single list irrespective of the state of the device. The second solution (adding an INITIALIZED state) seems simpler. In fact we could get rid of the UNINITIALIZED state after the patch series since there wouldn't be a way to allocate an unitialized structure. > I wanted to point out that the proposed patch will not fix the > /proc/interrupts reporting issue for ConnectX IB devices because > request_irq() is done by mlx4_core and not by mlx4_ib. Also, > mlx4_core doesn't plug into IB core module. Good point. So I guess we should try to come up with a more general way that works for mlx4 as well. Perhaps enhance the PCI core so that all MSI-X vectors for a device are reported in the /sys hierarchy (analogous to the existing irq file that is under /sys/devices), which would work for all possible devices, rather than having an RDMA-specific method? > I assume that there will be a transition period to allow deprecation > of ib_alloc_device_set_name() before we can apply this patch. Is my > assumption correct? No, once all the drivers in the kernel are converted to the new API, there's no longer any point in keeping the old API (especially given how rare new RDMA drivers are). - R. From zafargilani at gmail.com Tue May 19 21:25:53 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 20 May 2009 09:25:53 +0500 Subject: [ofa-general] Executing IB Verbs/RDMA client/server code via JNI Message-ID: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com> I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to implement a communication device over InfiniBand fabric. I have executed client/server code (most part from Roland Dreier, CISCO) and it works absolutely fine. However when I try to call the same thing via JNI, the code gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve the error. JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and that crashed happened in native code. Though this is understandable, but the error is not, since the same code works when executed directly with c compiler but gives trouble with JNI. Compilers: java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Environment: Red Hat 4.12 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz I have attached the compressed file that contains all the files (.java, .h, .c and .log). I was hoping that someone could may be point me in the right direction. Any help will be greatly appreciated. Regards, -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From zafargilani at gmail.com Tue May 19 21:27:23 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 20 May 2009 09:27:23 +0500 Subject: [ofa-general] Re: Executing IB Verbs/RDMA client/server code via JNI In-Reply-To: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com> References: <7d4423d30905192125s16bc1ee9jd65d564e37275cda@mail.gmail.com> Message-ID: <7d4423d30905192127h47379d3ch61070741d9292f88@mail.gmail.com> Sorry forgot to attach the tarball. On Wed, May 20, 2009 at 9:25 AM, Zafar Gilani wrote: > I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to > implement a communication device over InfiniBand fabric. I have executed > client/server code (most part from Roland Dreier, CISCO) and it works > absolutely fine. However when I try to call the same thing via JNI, the code > gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context > "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve > the error. > > JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and > that crashed happened in native code. Though this is understandable, but the > error is not, since the same code works when executed directly with c > compiler but gives trouble with JNI. > > Compilers: > java version "1.6.0_07" > Java(TM) SE Runtime Environment (build 1.6.0_07-b06) > Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) > > gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > Environment: > Red Hat 4.12 > 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz > > I have attached the compressed file that contains all the files (.java, .h, > .c and .log). I was hoping that someone could may be point me in the right > direction. > > Any help will be greatly appreciated. > > Regards, > -- > Syed Zafar ul Hussan Gilani | BIT-7 > Research Student | CHPSC > MSP 2008-09 > NUST SEECS | http://hpc.niit.edu.pk/~zafar > -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jni.tar Type: application/x-tar Size: 49664 bytes Desc: not available URL: From sebastien.dugue at bull.net Tue May 19 23:00:47 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Wed, 20 May 2009 08:00:47 +0200 Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: References: <20090518095156.7f9c39e6@frecb007965> <20090518095516.6a803492@frecb007965> Message-ID: <20090520080047.3d20cce7@frecb007965> Hi Roland, On Tue, 19 May 2009 15:01:13 -0700 Roland Dreier wrote: > > QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers > > size to the page size and then allocates page aligned memory using > > posix_memalign(). > > > > However, this allocation is quite wasteful on architectures using 64K pages > > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc > > parameter and chunks are allocated using mmap. thus we end up allocating: > > > > (requested size rounded to the page size) + (page size) + (malloc overhead) > > > > rounded internally to the page size. > > > > So for example, if we request a buffer of page_size bytes, we end up > > consuming 3 pages. In short, for each QP buffer we allocate, there is an > > overhead of 2 pages. This is quite visible on large clusters especially where > > the number of QP can reach several thousands. > > > > This patch creates a new function mlx4_alloc_page() for use by > > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when > > the page size is 64K. > > makes sense I guess. It would be nice if glibc() were smart enough to > know that mmap(MAP_ANONYMOUS) is going to give something page-aligned > anyway, If you mean in the posix_memalign() path, then yes it'd be really nice. > but it seems that malloc overhead (required to make the memory > from posix_memalign() work with free()) is going to cost at least one > extra page, which as you point out is pretty bad with 64KB pages. (Of > course 64KB pages are a disaster for any workload that deals with small > objects of any kind, but that's another story) Yep, agreed. > > However I wonder why we want to make this optimization only for 64KB > pages. It seems the code would be simpler if we just had our own > page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it > unconditionally everywhere. Or is it not actually better even on > sane-sized (ie 4KB) page systems? It seems we still have the malloc > overhead which is going to cost us another page? Well not really, because if we stay below MMAP_THRESHOLD, as we do with 4K pages, the only overhead is malloc's chaining structure. The extra space used to align the buffer is released before posix_memalign() returns, but that does increase fragmentation of mallocs chunks. Also, for 4K pages, mmap() systematically results in a syscall whereas posix_memalign() does not necessarily, but as we're not on a fast path I'm not sure what would be best. I don't mind converting all QP buffers allocation to mmap(), but I'd like to hear what people think. Thanks Roland, Sebastien. From ogerlitz at Voltaire.com Wed May 20 00:14:34 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Wed, 20 May 2009 10:14:34 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090519215505.GN6837@sgi.com> References: <20090519215505.GN6837@sgi.com> Message-ID: <4A13ADDA.5040908@Voltaire.com> akepner at sgi.com wrote: > We've seen a few instances of a crash in ipoib_neigh_cleanup() due to the use of a stale pointer: > 848 neigh = *to_ipoib_neigh(n); <- read neigh (no locking) > ..... > 858 spin_lock_irqsave(&priv->lock, flags); > 860 if (neigh->ah) <--- at this point neigh may be stale [...] > I've been looking into a proper fix for this, and I'd like to solicit any ideas. Before going into possible solutions, could you say what kernel are you using? With this or related problems being around for couple of years, I always wanted to understand (A) why access to from-the-kernel-point-of-view-to-be-destructed-neighbour be protected? and (B) how come it can becomes stale? before 2.6.17-20 or so this could have happen since the ipoib neighbour destructor could have been called for NON ipoib neighbours - which for my understanding isn't the case any more in modern kernels see commit ecbb416939da77c0d107409976499724baddce7b "[NET]: Fix neighbour destructor handling" Or. From vlad at lists.openfabrics.org Wed May 20 03:22:31 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 20 May 2009 03:22:31 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090520-0200 daily build status Message-ID: <20090520102231.5EEB4E61401@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From eli at dev.mellanox.co.il Wed May 20 04:10:36 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 20 May 2009 14:10:36 +0300 Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: <20090520080047.3d20cce7@frecb007965> References: <20090518095156.7f9c39e6@frecb007965> <20090518095516.6a803492@frecb007965> <20090520080047.3d20cce7@frecb007965> Message-ID: <20090520111036.GA13831@mtls03> On Wed, May 20, 2009 at 08:00:47AM +0200, sebastien dugue wrote: > > Well not really, because if we stay below MMAP_THRESHOLD, as we do > with 4K pages, the only overhead is malloc's chaining structure. The > extra space used to align the buffer is released before posix_memalign() > returns, but that does increase fragmentation of mallocs chunks. > > Also, for 4K pages, mmap() systematically results in a syscall whereas > posix_memalign() does not necessarily, but as we're not on a fast path > I'm not sure what would be best. I don't mind converting all QP buffers > allocation to mmap(), but I'd like to hear what people think. > If the only reasoning behind using a MMAP_THRESHOLD is to avoid the system call for smaller allocations, then I think we'd better use a uniform allocation scheme -- mmap -- as you proposed and not distinguish between the two cases. From sebastien.dugue at bull.net Wed May 20 04:39:06 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Wed, 20 May 2009 13:39:06 +0200 Subject: [ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages In-Reply-To: <20090520111036.GA13831@mtls03> References: <20090518095156.7f9c39e6@frecb007965> <20090518095516.6a803492@frecb007965> <20090520080047.3d20cce7@frecb007965> <20090520111036.GA13831@mtls03> Message-ID: <20090520133906.552fee01@frecb007965> On Wed, 20 May 2009 14:10:36 +0300 Eli Cohen wrote: > On Wed, May 20, 2009 at 08:00:47AM +0200, sebastien dugue wrote: > > > > Well not really, because if we stay below MMAP_THRESHOLD, as we do > > with 4K pages, the only overhead is malloc's chaining structure. The > > extra space used to align the buffer is released before posix_memalign() > > returns, but that does increase fragmentation of mallocs chunks. > > > > Also, for 4K pages, mmap() systematically results in a syscall whereas > > posix_memalign() does not necessarily, but as we're not on a fast path > > I'm not sure what would be best. I don't mind converting all QP buffers > > allocation to mmap(), but I'd like to hear what people think. > > > > If the only reasoning behind using a MMAP_THRESHOLD is to avoid the > system call for smaller allocations, Well, that's not the only reason. From what I understand, for small allocations, glibc's malloc can recycle freed heap chunks much more easily than mmapped chunks. Also the mmapped chunk must be zeroed by the kernel before being handed to the user which does not comes for free. > then I think we'd better use a > uniform allocation scheme -- mmap -- as you proposed and not > distinguish between the two cases. > I will respin those patches early next week if nobody disagrees with this route. Thanks, Sebastien. From zafargilani at gmail.com Tue May 19 21:14:02 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 20 May 2009 09:14:02 +0500 Subject: [ofa-general] Executing client/server code via JNI In-Reply-To: <7d4423d30905182216q5db826c1m48e3989d6e0df35f@mail.gmail.com> References: <7d4423d30905182216q5db826c1m48e3989d6e0df35f@mail.gmail.com> Message-ID: <7d4423d30905192114k4624f962p27f62886d7cc66f1@mail.gmail.com> I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to implement a communication device over InfiniBand fabric. I have executed client/server code (most part from Roland Dreier, CISCO) and it works absolutely fine. However when I try to call the same thing via JNI, the code gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context "cm_id->verbs" and the protection domain "ibv_pd", these seem to work fine. JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and that crashed happened in native code. Though this is understandable, but the error is not, since the same code works when executed directly with c compiler but gives trouble with JNI. Compilers: java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Environment: Red Hat 4.12 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz I have attached the compressed file that contains all the files (.java, .h, .c and .log). I was hoping that somebody could give me any ideas about the solution. Any help will be greatly appreciated. Regards, -- Zafar Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jni.tar Type: application/x-tar Size: 49664 bytes Desc: not available URL: From zafargilani at gmail.com Wed May 20 09:07:25 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 20 May 2009 21:07:25 +0500 Subject: [ofa-general] Executing IB Verbs/RDMA client/server code via JNI Message-ID: <7d4423d30905200907ob633063jf0aac806d4260f3e@mail.gmail.com> This is my second message on the list. This one is exactly same as first one, my previous did not receive any replies. I will be thankful if anyone could point out the problem in the attached code files. Problem is explained below: I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to implement a communication device over InfiniBand fabric. I have executed client/server code (most part from Roland Dreier, CISCO) and it works absolutely fine. However when I try to call the same thing via JNI, the code gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve the error. JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and that crashed happened in native code. Though this is understandable, but the error is not, since the same code works when executed directly with c compiler but gives trouble with JNI. Compilers: java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Environment: Red Hat 4.12 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz I have attached the compressed file (jni.tar) that contains all the files (.java, .h, .c and .log). I was hoping that someone could may be point me in the right direction. Any help will be greatly appreciated. Thanks, -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jni.tar Type: application/x-tar Size: 49664 bytes Desc: not available URL: From zafargilani at gmail.com Wed May 20 09:31:12 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 20 May 2009 21:31:12 +0500 Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI Message-ID: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com> This is my second message on the list. This one is exactly same as first one, my previous did not receive any replies. I will be thankful if anyone could point out the problem in the code files. Problem is explained below: I am an undergrad student doing my FYP. I am using IB Verbs and RDMA CM to implement a communication device over InfiniBand fabric. I have executed client/server code (most part from Roland Dreier, CISCO) and it works absolutely fine. However when I try to call the same thing via JNI, the code gets stuck at "ibv_alloc_pd". I have checked the "cm_id", the ibv_context "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve the error. JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and that crash happened in native code. Though this is understandable, but the error is not, since the same code works when executed directly with c compiler but gives trouble with JNI. Compilers: java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Environment: Red Hat 4.12 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz Compressed file (jni.tar) that contains all the files (.java, .h, .c and .log) is available at [http://hpc.niit.edu.pk/~zafar/work/ib/jni.tar]. I was hoping that someone could may be point me in the right direction. Any help will be greatly appreciated. Thanks, -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed May 20 10:28:38 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 20 May 2009 10:28:38 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090519215505.GN6837@sgi.com> (akepner@sgi.com's message of "Tue, 19 May 2009 14:55:05 -0700") References: <20090519215505.GN6837@sgi.com> Message-ID: > We've seen a few instances of a crash in ipoib_neigh_cleanup() due to > the use of a stale pointer: > > > 848 neigh = *to_ipoib_neigh(n); <- read neigh (no locking) > ..... > 858 spin_lock_irqsave(&priv->lock, flags); > 859 > 860 if (neigh->ah) <--- at this point neigh may be stale > 861 ah = neigh->ah; > 862 list_del(&neigh->list); > 863 ipoib_neigh_free(n->dev, neigh); > 864 > 865 spin_unlock_irqrestore(&priv->lock, flags); I'd like to understand the bug first -- how is the neighbour being destroyed out from under us in ipoib_neigh_cleanup()? I would have thought the cleanup function would run when no references to the struct remain but before it's freed. - R. From sokar6012 at hotmail.com Wed May 20 12:58:31 2009 From: sokar6012 at hotmail.com (anthony garnier) Date: Wed, 20 May 2009 19:58:31 +0000 Subject: [ofa-general] Infiniband with Xen Message-ID: Hi, I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working I got this message with dmesg on my DomU : [ 4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) [ 4.111047] ib_mthca: Initializing 0000:00:00.0 [ 4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002) [ 4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0 [ 15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting. [ 15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting. Do you know if there is a solution like previously with XEN smartio or Xen-IB (wich is no more developped) to do High Performance VMM-Bypass I/O in Virtual Machines. _________________________________________________________________ Téléphonez gratuitement à tous vos proches avec Windows Live Messenger  !  Téléchargez-le maintenant ! http://www.windowslive.fr/messenger/1.asp -------------- next part -------------- An HTML attachment was scrubbed... URL: From abenjamin at sgi.com Wed May 20 13:24:34 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Wed, 20 May 2009 15:24:34 -0500 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> > > I wanted to point out that the proposed patch will not fix the > > /proc/interrupts reporting issue for ConnectX IB devices because > > request_irq() is done by mlx4_core and not by mlx4_ib. Also, > > mlx4_core doesn't plug into IB core module. > Good point. So I guess we should try to come up with a more general way > that works for mlx4 as well. Perhaps enhance the PCI core so that all > MSI-X vectors for a device are reported in the /sys hierarchy (analogous > to the existing irq file that is under /sys/devices), which would work > for all possible devices, rather than having an RDMA-specific method? > - R. Can I proceed with the ib_alloc_device_set_name()IB core API changes, and mthca driver changes we agreed? After we test and apply these patches, we can take a look at how we can fix mlx4 as well. Please confirm. Regards, Benjamin -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed May 20 13:54:07 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 20 May 2009 13:54:07 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> (Arputham Benjamin's message of "Wed, 20 May 2009 15:24:34 -0500") References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> Message-ID: > Can I proceed with the ib_alloc_device_set_name()IB core API changes, > and mthca driver changes we agreed? After we test and apply these > patches, we can take a look at how we can fix mlx4 as well. I think it would be much better to come up with a way to handle mlx4 as well. There's not much point in making core changes if they don't fix the issue for all drivers. - R. From rdreier at cisco.com Wed May 20 13:55:49 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 20 May 2009 13:55:49 -0700 Subject: [ofa-general] Infiniband with Xen In-Reply-To: (anthony garnier's message of "Wed, 20 May 2009 19:58:31 +0000") References: Message-ID: > I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working > I got this message with dmesg on my DomU : > > [ 4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > [ 4.111047] ib_mthca: Initializing 0000:00:00.0 > [ 4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002) > [ 4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0 > [ 15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting. > [ 15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting. > > Do you know if there is a solution like previously with XEN smartio or Xen-IB (wich is no more developped) to do High Performance VMM-Bypass I/O in Virtual Machines. I'm not sure about smartio or Xen-IB, but you could try assigning both HCA PCI devices to your domU and see if it works better (the HCA should appear in lspci as both a PCI bridge and an actual HCA device, and the driver expects to find both) - R. From akepner at sgi.com Wed May 20 14:37:03 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 20 May 2009 14:37:03 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: References: <20090519215505.GN6837@sgi.com> Message-ID: <20090520213703.GT6837@sgi.com> On Wed, May 20, 2009 at 10:28:38AM -0700, Roland Dreier wrote: > > > We've seen a few instances of a crash in ipoib_neigh_cleanup() due to > > the use of a stale pointer: > > > > > > 848 neigh = *to_ipoib_neigh(n); <- read neigh (no locking) > > ..... > > 858 spin_lock_irqsave(&priv->lock, flags); > > 859 > > 860 if (neigh->ah) <--- at this point neigh may be stale > > 861 ah = neigh->ah; > > 862 list_del(&neigh->list); > > 863 ipoib_neigh_free(n->dev, neigh); > > 864 > > 865 spin_unlock_irqrestore(&priv->lock, flags); > > I'd like to understand the bug first -- how is the neighbour being > destroyed out from under us in ipoib_neigh_cleanup()? I would have > thought the cleanup function would run when no references to the struct > remain but before it's freed. > I should have been more specific - it's not the ipoib_neigh structure pointer itself, but the list inside the structure where we've found a problem. The specific crash we are trying to fix is when someone does list_del(&neigh->list) just before we acquire the lock at line 860. (But all the callers of list_del(&neigh->list) subsequently call ipoib_neigh_free(), too, so the neigh pointer is bad.) The signature of the crash is like this: Unable to handle kernel paging request at 0000000000100108 RIP: ^M ^^^^^^^^^^^^^^^^ LIST_POISON1+0x8 {:ib_ipoib:ipoib_neigh_cleanup+368}^M PGD 4152b3067 PUD 413ee4067 PMD 0 ^M Oops: 0002 [1] SMP ^M last sysfs file: /class/infiniband/mthca1/node_type^M CPU 7 ^M Modules linked in: sg sd_mod crc32c libcrc32c rdma_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi loop numatools xpmem i2c_i801 libata i2c_core scsi_mod shpchp pci_hotplug nfs lockd nfs_acl af_packet sunrpc e1000^M Pid: 0, comm: swapper Tainted: G U 2.6.16.54-0.2.5-smp #1^M RIP: 0010:[] {:ib_ipoib:ipoib_neigh_cleanup+368}^M RSP: 0018:ffff81042088bea8 EFLAGS: 00010082^M RAX: 0000000000200200 RBX: ffff8104162fdd40 RCX: ffff8104162fdd98^M RDX: 0000000000100100 RSI: ffff8104162fdd40 RDI: ffff81041b2f8500^M RBP: ffff8103c7600480 R08: ffff81041e7b10f0 R09: 0000000000000000^M R10: ffff810420885e48 R11: 0000000000003a98 R12: ffff81041aa39480^M R13: ffff81041b2f8500 R14: 0000000000000246 R15: ffffffff803d3ff0^M FS: 0000000000000000(0000) GS:ffff810420fdc2c0(0000) knlGS:0000000000000000^M CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M CR2: 0000000000100108 CR3: 0000000417469000 CR4: 00000000000006e0^M Process swapper (pid: 0, threadinfo ffff810420884000, task ffff81042083f100)^M Stack: ffff8103c7600480 0000000000000000 ffff8103c7600480 0000000108a8ec53 ^M ffffffff803c2700 ffffffff80284ebd ffff8103c7600480 ffffffff803862a0 ^M ffff81041610c380 ffffffff802871ca ^M Call Trace: {neigh_destroy+197}^M {neigh_periodic_timer+249} {neigh_periodic_timer+0}^M {run_timer_softirq+348} {__do_softirq+85}^M {call_softirq+30} {do_softirq+44}^M {mwait_idle+0} {apic_timer_interrupt+132} ^M {mwait_idle+0} {mwait_idle+54}^M {cpu_idle+151} {start_secondary+1240}^M -- Arthur From abenjamin at sgi.com Wed May 20 16:54:05 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Wed, 20 May 2009 18:54:05 -0500 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> >> Can I proceed with the ib_alloc_device_set_name()IB core API changes, >> and mthca driver changes we agreed? After we test and apply these >> patches, we can take a look at how we can fix mlx4 as well. > I think it would be much better to come up with a way to handle mlx4 as > well. There's not much point in making core changes if they don't fix > the issue for all drivers. > - R. I wanted to add some clarification. We have two types of IB devices: 1)Devices that can operate as an InfiniBand adapter only 2)Devices that can operate as an InfiniBand adapter or as an Ethernet NIC As per the current implementation of OFED stack, the driver architecture of #2 is very different from #1 because it needs to make sure InfiniBand and Ethernet functions can share the device without interfering with each other. I was thinking that we can fix /proc/interrupts issue for case#1 first and worry about #2 later because the design to fix /proc/interrupts for mlx4 case is going to be different and independent just as the driver design is different and independent for the two cases today. We don't have a common kernel module in OFED stack that plugs into both types of IB devices as far as interrupt resource allocation is concerned. I think creating such a module would be a fundamental S/W arch change and would require a lot of changes to adopt to it. Please let me know if you still think we need a common solution for both cases mentioned above. Any suggestions at a high level for such a common solution? Thank you for your help. Regards, Benjamin -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahomike at us.ibm.com Wed May 20 17:23:01 2009 From: ahomike at us.ibm.com (Mike Aho) Date: Wed, 20 May 2009 19:23:01 -0500 Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X Message-ID: I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped for an RNIC (Ethernet RDMA card). Is there doc somewhere to explain how the fields are addressed differently than an IB HCA? Is there a set of conventions? Mike Aho -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Wed May 20 18:47:12 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 May 2009 20:47:12 -0500 Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X In-Reply-To: References: Message-ID: <4A14B2A0.8000704@opengridcomputing.com> Mike Aho wrote: > I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped > for an RNIC (Ethernet RDMA card). Is there doc somewhere to explain > how the fields are addressed differently than an IB HCA? Is there a > set of conventions? > > Mike Aho Hey Mike, Unfortunately, no doc. What fields are not clear? Steve. From jgunthorpe at obsidianresearch.com Wed May 20 20:28:51 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 20 May 2009 21:28:51 -0600 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> Message-ID: <20090521032851.GC11690@obsidianresearch.com> On Wed, May 20, 2009 at 06:54:05PM -0500, Arputham Benjamin wrote: > I was thinking that we can fix /proc/interrupts issue for case#1 first > and worry about #2 later because the design to fix /proc/interrupts > for mlx4 case is going to be different and independent just as the > driver design is different and independent for the two cases today. I notice on my system with 2.6.28 some drivers are appending the PCI ID: 17: 101272606 IO-APIC-fasteoi ATI IXP, radeon at pci:0000:01:05.0 This is alot simpler than trying to create small monotonic numbers.. Jason From ogerlitz at voltaire.com Wed May 20 22:32:22 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 21 May 2009 08:32:22 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090520213703.GT6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> Message-ID: <4A14E766.1010005@voltaire.com> akepner at sgi.com wrote: > I should have been more specific [...] {:ib_ipoib:ipoib_neigh_cleanup+368} yes, being more specific here helps because > Pid: 0, comm: swapper Tainted: G U 2.6.16.54-0.2.5-smp its very likely that the problem you face in 2.6.16 was fixed by the commit I pointed on in my previous reply on this thread. Or. From ogerlitz at voltaire.com Wed May 20 22:41:56 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 21 May 2009 08:41:56 +0300 Subject: [ofa-general] ibv_devinfo -v conventions for RNIC In-Reply-To: References: Message-ID: <4A14E9A4.9090600@voltaire.com> Mike Aho wrote: > Is there doc somewhere to explain how the fields are addressed > differently than an IB HCA? Is there a set of conventions? try https://wiki.openfabrics.org/tiki-index.php?page=Verbs%3A+Infiniband+vs+iWARP Or. From vlad at lists.openfabrics.org Thu May 21 03:23:31 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 21 May 2009 03:23:31 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090521-0200 daily build status Message-ID: <20090521102332.4133CE60E8C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From amirv at mellanox.co.il Thu May 21 04:58:06 2009 From: amirv at mellanox.co.il (Amir Vadai) Date: Thu, 21 May 2009 14:58:06 +0300 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP Message-ID: <4A1541CE.9090509@mellanox.co.il> Sean Hi, I need to know how is the LocalAckTimeout for a new QP is calculated (for an SDP QP). Is there a way to change it through a module parameter? If not, what is the right way to change it? Thanks, Amir From Zhen.Liang at Sun.COM Thu May 21 05:03:12 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Thu, 21 May 2009 20:03:12 +0800 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes In-Reply-To: <4A12C200.4000708@mellanox.co.il> References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com> <4A12C200.4000708@mellanox.co.il> Message-ID: <4A154300.6020100@sun.com> Tziporet, I get two x4600 and think they are same, but on the one failed to startup when I run mstflint: mstflint -d 03:00.0 q Warning: memory access to device 03:00.0 failed: Input/output error. Warning: Fallback on IO: much slower, and unsafe if device in use. *** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR On the other one (which load driver without error): mstflint -d 03:00.0 q Image type: Failsafe I.S. Version: 1 Chip Revision: A0 Description: Node Port1 Port2 Sys image GUIDs: 00066a0098006abd 00066a00a0006abd 00066a01a0006abd 00066a0098006abd Board ID: j (MT_00A0000001) VSD: j PSID: MT_00A0000001 mstflint -d 03:00.0 v Failsafe image: Invariant /0x00000028-0x0000095f (0x000938)/ (BOOT2) - OK Primary Image /0x00010000-0x00010107 (0x000108)/ (Pointer Sector)- OK /0x00030028-0x000308af (0x000888)/ (BOOT2) - OK /0x000308b0-0x00034feb (0x00473c)/ (BOOT2) - OK /0x00034fec-0x00035edb (0x000ef0)/ (Configuration) - OK /0x00035edc-0x00035f0f (0x000034)/ (GUID) - OK /0x00035f10-0x0003ed63 (0x008e54)/ (DDR) - OK /0x0003ed64-0x0004d63b (0x00e8d8)/ (DDR) - OK /0x0004d63c-0x00050573 (0x002f38)/ (DDR) - OK /0x00050574-0x0005204f (0x001adc)/ (DDR) - OK /0x00052050-0x0006accf (0x018c80)/ (DDR) - OK /0x0006acd0-0x0007f23f (0x014570)/ (DDR) - OK /0x0007f240-0x0007f253 (0x000014)/ (Configuration) - OK /0x0007f254-0x0007f297 (0x000044)/ (Jump addresses) - OK /0x0007f298-0x0007f33f (0x0000a8)/ (FW Configuration) - OK Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK /0x00080028-0x000808af (0x000888)/ (BOOT2) - OK /0x000808b0-0x00084feb (0x00473c)/ (BOOT2) - OK /0x00084fec-0x00085edb (0x000ef0)/ (Configuration) - OK /0x00085edc-0x00085f0f (0x000034)/ (GUID) - OK /0x00085f10-0x0008ed63 (0x008e54)/ (DDR) - OK /0x0008ed64-0x0009d63b (0x00e8d8)/ (DDR) - OK /0x0009d63c-0x000a0573 (0x002f38)/ (DDR) - OK /0x000a0574-0x000a204f (0x001adc)/ (DDR) - OK /0x000a2050-0x000baccf (0x018c80)/ (DDR) - OK /0x000bacd0-0x000cf23f (0x014570)/ (DDR) - OK /0x000cf240-0x000cf253 (0x000014)/ (Configuration) - OK /0x000cf254-0x000cf297 (0x000044)/ (Jump addresses) - OK /0x000cf298-0x000cf33f (0x0000a8)/ (FW Configuration) - OK FW image verification succeeded. Image is bootable. Thanks Liang Tziporet Koren wrote: > Liang Zhen wrote: > >> Hi Ole, >> Have you got solution for this? I think we got exactly same problem on >> 4600 with ofed-1.4.1-rc4: >> lspci output: >> 03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe >> 2.0 2.5GT/s] (rev a0) >> >> and error messages from dmesg: >> >> mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008) >> mlx4_core: Initializing 0000:03:00.0 >> mlx4_core 0000:03:00.0: Requested number of MACs is too much for port 1, >> reducing to 1. >> mlx4_core 0000:03:00.0: command 0x13 failed: fw status = 0x1 >> mlx4_core 0000:03:00.0: SW2HW_EQ failed (-5) >> mlx4_core 0000:03:00.0: Failed to initialize event queue table, aborting. >> mlx4_core: probe of 0000:03:00.0 failed with error -5 >> >> >> > Can you send me the FW version and board type > Since the driver is not loading you can use mstflint to get this data > Please use: > > The devices can be accessed by their PCI ID as displayed by lspci > (bus:dev.fn). > Example: > # List all Mellanox devices > >> /sbin/lspci -d 15b3: >> > 02:00.0 Ethernet controller: Mellanox Technologies Unknown device 6368 > (rev a0) > > # Use mstflint tool to query the firmware on this device > >> mstflint -d 02:00.0 q >> > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sokar6012 at hotmail.com Thu May 21 05:05:13 2009 From: sokar6012 at hotmail.com (anthony garnier) Date: Thu, 21 May 2009 12:05:13 +0000 Subject: [ofa-general] Infiniband with Xen In-Reply-To: References: Message-ID: HI, I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 : 1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0 ( this is HCA) [ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1 ( This is eth1) [ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2 ( this is the HCA bridge) [ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device: [ 1298.655747] 1) see permissive attribute in sysfs [ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. > From: rdreier at cisco.com > To: sokar6012 at hotmail.com > CC: general at lists.openfabrics.org > Subject: Re: [ofa-general] Infiniband with Xen > Date: Wed, 20 May 2009 13:55:49 -0700 > > > I'm currently working on the latest version of xen with debian (lenny), and I have done 2 PCI passthrough, the fisrt one is with eth1 and i got no probleme with this one, but the second one is with a infiniband adapter ( MT23108 Cougar revision A1,latest firmware 3.5) => It's not working > > I got this message with dmesg on my DomU : > > > > [ 4.111023] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > > [ 4.111047] ib_mthca: Initializing 0000:00:00.0 > > [ 4.111501] ib_mthca 0000:00:00.0: enabling device (0000 -> 0002) > > [ 4.112859] ib_mthca 0000:00:00.0: No bridge found for 0000:00:00.0 > > [ 15.526745] ib_mthca 0000:00:00.0: PCI device did not come back after reset, aborting. > > [ 15.526765] ib_mthca 0000:00:00.0: Failed to reset HCA, aborting. > > > > Do you know if there is a solution like previously with XEN smartio or Xen-IB (wich is no more developped) to do High Performance VMM-Bypass I/O in Virtual Machines. > > I'm not sure about smartio or Xen-IB, but you could try assigning both > HCA PCI devices to your domU and see if it works better (the HCA should > appear in lspci as both a PCI bridge and an actual HCA device, and the > driver expects to find both) > > - R. _________________________________________________________________ Découvrez toutes les possibilités de communication avec vos proches http://www.microsoft.com/windows/windowslive/default.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahomike at us.ibm.com Thu May 21 05:42:16 2009 From: ahomike at us.ibm.com (Mike Aho) Date: Thu, 21 May 2009 07:42:16 -0500 Subject: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X In-Reply-To: <4A14B2A0.8000704@opengridcomputing.com> References: <4A14B2A0.8000704@opengridcomputing.com> Message-ID: Steve, Below is the verbose version I got as an example for a Chelsio card on a Dell machine. I REALLY like that an RNIC can show its settings via ibv_devinfo. But an RNIC is different from an IB card and perhaps ibv_devinfo should have an indicator [by port?] that differentiates an RNIC from an IB HCA and do the output differently. I doubt 1.4.1 can change this short term but 1.5 could incorporate a meaningful change. Developing a "readme" to cover it would recover separate maintenance and be challenging to keep updated and RNIC meaningful output would be more useful. It appears that the node_guid and sys_image_guid incorporate the MAC addresses into the format. Are these fields really useful as GUIDs? This also assumes a single port RNIC and I could see a two-port RNIC coming along or an IO card with an IB port and Ethernet (RNIC) port. I could see MAC address(es) being shown in the port subsection based on a port indicator enum of IB HCA, Ethernet, etc. The max_raw_ipv6_qp and max_raw_eth_qp seem superfluous on a plain RNIC but perhaps make sense on a combined RNIC and HCA adapter. So I think it can stay. Under the port subsection, the use of the enum values for mtu size are good for an IB HCA but seem not applicable to an RNIC. Perhaps these should move from an enum to a range of values for RNIC. I can live with some of the other IB artifacts under the port subsection such as pkey and the table lengths but these could be suppressed on an RNIC port indicator in the future. The active width, speed, and physical state need to change to address an RNIC port. Mike Aho hca_id: cxgb3_0 fw_ver: 7.0.0 node_guid: 0007:4305:6009:0000 sys_image_guid: 0007:4305:6009:0000 vendor_id: 0x1425 vendor_part_id: 48 hw_ver: 0x1 board_id: 1425.30 phys_port_cnt: 1 max_mr_size: 0x100000000 page_size_cap: 0xffff000 max_qp: 32736 max_qp_wr: 1023 device_cap_flags: 0x00228000 max_sge: 4 max_sge_rd: 1 max_cq: 32767 max_cqe: 8192 max_mr: 32768 max_pd: 32767 max_qp_rd_atom: 8 max_ee_rd_atom: 0 max_res_rd_atom: 0 max_qp_init_rd_atom: 8 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_NONE (0) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 0 max_mcast_qp_attach: 0 max_total_mcast_qp_attach: 0 max_ah: 0 max_fmr: 0 max_srq: 0 max_pkeys: 0 local_ca_ack_delay: 0 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 max_msg_sz: 0xffffffff port_cap_flags: 0x009f0000 max_vl_num: invalid value (0) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 1 gid_tbl_len: 1 subnet_timeout: 0 init_type_reply: 0 active_width: 4X (2) active_speed: 5.0 Gbps (2) phys_state: invalid physical state (0) Mike Aho From: Steve Wise To: Mike Aho/Rochester/IBM at IBMUS Cc: general at lists.openfabrics.org Date: 05/20/2009 08:47 PM Subject: Re: [ofa-general] ibv_devinfo -v conventions for RNIC in 1.4.1-X Mike Aho wrote: > I ran ibv_devinfo -v for 1.4.1-rc4 and got all the information dumped > for an RNIC (Ethernet RDMA card). Is there doc somewhere to explain > how the fields are addressed differently than an IB HCA? Is there a > set of conventions? > > Mike Aho Hey Mike, Unfortunately, no doc. What fields are not clear? Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Thu May 21 06:01:38 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 21 May 2009 16:01:38 +0300 Subject: [ofa-general] Problems using OFED 1.4 on largesmp nodes In-Reply-To: <4A154300.6020100@sun.com> References: <1233654242.1364.39.camel@pyren.uio.no> <4A12378C.8030101@sun.com> <4A12C200.4000708@mellanox.co.il> <4A154300.6020100@sun.com> Message-ID: <4A1550B2.4090506@mellanox.co.il> Liang Zhen wrote: > Tziporet, > > I get two x4600 and think they are same, but on the one failed to > startup when I run mstflint: > mstflint -d 03:00.0 q > Warning: memory access to device 03:00.0 failed: Input/output error. > Warning: Fallback on IO: much slower, and unsafe if device in use. > *** ERROR *** Can not open 03:00.0: Not a directory MFE_CR_ERROR > So you have some HW error. From mails with Ole, I understand his issue was also a HW issue. In his case it was an issue of having both the FC card and the IB card on the same south bridge. Please approach your HW vendor for resolution. Tziporet From hal.rosenstock at gmail.com Thu May 21 06:20:00 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 21 May 2009 09:20:00 -0400 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP In-Reply-To: <4A1541CE.9090509@mellanox.co.il> References: <4A1541CE.9090509@mellanox.co.il> Message-ID: On Thu, May 21, 2009 at 7:58 AM, Amir Vadai wrote: > Sean Hi, > > I need to know how is the LocalAckTimeout for a new QP is calculated (for an > SDP QP). It's a function of the packet life time on the path from source to destination and the local CA ack delay. > Is there a way to change it through a module parameter? > If not, what is the right way to change it? It depends on the SM being used. OpenSM has a way to change the path PLT returned/used. -- Hal > Thanks, > Amir > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From tziporet at mellanox.co.il Thu May 21 08:33:24 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 21 May 2009 18:33:24 +0300 Subject: [ofa-general] OFED 1.4.1-rc6 is available In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD020E76C4@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD02A2B828@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02C12251@mtlexch01.mtl.com> Hi, OFED-1.4.1-rc6 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc6.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.1 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.1-rc4 ========================== - Fixed all backport issues with NFS/RDMA - mlx4_en: Updated driver to version 1.4.1 that was released by Mellanox - Added an error in case of mlx4 library mismatch with kernel (due to XRC support) - 7 bug fixed (see attachment) - Updated bonding package: ib-bonding-0.9.0-40 - Updated MVAPICH package: 1.1.0-3355 - Updated documentation Tasks that should be completed for GA (May 27): ==================================== 1. Critical bug fixes - see list bellow 2. Complete documentation update Open bugs: ======== bug_id bug_severity op_sys assigned_to 1630 cri RHEL amirv at mellanox.co.il sdp module fails to compile with gcc 3.4 on i386 - we already have a patch but did not wanted to risk rc6 -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc6-fixed-bugs.csv Type: application/octet-stream Size: 814 bytes Desc: ofed-1.4.1-rc6-fixed-bugs.csv URL: From arlin.r.davis at intel.com Thu May 21 09:16:34 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 21 May 2009 09:16:34 -0700 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP In-Reply-To: References: <4A1541CE.9090509@mellanox.co.il> Message-ID: >> Sean Hi, >> >> I need to know how is the LocalAckTimeout for a new QP is >calculated (for an >> SDP QP). > >It's a function of the packet life time on the path from source to >destination and the local CA ack delay. > >> Is there a way to change it through a module parameter? >> If not, what is the right way to change it? > >It depends on the SM being used. OpenSM has a way to change the path >PLT returned/used. You can actually modify the path record (rdma_cm_id.route->path_rec) before the rdma_connect, after the RDMA_CM_EVENT_ROUTE_RESOLVED event, and rdma_cm will pick up the change for the internal QP modify. -arlin From akepner at sgi.com Thu May 21 12:39:10 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 21 May 2009 12:39:10 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A14E766.1010005@voltaire.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> Message-ID: <20090521193910.GX6837@sgi.com> On Thu, May 21, 2009 at 08:32:22AM +0300, Or Gerlitz wrote: > .... > >Pid: 0, comm: swapper Tainted: G U 2.6.16.54-0.2.5-smp > its very likely that the problem you face in 2.6.16 was fixed by the > commit I pointed on in my previous reply on this thread. > Hmmm, it's not obvious to me that that commit (ecbb416939da77c0d107409976499724baddce7b) would be relevant to the bug that I mentioned earlier. -- Arthur From amirv at mellanox.co.il Thu May 21 12:54:04 2009 From: amirv at mellanox.co.il (Amir Vadai) Date: Thu, 21 May 2009 22:54:04 +0300 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP References: <4A1541CE.9090509@mellanox.co.il> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com> Hal Hi, Assuming I am using OpenSM - how can I tell it to do it? Thanks, Amir ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Thu 21-May-09 4:20 PM To: Amir Vadai Cc: Sean Hefty; Nimrod Gindi; OpenIB Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP On Thu, May 21, 2009 at 7:58 AM, Amir Vadai wrote: > Sean Hi, > > I need to know how is the LocalAckTimeout for a new QP is calculated (for an > SDP QP). It's a function of the packet life time on the path from source to destination and the local CA ack delay. > Is there a way to change it through a module parameter? > If not, what is the right way to change it? It depends on the SM being used. OpenSM has a way to change the path PLT returned/used. -- Hal > Thanks, > Amir > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amirv at mellanox.co.il Thu May 21 12:57:00 2009 From: amirv at mellanox.co.il (Amir Vadai) Date: Thu, 21 May 2009 22:57:00 +0300 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP References: <4A1541CE.9090509@mellanox.co.il> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E809@mtlexch01.mtl.com> For that I will need to access structures that are private to CMA. I hoped there is a way to do it in SDP code only or preferably from the environment. - Amir ________________________________ From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Thu 21-May-09 7:16 PM To: Hal Rosenstock; Amir Vadai Cc: Nimrod Gindi; OpenIB Subject: RE: [ofa-general] Default value of LocalAckTimeout for a new QP >> Sean Hi, >> >> I need to know how is the LocalAckTimeout for a new QP is >calculated (for an >> SDP QP). > >It's a function of the packet life time on the path from source to >destination and the local CA ack delay. > >> Is there a way to change it through a module parameter? >> If not, what is the right way to change it? > >It depends on the SM being used. OpenSM has a way to change the path >PLT returned/used. You can actually modify the path record (rdma_cm_id.route->path_rec) before the rdma_connect, after the RDMA_CM_EVENT_ROUTE_RESOLVED event, and rdma_cm will pick up the change for the internal QP modify. -arlin -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu May 21 13:01:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 21 May 2009 16:01:36 -0400 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com> References: <4A1541CE.9090509@mellanox.co.il> <5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com> Message-ID: On Thu, May 21, 2009 at 3:54 PM, Amir Vadai wrote: > Hal Hi, > > Assuming I  am using OpenSM - how can I tell  it to do  it? Assuming this is a non QoS configuration, you need to configure the subnet_timeout. -- Hal > > Thanks, > Amir > ________________________________ > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Thu 21-May-09 4:20 PM > To: Amir Vadai > Cc: Sean Hefty; Nimrod Gindi; OpenIB > Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP > > On Thu, May 21, 2009 at 7:58 AM, Amir Vadai wrote: >> Sean Hi, >> >> I need to know how is the LocalAckTimeout for a new QP is calculated (for >> an >> SDP QP). > > It's a function of the packet life time on the path from source to > destination and the local CA ack delay. > >> Is there a way to change it through a module parameter? >> If not, what is the right way to change it? > > It depends on the SM being used. OpenSM has a way to change the path > PLT returned/used. > > -- Hal > >> Thanks, >> Amir >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From amirv at mellanox.co.il Thu May 21 13:04:32 2009 From: amirv at mellanox.co.il (Amir Vadai) Date: Thu, 21 May 2009 23:04:32 +0300 Subject: [ofa-general] Default value of LocalAckTimeout for a new QP References: <4A1541CE.9090509@mellanox.co.il><5D49E7A8952DC44FB38C38FA0D758EAD61E808@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD61E80A@mtlexch01.mtl.com> Thanks, Will try it. - Amir ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Thu 21-May-09 11:01 PM To: Amir Vadai Cc: Sean Hefty; Nimrod Gindi; OpenIB Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP On Thu, May 21, 2009 at 3:54 PM, Amir Vadai wrote: > Hal Hi, > > Assuming I am using OpenSM - how can I tell it to do it? Assuming this is a non QoS configuration, you need to configure the subnet_timeout. -- Hal > > Thanks, > Amir > ________________________________ > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Thu 21-May-09 4:20 PM > To: Amir Vadai > Cc: Sean Hefty; Nimrod Gindi; OpenIB > Subject: Re: [ofa-general] Default value of LocalAckTimeout for a new QP > > On Thu, May 21, 2009 at 7:58 AM, Amir Vadai wrote: >> Sean Hi, >> >> I need to know how is the LocalAckTimeout for a new QP is calculated (for >> an >> SDP QP). > > It's a function of the packet life time on the path from source to > destination and the local CA ack delay. > >> Is there a way to change it through a module parameter? >> If not, what is the right way to change it? > > It depends on the SM being used. OpenSM has a way to change the path > PLT returned/used. > > -- Hal > >> Thanks, >> Amir >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave at thedillows.org Thu May 21 13:34:15 2009 From: dave at thedillows.org (David Dillow) Date: Thu, 21 May 2009 16:34:15 -0400 Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI In-Reply-To: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com> References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com> Message-ID: <1242938055.4422.7.camel@obelisk.thedillows.org> On Wed, 2009-05-20 at 21:31 +0500, Zafar Gilani wrote: > This is my second message on the list. This one is exactly same as > first one, my previous did not receive any replies. I will be thankful > if anyone could point out the problem in the code files. Problem is > explained below: You sent several almost identical messages to this list in a very short period of time, with a question that while not exactly "could you do my homework for me?" is not too far removed. You then complain that you did not get a response within 12 hours, 8+ hours of which the US contingent of this list were likely asleep or otherwise away from their computers. The rest of the list was likely trying to solve their own problems at their workplace. This is a mostly volunteer effort; I don't think many people, if any, get paid to monitor this list. While people here are generally interested in helping people out, I think your expectations may need to be adjusted a bit closer to reality. From akepner at sgi.com Thu May 21 14:00:49 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 21 May 2009 14:00:49 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A13ADDA.5040908@Voltaire.com> References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com> Message-ID: <20090521210049.GY6837@sgi.com> As a recap to this thrilling thread, I'm tracking down a panic with a backtrace like this: ib_ipoib:ipoib_neigh_cleanup+368 .... neigh_periodic_timer+0 run_timer_softirq+348 __do_softirq+85 call_softirq+30 do_softirq+44 ..... And the following helpful hint: Unable to handle kernel paging request at 0000000000100108 ^^^^^^^^^^^^^^^^ LIST_POISON1+0x8 So, we're in ipoib_neigh_cleanup(), doing the list_del(): static void ipoib_neigh_cleanup(struct neighbour *n) { ....... neigh = *to_ipoib_neigh(n); ..... spin_lock_irqsave(&priv->lock, flags); if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); ipoib_neigh_free(n->dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); This has been practically impossible to reproduce (and I don't even have the original crashdump available any longer). What would prevent a race between a tx completion (with an error) and the cleanup of a neighbour? In that case the tx completion handler could do: ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { ........ if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) { ..... spin_lock_irqsave(&priv->lock, flags); neigh = tx->neigh; if (neigh) { neigh->cm = NULL; list_del(&neigh->list); ....... spin_unlock_irqrestore(&priv->lock, flags); While ipoib_neigh_cleanup() could grab the (now stale) neigh, and crash like above. (I've tried simulating tx completion failures to trigger this behavior, but haven't gotten lucky yet....) -- Arthur From rdreier at cisco.com Thu May 21 15:22:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 21 May 2009 15:22:48 -0700 Subject: [ofa-general] Re: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> (Arputham Benjamin's message of "Wed, 20 May 2009 18:54:05 -0500") References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> Message-ID: > I wanted to add some clarification. > > We have two types of IB devices: > 1)Devices that can operate as an InfiniBand adapter only > 2)Devices that can operate as an InfiniBand adapter or as an Ethernet NIC > > As per the current implementation of OFED stack, the driver architecture > of #2 is very different from #1 because it needs to make sure InfiniBand > and Ethernet functions can share the device without interfering with > each other. > > I was thinking that we can fix /proc/interrupts issue for case#1 first > and worry about #2 later because the design to fix /proc/interrupts > for mlx4 case is going to be different and independent just as the > driver design is different and independent for the two cases today. I disagree. A verbs consumer of mlx4 doesn't have to worry about the internal design of the driver being different from mthca, and I would hope that carries over to indentifying interrupts. It's much better for users if we can just come up with a solution that handles both of your cases at once, rather than an ad hoc solution for a subset of drivers. > Please let me know if you still think we need a common solution for > both cases mentioned above. Any suggestions at a high level for such > a common solution? I already suggested adding MSI-X vector information to /sys/devices/... to match the existing "irq" file there. That would allow userspace to figure out which interrupt belonged where. Jason's idea of adding the PCI device name to the interrupt name seems viable to me as well. - R. From rdreier at cisco.com Thu May 21 15:33:19 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 21 May 2009 15:33:19 -0700 Subject: [ofa-general] Infiniband with Xen In-Reply-To: (anthony garnier's message of "Thu, 21 May 2009 12:05:13 +0000") References: Message-ID: > I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 : > > 1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0 ( this is HCA) > [ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1 ( This is eth1) > [ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2 ( this is the HCA bridge) > [ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device: > [ 1298.655747] 1) see permissive attribute in sysfs > [ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. Yes, the driver does set the PCI bridge back up after resetting the HCA (to restore all the configuration values that are lost during reset). So you need to set up Xen so that the driver is allowed to restore the HCA and PCI bridge config space. Also it's not clear to me from these messages whether the HCA is put into the domU PCI topology as being under the PCI bridge -- that is probably required for the driver to work. - R. From abenjamin at sgi.com Thu May 21 16:23:17 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Thu, 21 May 2009 18:23:17 -0500 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> Message-ID: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> > I disagree. A verbs consumer of mlx4 doesn't have to worry about the > internal design of the driver being different from mthca, and I would > hope that carries over to indentifying interrupts. It's much better for > users if we can just come up with a solution that handles both of your > cases at once, rather than an ad hoc solution for a subset of drivers. I was not suggesting that we change the interface to verbs consumer/user or how we present the interrupt info. to user between mlx4 and mthca. I agree that it's much better if we can just come up with a solution that handles both. Any plan to merge the functionality of ib_core and mlx4_core into something like 'ofa_core' that will control resource allocation for both Infiniband and Ethernet functions? A single core will help in any similar resource issues. > I already suggested adding MSI-X vector information to > /sys/devices/... to match the existing "irq" file there. That would > allow userspace to figure out which interrupt belonged where. Jason's > idea of adding the PCI device name to the interrupt name seems viable to > me as well. > - R. Don't we need both /sys/devices/... and /proc/interrupts? Regards, Benjamin -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu May 21 17:05:05 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 21 May 2009 17:05:05 -0700 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> (Arputham Benjamin's message of "Thu, 21 May 2009 18:23:17 -0500") References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> Message-ID: > Any plan to merge the functionality of ib_core and mlx4_core into something > like 'ofa_core' that will control resource allocation for both Infiniband > and Ethernet functions? A single core will help in any similar resource issues. No, since they do pretty different things. > Don't we need both /sys/devices/... and /proc/interrupts? Not sure what you mean. If we put msi-x info under /sys, then you can figure out which interrupts belong to a given HCA by following the device link from /sys/class/infiniband. Similarly if /proc/interrupts gives the PCI device, then you have the same ability. So either way works as far as I can tell. From zafargilani at gmail.com Thu May 21 21:36:00 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Fri, 22 May 2009 09:36:00 +0500 Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI In-Reply-To: <1242938055.4422.7.camel@obelisk.thedillows.org> References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com> <1242938055.4422.7.camel@obelisk.thedillows.org> Message-ID: <7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com> > You sent several almost identical messages to this list in a very short > period of time, First of all I apologize for sending more than one message, but the reason is that I did not see any list of discussions on the lists.openfabrics.org, just a digest sent daily, so I wasn't sure whether anybody will have a look at it or not. Apart from that I think you have misunderstood my email. When I said "This is my second message on the list. This one is exactly same as first one, my previous did not receive any replies.", it did not imply that you people are supposed to reply, it was to ensure that if anybody has previously seen my first message, should not waste his/her time at this. Another reason was to see my message in the daily digest general, it is easy to miss a message in such a list. It would be my mistake if somebody could have possibly helped me but did not see my problem. > with a question that while not exactly "could you do my > homework for me?" is not too far removed. Secondly I am not asking you or anybody to do my home work. I have explained in the email what I have already tried, and requested for help such that somebody could give me any ideas or pointers in the right direction, since there must be highly experienced people on the list. Reason for giving the code was for better understanding of the context of the problem for someone who is willing to give his 10-15 minutes in this regard. > You then complain that you did > not get a response within 12 hours, 8+ hours of which the US contingent > of this list were likely asleep or otherwise away from their computers. Could you point out the part in my email where I am complaining explicitly or implicitly? Secondly is everyone from the US on this list? Last time I remember internet did not have territorial jurisdiction. Regards, Zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Thu May 21 21:46:53 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 21 May 2009 22:46:53 -0600 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> Message-ID: <20090522044653.GF11690@obsidianresearch.com> On Thu, May 21, 2009 at 06:23:17PM -0500, Arputham Benjamin wrote: > > I already suggested adding MSI-X vector information to > > /sys/devices/... to match the existing "irq" file there. That would > > allow userspace to figure out which interrupt belonged where. Jason's > > idea of adding the PCI device name to the interrupt name seems viable to > > me as well. > Don't we need both /sys/devices/... and /proc/interrupts? You don't need the device name in proc/interrupts, that is just for easy use of cat.. FWIW, this problem is not really an IB problem, but more a Linux problem, there should have a better interface for matching MSI vectors to the PCI device to the counters. Fixing up the MSI vector routines in PCI core to note the vector numbers in sysfs would help everyone. Jason From dave at thedillows.org Thu May 21 23:05:50 2009 From: dave at thedillows.org (David Dillow) Date: Fri, 22 May 2009 02:05:50 -0400 Subject: [ofa-general] Problem executing IB Verbs/RDMA code via JNI In-Reply-To: <7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com> References: <7d4423d30905200931r7e6a49aas1254ec644ba028c1@mail.gmail.com> <1242938055.4422.7.camel@obelisk.thedillows.org> <7d4423d30905212136w27c7db0ev61736ba267f9eb78@mail.gmail.com> Message-ID: <1242972350.4422.38.camel@obelisk.thedillows.org> On Fri, 2009-05-22 at 09:36 +0500, Zafar Gilani wrote: > First of all I apologize for sending more than one message, but the > reason is that I did not see any list of discussions on the > lists.openfabrics.org, just a digest sent daily, so I wasn't sure > whether anybody will have a look at it or not. Apart from that I think > you have misunderstood my email. Nice backpedaling, but 1) The list is archived at lists.openfabrics.org, and it is hard to miss the link. 2) The archive has a copy of your message to me, within 90 minutes of you sending it. 3) This is not a high-volume list, and "my message might be missed" really doesn't hold water. > > You then complain that you did > > not get a response within 12 hours, 8+ hours of which the US > contingent > > of this list were likely asleep or otherwise away from their > computers. > > Could you point out the part in my email where I am complaining > explicitly or implicitly? "No one answered my email" in 12 hours is generally considered a whine. > Secondly is everyone from the US on this list? Last time I remember > internet did not have territorial jurisdiction. No, only a very small percentage of the US population is on this list. I'd be mildly surprised if less than half of the list is based in the same timezones as the US, though. I pointed out that good part of the list membership was asleep, and the reset were working on their day jobs. You seem to have cut that part. In any event, if the code works natively and not inside the JVM, I would suggest investigating what is different in the broken environment. You seem to have forgotten the error log in the tarball, nor did you state which side crashes -- client, server, or both. You also don't specify which OFED version you are running. This makes it hard for the people that actually want to help you. From vlad at lists.openfabrics.org Fri May 22 03:22:20 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 22 May 2009 03:22:20 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090522-0200 daily build status Message-ID: <20090522102220.BA12BE613EC@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From yossi.openib at gmail.com Fri May 22 03:24:17 2009 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 22 May 2009 13:24:17 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090521193910.GX6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> Message-ID: <4A167D51.1060607@gmail.com> akepner at sgi.com wrote: > On Thu, May 21, 2009 at 08:32:22AM +0300, Or Gerlitz wrote: >> .... >>> Pid: 0, comm: swapper Tainted: G U 2.6.16.54-0.2.5-smp >> its very likely that the problem you face in 2.6.16 was fixed by the >> commit I pointed on in my previous reply on this thread. >> > > Hmmm, it's not obvious to me that that commit > (ecbb416939da77c0d107409976499724baddce7b) would be relevant > to the bug that I mentioned earlier. > So, ipoib tries to list_del(neigh) twice because the second time the condition (neigh != NULL) is not protected with a lock. How about this: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index ab2c192..993b5a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n) unsigned long flags; struct ipoib_ah *ah = NULL; + spin_lock_irqsave(&priv->lock, flags); + neigh = *to_ipoib_neigh(n); if (neigh) priv = netdev_priv(neigh->dev); else - return; + goto out; ipoib_dbg(priv, "neigh_cleanup for %06x %pI6\n", IPOIB_QPN(n->ha), n->ha + 4); - spin_lock_irqsave(&priv->lock, flags); - if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); ipoib_neigh_free(n->dev, neigh); +out: spin_unlock_irqrestore(&priv->lock, flags); if (ah) From sokar6012 at hotmail.com Fri May 22 04:34:29 2009 From: sokar6012 at hotmail.com (anthony garnier) Date: Fri, 22 May 2009 11:34:29 +0000 Subject: [ofa-general] Infiniband with Xen In-Reply-To: References: Message-ID: Hi, You told me " you need to set up Xen so that the driver is allowed to restore the HCA and PCI bridge config space." But how can I set up xen to allow the driver to restore the HCA and pci config space? > From: rdreier at cisco.com > To: sokar6012 at hotmail.com > CC: general at lists.openfabrics.org > Subject: Re: [ofa-general] Infiniband with Xen > Date: Thu, 21 May 2009 15:33:19 -0700 > > > I have tried to also to passthrough the pci bridge but It doesn't work, I got that with de dmesg on my Dom0 : > > > > 1297.293367] pciback: vpci: 0000:04:00.0: assign to virtual slot 0 ( this is HCA) > > [ 1297.295538] pciback: vpci: 0000:03:07.1: assign to virtual slot 1 ( This is eth1) > > [ 1297.298346] pciback: vpci: 0000:03:08.0: assign to virtual slot 2 ( this is the HCA bridge) > > [ 1298.655743] pciback 0000:03:08.0: Driver tried to write to a read-only configuration space field at offset 0x3e, size 2. This may be harmless, but if you have problems with your device: > > [ 1298.655747] 1) see permissive attribute in sysfs > > [ 1298.655749] 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. > > Yes, the driver does set the PCI bridge back up after resetting the HCA > (to restore all the configuration values that are lost during reset). > So you need to set up Xen so that the driver is allowed to restore the > HCA and PCI bridge config space. > > Also it's not clear to me from these messages whether the HCA is put > into the domU PCI topology as being under the PCI bridge -- that is > probably required for the driver to work. > > - R. _________________________________________________________________ Vous voulez savoir ce que vous pouvez faire avec le nouveau Windows Live ? Lancez-vous ! http://www.microsoft.com/windows/windowslive/default.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Fri May 22 04:42:34 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 22 May 2009 07:42:34 -0400 Subject: [ofa-general] [PATCH 1/2] [TRIVIAL] opensm/osm_ucast_lash.c: Fix commentary typo Message-ID: <20090522114234.GB29953@comcast.net> Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index fa8e7e9..e034d6f 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -94,7 +94,7 @@ static void connect_switches(lash_t * p_lash, int sw1, int sw2, int phy_port_1) if (sw1 == sw2) return; - /* see if we are alredy linked to sw2 */ + /* see if we are already linked to sw2 */ for (i = 0; i < num; i++) { l = node->links[i]; From hnrose at comcast.net Fri May 22 04:41:10 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 22 May 2009 07:41:10 -0400 Subject: [ofa-general] [PATCH] opensm/osm_mesh.c: Use define rather than hard coded constant Message-ID: <20090522114110.GA29953@comcast.net> Add LARGE define and use it Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_mesh.c b/opensm/opensm/osm_mesh.c index 263d29e..1867876 100644 --- a/opensm/opensm/osm_mesh.c +++ b/opensm/opensm/osm_mesh.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2008 System Fabric Works, Inc. + * Copyright (c) 2008,2009 System Fabric Works, Inc. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -50,6 +50,7 @@ #define MAX_DEGREE (8) #define MAX_DIMENSION (8) +#define LARGE (0x7fffffff) /* * characteristic polynomials for selected 1d through 8d tori @@ -594,7 +595,7 @@ static int get_switch_metric(lash_t *p_lash, int sw) /* make all distances big except s1 to itself */ for (sw2 = 0; sw2 < p_lash->num_switches; sw2++) - p_lash->switches[sw2]->node->temp = 0x7fffffff; + p_lash->switches[sw2]->node->temp = LARGE; s1->node->temp = 0; @@ -603,7 +604,7 @@ static int get_switch_metric(lash_t *p_lash, int sw) for (sw2 = 0; sw2 < p_lash->num_switches; sw2++) { s2 = p_lash->switches[sw2]; - if (s2->node->temp == 0x7fffffff) + if (s2->node->temp == LARGE) continue; for (j = 0; j < s2->node->num_links; j++) { sw3 = s2->node->links[j]->switch_id; @@ -1120,7 +1121,7 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed) s->node->coord = calloc(dimension, sizeof(int)); for (i = 0; i < dimension; i++) - s->node->coord[i] = (sw == seed)? 0 : 0x7fffffff; + s->node->coord[i] = (sw == seed) ? 0 : LARGE; for (i = 0; i < s->node->num_links; i++) if (s->node->axes[i] == 0) @@ -1137,7 +1138,7 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed) for (sw = 0; sw < num_switches; sw++) { s = p_lash->switches[sw]; - if (s->node->coord[0] == 0x7fffffff) + if (s->node->coord[0] == LARGE) continue; for (j = 0; j < s->node->num_links; j++) { @@ -1172,15 +1173,15 @@ static int measure_geometry(lash_t *p_lash, mesh_t *mesh, int seed) mesh->size = calloc(dimension, sizeof(int)); for (i = 0; i < dimension; i++) { - max[i] = -0x7fffffff; - min[i] = 0x7fffffff; + max[i] = -LARGE; + min[i] = LARGE; } for (sw = 0; sw < num_switches; sw++) { s = p_lash->switches[sw]; for (i = 0; i < dimension; i++) { - if (s->node->coord[i] == 0x7fffffff) + if (s->node->coord[i] == LARGE) continue; if (s->node->coord[i] > max[i]) max[i] = s->node->coord[i]; From hnrose at comcast.net Fri May 22 04:43:46 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 22 May 2009 07:43:46 -0400 Subject: [ofa-general] [PATCH 2/2] opensm/osm_ucast_lash.c: Use calloc rather than malloc/memset Message-ID: <20090522114346.GC29953@comcast.net> Signed-off-by: Robert Pearson Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index e034d6f..a987eb3 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -62,9 +62,9 @@ typedef struct _reachable_dest { static cdg_vertex_t *create_cdg_vertex(unsigned num_switches) { - cdg_vertex_t *v = malloc(sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); + cdg_vertex_t *v; - memset(v, 0, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); + v = calloc(1, sizeof(*v) + (num_switches - 1) * sizeof(v->deps[0])); return v; } @@ -838,13 +838,12 @@ static int lash_core(lash_t * p_lash) } } - switch_bitmap = malloc(num_switches * num_switches * sizeof(int)); + switch_bitmap = calloc(num_switches * num_switches, sizeof(int)); if (!switch_bitmap) { OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D04: " "Failed allocating switch_bitmap - out of memory\n"); goto Exit; } - memset(switch_bitmap, 0, num_switches * num_switches * sizeof(int)); for (i = 0; i < num_switches; i++) { for (dest_switch = 0; dest_switch < num_switches; dest_switch++) @@ -1145,10 +1144,9 @@ static int discover_network_properties(lash_t * p_lash) p_lash->num_switches = cl_qmap_count(&p_subn->sw_guid_tbl); - p_lash->switches = malloc(p_lash->num_switches * sizeof(switch_t *)); + p_lash->switches = calloc(p_lash->num_switches, sizeof(switch_t *)); if (!p_lash->switches) return -1; - memset(p_lash->switches, 0, p_lash->num_switches * sizeof(switch_t *)); vl_min = 5; /* set to a high value */ @@ -1251,11 +1249,10 @@ static lash_t *lash_create(osm_opensm_t * p_osm) { lash_t *p_lash; - p_lash = malloc(sizeof(lash_t)); + p_lash = calloc(1, sizeof(lash_t)); if (!p_lash) return NULL; - memset(p_lash, 0, sizeof(lash_t)); p_lash->p_osm = p_osm; return (p_lash); From zafargilani at gmail.com Fri May 22 08:04:01 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Fri, 22 May 2009 20:04:01 +0500 Subject: [ofa-general] Problem in client/server code with JNI Message-ID: <7d4423d30905220804p596a139aq351d8fc956aa32c4@mail.gmail.com> I am using IB Verbs and RDMA CM to implement a communication device over InfiniBand fabric. I have executed client/server code (most part from Roland Dreier, CISCO) and it works absolutely fine. The server listens for requests, client sends two integers to the server and server returns their sum. When I try to call the same thing via JNI, the code gets stuck at method "ibv_alloc_pd()" (line 170) in the client code (nativeclient.c). I have checked the rdma_cm_id "cm_id", the ibv_context "cm_id->verbs" and the protection domain "ibv_pd" but I am unable to resolve the error. JVM actually crashes stating that error exists at frame: "ibv_alloc_pd" and that crash happened in native code. Though this is understandable, but the error is not, since the same code works when executed directly with c compiler but gives trouble with JNI. Compilers: java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Environment: Red Hat 4.12 2 Intel(R) Xeon(TM) 5060 CPU dual-core hyperthreading 3.20GHz OFED version 1.4 Compressed file (jni.tar) that contains all the files (.java, .h, .c and .log) is available at [http://hpc.niit.edu.pk/~zafar/work/ib/jni.tar] for better understanding. I was hoping that someone could may be give me some pointers/suggestions in the right direction. Any help will be greatly appreciated. P.S.: Structure of the code: nativeclient.c [native code for client] nativeserver.c [native code for server] RdmaOpsServer.java [Server code calling native server code] RdmaOpsClient.java [Client code calling native client code] Thanks, -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From akepner at sgi.com Fri May 22 08:52:08 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 22 May 2009 08:52:08 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A167D51.1060607@gmail.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> Message-ID: <20090522155208.GB6837@sgi.com> On Fri, May 22, 2009 at 01:24:17PM +0300, Yossi Etigin wrote: > ... > So, ipoib tries to list_del(neigh) twice because the second time > the condition (neigh != NULL) is not protected with a lock. > How about this: > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index ab2c192..993b5a7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n) > unsigned long flags; > struct ipoib_ah *ah = NULL; > > + spin_lock_irqsave(&priv->lock, flags); <-- deadlock here > + > neigh = *to_ipoib_neigh(n); > if (neigh) > priv = netdev_priv(neigh->dev); > else > - return; > + goto out; > ipoib_dbg(priv, > "neigh_cleanup for %06x %pI6\n", > IPOIB_QPN(n->ha), > n->ha + 4); > > - spin_lock_irqsave(&priv->lock, flags); > - > if (neigh->ah) > ah = neigh->ah; > list_del(&neigh->list); > ipoib_neigh_free(n->dev, neigh); > > +out: > spin_unlock_irqrestore(&priv->lock, flags); > > if (ah) > This is essentially what I did first time around, but a deadlock on the line marked above was quickly found. Instead what we've been doing is: --- e/ofa_kernel-1.3.1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-06-06 12:04:20.791744390 -0700 +++ f/ofa_kernel-1.3.1/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-06-06 12:10:14.129143660 -0700 @@ -835,11 +835,14 @@ static void ipoib_neigh_cleanup(struct n IPOIB_GID_RAW_ARG(n->ha + 4)); spin_lock_irqsave(&priv->lock, flags); - - if (neigh->ah) - ah = neigh->ah; - list_del(&neigh->list); - ipoib_neigh_free(n->dev, neigh); + + neigh = *to_ipoib_neigh(n); + if (neigh) { + if (neigh->ah) + ah = neigh->ah; + list_del(&neigh->list); + ipoib_neigh_free(n->dev, neigh); + } spin_unlock_irqrestore(&priv->lock, flags); This has worked in practice, but it obviously leaves a small hole open. -- Arthur From rdreier at cisco.com Fri May 22 10:25:21 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 22 May 2009 10:25:21 -0700 Subject: [ofa-general] Infiniband with Xen In-Reply-To: (anthony garnier's message of "Fri, 22 May 2009 11:34:29 +0000") References: Message-ID: > You told me " you need to set up Xen so that the driver is allowed to restore the HCA and PCI bridge config space." > But how can I set up xen to allow the driver to restore the HCA and pci config space? No idea -- I've never used xen pci-passthrough. But this line in your log might be a clue: > > [ 1298.655747] 1) see permissive attribute in sysfs From yossi.openib at gmail.com Fri May 22 10:34:57 2009 From: yossi.openib at gmail.com (Yossi Etigin) Date: Fri, 22 May 2009 20:34:57 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090522155208.GB6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> Message-ID: <4A16E241.1050604@gmail.com> akepner at sgi.com wrote: > On Fri, May 22, 2009 at 01:24:17PM +0300, Yossi Etigin wrote: >> ... >> So, ipoib tries to list_del(neigh) twice because the second time >> the condition (neigh != NULL) is not protected with a lock. >> How about this: >> >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c >> index ab2c192..993b5a7 100644 >> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c >> @@ -845,23 +845,24 @@ static void ipoib_neigh_cleanup(struct neighbour *n) >> unsigned long flags; >> struct ipoib_ah *ah = NULL; >> >> + spin_lock_irqsave(&priv->lock, flags); <-- deadlock here >> + >> neigh = *to_ipoib_neigh(n); >> if (neigh) >> priv = netdev_priv(neigh->dev); >> else >> - return; >> + goto out; >> ipoib_dbg(priv, >> "neigh_cleanup for %06x %pI6\n", >> IPOIB_QPN(n->ha), >> n->ha + 4); >> >> - spin_lock_irqsave(&priv->lock, flags); >> - >> if (neigh->ah) >> ah = neigh->ah; >> list_del(&neigh->list); >> ipoib_neigh_free(n->dev, neigh); >> >> +out: >> spin_unlock_irqrestore(&priv->lock, flags); >> >> if (ah) >> > > This is essentially what I did first time around, but a deadlock on > the line marked above was quickly found. > Interesting... what does it deadlock with? And what is the hole your fix leaves? If the (neigh!=NULL) check passes with the spinlock held, shouldn't it be OK to list_del() it? --Yossi From valdes at anl.gov Fri May 22 11:40:06 2009 From: valdes at anl.gov (John Valdes) Date: Fri, 22 May 2009 13:40:06 -0500 Subject: [ofa-general] SRP on RHEL 5.3/OFED 1.3 vs RHEL 5.1/OFED 1.2? Message-ID: <20090522184006.GE26282@starfish.mcs.anl.gov> Hi all, We have a storage array (a DDN 9550) attached to 8 servers via IB. This setup has been running fine for the last 1.5 years or so, with the servers running RHEL 5.1 and the OFED (OpenIB) 1.2 stack that's included with RHEL 5.1. Recently, we tried to upgrade to new servers running RHEL 5.3 with its bundled OFED 1.3 stack, but now we're seeing frequent timeouts resulting in LUN resets and SCSI command aborts between the servers and the DDN. As far as we can tell, our IB setup on the servers under 5.3 is identical to the setup under 5.1, so we don't know why we're seeing the timeouts and resets. Is anyone aware of any changes when using IB SRP w/ RHEL 5.3 and OFED 1.3 vs RHEL 5.1/OFED 1.2 which might be causing this? For reference, here are some of the details of our setup: OLD CONFIGURATION ----------------- * SuperMicro P4DP6 motherboard, w/ dual Xeon CPUs (x86, single core "Prestonia"), all circa 2002 hardware * Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X, 128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN) * RHEL 5.1 w/ bundled OFED/OpenIB 1.2 * ib_mthca module loaded w/o any extra options * ib_srp module loaded w/ option "srp_sg_tablesize=255" * Connection to DDN established using "srp_daemon" invoked as: "srp_daemon -coe" with options "max_sect=8192,max_cmd_per_lun=5" given in /etc/srp_daemon.conf (Note that due to a bug in the OFED 1.2 srp_daemon, the "max_sect=8192" option is ignored, which is OK since we weren't taking advantage of that option). * 7 DDN LUNs are accessed by all 8 servers as clustered logical volumes (under RedHat's CLVM) holding GFS filesystems. * 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN per server) as a plain disk holding an ext3 filesystem. NEW CONFIGURATION ----------------- * SuperMicro H8DME-2 motherboard, w/ dual quad-core AMD Opteron 2342, x86_64 * Cisco SFS-HCA-X2T7-A1 IB HCA (aka Mellanox Cougar Cub), 133 MHz PCI-X, 128 MB memory, Firmware v3.5.917, dual port (port 1 attached to DDN) --same card as in old configuration, physically moved to new servers * RHEL 5.3 w/ bundled OFED/OpenIB 1.3 * ib_mthca module loaded w/o any extra options * ib_srp module loaded w/ option "srp_sg_tablesize=255" * Connection to DDN established using "srp_daemon" invoked as: "srp_daemon -coe -f /etc/ofed/srp_daemon.conf" with options "max_sect=8192,max_cmd_per_lun=5" srp_daemon.conf * 7 DDN LUNs are accessed by all 8 servers as clustered logical volumes (under RedHat's CLVM) holding GFS filesystems. * 8 unique (not-shared) DDN LUNs are accessed by the servers (one LUN per server) as a plain disk holding an ext3 filesystem. With the new configuration, timeouts/resets have frequently occurred when starting up CLVM on the servers (eg, when the servers scan the LUNs looking for the Linux (clustered) LVM data) as well as when doing I/O to the mounted filesystems. Just to make sure the CLVM/GFS setup wasn't causing problems, we tested the plain ext3 filesystem on the non-shared LUN from one of the new servers, and when doing a simple "dd" to the LUN, we were still seeing timeouts and LUN resets. Does any of this sound familiar to anyone? Do you have a recommended IB/SRP setup for RHEL 5.3? John ---------------------------------------------------------------------- John Valdes Mathematics and Computer Science Division valdes at anl.gov Argonne National Laboratory From akepner at sgi.com Fri May 22 11:44:03 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 22 May 2009 11:44:03 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A16E241.1050604@gmail.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com> Message-ID: <20090522184403.GF6837@sgi.com> On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote: > ... > Interesting... what does it deadlock with? > And what is the hole your fix leaves? If the (neigh!=NULL) check passes > with the spinlock held, shouldn't it be OK to list_del() it? > Unfortunately, I don't have enough information to answer that question any longer (it's an old, closed bug). But the crash dump showed a hang like this: PID: 8643 TASK: ffff810130f060c0 CPU: 3 COMMAND: "sshd" ....... #3 [ffff81013b3d7ea0] .text.lock.spinlock at ffffffff802ea2df (via _spin_lock_i rqsave) #4 [ffff81013b3d7ea0] ipoib_neigh_cleanup at ffffffff883f8972 #5 [ffff81013b3d7ed0] neigh_destroy at ffffffff8029011c crash> dis ipoib_neigh_cleanup 0xffffffff883f8952 : push %r13 0xffffffff883f8954 : push %r12 0xffffffff883f8956 : push %rbp 0xffffffff883f8957 : mov %rdi,%rbp 0xffffffff883f895a : push %rbx 0xffffffff883f895b : sub $0x8,%rsp 0xffffffff883f895f : mov 0x18(%rdi),%rax 0xffffffff883f8963 : lea 0x500(%rax),%r12 0xffffffff883f896a : mov %r12,%rdi 0xffffffff883f896d : callq 0xffffffff802ea1a5 <_spin_lock_irqsave> 0xffffffff883f8972 : mov %rax,%rsi ..... and here is the patch to ipoib_neigh_cleanup() that was in use: diff -rup c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c --- c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 13:25:23.131563415 -0700 +++ d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 15:24:31.475721847 -0700 @@ -821,11 +821,15 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + spin_lock_irqsave(&priv->lock, flags); neigh = *to_ipoib_neigh(n); + spin_unlock_irqrestore(&priv->lock, flags); if (neigh) { - priv = netdev_priv(neigh->dev); - ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", - n->dev->name); + if (priv != netdev_priv(neigh->dev)) { + ipoib_dbg(priv, "neigh_destructor for bonding device: " + "%s\n", n->dev->name); + priv = netdev_priv(neigh->dev); + } } else return; ipoib_dbg(priv, @@ -835,10 +839,13 @@ static void ipoib_neigh_cleanup(struct n spin_lock_irqsave(&priv->lock, flags); - if (neigh->ah) - ah = neigh->ah; - list_del(&neigh->list); - ipoib_neigh_free(n->dev, neigh); + neigh = *to_ipoib_neigh(n); + if (neigh) { + if (neigh->ah) + ah = neigh->ah; + list_del(&neigh->list); + ipoib_neigh_free(n->dev, neigh); + } spin_unlock_irqrestore(&priv->lock, flags); I'll see if I can reproduce the deadlock (with a new kernel). -- Arthur From akepner at sgi.com Fri May 22 13:44:45 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 22 May 2009 13:44:45 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A16E241.1050604@gmail.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com> Message-ID: <20090522204445.GG6837@sgi.com> On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote: > ... > Interesting... what does it deadlock with? (My previous mail was addressing only the question above. I overlooked what follows.) > And what is the hole your fix leaves? Well, in this small window: static void ipoib_neigh_cleanup(struct neighbour *n) { struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n->dev); unsigned long flags; struct ipoib_ah *ah = NULL; neigh = *to_ipoib_neigh(n); <------- from here if (neigh) priv = netdev_priv(neigh->dev); else return; ipoib_dbg(priv, "neigh_cleanup for %06x %pI6\n", IPOIB_QPN(n->ha), n->ha + 4); <------------ to here spin_lock_irqsave(&priv->lock, flags); we could be using a no-longer-valid neigh. > If the (neigh!=NULL) check passes > with the spinlock held, shouldn't it be OK to list_del() it? Yeah, that should be OK. -- Arthur From yossi.openib at gmail.com Fri May 22 14:13:11 2009 From: yossi.openib at gmail.com (Yossi Etigin) Date: Sat, 23 May 2009 00:13:11 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090522184403.GF6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com> <20090522184403.GF6837@sgi.com> Message-ID: <4A171567.7060001@gmail.com> akepner at sgi.com wrote: > diff -rup c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c > --- c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 13:25:23.131563415 -0700 > +++ d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 15:24:31.475721847 -0700 > @@ -821,11 +821,15 @@ static void ipoib_neigh_cleanup(struct n > unsigned long flags; > struct ipoib_ah *ah = NULL; > > + spin_lock_irqsave(&priv->lock, flags); > neigh = *to_ipoib_neigh(n); > + spin_unlock_irqrestore(&priv->lock, flags); > if (neigh) { > - priv = netdev_priv(neigh->dev); > - ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", > - n->dev->name); > + if (priv != netdev_priv(neigh->dev)) { > + ipoib_dbg(priv, "neigh_destructor for bonding device: " > + "%s\n", n->dev->name); > + priv = netdev_priv(neigh->dev); > + } > } else > return; > ipoib_dbg(priv, Now I see that the patch that caused the deadlock is a little more that moving spin_lock_irqsave() a few lines up in the code.. The code above looks a little suspicious. The spin_lock_irqsave() above looks redundant - someone could kfree the neigh after you release the lock and you get a corrupted `priv'. Besides, I see that in the 1.3.1 code there is a test 'if (n->dev->type != ARPHRD_INFINIBAND)', check this out: http://www.mail-archive.com/general at lists.openfabrics.org/msg00839.html --Yossi From yossi.openib at gmail.com Fri May 22 14:16:24 2009 From: yossi.openib at gmail.com (Yossi Etigin) Date: Sat, 23 May 2009 00:16:24 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090522204445.GG6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com> <20090522204445.GG6837@sgi.com> Message-ID: <4A171628.8090307@gmail.com> akepner at sgi.com wrote: > On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote: >> ... >> Interesting... what does it deadlock with? > > (My previous mail was addressing only the question above. I > overlooked what follows.) > >> And what is the hole your fix leaves? > > Well, in this small window: > > static void ipoib_neigh_cleanup(struct neighbour *n) > { > struct ipoib_neigh *neigh; > struct ipoib_dev_priv *priv = netdev_priv(n->dev); > unsigned long flags; > struct ipoib_ah *ah = NULL; > > neigh = *to_ipoib_neigh(n); <------- from here > if (neigh) > priv = netdev_priv(neigh->dev); > else > return; > ipoib_dbg(priv, > "neigh_cleanup for %06x %pI6\n", > IPOIB_QPN(n->ha), > n->ha + 4); <------------ to here > spin_lock_irqsave(&priv->lock, flags); > > > we could be using a no-longer-valid neigh. > >> If the (neigh!=NULL) check passes >> with the spinlock held, shouldn't it be OK to list_del() it? > > Yeah, that should be OK. > So it's a catch.. You can't take priv out of neigh without being protected by the spinlock (or someone will kfree the neigh), but you need priv to get the spinlock in the first place.. From akepner at sgi.com Fri May 22 14:24:38 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 22 May 2009 14:24:38 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A171628.8090307@gmail.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A167D51.1060607@gmail.com> <20090522155208.GB6837@sgi.com> <4A16E241.1050604@gmail.com> <20090522204445.GG6837@sgi.com> <4A171628.8090307@gmail.com> Message-ID: <20090522212438.GH6837@sgi.com> On Sat, May 23, 2009 at 12:16:24AM +0300, Yossi Etigin wrote: > .... > So it's a catch.. You can't take priv out of neigh without being > protected by the spinlock (or someone will kfree the neigh), but you > need priv to get the spinlock in the first place.. Exactly. -- Arthur From vlad at lists.openfabrics.org Sat May 23 03:22:59 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 23 May 2009 03:22:59 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090523-0200 daily build status Message-ID: <20090523102259.B2CA7E61445@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at voltaire.com Sat May 23 22:11:32 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 24 May 2009 08:11:32 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090521193910.GX6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> Message-ID: <4A18D704.1020000@voltaire.com> akepner at sgi.com wrote: > Hmmm, it's not obvious to me that that commit ecbb416939da77c0d107409976499724baddce7b would be relevant to the bug that I mentioned earlier If its not relate to the phenomena addressed by that commit, then repeating the question posed by Roland: how come a neigh cleanup callback is invoked when someone out there has a ref on the neighbour? also I'd like to clarify with you if the rest of this thread applies only to 2.6.16 and possibly more old kernels, or to the current mainline bits? Or. From vlad at lists.openfabrics.org Sun May 24 03:21:59 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 24 May 2009 03:21:59 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090524-0200 daily build status Message-ID: <20090524102159.6FECBE612F2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From vlad at lists.openfabrics.org Mon May 25 03:21:45 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 25 May 2009 03:21:45 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090525-0200 daily build status Message-ID: <20090525102146.158C4E61571@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From zafargilani at gmail.com Mon May 25 08:36:02 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Mon, 25 May 2009 20:36:02 +0500 Subject: [ofa-general] Sending two integers via RDMA_WRITE Message-ID: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> Hi, I am trying to send two integers (essentially a buf array of type uint32_t) to the server via RDMA_WRITE method. The following is piece of code that I rewrote: ++++++++++++++++++++++++++++++++++ buf[0] = strtoul(argv[2], NULL, 0); buf[1] = strtoul(argv[3], NULL, 0); printf("%d + %d = ", buf[0], buf[1]); buf[0] = htonl(buf[0]); buf[1] = htonl(buf[1]); /* ----------------------------------- ---- START - write operation 1 ---- ----------------------------------- */ int c; for(c = 0; c < 2; c++) { sge.addr = (uintptr_t) buf + ((uint32_t)c*sizeof(uint32_t)); sge.length = sizeof(uint32_t); sge.lkey = mr->lkey; send_wr.wr_id = (uint64_t)(c+1);//1; send_wr.opcode = IBV_WR_RDMA_WRITE; send_wr.sg_list = &sge; send_wr.num_sge = 1; send_wr.wr.rdma.rkey = ntohl(server_pdata.buf_rkey); send_wr.wr.rdma.remote_addr = ntohll(server_pdata.buf_va); if(ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr)) return 1; } ++++++++++++++++++++++++++++++++++ I receive no compilation errors but it does not write to remote memory. Any suggestions of what might be wrong? Thanks, -- Syed Zafar ul Hussan Gilani | BIT-7 Research Student | CHPSC MSP 2008-09 NUST SEECS | http://hpc.niit.edu.pk/~zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Mon May 25 10:46:47 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 25 May 2009 19:46:47 +0200 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> Message-ID: <4A1AD987.2080200@gmail.com> Hi. Why do you use ntohl() on the rkey/remote_addr? Which QP type is it? (RC or UC). Did you poll for a completion and check that the status is good? Dotan Zafar Gilani wrote: > Hi, > > I am trying to send two integers (essentially a buf array of type > uint32_t) to the server via RDMA_WRITE method. The following is piece > of code that I rewrote: > > ++++++++++++++++++++++++++++++++++ > > buf[0] = strtoul(argv[2], NULL, 0); > buf[1] = strtoul(argv[3], NULL, 0); > > printf("%d + %d = ", buf[0], buf[1]); > > buf[0] = htonl(buf[0]); > buf[1] = htonl(buf[1]); > > /* ----------------------------------- > ---- START - write operation 1 ---- > ----------------------------------- */ > > int c; > for(c = 0; c < 2; c++) > { > sge.addr = (uintptr_t) buf + ((uint32_t)c*sizeof(uint32_t)); > sge.length = sizeof(uint32_t); > sge.lkey = mr->lkey; > > send_wr.wr_id = (uint64_t)(c+1);//1; > send_wr.opcode = IBV_WR_RDMA_WRITE; > send_wr.sg_list = &sge; > send_wr.num_sge = 1; > send_wr.wr.rdma.rkey = ntohl(server_pdata.buf_rkey); > send_wr.wr.rdma.remote_addr = ntohll(server_pdata.buf_va); > > if(ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr)) > return 1; > } > > ++++++++++++++++++++++++++++++++++ > > I receive no compilation errors but it does not write to remote > memory. Any suggestions of what might be wrong? > > Thanks, > -- > Syed Zafar ul Hussan Gilani | BIT-7 > Research Student | CHPSC > MSP 2008-09 > NUST SEECS | http://hpc.niit.edu.pk/~zafar > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From zafargilani at gmail.com Mon May 25 11:08:26 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Mon, 25 May 2009 23:08:26 +0500 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <4A1AD987.2080200@gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> <4A1AD987.2080200@gmail.com> Message-ID: <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> Thanks for the reply. I read your name under the Author for may be all the IBV structs/operations at linux.die.net. So I am highly impressed by the work (can only dream of it myself). :) 1. I don't know why the original author (Roland Dreir) has used ntohl() for rkey and remote_addr. Though to use it on buffer (buf) is essential in order to transfer the byte order from network to host. 2. I am using QP type RC for reliable connection. 3. Yes I am checking for that but the code gets stuck before that, around when I call ibv_get_cq_event to wait for next completion event in the event channel. I think the second write (second iteration of for loop) is not working properly since when I try to send buf[0] via RDMA_WRITE and buf[1] via SEND then it works fine. The code I am walking about: while(1) { if(ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) // here it gets stuck return 1; // does not print this printf("after get_cq_event\n"); fflush(stdout); if(ibv_req_notify_cq(cq, 0)) return 1; printf("after req_notify_cq\n"); fflush(stdout); if(ibv_poll_cq(cq, 1, &wc) != 1) return 1; printf("after poll_cq\n"); fflush(stdout); if(wc.status != IBV_WC_SUCCESS) return 1; printf("after wc.status\n"); fflush(stdout); if(wc.wr_id == 0) { printf("%d\n", ntohl(buf[0])); fflush(stdout); return 0; } } Your thoughts on this? Thanks, zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Mon May 25 12:27:57 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 25 May 2009 21:27:57 +0200 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> <4A1AD987.2080200@gmail.com> <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> Message-ID: <4A1AF13D.8040506@gmail.com> Zafar Gilani wrote: > Thanks for the reply. I read your name under the Author for may be all > the IBV structs/operations at linux.die.net . So > I am highly impressed by the work (can only dream of it myself). :) thanks ... > > 1. I don't know why the original author (Roland Dreir) has used > ntohl() for rkey and remote_addr. Though to use it on buffer (buf) is > essential in order to transfer the byte order from network to host. I need to check the test code to check the reason.... > > 2. I am using QP type RC for reliable connection. This means that if there is an error you will get a bad completion (it you would have use UC QP, in case of an error in the receiver side, the packet would have dropped). > > 3. Yes I am checking for that but the code gets stuck before that, > around when I call ibv_get_cq_event to wait for next completion event > in the event channel. I think the second write (second iteration of > for loop) is not working properly since when I try to send buf[0] via > RDMA_WRITE and buf[1] via SEND then it works fine. Using ibv_get_cq_event can be tricky, you must arm the CQ (call ibv_req_notify_cq) BEFORE a completion can enter to that CQ. To make things more clear, why won't you just poll the CQ for completion directly? (without using the CQ events) I believe that you will get a completion with error... Dotan > > The code I am walking about: > > while(1) { > if(ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) // here it > gets stuck > return 1; > > // does not print this > printf("after get_cq_event\n"); fflush(stdout); > > if(ibv_req_notify_cq(cq, 0)) > return 1; > > printf("after req_notify_cq\n"); fflush(stdout); > > if(ibv_poll_cq(cq, 1, &wc) != 1) > return 1; > > printf("after poll_cq\n"); fflush(stdout); > > if(wc.status != IBV_WC_SUCCESS) > return 1; > > printf("after wc.status\n"); fflush(stdout); > > if(wc.wr_id == 0) { > printf("%d\n", ntohl(buf[0])); fflush(stdout); > return 0; > } > } > > Your thoughts on this? > > Thanks, > zafar From zafargilani at gmail.com Mon May 25 12:55:48 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Tue, 26 May 2009 00:55:48 +0500 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <4A1AF13D.8040506@gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> <4A1AD987.2080200@gmail.com> <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> <4A1AF13D.8040506@gmail.com> Message-ID: <7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com> I am attaching the code files (client.c and server.c). I hope I am not bugging you that much! Is there any guide for IBV/RDMA CM? I am using IBA Specification but that only provides the theory, I am looking for something close to javadocs (if any). I will try the rest of what you said in your mail when I can log onto the cluster (for the time being it does not seem to be responding). I will let you know when I do so. Meanwhile kindly have a look at the code, lot of comments for my own good. :) Thanks, zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: client.c Type: application/octet-stream Size: 17153 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: server.c Type: application/octet-stream Size: 5024 bytes Desc: not available URL: From dotanba at gmail.com Mon May 25 22:04:19 2009 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 26 May 2009 08:04:19 +0300 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> <4A1AD987.2080200@gmail.com> <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> <4A1AF13D.8040506@gmail.com> <7d4423d30905251255w3cc9a1c3ha1226251388baf2b@mail.gmail.com> Message-ID: <2f3bf9a60905252204l5e5e9bbbj7e25e9b6badb30c@mail.gmail.com> RDMA Write doesn't produce any completion in the receiver side. Dotan On Mon, May 25, 2009 at 10:55 PM, Zafar Gilani wrote: > I am attaching the code files (client.c and server.c). I hope I am not > bugging you that much! Is there any guide for IBV/RDMA CM? I am using IBA > Specification but that only provides the theory, I am looking for something > close to javadocs (if any). > > I will try the rest of what you said in your mail when I can log onto the > cluster (for the time being it does not seem to be responding). I will let > you know when I do so. Meanwhile kindly have a look at the code, lot of > comments for my own good. :) > > Thanks, > zafar > From zafargilani at gmail.com Mon May 25 22:45:24 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Tue, 26 May 2009 10:45:24 +0500 Subject: [ofa-general] Sending two integers via RDMA_WRITE In-Reply-To: <4A1AF13D.8040506@gmail.com> References: <7d4423d30905250836r431e7d1eh170f3bc22b731a55@mail.gmail.com> <4A1AD987.2080200@gmail.com> <7d4423d30905251108m5bc951ach4da2027698b0ffb6@mail.gmail.com> <4A1AF13D.8040506@gmail.com> Message-ID: <7d4423d30905252245o869fd93p96facb1d1b2b94e6@mail.gmail.com> To make things more clear, why won't you just poll the CQ for completion directly? (without using the CQ events) I believe that you will get a completion with error... Yes I tried polling directly and it returns a negative number. What is the remedy for this? Is my for loop logically correct (client.c)? I also tried polling the server CQ directly (server.c) and polling here also returns a negative number, which means that data write is not working properly thus no completion events. What do you suggest I do? I am obviously lost! :( -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Tue May 26 03:24:27 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 26 May 2009 03:24:27 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090526-0200 daily build status Message-ID: <20090526102428.39CF2E61597@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From He.Huang at Sun.COM Tue May 26 13:03:46 2009 From: He.Huang at Sun.COM (Isaac Huang) Date: Tue, 26 May 2009 16:03:46 -0400 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state Message-ID: <20090526200346.GQ4239@sun.com> Hi all, 1. In a previous discussion: http://www.mail-archive.com/general at lists.openfabrics.org/msg19820.html It was mentioned that: You're allowed to destroy a QP earlier, but you have a remote chance of getting into trouble if you reuse the same QP number before any stale packets have drained from the fabric. If rdma_destroy_qp is called on a QP before it exits the TimeWait state (i.e. after RDMA_CM_EVENT_DISCONNECTED but before RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent rdma_create_qp would reuse the same QP while it's still in TimeWait? I'd think that rdma_destroy_qp should not make a TimeWait QP immediately reusable, but wouldn't be surprised if otherwise. 2. In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP could enter the TimeWait state without having entered the Established state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side. Thanks, Isaac From or.gerlitz at gmail.com Tue May 26 13:22:24 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 26 May 2009 23:22:24 +0300 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state In-Reply-To: <20090526200346.GQ4239@sun.com> References: <20090526200346.GQ4239@sun.com> Message-ID: <15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com> On Tue, May 26, 2009 at 11:03 PM, Isaac Huang wrote: > If rdma_destroy_qp is called on a QP before it exits the TimeWait state > (i.e. after RDMA_CM_EVENT_DISCONNECTED but before > RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent > rdma_create_qp would reuse the same QP while it's still in TimeWait? YES - as rdma_destroy/create_qp are basically wrappers to ib_destroy/create_qp and the latter two are not aware by any means to the QP state from the CM point of view. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue May 26 13:43:25 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 26 May 2009 13:43:25 -0700 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state In-Reply-To: <20090526200346.GQ4239@sun.com> References: <20090526200346.GQ4239@sun.com> Message-ID: <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com> >In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP >could enter the TimeWait state without having entered the Established >state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT >happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a >RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some >resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side. Yes, it's possible to enter timewait without going through established. I'd have to walk through the code at this point to identify all of the cases. Note that a lot (most?) connections between QPs are established out of band using TCP, and these are not tracked by the CM or go through any sort of timewait before potentially being reused. - Sean From rdreier at cisco.com Tue May 26 16:13:08 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 26 May 2009 16:13:08 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <20090507224806.GF16280@obsidianresearch.com> (Jason Gunthorpe's message of "Thu, 7 May 2009 16:48:06 -0600") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> Message-ID: > > > Or, ignore the overlapping problem, and use your original technique, > > > slightly modified: > > > - Userspace registers a counter with the kernel. Kernel pins the > > > page, sets up mmu notifiers and increments the counter when > > > invalidates intersect with registrations > > > - Kernel maintains a linked list of registrations that have been > > > invalidated via mmu notifiers using the registration structure > > > and a dirty bit > > > - Userspace checks the counter at every cache hit, if different it > > > calls into the kernel: > > > MR_Cookie *mrs[100]; > > > int rc = ibv_get_invalid_mrs(mrs,100); > > > invalidate_cache(mrs,rc); > > > // Repeat until drained > > > > > > get_invalid_mrs traverses the linked list and returns an > > > identifying value to userspace, which looks it up in the cache, > > > calls unregister and removes it from the cache. > > > > What's the advantage of this? I have to do the get_invalid_mrs() call a > > bunch of times, rather than just reading which ones are invalid from the > > cache directly? > > This is a trade off, the above is a more normal kernel API and lets > the app get an list of changes it can scan. Having the kernel update > flags means if the app wants a list of changes it has to scan all > registrations. The more I thought about this, the more I liked the idea, until I liked it so much that I actually went ahead and prototyped this. A preliminary version is below -- *very* lightly tested, and no doubt there are obvious bugs that any real use or review will uncover. But I thought I'd throw it out and hope for comments and/or testing. I'm actually pretty happy with how small and simple this ended up being. I'll reply to this message with a simple test program I've used to sanity check this. === [PATCH] ummunot: Userspace support for MMU notifications As discussed in and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in ). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in ). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. NOT-Signed-off-by: Roland Dreier --- drivers/char/Kconfig | 12 ++ drivers/char/Makefile | 1 + drivers/char/ummunot.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ummunot.h | 85 +++++++++ 4 files changed, 555 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate "Userspace MMU notifications" + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source "drivers/s390/char/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c new file mode 100644 index 0000000..1341edc --- /dev/null +++ b/drivers/char/ummunot.c @@ -0,0 +1,457 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Userspace MMU notifiers"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UMMUNOT_FLAG_DIRTY = 1, + UMMUNOT_FLAG_HINT = 2, +}; + +struct ummunot_reg { + u64 user_cookie; + unsigned long start; + unsigned long end; + unsigned long hint_start; + unsigned long hint_end; + unsigned long flags; + struct rb_node node; + struct list_head list; +}; + +struct ummunot_file { + struct mmu_notifier mmu_notifier; + struct mm_struct *mm; + struct rb_root reg_tree; + struct list_head dirty_list; + u64 *counter; + spinlock_t lock; + wait_queue_head_t read_wait; + struct fasync_struct *async_queue; +}; + +static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn) +{ + return container_of(mn, struct ummunot_file, mmu_notifier); +} + +static void ummunot_handle_not(struct mmu_notifier *mn, + unsigned long start, unsigned long end) +{ + struct ummunot_file *priv = to_ummunot_file(mn); + struct rb_node *n; + struct ummunot_reg *reg; + unsigned long flags; + int hit = 0; + + spin_lock_irqsave(&priv->lock, flags); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->start >= end) + break; + + if ((reg->start <= start && reg->end > start) || + (reg->start <= end && reg->end > end)) { + hit = 1; + + if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_add_tail(®->list, &priv->dirty_list); + + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + clear_bit(UMMUNOT_FLAG_HINT, ®->flags); + } else { + set_bit(UMMUNOT_FLAG_HINT, ®->flags); + reg->hint_start = start; + reg->hint_end = end; + } + } + } + + if (hit) { + ++(*priv->counter); + flush_dcache_page(virt_to_page(priv->counter)); + wake_up_interruptible(&priv->read_wait); + kill_fasync(&priv->async_queue, SIGIO, POLL_IN); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void ummunot_inval_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long addr) +{ + ummunot_handle_not(mn, addr, addr + PAGE_SIZE); +} + +static void ummunot_inval_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + ummunot_handle_not(mn, start, end); +} + +static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = { + .invalidate_page = ummunot_inval_page, + .invalidate_range_start = ummunot_inval_range_start, +}; + +static int ummunot_open(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv; + int ret; + + if (filp->f_mode & FMODE_WRITE) + return -EINVAL; + + priv = kmalloc(sizeof *priv, GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->counter = (void *) get_zeroed_page(GFP_KERNEL); + if (!priv->counter) { + ret = -ENOMEM; + goto err; + } + + priv->reg_tree = RB_ROOT; + INIT_LIST_HEAD(&priv->dirty_list); + spin_lock_init(&priv->lock); + init_waitqueue_head(&priv->read_wait); + priv->async_queue = NULL; + + priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops; + /* + * Register notifier last, since notifications can occur as + * soon as we register.... + */ + ret = mmu_notifier_register(&priv->mmu_notifier, current->mm); + if (ret) + goto err_page; + + priv->mm = current->mm; + atomic_inc(&priv->mm->mm_count); + + filp->private_data = priv; + + return 0; + +err_page: + free_page((unsigned long) priv->counter); + +err: + kfree(priv); + return ret; +} + +static int ummunot_close(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv = filp->private_data; + struct rb_node *n; + struct ummunot_reg *reg; + + mmu_notifier_unregister(&priv->mmu_notifier, priv->mm); + mmdrop(priv->mm); + free_page((unsigned long) priv->counter); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + rb_erase(n, &priv->reg_tree); + kfree(reg); + } + + kfree(priv); + + return 0; +} + +static ssize_t ummunot_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ummunot_file *priv = filp->private_data; + struct ummunot_reg *reg; + ssize_t ret; + struct ummunot_event *events; + int max; + int n; + + events = (void *) get_zeroed_page(GFP_KERNEL); + if (!events) { + ret = -ENOMEM; + goto out; + } + + spin_lock_irq(&priv->lock); + + while (list_empty(&priv->dirty_list)) { + spin_unlock_irq(&priv->lock); + + if (filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + goto out; + } + + if (wait_event_interruptible(priv->read_wait, + !list_empty(&priv->dirty_list))) { + ret = -ERESTARTSYS; + goto out; + } + + spin_lock_irq(&priv->lock); + } + + max = min(PAGE_SIZE, count) / sizeof *events; + + for (n = 0; n < max; ++n) { + if (list_empty(&priv->dirty_list)) { + events[n].type = UMMUNOT_EVENT_TYPE_LAST; + events[n].user_cookie_counter = *priv->counter; + ++n; + break; + } + + reg = list_first_entry(&priv->dirty_list, struct ummunot_reg, + list); + + events[n].type = UMMUNOT_EVENT_TYPE_INVAL; + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + events[n].flags = UMMUNOT_EVENT_FLAG_HINT; + events[n].hint_start = reg->hint_start; + events[n].hint_end = reg->hint_end; + } + events[n].user_cookie_counter = reg->user_cookie; + + list_del(®->list); + reg->flags = 0; + } + + spin_unlock_irq(&priv->lock); + + if (copy_to_user(buf, events, n * sizeof *events)) + ret = -EFAULT; + else + ret = n * sizeof *events; + +out: + free_page((unsigned long) events); + return ret; +} + +static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ummunot_file *priv = filp->private_data; + + poll_wait(filp, &priv->read_wait, wait); + + return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM); +} + +static long ummunot_register_region(struct ummunot_file *priv, + struct ummunot_register_ioctl __user *arg) +{ + struct ummunot_register_ioctl parm; + struct ummunot_reg *reg, *treg; + struct rb_node **n = &priv->reg_tree.rb_node; + struct rb_node *pn = NULL; + + if (copy_from_user(&parm, arg, sizeof parm)) + return -EFAULT; + + if (parm.intf_version != UMMUNOT_INTF_VERSION) + return -EINVAL; + + reg = kmalloc(sizeof *reg, GFP_KERNEL); + if (!reg) + return -ENOMEM; + + reg->user_cookie = parm.user_cookie; + reg->start = parm.start; + reg->end = parm.end; + reg->flags = 0; + + spin_lock_irq(&priv->lock); + + while (*n) { + treg = rb_entry(pn, struct ummunot_reg, node); + pn = *n; + if (reg->start <= treg->start) + n = &pn->rb_left; + else + n = &pn->rb_right; + } + + rb_link_node(®->node, pn, n); + rb_insert_color(®->node, &priv->reg_tree); + + spin_unlock_irq(&priv->lock); + + return 0; +} + +static long ummunot_unregister_region(struct ummunot_file *priv, + __u64 __user *arg) +{ + u64 user_cookie; + struct rb_node *n; + struct ummunot_reg *reg; + int ret = -EINVAL; + + if (get_user(user_cookie, arg)) + return -EFAULT; + + spin_lock_irq(&priv->lock); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->user_cookie == user_cookie) { + rb_erase(n, &priv->reg_tree); + if (test_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_del(®->list); + kfree(reg); + ret = 0; + break; + } + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunot_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + struct ummunot_file *priv = filp->private_data; + void __user *argp = (void __user *) arg; + + switch (cmd) { + case UMMUNOT_REGISTER_REGION: + return ummunot_register_region(priv, argp); + case UMMUNOT_UNREGISTER_REGION: + return ummunot_unregister_region(priv, argp); + default: + return -ENOIOCTLCMD; + } +} + +static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct ummunot_file *priv = vma->vm_private_data; + + if (vmf->pgoff != 0) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(priv->counter); + get_page(vmf->page); + + return 0; + +} + +static struct vm_operations_struct ummunot_vm_ops = { + .fault = ummunot_fault, +}; + +static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ummunot_file *priv = filp->private_data; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || + vma->vm_pgoff != 0) + return -EINVAL; + + vma->vm_ops = &ummunot_vm_ops; + vma->vm_private_data = priv; + + return 0; +} + +static int ummunot_fasync(int fd, struct file *filp, int on) +{ + struct ummunot_file *priv = filp->private_data; + + return fasync_helper(fd, filp, on, &priv->async_queue); +} + +static const struct file_operations ummunot_fops = { + .owner = THIS_MODULE, + .open = ummunot_open, + .release = ummunot_close, + .read = ummunot_read, + .poll = ummunot_poll, + .unlocked_ioctl = ummunot_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ummunot_ioctl, +#endif + .mmap = ummunot_mmap, + .fasync = ummunot_fasync, +}; + +static struct miscdevice ummunot_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ummunot", + .fops = &ummunot_fops, +}; + +static int __init ummunot_init(void) +{ + return misc_register(&ummunot_misc); +} + +static void __exit ummunot_cleanup(void) +{ + misc_deregister(&ummunot_misc); +} + +module_init(ummunot_init); +module_exit(ummunot_cleanup); diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h new file mode 100644 index 0000000..e1abd89 --- /dev/null +++ b/include/linux/ummunot.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _LINUX_UMMUNOT_H +#define _LINUX_UMMUNOT_H + +#include +#include + +#define UMMUNOT_INTF_VERSION 1 + +enum { + UMMUNOT_EVENT_TYPE_INVAL = 0, + UMMUNOT_EVENT_TYPE_LAST = 1, +}; + +enum { + UMMUNOT_EVENT_FLAG_HINT = 1 << 0, +}; + +/* + * If type field is INVAL, then user_cookie_counter holds the + * user_cookie for the region being reported; if the HINT flag is set + * then hint_start/hint_end hold the start and end of the mapping that + * was invalidated. (If HINT is not set, then multiple events + * invalidated parts of the registered range and hint_start/hint_end + * should be ignored) + * + * If type is LAST, then the read operation has emptied the list of + * invalidated regions, and user_cookie_counter holds the value of the + * kernel's generation counter when the empty list occurred. The + * other fields are not filled in for this event. + */ +struct ummunot_event { + __u32 type; + __u32 flags; + __u64 hint_start; + __u64 hint_end; + __u64 user_cookie_counter; +}; + +struct ummunot_register_ioctl { + __u32 intf_version; /* in */ + __u32 reserved1; + __u64 start; /* in */ + __u64 end; /* in */ + __u64 user_cookie; /* in */ +}; + +#define UMMUNOT_MAGIC 'U' + +#define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ + struct ummunot_register_ioctl) +#define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) + +#endif /* _LINUX_UMMUNOT_H */ -- 1.6.0.4 From rdreier at cisco.com Tue May 26 16:13:58 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 26 May 2009 16:13:58 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Roland Dreier's message of "Tue, 26 May 2009 16:13:08 -0700") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> Message-ID: Here's the test program: #include #include #include #include #include #include #include #include #define UMMUNOT_INTF_VERSION 1 enum { UMMUNOT_EVENT_TYPE_INVAL = 0, UMMUNOT_EVENT_TYPE_LAST = 1, }; enum { UMMUNOT_EVENT_FLAG_HINT = 1 << 0, }; /* * If type field is INVAL, then user_cookie_counter holds the * user_cookie for the region being reported; if the HINT flag is set * then hint_start/hint_end hold the start and end of the mapping that * was invalidated. (If HINT is not set, then multiple events * invalidated parts of the registered range and hint_start/hint_end * should be ignored) * * If type is LAST, then the read operation has emptied the list of * invalidated regions, and user_cookie_counter holds the value of the * kernel's generation counter when the empty list occurred. The * other fields are not filled in for this event. */ struct ummunot_event { __u32 type; __u32 flags; __u64 hint_start; __u64 hint_end; __u64 user_cookie_counter; }; struct ummunot_register_ioctl { __u32 intf_version; /* in */ __u32 reserved1; __u64 start; /* in */ __u64 end; /* in */ __u64 user_cookie; /* in */ }; #define UMMUNOT_MAGIC 'U' #define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ struct ummunot_register_ioctl) #define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) static int umn_fd; static volatile unsigned long long *umn_counter; static int umn_init(void) { umn_fd = open("/dev/ummunot", O_RDONLY); if (umn_fd < 0) { perror("open"); return 1; } umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ, MAP_SHARED, umn_fd, 0); if (umn_counter == MAP_FAILED) { perror("mmap"); return 1; } return 0; } static int umn_register(void *buf, size_t size, __u64 cookie) { struct ummunot_register_ioctl r = { .intf_version = UMMUNOT_INTF_VERSION, .start = (unsigned long) buf, .end = (unsigned long) buf + size, .user_cookie = cookie, }; if (ioctl(umn_fd, UMMUNOT_REGISTER_REGION, &r)) { perror("ioctl"); return 1; } return 0; } static int umn_unregister(__u64 cookie) { if (ioctl(umn_fd, UMMUNOT_UNREGISTER_REGION, &cookie)) { perror("ioctl"); return 1; } return 0; } int main(int argc, char *argv[]) { int page_size = sysconf(_SC_PAGESIZE); void *t; if (umn_init()) return 1; if (*umn_counter != 0) { fprintf(stderr, "counter = %lld (expected 0)\n", *umn_counter); return 1; } t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; munmap(t + page_size, page_size); printf("ummunot events: %lld\n", *umn_counter); if (*umn_counter > 0) { struct ummunot_event ev[2]; int len; int i; len = read(umn_fd, &ev, sizeof ev); printf("read %d events (%d tot)\n", len / sizeof ev[0], len); for (i = 0; i < len / sizeof ev[0]; ++i) { switch (ev[i].type) { case UMMUNOT_EVENT_TYPE_INVAL: printf("[%3d]: inval cookie %lld\n", i, ev[i].user_cookie_counter); if (ev[i].flags & UMMUNOT_EVENT_FLAG_HINT) printf(" hint %llx...%lx\n", ev[i].hint_start, ev[i].hint_end); break; case UMMUNOT_EVENT_TYPE_LAST: printf("[%3d]: empty up to %lld\n", i, ev[i].user_cookie_counter); break; default: printf("[%3d]: unknown event type %d\n", i, ev[i].type); break; } } } umn_unregister(123); munmap(t, page_size); printf("ummunot events: %lld\n", *umn_counter); return 0; } From jgunthorpe at obsidianresearch.com Tue May 26 16:51:58 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 26 May 2009 17:51:58 -0600 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> Message-ID: <20090526235158.GG29521@obsidianresearch.com> On Tue, May 26, 2009 at 04:13:08PM -0700, Roland Dreier wrote: > > > > Or, ignore the overlapping problem, and use your original technique, > > > > slightly modified: > > > > - Userspace registers a counter with the kernel. Kernel pins the > > > > page, sets up mmu notifiers and increments the counter when > > > > invalidates intersect with registrations > > > > - Kernel maintains a linked list of registrations that have been > > > > invalidated via mmu notifiers using the registration structure > > > > and a dirty bit > > > > - Userspace checks the counter at every cache hit, if different it > > > > calls into the kernel: > > > > MR_Cookie *mrs[100]; > > > > int rc = ibv_get_invalid_mrs(mrs,100); > > > > invalidate_cache(mrs,rc); > > > > // Repeat until drained > > > > > > > > get_invalid_mrs traverses the linked list and returns an > > > > identifying value to userspace, which looks it up in the cache, > > > > calls unregister and removes it from the cache. > > > > > > What's the advantage of this? I have to do the get_invalid_mrs() call a > > > bunch of times, rather than just reading which ones are invalid from the > > > cache directly? > > > > This is a trade off, the above is a more normal kernel API and lets > > the app get an list of changes it can scan. Having the kernel update > > flags means if the app wants a list of changes it has to scan all > > registrations. > > The more I thought about this, the more I liked the idea, until I liked > it so much that I actually went ahead and prototyped this. A > preliminary version is below -- *very* lightly tested, and no doubt > there are obvious bugs that any real use or review will uncover. But I > thought I'd throw it out and hope for comments and/or testing. I'm > actually pretty happy with how small and simple this ended up being. Seems reasonable to me. This doesn't catch all mmap cases, ie this kind of stuff: t = mmap(NULL, 3 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); if (umn_register(t, 3 * page_size, 123)) return 1; t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0); // Event? Probably munmap(t,page_size); // Event? No, no MAP_POPULATE t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0); // Event? No And I guess the use of MAP_POPULATE is deliberate as thats how mmu notifier works.. So the use model for a MPI would be to call ibv_register/umn_register and watch for events. Any event at all means the entire region is toast and must be re-registered the next time someone calls with that address. ibv_register does the same as MAP_POPULATE internally.. The MPI library uses the result of this to build a list of invalided regions. From time to time the MPI library should unregister those regions. If that is the use then the kernel side should probably also be a one-shot type of interface.. I'm also trying to think of a use case outside of RDMA and failing - if the kernel hasn't pinned the pages being watched through some other means it seems useless as a general feature?? Jason From klakshman03 at hotmail.com Tue May 26 20:10:33 2009 From: klakshman03 at hotmail.com (lakshmana swamy) Date: Wed, 27 May 2009 08:40:33 +0530 Subject: [ofa-general] ***SPAM*** nfs over rdma in Cent OS 5.2 Message-ID: HI All, Iam going to setup nfs over rdma in 10 clusterd nodes which has *Cents OS 5.2*operating system. Since nfs over rdma will not support to the defalut kernel of Cent OS 5.2, we need to upgrade the kernel verson. Thus *Which versions of kernel, nfs-util and ofed is the best combination * to go ahead for smooth Installation? Thansk laxman _________________________________________________________________ Live Search extreme As India feels the heat of poll season, get all the info you need on the MSN News Aggregator http://news.in.msn.com/National/indiaelections2009/aggregator/default.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: From jon at opengridcomputing.com Wed May 27 02:52:18 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Wed, 27 May 2009 04:52:18 -0500 Subject: [ofa-general] ***SPAM*** nfs over rdma in Cent OS 5.2 In-Reply-To: References: Message-ID: <20090527095216.GA21367@opengridcomputing.com> On Wed, May 27, 2009 at 08:40:33AM +0530, lakshmana swamy wrote: > > > HI All, > > > > Iam going to setup nfs over rdma in 10 clusterd nodes which has *Cents OS 5.2*operating system. Since nfs over rdma will not support to the defalut kernel of Cent OS 5.2, we need to upgrade the kernel verson. Thus *Which versions of kernel, nfs-util and ofed is the best combination * to go ahead for smooth Installation? OFED 1.4.1 will have support for the stock CentOS/RHEL 5.2 kernel, and should be coming out any day now. If you do not want to wait for it to come out, then you should use the 2.6.28 kernel, nfs-utils 1.5, and OFED 1.4. The basics of how to set it up can be found in the Linux kernel documentation at Documentation/filesystems/nfs-rdma.txt Thanks, Jon > > > > > Thansk > > laxman > _________________________________________________________________ > Live Search extreme As India feels the heat of poll season, get all the info you need on the MSN News Aggregator > http://news.in.msn.com/National/indiaelections2009/aggregator/default.aspx > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Line.Holen at Sun.COM Wed May 27 03:07:41 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Wed, 27 May 2009 12:07:41 +0200 Subject: [ofa-general] [PATCH] osm_ucast_ftree.c Allow horizontal links between switches of max rank Message-ID: <4A1D10ED.6040001@Sun.COM> This patch makes it legal to have cross links (horizontal links) between switches at max rank. These switches do have same rank, so hop count cannot be calculated based on rank anymore. The horizontal links are treated as downlinks. Switch A has a downlink to B while B has a downlink to A. Tests on lids and also number of hops makes sure that we don't loop back and forth across the link. Signed-off-by: Frank Olaf Sem-Jacobsen Signed-off-by: Line Holen --- diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 8ed2f74..1f1d0ff 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2009 Simula Research Laboratory. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved. @@ -1495,7 +1496,8 @@ static void fabric_make_indexing(IN ftree_fabric_t * p_ftree) p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw. p_sw; - if (tuple_assigned(p_remote_sw->tuple)) { + if (tuple_assigned(p_remote_sw->tuple) || + (p_sw->rank == p_remote_sw->rank)) { /* this switch has been already indexed */ continue; } @@ -1903,12 +1905,11 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, IN ftree_sw_t * p_prev_sw, IN uint16_t target_lid, - IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, IN boolean_t is_target_a_sw, - IN uint8_t highest_rank_in_route, - IN uint16_t reverse_hops) + IN uint16_t reverse_hops, + IN uint8_t current_hops) { ftree_sw_t *p_remote_sw; uint16_t ports_num; @@ -1919,6 +1920,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, uint16_t j; uint16_t k; boolean_t created_route = FALSE; + uint8_t least_hops; /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); @@ -1968,14 +1970,15 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, Set on the remote switch how to get to the target_lid - set LFT(target_lid) on the remote switch to the remote port */ p_remote_sw = p_group->remote_hca_or_sw.p_sw; + least_hops = sw_get_least_hops(p_remote_sw, target_lid); - if (sw_get_least_hops(p_remote_sw, target_lid) != OSM_NO_PATH) { + if ((least_hops != OSM_NO_PATH) && (least_hops <= current_hops)) { /* Loop in the fabric - we already routed the remote switch on our way UP, and now we see it again on our way DOWN */ OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Loop of lenght %d in the fabric:\n " "Switch %s (LID %u) closes loop through switch %s (LID %u)\n", - (p_remote_sw->rank - highest_rank_in_route) * 2, + current_hops, tuple_to_str(p_remote_sw->tuple), p_group->base_lid, tuple_to_str(p_sw->tuple), @@ -2022,20 +2025,17 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, p_remote_sw->p_osm_sw->new_lft[target_lid] = p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Switch %s: set path to CA LID %u through port %u\n", + "Switch %s: set path to CA LID %u through port %u, hops %u\n", tuple_to_str(p_remote_sw->tuple), target_lid, - p_min_port->remote_port_num); + p_min_port->remote_port_num, + current_hops + 1); /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ set_hops_on_remote_sw(p_group, target_lid, - ((target_rank - - highest_rank_in_route) + - (p_remote_sw->rank - - highest_rank_in_route) + - reverse_hops * 2), + current_hops + 1 + reverse_hops * 2, is_target_a_sw); } @@ -2050,13 +2050,13 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* Recursion step: Assign upgoing ports by stepping down, starting on REMOTE switch */ created_route |= fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ - NULL, /* prev. position - NULL to mark that we went down and not up */ + p_sw, /* prev. position */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether the target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ - highest_rank_in_route, reverse_hops); /* highest visited point in the tree before going down */ + reverse_hops, + current_hops + 1); } /* done scanning all the down-going port groups */ @@ -2087,12 +2087,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, IN ftree_sw_t * p_prev_sw, IN uint16_t target_lid, - IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, IN boolean_t is_target_a_sw, IN uint16_t reverse_hop_credit, - IN uint16_t reverse_hops) + IN uint16_t reverse_hops, + IN uint8_t current_hops) { ftree_sw_t *p_remote_sw; uint16_t ports_num; @@ -2110,12 +2110,11 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, fabric_route_upgoing_by_going_down(p_ftree, p_sw, /* local switch - used as a route-upgoing alg. start point */ p_prev_sw, /* switch that we went up from (NULL means that we went down) */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this path to HCA should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ - p_sw->rank, /* the highest visited point in the tree before going down */ - reverse_hops); /* Number of reverse_hops done up to this point */ + reverse_hops, /* Number of reverse_hops done up to this point */ + current_hops); /* recursion stop condition - if it's a root switch, */ if (p_sw->rank == 0) { @@ -2140,12 +2139,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ p_sw, /* this switch - prev. position switch for the function */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ - reverse_hops + 1); /* Number of reverse_hops done up to this point */ + reverse_hops + 1, /* Number of reverse_hops done up to this point */ + current_hops + 1); } } @@ -2244,17 +2243,17 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, new_lft[target_lid] = p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Switch %s: set path to CA LID %u through port %u\n", + "Switch %s: set path to CA LID %u through port %u, hops %u\n", tuple_to_str(p_remote_sw->tuple), target_lid, - p_min_port->remote_port_num); + p_min_port->remote_port_num, current_hops + 1); } /* On the remote switch that is pointed by the min_group, set hops for ALL the ports in the remote group. */ set_hops_on_remote_sw(p_min_group, target_lid, - target_rank - p_remote_sw->rank + - 2 * reverse_hops, is_target_a_sw); + current_hops + 1 + 2 * reverse_hops, + is_target_a_sw); } /* Recursion step: @@ -2262,12 +2261,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ p_sw, /* this switch - prev. position switch for the function */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit, /* Remaining reverse_hops allowed */ - reverse_hops); /* Number of reverse_hops done up to this point */ + reverse_hops, /* Number of reverse_hops done up to this point */ + current_hops + 1); } /* we're done for the third case */ @@ -2335,22 +2334,21 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ - set_hops_on_remote_sw(p_group, target_lid, - target_rank - p_remote_sw->rank + - 2 * reverse_hops, is_target_a_sw); + current_hops + 1 + 2 * reverse_hops, + is_target_a_sw); /* Recursion step: Assign downgoing ports by stepping up, starting on REMOTE switch. */ fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ p_sw, /* this switch - prev. position switch for the function */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is real or dummy */ FALSE, /* whether this is path to HCA that should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit, /* Remaining reverse_hops allowed */ - reverse_hops); /* Number of reverse_hops done up to this point */ + reverse_hops, /* Number of reverse_hops done up to this point */ + current_hops + 1); } /* If we don't have any reverse hop credits, we are done */ @@ -2374,12 +2372,12 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, fabric_route_downgoing_by_going_up(p_ftree, p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ p_sw, /* this switch - prev. position switch for the function */ target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is real or dummy */ TRUE, /* whether this is path to HCA that should by tracked by counters */ is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ - reverse_hops + 1); /* Number of reverse_hops done up to this point */ + reverse_hops + 1, /* Number of reverse_hops done up to this point */ + current_hops + 1); } } /* ftree_fabric_route_downgoing_by_going_up() */ @@ -2451,7 +2449,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) p_port->port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "Switch %s: set path to CN LID %u through port %u\n", + "Switch %s: set path to CN LID %u through port %u, hop 1\n", tuple_to_str(p_sw->tuple), hca_lid, p_port->port_num); @@ -2464,12 +2462,12 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ hca_lid, /* LID that we're routing to */ - p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ FALSE, /* wheter target lid is a switch or not */ 0, /* Number of reverse hops allowed */ - 0); /* Number of reverse hops done yet */ + 0, /* Number of reverse hops done yet */ + 1); /* count how many real targets have been routed from this leaf switch */ routed_targets_on_leaf++; @@ -2492,12 +2490,12 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ 0, /* LID that we're routing to - ignored for dummy HCA */ - 0, /* rank of the LID that we're routing to - ignored for dummy HCA */ FALSE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ FALSE, /* Wheter the target LID is a switch or not */ 0, /* Number of reverse hops allowed */ - 0); /* Number of reverse hops done yet */ + 0, /* Number of reverse hops done yet */ + 1); } } } @@ -2579,12 +2577,12 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ hca_lid, /* LID that we're routing to */ - p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ FALSE, /* Wheter the target LID is a switch or not */ p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops : 0, /* Number or reverse hops allowed */ - 0); /* Number or reverse hops done yet */ + 0, /* Number or reverse hops done yet */ + 1); } /* done with all the port groups of this HCA - go to next HCA */ } @@ -2632,12 +2630,12 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree) fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ p_sw->base_lid, /* LID that we're routing to */ - p_sw->rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is a real or dummy */ FALSE, /* whether this path to HCA should by tracked by counters */ TRUE, /* Wheter the target LID is a switch or not */ 0, /* Number of reverse hops allowed */ - 0); /* Number of reverse hops done yet */ + 0, /* Number of reverse hops done yet */ + 0); } OSM_LOG_EXIT(&p_ftree->p_osm->log); @@ -3058,7 +3056,8 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, p_remote_hca_or_sw = (void *)p_remote_sw; - if (abs(p_sw->rank - p_remote_sw->rank) != 1) { + if ((abs(p_sw->rank - p_remote_sw->rank) != 1) && + (p_sw->rank != p_ftree->max_switch_rank)) { OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_ERROR, "ERR AB16: " "Illegal link between switches with ranks %u and %u:\n" From vlad at lists.openfabrics.org Wed May 27 03:24:52 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 27 May 2009 03:24:52 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090527-0200 daily build status Message-ID: <20090527102452.59AAFE615C2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From Line.Holen at Sun.COM Wed May 27 03:32:23 2009 From: Line.Holen at Sun.COM (Line.Holen at Sun.COM) Date: Wed, 27 May 2009 12:32:23 +0200 Subject: [ofa-general] [PATCH] osm_dump.c dump port if lft is set up Message-ID: <4A1D16B7.7070300@Sun.COM> dump_ucast_routes() claims that a node is unreachable if the number of hops to it is unknown. This is changed to print actual port and give proper warning about hops. Signed-off-by: Line Holen --- diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c index 946ee6a..08b3156 100644 --- a/opensm/opensm/osm_dump.c +++ b/opensm/opensm/osm_dump.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. @@ -201,7 +202,7 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE * file, void *cxt) } if (num_hops == OSM_NO_PATH) { - fprintf(file, "UNREACHABLE\n"); + fprintf(file, "%03u : HOPS UNKNOWN\n", port_num); continue; } From jsquyres at cisco.com Wed May 27 10:34:22 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 May 2009 13:34:22 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com><20090506222638.GA16280@obsidianresearch.com><20090507000231.GB16280@obsidianresearch.com><20090507224806.GF16280@obsidianresearch.com> Message-ID: <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> On May 26, 2009, at 7:13 PM, Roland Dreier (rdreier) wrote: > /* > * If type field is INVAL, then user_cookie_counter holds the > * user_cookie for the region being reported; if the HINT flag is set > * then hint_start/hint_end hold the start and end of the mapping that > * was invalidated. (If HINT is not set, then multiple events > * invalidated parts of the registered range and hint_start/hint_end > * should be ignored) > I don't quite grok this. Is the intent that HINT will only be set if an *entire* hint_start/hint_end range is invalidated by a single event? I.e., if only part of the hint_start/hint_end range is invalidated, you'll get the cookie back, but not what part of the range is invalid (because assumedly the entire IBV registration is now invalid anyway)? > * If type is LAST, then the read operation has emptied the list of > * invalidated regions, and user_cookie_counter holds the value of the > * kernel's generation counter when the empty list occurred. The > * other fields are not filled in for this event. > Just to be clear -- we're supposed to keep reading events until we get a LAST event? > if (*umn_counter != 0) { > fprintf(stderr, "counter = %lld (expected 0)\n", > *umn_counter); > return 1; > } > Some clarification questions about umn_counter: 1. Will it increase by 1 each time a page (or set of pages?) is removed from a user process? 2. Does it change if pages are *added* to a user process? I.e., does the counter indicate *removals* or *changes* to the user process page table? > t = mmap(NULL, 3 * page_size, PROT_READ, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); > > if (umn_register(t, 3 * page_size, 123)) > return 1; > > munmap(t + page_size, page_size); > > printf("ummunot events: %lld\n", *umn_counter); > > if (*umn_counter > 0) { > Is the *unm_counter value guaranteed to have been changed by the time munmap() returns? > struct ummunot_event ev[2]; > Did you pick [2] here simply because you're only expecting an INVAL and a LAST event in this specific example? I'm assuming that we should normally loop over reading until we get LAST, correct? > > int len; > int i; > > len = read(umn_fd, &ev, sizeof ev); > printf("read %d events (%d tot)\n", len / sizeof > ev[0], len); > > for (i = 0; i < len / sizeof ev[0]; ++i) { > switch (ev[i].type) { > case UMMUNOT_EVENT_TYPE_INVAL: > printf("[%3d]: inval cookie %lld\n", > i, ev[i].user_cookie_counter); > if (ev[i].flags & > UMMUNOT_EVENT_FLAG_HINT) > printf(" hint %llx...%lx\n", > ev[i].hint_start, > ev[i].hint_end); > break; > case UMMUNOT_EVENT_TYPE_LAST: > printf("[%3d]: empty up to %lld\n", > i, ev[i].user_cookie_counter); > break; > default: > printf("[%3d]: unknown event type %d > \n", > i, ev[i].type); > break; > } > } > } > > umn_unregister(123); > What happens if I register multiple regions with the same cookie value? Is a process responsible for guaranteeing that it umn_unregister()s everything before exiting, or will all pending registrations be cleaned up/unregistered/whatever when a process exits? -- Jeff Squyres Cisco Systems From rdreier at cisco.com Wed May 27 10:49:57 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 10:49:57 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> (Jeff Squyres's message of "Wed, 27 May 2009 13:34:22 -0400") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: > > /* > > * If type field is INVAL, then user_cookie_counter holds the > > * user_cookie for the region being reported; if the HINT flag is set > > * then hint_start/hint_end hold the start and end of the mapping that > > * was invalidated. (If HINT is not set, then multiple events > > * invalidated parts of the registered range and hint_start/hint_end > > * should be ignored) > I don't quite grok this. Is the intent that HINT will only be set if > an *entire* hint_start/hint_end range is invalidated by a single > event? I.e., if only part of the hint_start/hint_end range is > invalidated, you'll get the cookie back, but not what part of the > range is invalid (because assumedly the entire IBV registration is now > invalid anyway)? Basically, I just keep one hint_start/hint_end. If multiple events hit the same registration then I just give up and don't give you a hint. > > * If type is LAST, then the read operation has emptied the list of > > * invalidated regions, and user_cookie_counter holds the value of the > > * kernel's generation counter when the empty list occurred. The > > * other fields are not filled in for this event. > Just to be clear -- we're supposed to keep reading events until we get > a LAST event? Yes, that's probably the sanest use case. > 1. Will it increase by 1 each time a page (or set of pages?) is > removed from a user process? As it stands it increases by 1 every time there is an MMU notification, even if that notification hits multiple registrations. It wouldn't be hard to change that to count the number of events generated if that works better. > 2. Does it change if pages are *added* to a user process? I.e., does > the counter indicate *removals* or *changes* to the user process page > table? No, additions don't trigger any MMU notification -- that's inherent in the design of the MMU notifiers stuff. The idea is that you have a "secondary MMU" and MMU notifications are the equivalent of TLB shootdowns; the secondary MMU is responsible for populating itself on faults etc. > Is the *unm_counter value guaranteed to have been changed by the time > munmap() returns? Yes. > Did you pick [2] here simply because you're only expecting an INVAL > and a LAST event in this specific example? I'm assuming that we > should normally loop over reading until we get LAST, correct? Right. > What happens if I register multiple regions with the same cookie value? You get in trouble -- I need to fix things to reject duplicated cookies actually, because otherwise there's no way to unregister. > Is a process responsible for guaranteeing that it umn_unregister()s > everything before exiting, or will all pending registrations be > cleaned up/unregistered/whatever when a process exits? The kernel cleans up of course to handle crashes etc. - R. From rdreier at cisco.com Wed May 27 11:13:34 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 11:13:34 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Roland Dreier's message of "Wed, 27 May 2009 10:49:57 -0700") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: Fixed version below -- returns EINVAL for an attempt to reuse a user cookie (since otherwise unregister would get confused). === ummunot: Userspace support for MMU notifications As discussed in and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in ). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in ). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. NOT-YET-Signed-off-by: Roland Dreier --- drivers/char/Kconfig | 12 ++ drivers/char/Makefile | 1 + drivers/char/ummunot.c | 457 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ummunot.h | 85 +++++++++ 4 files changed, 555 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate "Userspace MMU notifications" + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source "drivers/s390/char/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c new file mode 100644 index 0000000..1341edc --- /dev/null +++ b/drivers/char/ummunot.c @@ -0,0 +1,457 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Userspace MMU notifiers"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UMMUNOT_FLAG_DIRTY = 1, + UMMUNOT_FLAG_HINT = 2, +}; + +struct ummunot_reg { + u64 user_cookie; + unsigned long start; + unsigned long end; + unsigned long hint_start; + unsigned long hint_end; + unsigned long flags; + struct rb_node node; + struct list_head list; +}; + +struct ummunot_file { + struct mmu_notifier mmu_notifier; + struct mm_struct *mm; + struct rb_root reg_tree; + struct list_head dirty_list; + u64 *counter; + spinlock_t lock; + wait_queue_head_t read_wait; + struct fasync_struct *async_queue; +}; + +static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn) +{ + return container_of(mn, struct ummunot_file, mmu_notifier); +} + +static void ummunot_handle_not(struct mmu_notifier *mn, + unsigned long start, unsigned long end) +{ + struct ummunot_file *priv = to_ummunot_file(mn); + struct rb_node *n; + struct ummunot_reg *reg; + unsigned long flags; + int hit = 0; + + spin_lock_irqsave(&priv->lock, flags); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->start >= end) + break; + + if ((reg->start <= start && reg->end > start) || + (reg->start <= end && reg->end > end)) { + hit = 1; + + if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_add_tail(®->list, &priv->dirty_list); + + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + clear_bit(UMMUNOT_FLAG_HINT, ®->flags); + } else { + set_bit(UMMUNOT_FLAG_HINT, ®->flags); + reg->hint_start = start; + reg->hint_end = end; + } + } + } + + if (hit) { + ++(*priv->counter); + flush_dcache_page(virt_to_page(priv->counter)); + wake_up_interruptible(&priv->read_wait); + kill_fasync(&priv->async_queue, SIGIO, POLL_IN); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void ummunot_inval_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long addr) +{ + ummunot_handle_not(mn, addr, addr + PAGE_SIZE); +} + +static void ummunot_inval_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + ummunot_handle_not(mn, start, end); +} + +static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = { + .invalidate_page = ummunot_inval_page, + .invalidate_range_start = ummunot_inval_range_start, +}; + +static int ummunot_open(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv; + int ret; + + if (filp->f_mode & FMODE_WRITE) + return -EINVAL; + + priv = kmalloc(sizeof *priv, GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->counter = (void *) get_zeroed_page(GFP_KERNEL); + if (!priv->counter) { + ret = -ENOMEM; + goto err; + } + + priv->reg_tree = RB_ROOT; + INIT_LIST_HEAD(&priv->dirty_list); + spin_lock_init(&priv->lock); + init_waitqueue_head(&priv->read_wait); + priv->async_queue = NULL; + + priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops; + /* + * Register notifier last, since notifications can occur as + * soon as we register.... + */ + ret = mmu_notifier_register(&priv->mmu_notifier, current->mm); + if (ret) + goto err_page; + + priv->mm = current->mm; + atomic_inc(&priv->mm->mm_count); + + filp->private_data = priv; + + return 0; + +err_page: + free_page((unsigned long) priv->counter); + +err: + kfree(priv); + return ret; +} + +static int ummunot_close(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv = filp->private_data; + struct rb_node *n; + struct ummunot_reg *reg; + + mmu_notifier_unregister(&priv->mmu_notifier, priv->mm); + mmdrop(priv->mm); + free_page((unsigned long) priv->counter); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + rb_erase(n, &priv->reg_tree); + kfree(reg); + } + + kfree(priv); + + return 0; +} + +static ssize_t ummunot_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ummunot_file *priv = filp->private_data; + struct ummunot_reg *reg; + ssize_t ret; + struct ummunot_event *events; + int max; + int n; + + events = (void *) get_zeroed_page(GFP_KERNEL); + if (!events) { + ret = -ENOMEM; + goto out; + } + + spin_lock_irq(&priv->lock); + + while (list_empty(&priv->dirty_list)) { + spin_unlock_irq(&priv->lock); + + if (filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + goto out; + } + + if (wait_event_interruptible(priv->read_wait, + !list_empty(&priv->dirty_list))) { + ret = -ERESTARTSYS; + goto out; + } + + spin_lock_irq(&priv->lock); + } + + max = min(PAGE_SIZE, count) / sizeof *events; + + for (n = 0; n < max; ++n) { + if (list_empty(&priv->dirty_list)) { + events[n].type = UMMUNOT_EVENT_TYPE_LAST; + events[n].user_cookie_counter = *priv->counter; + ++n; + break; + } + + reg = list_first_entry(&priv->dirty_list, struct ummunot_reg, + list); + + events[n].type = UMMUNOT_EVENT_TYPE_INVAL; + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + events[n].flags = UMMUNOT_EVENT_FLAG_HINT; + events[n].hint_start = reg->hint_start; + events[n].hint_end = reg->hint_end; + } + events[n].user_cookie_counter = reg->user_cookie; + + list_del(®->list); + reg->flags = 0; + } + + spin_unlock_irq(&priv->lock); + + if (copy_to_user(buf, events, n * sizeof *events)) + ret = -EFAULT; + else + ret = n * sizeof *events; + +out: + free_page((unsigned long) events); + return ret; +} + +static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ummunot_file *priv = filp->private_data; + + poll_wait(filp, &priv->read_wait, wait); + + return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM); +} + +static long ummunot_register_region(struct ummunot_file *priv, + struct ummunot_register_ioctl __user *arg) +{ + struct ummunot_register_ioctl parm; + struct ummunot_reg *reg, *treg; + struct rb_node **n = &priv->reg_tree.rb_node; + struct rb_node *pn = NULL; + + if (copy_from_user(&parm, arg, sizeof parm)) + return -EFAULT; + + if (parm.intf_version != UMMUNOT_INTF_VERSION) + return -EINVAL; + + reg = kmalloc(sizeof *reg, GFP_KERNEL); + if (!reg) + return -ENOMEM; + + reg->user_cookie = parm.user_cookie; + reg->start = parm.start; + reg->end = parm.end; + reg->flags = 0; + + spin_lock_irq(&priv->lock); + + while (*n) { + treg = rb_entry(pn, struct ummunot_reg, node); + pn = *n; + if (reg->start <= treg->start) + n = &pn->rb_left; + else + n = &pn->rb_right; + } + + rb_link_node(®->node, pn, n); + rb_insert_color(®->node, &priv->reg_tree); + + spin_unlock_irq(&priv->lock); + + return 0; +} + +static long ummunot_unregister_region(struct ummunot_file *priv, + __u64 __user *arg) +{ + u64 user_cookie; + struct rb_node *n; + struct ummunot_reg *reg; + int ret = -EINVAL; + + if (get_user(user_cookie, arg)) + return -EFAULT; + + spin_lock_irq(&priv->lock); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->user_cookie == user_cookie) { + rb_erase(n, &priv->reg_tree); + if (test_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_del(®->list); + kfree(reg); + ret = 0; + break; + } + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunot_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + struct ummunot_file *priv = filp->private_data; + void __user *argp = (void __user *) arg; + + switch (cmd) { + case UMMUNOT_REGISTER_REGION: + return ummunot_register_region(priv, argp); + case UMMUNOT_UNREGISTER_REGION: + return ummunot_unregister_region(priv, argp); + default: + return -ENOIOCTLCMD; + } +} + +static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct ummunot_file *priv = vma->vm_private_data; + + if (vmf->pgoff != 0) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(priv->counter); + get_page(vmf->page); + + return 0; + +} + +static struct vm_operations_struct ummunot_vm_ops = { + .fault = ummunot_fault, +}; + +static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ummunot_file *priv = filp->private_data; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || + vma->vm_pgoff != 0) + return -EINVAL; + + vma->vm_ops = &ummunot_vm_ops; + vma->vm_private_data = priv; + + return 0; +} + +static int ummunot_fasync(int fd, struct file *filp, int on) +{ + struct ummunot_file *priv = filp->private_data; + + return fasync_helper(fd, filp, on, &priv->async_queue); +} + +static const struct file_operations ummunot_fops = { + .owner = THIS_MODULE, + .open = ummunot_open, + .release = ummunot_close, + .read = ummunot_read, + .poll = ummunot_poll, + .unlocked_ioctl = ummunot_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ummunot_ioctl, +#endif + .mmap = ummunot_mmap, + .fasync = ummunot_fasync, +}; + +static struct miscdevice ummunot_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ummunot", + .fops = &ummunot_fops, +}; + +static int __init ummunot_init(void) +{ + return misc_register(&ummunot_misc); +} + +static void __exit ummunot_cleanup(void) +{ + misc_deregister(&ummunot_misc); +} + +module_init(ummunot_init); +module_exit(ummunot_cleanup); diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h new file mode 100644 index 0000000..e1abd89 --- /dev/null +++ b/include/linux/ummunot.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _LINUX_UMMUNOT_H +#define _LINUX_UMMUNOT_H + +#include +#include + +#define UMMUNOT_INTF_VERSION 1 + +enum { + UMMUNOT_EVENT_TYPE_INVAL = 0, + UMMUNOT_EVENT_TYPE_LAST = 1, +}; + +enum { + UMMUNOT_EVENT_FLAG_HINT = 1 << 0, +}; + +/* + * If type field is INVAL, then user_cookie_counter holds the + * user_cookie for the region being reported; if the HINT flag is set + * then hint_start/hint_end hold the start and end of the mapping that + * was invalidated. (If HINT is not set, then multiple events + * invalidated parts of the registered range and hint_start/hint_end + * should be ignored) + * + * If type is LAST, then the read operation has emptied the list of + * invalidated regions, and user_cookie_counter holds the value of the + * kernel's generation counter when the empty list occurred. The + * other fields are not filled in for this event. + */ +struct ummunot_event { + __u32 type; + __u32 flags; + __u64 hint_start; + __u64 hint_end; + __u64 user_cookie_counter; +}; + +struct ummunot_register_ioctl { + __u32 intf_version; /* in */ + __u32 reserved1; + __u64 start; /* in */ + __u64 end; /* in */ + __u64 user_cookie; /* in */ +}; + +#define UMMUNOT_MAGIC 'U' + +#define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ + struct ummunot_register_ioctl) +#define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) + +#endif /* _LINUX_UMMUNOT_H */ -- 1.6.0.4 From zafargilani at gmail.com Wed May 27 11:21:50 2009 From: zafargilani at gmail.com (Zafar Gilani) Date: Wed, 27 May 2009 23:21:50 +0500 Subject: [ofa-general] Verbs or RDMA code via JNI Message-ID: <7d4423d30905271121k415bf7b1ob3327a60f333c2fe@mail.gmail.com> This information is very important for me and I will greatly appreciate anyone who could help me. My question is: Has anyone tried executing native C (verbs or RDMA) code via JNI (Java Native Interface). If somebody has, then kindly let me know whether it was successful or not. Thanks, -- Zafar -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Wed May 27 12:02:36 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 May 2009 15:02:36 -0400 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com><20090506222638.GA16280@obsidianresearch.com><20090507000231.GB16280@obsidianresearch.com><20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: Other MPI implementors -- what do you think of this scheme? On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: > > > > /* > > > * If type field is INVAL, then user_cookie_counter holds the > > > * user_cookie for the region being reported; if the HINT flag > is set > > > * then hint_start/hint_end hold the start and end of the > mapping that > > > * was invalidated. (If HINT is not set, then multiple events > > > * invalidated parts of the registered range and hint_start/ > hint_end > > > * should be ignored) > > > I don't quite grok this. Is the intent that HINT will only be > set if > > an *entire* hint_start/hint_end range is invalidated by a single > > event? I.e., if only part of the hint_start/hint_end range is > > invalidated, you'll get the cookie back, but not what part of the > > range is invalid (because assumedly the entire IBV registration > is now > > invalid anyway)? > > Basically, I just keep one hint_start/hint_end. If multiple events > hit > the same registration then I just give up and don't give you a hint. > > > > * If type is LAST, then the read operation has emptied the > list of > > > * invalidated regions, and user_cookie_counter holds the value > of the > > > * kernel's generation counter when the empty list occurred. The > > > * other fields are not filled in for this event. > > > Just to be clear -- we're supposed to keep reading events until > we get > > a LAST event? > > Yes, that's probably the sanest use case. > > > 1. Will it increase by 1 each time a page (or set of pages?) is > > removed from a user process? > > As it stands it increases by 1 every time there is an MMU > notification, > even if that notification hits multiple registrations. It wouldn't be > hard to change that to count the number of events generated if that > works better. > > > 2. Does it change if pages are *added* to a user process? I.e., > does > > the counter indicate *removals* or *changes* to the user process > page > > table? > > No, additions don't trigger any MMU notification -- that's inherent in > the design of the MMU notifiers stuff. The idea is that you have a > "secondary MMU" and MMU notifications are the equivalent of TLB > shootdowns; the secondary MMU is responsible for populating itself on > faults etc. > > > Is the *unm_counter value guaranteed to have been changed by the > time > > munmap() returns? > > Yes. > > > Did you pick [2] here simply because you're only expecting an INVAL > > and a LAST event in this specific example? I'm assuming that we > > should normally loop over reading until we get LAST, correct? > > Right. > > > What happens if I register multiple regions with the same cookie > value? > > You get in trouble -- I need to fix things to reject duplicated > cookies > actually, because otherwise there's no way to unregister. > > > Is a process responsible for guaranteeing that it umn_unregister()s > > everything before exiting, or will all pending registrations be > > cleaned up/unregistered/whatever when a process exits? > > The kernel cleans up of course to handle crashes etc. > > - R. > -- Jeff Squyres Cisco Systems From swise at opengridcomputing.com Wed May 27 12:08:52 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 27 May 2009 14:08:52 -0500 Subject: [ofa-general] [PATCH] RDMA/cxgb3: Report correct port state and mtu. Message-ID: <20090527190852.16426.82898.stgit@build.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 32 +++++++++++++++++++++++++-- 1 files changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 160ef48..e2a6321 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -1152,12 +1153,39 @@ static int iwch_query_device(struct ib_device *ibdev, static int iwch_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props) { + struct iwch_dev *dev; + struct net_device *netdev; + struct in_device *inetdev; + PDBG("%s ibdev %p\n", __func__, ibdev); + dev = to_iwch_dev(ibdev); + netdev = dev->rdev.port_info.lldevs[port-1]; + memset(props, 0, sizeof(struct ib_port_attr)); props->max_mtu = IB_MTU_4096; - props->active_mtu = IB_MTU_2048; - props->state = IB_PORT_ACTIVE; + if (netdev->mtu >= 4096) + props->active_mtu = IB_MTU_4096; + else if (netdev->mtu >= 2048) + props->active_mtu = IB_MTU_2048; + else if (netdev->mtu >= 1024) + props->active_mtu = IB_MTU_1024; + else if (netdev->mtu >= 512) + props->active_mtu = IB_MTU_512; + else + props->active_mtu = IB_MTU_256; + + if (!netif_carrier_ok(netdev)) + props->state = IB_PORT_DOWN; + else { + inetdev = in_dev_get(netdev); + if (inetdev->ifa_list) + props->state = IB_PORT_ACTIVE; + else + props->state = IB_PORT_INIT; + in_dev_put(inetdev); + } + props->port_cap_flags = IB_PORT_CM_SUP | IB_PORT_SNMP_TUNNEL_SUP | From swise at opengridcomputing.com Wed May 27 12:35:42 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 27 May 2009 14:35:42 -0500 Subject: [ofa-general] [PATCH] RDMA/cxgb3: limit fastreg size based on T3 limitations. Message-ID: <20090527193542.24913.25649.stgit@build.ogc.int> T3 firmware only supports one WR's worth of page list. The driver currently allows 2 WR's worth, which doesn't work for T3. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_wr.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index ff9be1a..32e3b14 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -176,7 +176,7 @@ struct t3_send_wr { struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ }; -#define T3_MAX_FASTREG_DEPTH 24 +#define T3_MAX_FASTREG_DEPTH 10 #define T3_MAX_FASTREG_FRAG 10 struct t3_fastreg_wr { From valdes at anl.gov Wed May 27 13:07:11 2009 From: valdes at anl.gov (John Valdes) Date: Wed, 27 May 2009 15:07:11 -0500 Subject: [ofa-general] ib_mthca catastrophic error detected Message-ID: <20090527200711.GA7605@starfish.mcs.anl.gov> All, I had posted last week about SRP problems we've been having after upgrading from some old servers running RHEL 5.1 to new servers running RHEL 5.3. We're still trying to isolate the cause of the problems, but one of the symptoms we're seeing is that occasionally when under stress (well, if you can call doing a "dd" from /dev/zero to the SRP target stress...), the ib_mthca driver will report a "Catastrophic error": ib_mthca 0000:04:00.0: Catastrophic error detected: internal error host8: ib_srp: failed receive status 4 ib_srp: host8: add qp_in_err timer host8: ib_srp: failed receive status 5 ib_mthca 0000:04:00.0: buf[00]: 00000000 ib_mthca 0000:04:00.0: buf[01]: 00000000 ib_mthca 0000:04:00.0: buf[02]: 00000000 ib_mthca 0000:04:00.0: buf[03]: 00000000 ib_mthca 0000:04:00.0: buf[04]: 00000000 ib_mthca 0000:04:00.0: buf[05]: 00000000 ib_mthca 0000:04:00.0: buf[06]: 00000000 ib_mthca 0000:04:00.0: buf[07]: 00000000 ib_mthca 0000:04:00.0: buf[08]: 00000000 ib_mthca 0000:04:00.0: buf[09]: 00000000 ib_mthca 0000:04:00.0: buf[0a]: 00000000 ib_mthca 0000:04:00.0: buf[0b]: 00000000 ib_mthca 0000:04:00.0: buf[0c]: 00000000 ib_mthca 0000:04:00.0: buf[0d]: 00000000 ib_mthca 0000:04:00.0: buf[0e]: 00000000 ib_mthca 0000:04:00.0: buf[0f]: 00000000 host8: ib_srp: srp_qp_in_err_timer called Checking back through the list archives, the consensus seems to be that these are due to card problems, usually with the firmware. We've never had this problem w/ the old servers under RHEL 5.1 w/ the bundled OFED 1.2, but maybe the new servers and/or the RHEL 5.3 w/ OFED 1.3 is pushing the card harder and/or tickling a bug in the firmware? The cards are Cisco branded Mellanox Cougar Cub cards; "tvflash -i" identifies them as: HCA #0: MT23108, Cougar Cub, revision A1 Primary image is v3.5.917 build 3.2.0.149, with label 'HCA.CougarCub.A1' Secondary image is v3.3.005 build 3.2.0.67, with label 'HCA.CougarCub.A1' Vital Product Data Product Name: Cougar cub P/N: SFS-HCA-X2T7-A1 E/C: Rev: A0 S/N: CS0636X00286 Freq/Power: PW=12W;PCI 66MHZ;PCI-X 133MHZ Date Code: 0636 Checksum: Ok Unfortunately, v3.5.917 seems to be the latest version of the firmware listed on Cisco's website, at least that I could find. Is anyone aware of any issues with this version of the firmware? John ---------------------------------------------------------------------- John Valdes Mathematics and Computer Science Division valdes at anl.gov Argonne National Laboratory From jphgross at gmail.com Wed May 27 13:07:20 2009 From: jphgross at gmail.com (Jason Gross) Date: Wed, 27 May 2009 13:07:20 -0700 Subject: [ofa-general] OFED 1.3.1 and 1.4.1 compatibility. Message-ID: <899683130905271307w70650706qf97b3374735a8e3b@mail.gmail.com> Hi All, We are planning to upgrade some of our nodes to OFED 1.4.1 while leaving a few at 1.3.1. Are there any known incompatibilities between the two versions, specifically for IPoIB or the Open SM? We have a custom application that uses IB verbs, which I expect shouldn't be affected. Thanks! Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed May 27 14:36:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 14:36:01 -0700 Subject: [ofa-general] Memory registration redux In-Reply-To: (Roland Dreier's message of "Wed, 27 May 2009 11:13:34 -0700") References: <20090506214628.GM2590@obsidianresearch.com> <20090506222638.GA16280@obsidianresearch.com> <20090507000231.GB16280@obsidianresearch.com> <20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: Sigh... real version that returns EINVAL for an attempt to reuse a user cookie (since otherwise unregister would get confused). Previous posting was the old patch, sorry. === ummunot: Userspace support for MMU notifications As discussed in and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunot, that creates a /dev/ummunot node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunot_register_ioctl in ). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunot_event in ). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Signed-off-by: Roland Dreier --- drivers/char/Kconfig | 12 ++ drivers/char/Makefile | 1 + drivers/char/ummunot.c | 469 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ummunot.h | 85 +++++++++ 4 files changed, 567 insertions(+), 0 deletions(-) create mode 100644 drivers/char/ummunot.c create mode 100644 include/linux/ummunot.h diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..91fe068 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOT + tristate "Userspace MMU notifications" + select MMU_NOTIFIER + help + The ummunot (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source "drivers/s390/char/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..dcbcd7c 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOT) += ummunot.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c new file mode 100644 index 0000000..ebfd038 --- /dev/null +++ b/drivers/char/ummunot.c @@ -0,0 +1,469 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Userspace MMU notifiers"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UMMUNOT_FLAG_DIRTY = 1, + UMMUNOT_FLAG_HINT = 2, +}; + +struct ummunot_reg { + u64 user_cookie; + unsigned long start; + unsigned long end; + unsigned long hint_start; + unsigned long hint_end; + unsigned long flags; + struct rb_node node; + struct list_head list; +}; + +struct ummunot_file { + struct mmu_notifier mmu_notifier; + struct mm_struct *mm; + struct rb_root reg_tree; + struct list_head dirty_list; + u64 *counter; + spinlock_t lock; + wait_queue_head_t read_wait; + struct fasync_struct *async_queue; +}; + +static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn) +{ + return container_of(mn, struct ummunot_file, mmu_notifier); +} + +static void ummunot_handle_not(struct mmu_notifier *mn, + unsigned long start, unsigned long end) +{ + struct ummunot_file *priv = to_ummunot_file(mn); + struct rb_node *n; + struct ummunot_reg *reg; + unsigned long flags; + int hit = 0; + + spin_lock_irqsave(&priv->lock, flags); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->start >= end) + break; + + if ((reg->start <= start && reg->end > start) || + (reg->start <= end && reg->end > end)) { + hit = 1; + + if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_add_tail(®->list, &priv->dirty_list); + + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + clear_bit(UMMUNOT_FLAG_HINT, ®->flags); + } else { + set_bit(UMMUNOT_FLAG_HINT, ®->flags); + reg->hint_start = start; + reg->hint_end = end; + } + } + } + + if (hit) { + ++(*priv->counter); + flush_dcache_page(virt_to_page(priv->counter)); + wake_up_interruptible(&priv->read_wait); + kill_fasync(&priv->async_queue, SIGIO, POLL_IN); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void ummunot_inval_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long addr) +{ + ummunot_handle_not(mn, addr, addr + PAGE_SIZE); +} + +static void ummunot_inval_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + ummunot_handle_not(mn, start, end); +} + +static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = { + .invalidate_page = ummunot_inval_page, + .invalidate_range_start = ummunot_inval_range_start, +}; + +static int ummunot_open(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv; + int ret; + + if (filp->f_mode & FMODE_WRITE) + return -EINVAL; + + priv = kmalloc(sizeof *priv, GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv->counter = (void *) get_zeroed_page(GFP_KERNEL); + if (!priv->counter) { + ret = -ENOMEM; + goto err; + } + + priv->reg_tree = RB_ROOT; + INIT_LIST_HEAD(&priv->dirty_list); + spin_lock_init(&priv->lock); + init_waitqueue_head(&priv->read_wait); + priv->async_queue = NULL; + + priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops; + /* + * Register notifier last, since notifications can occur as + * soon as we register.... + */ + ret = mmu_notifier_register(&priv->mmu_notifier, current->mm); + if (ret) + goto err_page; + + priv->mm = current->mm; + atomic_inc(&priv->mm->mm_count); + + filp->private_data = priv; + + return 0; + +err_page: + free_page((unsigned long) priv->counter); + +err: + kfree(priv); + return ret; +} + +static int ummunot_close(struct inode *inode, struct file *filp) +{ + struct ummunot_file *priv = filp->private_data; + struct rb_node *n; + struct ummunot_reg *reg; + + mmu_notifier_unregister(&priv->mmu_notifier, priv->mm); + mmdrop(priv->mm); + free_page((unsigned long) priv->counter); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + rb_erase(n, &priv->reg_tree); + kfree(reg); + } + + kfree(priv); + + return 0; +} + +static ssize_t ummunot_read(struct file *filp, char __user *buf, + size_t count, loff_t *pos) +{ + struct ummunot_file *priv = filp->private_data; + struct ummunot_reg *reg; + ssize_t ret; + struct ummunot_event *events; + int max; + int n; + + events = (void *) get_zeroed_page(GFP_KERNEL); + if (!events) { + ret = -ENOMEM; + goto out; + } + + spin_lock_irq(&priv->lock); + + while (list_empty(&priv->dirty_list)) { + spin_unlock_irq(&priv->lock); + + if (filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + goto out; + } + + if (wait_event_interruptible(priv->read_wait, + !list_empty(&priv->dirty_list))) { + ret = -ERESTARTSYS; + goto out; + } + + spin_lock_irq(&priv->lock); + } + + max = min(PAGE_SIZE, count) / sizeof *events; + + for (n = 0; n < max; ++n) { + if (list_empty(&priv->dirty_list)) { + events[n].type = UMMUNOT_EVENT_TYPE_LAST; + events[n].user_cookie_counter = *priv->counter; + ++n; + break; + } + + reg = list_first_entry(&priv->dirty_list, struct ummunot_reg, + list); + + events[n].type = UMMUNOT_EVENT_TYPE_INVAL; + if (test_bit(UMMUNOT_FLAG_HINT, ®->flags)) { + events[n].flags = UMMUNOT_EVENT_FLAG_HINT; + events[n].hint_start = reg->hint_start; + events[n].hint_end = reg->hint_end; + } + events[n].user_cookie_counter = reg->user_cookie; + + list_del(®->list); + reg->flags = 0; + } + + spin_unlock_irq(&priv->lock); + + if (copy_to_user(buf, events, n * sizeof *events)) + ret = -EFAULT; + else + ret = n * sizeof *events; + +out: + free_page((unsigned long) events); + return ret; +} + +static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ummunot_file *priv = filp->private_data; + + poll_wait(filp, &priv->read_wait, wait); + + return list_empty(&priv->dirty_list) ? 0 : (POLLIN | POLLRDNORM); +} + +static long ummunot_register_region(struct ummunot_file *priv, + struct ummunot_register_ioctl __user *arg) +{ + struct ummunot_register_ioctl parm; + struct ummunot_reg *reg, *treg; + struct rb_node **n = &priv->reg_tree.rb_node; + struct rb_node *pn; + int ret = 0; + + if (copy_from_user(&parm, arg, sizeof parm)) + return -EFAULT; + + if (parm.intf_version != UMMUNOT_INTF_VERSION) + return -EINVAL; + + reg = kmalloc(sizeof *reg, GFP_KERNEL); + if (!reg) + return -ENOMEM; + + reg->user_cookie = parm.user_cookie; + reg->start = parm.start; + reg->end = parm.end; + reg->flags = 0; + + spin_lock_irq(&priv->lock); + + for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) { + reg = rb_entry(pn, struct ummunot_reg, node); + + if (reg->user_cookie == parm.user_cookie) { + ret = -EINVAL; + goto out; + } + } + + pn = NULL; + while (*n) { + treg = rb_entry(pn, struct ummunot_reg, node); + pn = *n; + if (reg->start <= treg->start) + n = &pn->rb_left; + else + n = &pn->rb_right; + } + + rb_link_node(®->node, pn, n); + rb_insert_color(®->node, &priv->reg_tree); + +out: + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunot_unregister_region(struct ummunot_file *priv, + __u64 __user *arg) +{ + u64 user_cookie; + struct rb_node *n; + struct ummunot_reg *reg; + int ret = -EINVAL; + + if (get_user(user_cookie, arg)) + return -EFAULT; + + spin_lock_irq(&priv->lock); + + for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) { + reg = rb_entry(n, struct ummunot_reg, node); + + if (reg->user_cookie == user_cookie) { + rb_erase(n, &priv->reg_tree); + if (test_bit(UMMUNOT_FLAG_DIRTY, ®->flags)) + list_del(®->list); + kfree(reg); + ret = 0; + break; + } + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +static long ummunot_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + struct ummunot_file *priv = filp->private_data; + void __user *argp = (void __user *) arg; + + switch (cmd) { + case UMMUNOT_REGISTER_REGION: + return ummunot_register_region(priv, argp); + case UMMUNOT_UNREGISTER_REGION: + return ummunot_unregister_region(priv, argp); + default: + return -ENOIOCTLCMD; + } +} + +static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct ummunot_file *priv = vma->vm_private_data; + + if (vmf->pgoff != 0) + return VM_FAULT_SIGBUS; + + vmf->page = virt_to_page(priv->counter); + get_page(vmf->page); + + return 0; + +} + +static struct vm_operations_struct ummunot_vm_ops = { + .fault = ummunot_fault, +}; + +static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ummunot_file *priv = filp->private_data; + + if (vma->vm_end - vma->vm_start != PAGE_SIZE || + vma->vm_pgoff != 0) + return -EINVAL; + + vma->vm_ops = &ummunot_vm_ops; + vma->vm_private_data = priv; + + return 0; +} + +static int ummunot_fasync(int fd, struct file *filp, int on) +{ + struct ummunot_file *priv = filp->private_data; + + return fasync_helper(fd, filp, on, &priv->async_queue); +} + +static const struct file_operations ummunot_fops = { + .owner = THIS_MODULE, + .open = ummunot_open, + .release = ummunot_close, + .read = ummunot_read, + .poll = ummunot_poll, + .unlocked_ioctl = ummunot_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ummunot_ioctl, +#endif + .mmap = ummunot_mmap, + .fasync = ummunot_fasync, +}; + +static struct miscdevice ummunot_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "ummunot", + .fops = &ummunot_fops, +}; + +static int __init ummunot_init(void) +{ + return misc_register(&ummunot_misc); +} + +static void __exit ummunot_cleanup(void) +{ + misc_deregister(&ummunot_misc); +} + +module_init(ummunot_init); +module_exit(ummunot_cleanup); diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h new file mode 100644 index 0000000..e1abd89 --- /dev/null +++ b/include/linux/ummunot.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenFabrics BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _LINUX_UMMUNOT_H +#define _LINUX_UMMUNOT_H + +#include +#include + +#define UMMUNOT_INTF_VERSION 1 + +enum { + UMMUNOT_EVENT_TYPE_INVAL = 0, + UMMUNOT_EVENT_TYPE_LAST = 1, +}; + +enum { + UMMUNOT_EVENT_FLAG_HINT = 1 << 0, +}; + +/* + * If type field is INVAL, then user_cookie_counter holds the + * user_cookie for the region being reported; if the HINT flag is set + * then hint_start/hint_end hold the start and end of the mapping that + * was invalidated. (If HINT is not set, then multiple events + * invalidated parts of the registered range and hint_start/hint_end + * should be ignored) + * + * If type is LAST, then the read operation has emptied the list of + * invalidated regions, and user_cookie_counter holds the value of the + * kernel's generation counter when the empty list occurred. The + * other fields are not filled in for this event. + */ +struct ummunot_event { + __u32 type; + __u32 flags; + __u64 hint_start; + __u64 hint_end; + __u64 user_cookie_counter; +}; + +struct ummunot_register_ioctl { + __u32 intf_version; /* in */ + __u32 reserved1; + __u64 start; /* in */ + __u64 end; /* in */ + __u64 user_cookie; /* in */ +}; + +#define UMMUNOT_MAGIC 'U' + +#define UMMUNOT_REGISTER_REGION _IOWR(UMMUNOT_MAGIC, 1, \ + struct ummunot_register_ioctl) +#define UMMUNOT_UNREGISTER_REGION _IOW(UMMUNOT_MAGIC, 2, __u64) + +#endif /* _LINUX_UMMUNOT_H */ -- 1.6.0.4 From rdreier at cisco.com Wed May 27 14:39:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 14:39:10 -0700 Subject: [ofa-general] Re: [PATCH 2/2] ib_mthca: Use module parameter for number of MTTs per segment In-Reply-To: <20090518085551.GA16106@mtls03> (Eli Cohen's message of "Mon, 18 May 2009 11:55:51 +0300") References: <20090518085551.GA16106@mtls03> Message-ID: Sigh... unfortunate to add a tunable that people have to mess with rather than just making things work automatically somehow. Anyway, applied these. From rdreier at cisco.com Wed May 27 14:42:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 14:42:53 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Report correct port state and mtu. In-Reply-To: <20090527190852.16426.82898.stgit@build.ogc.int> (Steve Wise's message of "Wed, 27 May 2009 14:08:52 -0500") References: <20090527190852.16426.82898.stgit@build.ogc.int> Message-ID: OK, applied. Would be nice if we had a better way to report MTU, but whatever... From rdreier at cisco.com Wed May 27 14:42:57 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 27 May 2009 14:42:57 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: limit fastreg size based on T3 limitations. In-Reply-To: <20090527193542.24913.25649.stgit@build.ogc.int> (Steve Wise's message of "Wed, 27 May 2009 14:35:42 -0500") References: <20090527193542.24913.25649.stgit@build.ogc.int> Message-ID: thanks, applied. From akepner at sgi.com Wed May 27 16:27:21 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 27 May 2009 16:27:21 -0700 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A18D704.1020000@voltaire.com> References: <20090519215505.GN6837@sgi.com> <20090520213703.GT6837@sgi.com> <4A14E766.1010005@voltaire.com> <20090521193910.GX6837@sgi.com> <4A18D704.1020000@voltaire.com> Message-ID: <20090527232721.GD5819@sgi.com> On Sun, May 24, 2009 at 08:11:32AM +0300, Or Gerlitz wrote: > ... how come a neigh cleanup > callback is invoked when someone out there has a ref on the neighbour? Don't know if you saw all of this thread, but in: http://lists.openfabrics.org/pipermail/general/2009-May/059730.html I mentioned a race between a tx completion (with an error) and ipoib_neigh_cleanup(), which could happen even if the callback is invoked at the correct time (as far as the neighbour code is concerned). > ... > also I'd like to clarify with you if the rest of this thread applies > only to 2.6.16 and possibly more old kernels, or to the current mainline > bits? > Although I've only seen the bug with 2.6.16 vintage kernels (and maybe only once) , I think it's still possible in the latest code via the mechanism I mentioned above (and maybe other ways, too). The best idea I've got so far is to use a new set of locks to consistently read/write the struct ipoib_neigh pointer that's stashed away in the neighbour structures. -- Arthur From abenjamin at sgi.com Wed May 27 18:41:28 2009 From: abenjamin at sgi.com (Arputham Benjamin) Date: Wed, 27 May 2009 18:41:28 -0700 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> Message-ID: <4A1DEBC8.9050207@sgi.com> Roland Dreier wrote: > > Don't we need both /sys/devices/... and /proc/interrupts? > > Not sure what you mean. If we put msi-x info under /sys, then you can > figure out which interrupts belong to a given HCA by following the > device link from /sys/class/infiniband. Similarly if /proc/interrupts > gives the PCI device, then you have the same ability. So either way > works as far as I can tell. Linux is supposed to move away from procfs to sysfs for this type of device related info. However, /proc/interrupts is still present in the latest distro releases (for example, SLES11) and OFED needs to provide support for this in procfs until the /proc/interrupts support is removed from kernel. Also, OFED implementation needs to be consistent with other Ethernet device drivers present in a system as OFED includes both Infiniband and Ethernet functions( for example, ConnectX) I wanted to summarize what we had discussed so far. 1) Enhance sysfs to include info. found in /proc/interrupts: I have not seen full sysfs support for Ethernet devices . I have seen IRQ number info but no interrupt counters on a per CPU basis. Do we know when the full support for ethernet devices will be available in sysfs? We can enhance OFED at the same time ethernet support is made available in the kernel. 2) Use PCI ID in /proc/interrupts: I have not seen Ethernet devices follow this convention. Also, OFED tools currently use the device name convention mthcaX, mlx4_X etc. Therefore, this approach is not preferable for consistency reason. As an alternative to #2, 3) We can add dev_alloc_name() functionality to mlx4_core similar to alloc_name() present in ib_core. This is consistent with other ethernet device driver implementations using the function dev_alloc_name() present in the kernel. (Please see .../net/core/dev.c) Any objection for going with #3? Regards, Benjamin From He.Huang at Sun.COM Wed May 27 21:33:19 2009 From: He.Huang at Sun.COM (Isaac Huang) Date: Thu, 28 May 2009 00:33:19 -0400 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state In-Reply-To: <15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com> References: <20090526200346.GQ4239@sun.com> <15ddcffd0905261322t3df149fuaf27ebcd17a8ac46@mail.gmail.com> Message-ID: <20090528043319.GW4494@sun.com> On Tue, May 26, 2009 at 11:22:24PM +0300, Or Gerlitz wrote: > On Tue, May 26, 2009 at 11:03 PM, Isaac Huang <[1]He.Huang at sun.com> > wrote: > > If rdma_destroy_qp is called on a QP before it exits the TimeWait > state (i.e. after RDMA_CM_EVENT_DISCONNECTED but before > RDMA_CM_EVENT_TIMEWAIT_EXIT), is it possible that a subsequent > rdma_create_qp would reuse the same QP while it's still in TimeWait? > > YES - as rdma_destroy/create_qp are basically wrappers to > ib_destroy/create_qp and the latter two are not aware by any means to > the QP state from the CM point of view. Thanks, they should probably be called CM ID states instead if QP states. Isaac From He.Huang at Sun.COM Wed May 27 21:43:35 2009 From: He.Huang at Sun.COM (Isaac Huang) Date: Thu, 28 May 2009 00:43:35 -0400 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state In-Reply-To: <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com> References: <20090526200346.GQ4239@sun.com> <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com> Message-ID: <20090528044335.GX4494@sun.com> On Tue, May 26, 2009 at 01:43:25PM -0700, Sean Hefty wrote: > >In 12.9.6 of the Infiniband Architecture v1.2, it seemed that a QP > >could enter the TimeWait state without having entered the Established > >state first, via the RTU timeout. Could a RDMA_CM_EVENT_TIMEWAIT_EXIT > >happen right after a RDMA_CM_EVENT_CONNECT_REQUEST without a > >RDMA_CM_EVENT_ESTABLISHED? If yes, our ULP would have to cleanup some > >resources in case RDMA_CM_EVENT_TIMEWAIT_EXIT happens on passive side. > > Yes, it's possible to enter timewait without going through established. I'd > have to walk through the code at this point to identify all of the cases. Thanks, I followed cm_enter_timewait() call sites and found that it could be entered via several paths without going through IB_CM_ESTABLISHED. > Note that a lot (most?) connections between QPs are established out of band > using TCP, and these are not tracked by the CM or go through any sort of > timewait before potentially being reused. I don't quite understand this. Could you please point me to places (code, IB spec, so on) where I could poke around? Thanks, Isaac From sean.hefty at intel.com Wed May 27 22:50:22 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 27 May 2009 22:50:22 -0700 Subject: [ofa-general] two questions about RDMA_CM_EVENT_TIMEWAIT_EXIT and the TimeWait state In-Reply-To: <20090528044335.GX4494@sun.com> References: <20090526200346.GQ4239@sun.com> <93D6D1D3B9C94280A5277CE07A98663C@amr.corp.intel.com> <20090528044335.GX4494@sun.com> Message-ID: <51A9484D90DE44DFA7A3F6F31A2E6D94@amr.corp.intel.com> >> Note that a lot (most?) connections between QPs are established out of band >> using TCP, and these are not tracked by the CM or go through any sort of >> timewait before potentially being reused. > >I don't quite understand this. Could you please point me to places >(code, IB spec, so on) where I could poke around? MPIs typically connect QPs by connecting over sockets and exchanging the QP information that way. The QPs are then modified directly using a combination of locally read and hard-coded values. The libibverb examples along with the perftest programs can connect QPs in this fashion. From pashash at gmail.com Thu May 28 00:09:42 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Thu, 28 May 2009 10:09:42 +0300 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com><20090506222638.GA16280@obsidianresearch.com><20090507000231.GB16280@obsidianresearch.com><20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: <4A1E38B6.70305@dev.mellanox.co.il> Sounds good for me, Jeff Squyres wrote: > Other MPI implementors -- what do you think of this scheme? > > > On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: > >> >> > > /* >> > > * If type field is INVAL, then user_cookie_counter holds the >> > > * user_cookie for the region being reported; if the HINT flag >> is set >> > > * then hint_start/hint_end hold the start and end of the >> mapping that >> > > * was invalidated. (If HINT is not set, then multiple events >> > > * invalidated parts of the registered range and >> hint_start/hint_end >> > > * should be ignored) >> >> > I don't quite grok this. Is the intent that HINT will only be set if >> > an *entire* hint_start/hint_end range is invalidated by a single >> > event? I.e., if only part of the hint_start/hint_end range is >> > invalidated, you'll get the cookie back, but not what part of the >> > range is invalid (because assumedly the entire IBV registration is >> now >> > invalid anyway)? >> >> Basically, I just keep one hint_start/hint_end. If multiple events hit >> the same registration then I just give up and don't give you a hint. >> >> > > * If type is LAST, then the read operation has emptied the list of >> > > * invalidated regions, and user_cookie_counter holds the value >> of the >> > > * kernel's generation counter when the empty list occurred. The >> > > * other fields are not filled in for this event. >> >> > Just to be clear -- we're supposed to keep reading events until we >> get >> > a LAST event? >> >> Yes, that's probably the sanest use case. >> >> > 1. Will it increase by 1 each time a page (or set of pages?) is >> > removed from a user process? >> >> As it stands it increases by 1 every time there is an MMU notification, >> even if that notification hits multiple registrations. It wouldn't be >> hard to change that to count the number of events generated if that >> works better. >> >> > 2. Does it change if pages are *added* to a user process? I.e., does >> > the counter indicate *removals* or *changes* to the user process page >> > table? >> >> No, additions don't trigger any MMU notification -- that's inherent in >> the design of the MMU notifiers stuff. The idea is that you have a >> "secondary MMU" and MMU notifications are the equivalent of TLB >> shootdowns; the secondary MMU is responsible for populating itself on >> faults etc. >> >> > Is the *unm_counter value guaranteed to have been changed by the time >> > munmap() returns? >> >> Yes. >> >> > Did you pick [2] here simply because you're only expecting an INVAL >> > and a LAST event in this specific example? I'm assuming that we >> > should normally loop over reading until we get LAST, correct? >> >> Right. >> >> > What happens if I register multiple regions with the same cookie >> value? >> >> You get in trouble -- I need to fix things to reject duplicated cookies >> actually, because otherwise there's no way to unregister. >> >> > Is a process responsible for guaranteeing that it umn_unregister()s >> > everything before exiting, or will all pending registrations be >> > cleaned up/unregistered/whatever when a process exits? >> >> The kernel cleans up of course to handle crashes etc. >> >> - R. >> > > From sebastien.dugue at bull.net Thu May 28 01:20:59 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Thu, 28 May 2009 10:20:59 +0200 Subject: [ofa-general] [PATCH 0/3] V2 - libmthca libmlx4 - Optimize memory allocation of QP buffers Message-ID: <20090528102059.2fd85540@frecb007965> Hi, here is a re-spin of the QP buffers memory allocation optimization patches in which QP buffers are allocated using mmap() regardless of the page size. Changes V1 -> V2: ---------------- - Use mmap whatever the page size, not only with 64K pages. libmthca and libmlx4 allocate QP buffers using posix_memalign(), which results in big memory wastage on architectures with 64K pages. Replacing posix_memalign() with mmap() allows to fix this (more description in the patches themselves). Now, for some numbers, a micro benchmark I wrote shows the heap usage and the number of mmaped pages used with posix_memalign() and mmap() respectively for 1000, 2000, up to 8000 QP. MTHCA posix_memalign mmap QP heap mmaped(pages) heap mmaped(pages) 1000 838736 2988 576512 1000 2000 1751216 5973 1161264 2000 3000 2598144 8961 1746016 3000 4000 3510656 11946 2330704 4000 5000 4357616 14934 2915440 5000 6000 5270080 17919 3500176 6000 7000 6117056 20907 4084912 7000 8000 6963968 23895 4669632 8000 MLX4 posix_memalign mmap QP heap mmaped(pages) heap mmaped(pages) 1000 1469424 2982 1010544 1003 2000 2994048 5958 2010752 2003 3000 4518672 8934 3010960 3003 4000 5969520 11913 4002960 4003 5000 7494176 14889 5003168 5003 6000 8953248 17868 6003376 6003 7000 10477856 20844 7003584 7003 8000 12002496 23820 8003792 8003 This patchset consists in 3 patches: 1. Optimize memory allocation of QP buffers for libmthca 2. Optimize memory allocation of QP buffers for libmlx4 3. Fix the fixes patches for libmlx4 after having applied the previous patch. Sebastien Dugue From sebastien.dugue at bull.net Thu May 28 01:24:26 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Thu, 28 May 2009 10:24:26 +0200 Subject: [ofa-general] [PATCH 3/3] V2 - libmlx4 - Fix fixes after QP buffers alloc optimization patch to allow build In-Reply-To: <20090528102059.2fd85540@frecb007965> References: <20090528102059.2fd85540@frecb007965> Message-ID: <20090528102426.2ea0d35e@frecb007965> The patches in 'fixes/' need to be refreshed after the previous patch in order to build properly. Signed-off-by: Sebastien Dugue --- fixes/lim_qp_resources.patch | 20 ++++------- fixes/resize_cq_owner_bit.patch | 4 +-- fixes/userspace_dev_lims.patch | 12 ++---- fixes/xrc_consolidated_v2.patch | 68 ++++++++++++++------------------------ fixes/xrc_fix_close_domain.patch | 8 ++--- fixes/xrc_rcv_qp_v2.patch | 12 ++----- 6 files changed, 44 insertions(+), 80 deletions(-) diff --git a/fixes/lim_qp_resources.patch b/fixes/lim_qp_resources.patch index 1f89256..54cc63e 100644 --- a/fixes/lim_qp_resources.patch +++ b/fixes/lim_qp_resources.patch @@ -7,11 +7,9 @@ qp creation also lie within the reported device limits. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/qp.c -=================================================================== ---- libmlx4.orig/src/qp.c 2008-06-04 08:24:45.000000000 +0300 -+++ libmlx4/src/qp.c 2008-06-04 08:24:49.000000000 +0300 -@@ -619,6 +619,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q +--- a/src/qp.c ++++ b/src/qp.c +@@ -622,6 +622,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q enum ibv_qp_type type) { int wqe_size; @@ -19,7 +17,7 @@ Index: libmlx4/src/qp.c wqe_size = (1 << qp->sq.wqe_shift) - sizeof (struct mlx4_wqe_ctrl_seg); switch (type) { -@@ -636,8 +637,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q +@@ -639,8 +640,9 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q } qp->sq.max_gs = wqe_size / sizeof (struct mlx4_wqe_data_seg); @@ -31,10 +29,8 @@ Index: libmlx4/src/qp.c cap->max_send_wr = qp->sq.max_post; /* -Index: libmlx4/src/verbs.c -=================================================================== ---- libmlx4.orig/src/verbs.c 2008-06-04 08:24:45.000000000 +0300 -+++ libmlx4/src/verbs.c 2008-06-04 08:24:49.000000000 +0300 +--- a/src/verbs.c ++++ b/src/verbs.c @@ -390,12 +390,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv struct ibv_create_qp_resp resp; struct mlx4_qp *qp; @@ -54,9 +50,9 @@ Index: libmlx4/src/verbs.c attr->cap.max_inline_data > 1024) return NULL; -@@ -461,8 +463,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv - if (ret) +@@ -464,8 +466,14 @@ struct ibv_qp *mlx4_create_qp(struct ibv goto err_destroy; + pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex); - qp->rq.wqe_cnt = qp->rq.max_post = attr->cap.max_recv_wr; + qp->rq.wqe_cnt = attr->cap.max_recv_wr; diff --git a/fixes/resize_cq_owner_bit.patch b/fixes/resize_cq_owner_bit.patch index 6557027..0a5b564 100644 --- a/fixes/resize_cq_owner_bit.patch +++ b/fixes/resize_cq_owner_bit.patch @@ -3,11 +3,9 @@ for the target buffer (and not left as it was in the source buffer). Signed-off-by: Jack Morgenstein -diff --git a/src/cq.c b/src/cq.c -index 68e16e9..8226b6b 100644 --- a/src/cq.c +++ b/src/cq.c -@@ -455,6 +455,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx4_cq *cq, void *buf, int old_cqe) +@@ -478,6 +478,8 @@ void mlx4_cq_resize_copy_cqes(struct mlx cqe = get_cqe(cq, (i & old_cqe)); while ((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) != MLX4_CQE_OPCODE_RESIZE) { diff --git a/fixes/userspace_dev_lims.patch b/fixes/userspace_dev_lims.patch index 07cf638..80d4d14 100644 --- a/fixes/userspace_dev_lims.patch +++ b/fixes/userspace_dev_lims.patch @@ -9,10 +9,8 @@ preferable to breaking the ABI. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/mlx4.c -=================================================================== ---- libmlx4.orig/src/mlx4.c 2008-06-03 15:45:18.000000000 +0300 -+++ libmlx4/src/mlx4.c 2008-06-04 08:24:10.000000000 +0300 +--- a/src/mlx4.c ++++ b/src/mlx4.c @@ -104,6 +104,7 @@ static struct ibv_context *mlx4_alloc_co struct ibv_get_context cmd; struct mlx4_alloc_ucontext_resp resp; @@ -42,10 +40,8 @@ Index: libmlx4/src/mlx4.c err_free: free(context); return NULL; -Index: libmlx4/src/mlx4.h -=================================================================== ---- libmlx4.orig/src/mlx4.h 2008-06-03 15:45:18.000000000 +0300 -+++ libmlx4/src/mlx4.h 2008-06-04 08:24:10.000000000 +0300 +--- a/src/mlx4.h ++++ b/src/mlx4.h @@ -83,6 +83,20 @@ #define PFX "mlx4: " diff --git a/fixes/xrc_consolidated_v2.patch b/fixes/xrc_consolidated_v2.patch index 6fbd0a9..78a4f6c 100644 --- a/fixes/xrc_consolidated_v2.patch +++ b/fixes/xrc_consolidated_v2.patch @@ -18,8 +18,6 @@ V2: 2. Changed xrc_ops to more ops 3. Check for xrc verbs in ibv_more_ops via AC_CHECK_MEMBER -diff --git a/configure.in b/configure.in -index 25f27f7..46a3a64 100644 --- a/configure.in +++ b/configure.in @@ -42,6 +42,12 @@ AC_CHECK_HEADER(valgrind/memcheck.h, @@ -35,11 +33,9 @@ index 25f27f7..46a3a64 100644 dnl Checks for library functions AC_CHECK_FUNC(ibv_read_sysfs_file, [], -diff --git a/src/cq.c b/src/cq.c -index 68e16e9..c598b87 100644 --- a/src/cq.c +++ b/src/cq.c -@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -194,8 +194,9 @@ static int mlx4_poll_one(struct mlx4_cq { struct mlx4_wq *wq; struct mlx4_cqe *cqe; @@ -50,7 +46,7 @@ index 68e16e9..c598b87 100644 uint32_t g_mlpath_rqpn; uint16_t wqe_index; int is_error; -@@ -221,20 +223,29 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -221,20 +222,29 @@ static int mlx4_poll_one(struct mlx4_cq is_error = (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR; @@ -84,7 +80,7 @@ index 68e16e9..c598b87 100644 if (is_send) { wq = &(*cur_qp)->sq; -@@ -242,6 +254,10 @@ static int mlx4_poll_one(struct mlx4_cq *cq, +@@ -242,6 +252,10 @@ static int mlx4_poll_one(struct mlx4_cq wq->tail += (uint16_t) (wqe_index - (uint16_t) wq->tail); wc->wr_id = wq->wrid[wq->tail & (wq->wqe_cnt - 1)]; ++wq->tail; @@ -95,7 +91,7 @@ index 68e16e9..c598b87 100644 } else if ((*cur_qp)->ibv_qp.srq) { srq = to_msrq((*cur_qp)->ibv_qp.srq); wqe_index = htons(cqe->wqe_index); -@@ -387,6 +403,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) +@@ -387,6 +401,10 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t prod_index; uint8_t owner_bit; int nfreed = 0; @@ -106,7 +102,7 @@ index 68e16e9..c598b87 100644 /* * First we need to find the current producer index, so we -@@ -405,7 +425,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq) +@@ -405,7 +423,12 @@ void __mlx4_cq_clean(struct mlx4_cq *cq, */ while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); @@ -120,8 +116,6 @@ index 68e16e9..c598b87 100644 if (srq && !(cqe->owner_sr_opcode & MLX4_CQE_IS_SEND_MASK)) mlx4_free_srq_wqe(srq, ntohs(cqe->wqe_index)); ++nfreed; -diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h -index 20a40c9..1b1253c 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -68,6 +68,14 @@ struct mlx4_resize_cq { @@ -152,8 +146,6 @@ index 20a40c9..1b1253c 100644 +#endif + #endif /* MLX4_ABI_H */ -diff --git a/src/mlx4.c b/src/mlx4.c -index 671e849..27ca75d 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -68,6 +68,16 @@ struct { @@ -173,7 +165,7 @@ index 671e849..27ca75d 100644 static struct ibv_context_ops mlx4_ctx_ops = { .query_device = mlx4_query_device, .query_port = mlx4_query_port, -@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ +@@ -124,6 +134,15 @@ static struct ibv_context *mlx4_alloc_co for (i = 0; i < MLX4_QP_TABLE_SIZE; ++i) context->qp_table[i].refcnt = 0; @@ -189,7 +181,7 @@ index 671e849..27ca75d 100644 for (i = 0; i < MLX4_NUM_DB_TYPE; ++i) context->db_list[i] = NULL; -@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_context(struct ibv_device *ibdev, int cmd_ +@@ -156,6 +175,9 @@ static struct ibv_context *mlx4_alloc_co pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE); context->ibv_ctx.ops = mlx4_ctx_ops; @@ -199,8 +191,6 @@ index 671e849..27ca75d 100644 if (mlx4_query_device(&context->ibv_ctx, &dev_attrs)) goto query_free; -diff --git a/src/mlx4.h b/src/mlx4.h -index 8643d8f..3eadb98 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -79,6 +79,11 @@ @@ -248,7 +238,7 @@ index 8643d8f..3eadb98 100644 struct mlx4_db_page *db_list[MLX4_NUM_DB_TYPE]; pthread_mutex_t db_list_mutex; }; -@@ -260,6 +284,11 @@ struct mlx4_ah { +@@ -266,6 +290,11 @@ struct mlx4_ah { struct mlx4_av av; }; @@ -260,7 +250,7 @@ index 8643d8f..3eadb98 100644 static inline unsigned long align(unsigned long val, unsigned long align) { return (val + align - 1) & ~(align - 1); -@@ -304,6 +333,13 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah) +@@ -310,6 +339,13 @@ static inline struct mlx4_ah *to_mah(str return to_mxxx(ah, ah); } @@ -272,9 +262,9 @@ index 8643d8f..3eadb98 100644 +#endif + int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size); + int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size); void mlx4_free_buf(struct mlx4_buf *buf); - -@@ -350,6 +386,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq *srq, int ind); +@@ -357,6 +393,10 @@ void mlx4_free_srq_wqe(struct mlx4_srq * int mlx4_post_srq_recv(struct ibv_srq *ibsrq, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); @@ -285,7 +275,7 @@ index 8643d8f..3eadb98 100644 struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr); int mlx4_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, -@@ -380,5 +420,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, struct ibv_ah_attr *attr, +@@ -387,5 +427,16 @@ int mlx4_alloc_av(struct mlx4_pd *pd, st void mlx4_free_av(struct mlx4_ah *ah); int mlx4_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); @@ -302,11 +292,9 @@ index 8643d8f..3eadb98 100644 + #endif /* MLX4_H */ -diff --git a/src/qp.c b/src/qp.c -index 01e8580..2f02430 100644 --- a/src/qp.c +++ b/src/qp.c -@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, +@@ -226,7 +226,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, ctrl = wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1)); qp->sq.wrid[ind & (qp->sq.wqe_cnt - 1)] = wr->wr_id; @@ -315,7 +303,7 @@ index 01e8580..2f02430 100644 (wr->send_flags & IBV_SEND_SIGNALED ? htonl(MLX4_WQE_CTRL_CQ_UPDATE) : 0) | (wr->send_flags & IBV_SEND_SOLICITED ? -@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, +@@ -243,6 +243,9 @@ int mlx4_post_send(struct ibv_qp *ibqp, size = sizeof *ctrl / 16; switch (ibqp->qp_type) { @@ -325,7 +313,7 @@ index 01e8580..2f02430 100644 case IBV_QPT_RC: case IBV_QPT_UC: switch (wr->opcode) { -@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp_cap *cap, enum ibv_qp_type type, +@@ -543,6 +546,7 @@ void mlx4_calc_sq_wqe_size(struct ibv_qp size += sizeof (struct mlx4_wqe_raddr_seg); break; @@ -333,7 +321,7 @@ index 01e8580..2f02430 100644 case IBV_QPT_RC: size += sizeof (struct mlx4_wqe_raddr_seg); /* -@@ -631,6 +635,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *qp, struct ibv_qp_cap *cap, +@@ -632,6 +636,7 @@ void mlx4_set_sq_sizes(struct mlx4_qp *q case IBV_QPT_UC: case IBV_QPT_RC: @@ -341,11 +329,9 @@ index 01e8580..2f02430 100644 wqe_size -= sizeof (struct mlx4_wqe_raddr_seg); break; -diff --git a/src/srq.c b/src/srq.c -index ba2ceb9..1350792 100644 --- a/src/srq.c +++ b/src/srq.c -@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd, struct ibv_srq_attr *attr, +@@ -167,3 +167,53 @@ int mlx4_alloc_srq_buf(struct ibv_pd *pd return 0; } @@ -399,8 +385,6 @@ index ba2ceb9..1350792 100644 + pthread_mutex_unlock(&ctx->xrc_srq_table_mutex); +} + -diff --git a/src/verbs.c b/src/verbs.c -index 400050c..b7c9c8e 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -368,18 +368,36 @@ int mlx4_query_srq(struct ibv_srq *srq, @@ -447,7 +431,7 @@ index 400050c..b7c9c8e 100644 return 0; } -@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -415,7 +433,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv qp->sq.wqe_cnt = align_queue_size(attr->cap.max_send_wr + qp->sq_spare_wqes); qp->rq.wqe_cnt = align_queue_size(attr->cap.max_recv_wr); @@ -456,7 +440,7 @@ index 400050c..b7c9c8e 100644 attr->cap.max_recv_wr = qp->rq.wqe_cnt = 0; else { if (attr->cap.max_recv_sge < 1) -@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -433,7 +451,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; @@ -465,7 +449,7 @@ index 400050c..b7c9c8e 100644 qp->db = mlx4_alloc_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ); if (!qp->db) goto err_free; -@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) +@@ -442,7 +460,7 @@ struct ibv_qp *mlx4_create_qp(struct ibv } cmd.buf_addr = (uintptr_t) qp->buf.buf; @@ -474,7 +458,7 @@ index 400050c..b7c9c8e 100644 cmd.db_addr = 0; else cmd.db_addr = (uintptr_t) qp->db; -@@ -485,7 +503,7 @@ err_destroy: +@@ -489,7 +507,7 @@ err_destroy: err_rq_db: pthread_mutex_unlock(&to_mctx(pd->context)->qp_table_mutex); @@ -483,7 +467,7 @@ index 400050c..b7c9c8e 100644 mlx4_free_db(to_mctx(pd->context), MLX4_DB_TYPE_RQ, qp->db); err_free: -@@ -544,7 +562,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, +@@ -548,7 +566,7 @@ int mlx4_modify_qp(struct ibv_qp *qp, st mlx4_cq_clean(to_mcq(qp->send_cq), qp->qp_num, NULL); mlx4_init_qp_indices(to_mqp(qp)); @@ -492,16 +476,16 @@ index 400050c..b7c9c8e 100644 *to_mqp(qp)->db = 0; } -@@ -603,7 +621,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) - +@@ -611,7 +629,7 @@ int mlx4_destroy_qp(struct ibv_qp *ibqp) mlx4_unlock_cqs(ibqp); + pthread_mutex_unlock(&to_mctx(ibqp->context)->qp_table_mutex); - if (!ibqp->srq) + if (!ibqp->srq && ibqp->qp_type != IBV_QPT_XRC) mlx4_free_db(to_mctx(ibqp->context), MLX4_DB_TYPE_RQ, qp->db); free(qp->sq.wrid); if (qp->rq.wqe_cnt) -@@ -661,3 +679,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid) +@@ -669,3 +687,103 @@ int mlx4_detach_mcast(struct ibv_qp *qp, { return ibv_cmd_detach_mcast(qp, gid, lid); } @@ -605,8 +589,6 @@ index 400050c..b7c9c8e 100644 + return 0; +} +#endif -diff --git a/src/wqe.h b/src/wqe.h -index 6f7f309..fa2f8ac 100644 --- a/src/wqe.h +++ b/src/wqe.h @@ -65,7 +65,7 @@ struct mlx4_wqe_ctrl_seg { diff --git a/fixes/xrc_fix_close_domain.patch b/fixes/xrc_fix_close_domain.patch index dfad7ac..3af2640 100644 --- a/fixes/xrc_fix_close_domain.patch +++ b/fixes/xrc_fix_close_domain.patch @@ -6,11 +6,9 @@ Need to pass this upward to caller. Signed-off-by: Jack Morgenstein -Index: libmlx4/src/verbs.c -=================================================================== ---- libmlx4.orig/src/verbs.c 2008-09-01 10:51:11.000000000 +0300 -+++ libmlx4/src/verbs.c 2008-09-01 10:52:40.000000000 +0300 -@@ -774,9 +774,11 @@ +--- a/src/verbs.c ++++ b/src/verbs.c +@@ -782,9 +782,11 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom int mlx4_close_xrc_domain(struct ibv_xrc_domain *d) { diff --git a/fixes/xrc_rcv_qp_v2.patch b/fixes/xrc_rcv_qp_v2.patch index 311c500..00ffd53 100644 --- a/fixes/xrc_rcv_qp_v2.patch +++ b/fixes/xrc_rcv_qp_v2.patch @@ -5,11 +5,9 @@ Signed-off-by: Jack Morgenstein V2: 1. xrc_ops changed to more_ops -diff --git a/src/mlx4.c b/src/mlx4.c -index 27ca75d..e5ded78 100644 --- a/src/mlx4.c +++ b/src/mlx4.c -@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops = { +@@ -74,6 +74,11 @@ static struct ibv_more_ops mlx4_more_ops .create_xrc_srq = mlx4_create_xrc_srq, .open_xrc_domain = mlx4_open_xrc_domain, .close_xrc_domain = mlx4_close_xrc_domain, @@ -21,11 +19,9 @@ index 27ca75d..e5ded78 100644 #endif }; #endif -diff --git a/src/mlx4.h b/src/mlx4.h -index 3eadb98..6307a2d 100644 --- a/src/mlx4.h +++ b/src/mlx4.h -@@ -429,6 +429,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_domain(struct ibv_context *context, +@@ -436,6 +436,21 @@ struct ibv_xrc_domain *mlx4_open_xrc_dom int fd, int oflag); int mlx4_close_xrc_domain(struct ibv_xrc_domain *d); @@ -47,11 +43,9 @@ index 3eadb98..6307a2d 100644 #endif -diff --git a/src/verbs.c b/src/verbs.c -index b7c9c8e..8261eae 100644 --- a/src/verbs.c +++ b/src/verbs.c -@@ -778,4 +778,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc_domain *d) +@@ -786,4 +786,59 @@ int mlx4_close_xrc_domain(struct ibv_xrc free(d); return 0; } -- 1.6.3.1 From sebastien.dugue at bull.net Thu May 28 01:22:49 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Thu, 28 May 2009 10:22:49 +0200 Subject: [ofa-general] [PATCH 1/3] V2 - libmthca - Optimize memory allocation of QP buffers In-Reply-To: <20090528102059.2fd85540@frecb007965> References: <20090528102059.2fd85540@frecb007965> Message-ID: <20090528102249.2ca01866@frecb007965> QP buffers are allocated with mthca_alloc_buf(), which rounds the buffers size to the page size and then allocates page aligned memory using posix_memalign(). However, this allocation is quite wasteful on architectures using 64K pages (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc parameter and chunks are allocated using mmap. thus we end up allocating: (requested size rounded to the page size) + (page size) + (malloc overhead) rounded internally to the page size. So for example, if we request a buffer of page_size bytes, we end up consuming 3 pages. In short, for each QP buffer we allocate, there is an overhead of 2 pages. This is quite visible on large clusters especially where the number of QP can reach several thousands. This patch creates a new function mthca_alloc_page() for use by mthca_alloc_qp_buf() that does an mmap() instead of a posix_memalign(). Signed-off-by: Sebastien Dugue --- src/buf.c | 34 ++++++++++++++++++++++++++++++++-- src/mthca.h | 7 +++++++ src/qp.c | 7 ++++--- 3 files changed, 43 insertions(+), 5 deletions(-) diff --git a/src/buf.c b/src/buf.c index 6c1be4f..499edeb 100644 --- a/src/buf.c +++ b/src/buf.c @@ -35,6 +35,8 @@ #endif /* HAVE_CONFIG_H */ #include +#include +#include #include "mthca.h" @@ -69,8 +71,32 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) if (ret) free(buf->buf); - if (!ret) + if (!ret) { buf->length = size; + buf->type = MTHCA_MALIGN; + } + + return ret; +} + +int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size) +{ + int ret; + + /* Use mmap directly to allocate an aligned buffer */ + buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + if (buf->buf == MAP_FAILED) + return errno; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + munmap(buf->buf, align(size, page_size)); + else { + buf->length = size; + buf->type = MTHCA_MMAP; + } return ret; } @@ -78,5 +104,9 @@ int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size) void mthca_free_buf(struct mthca_buf *buf) { ibv_dofork_range(buf->buf, buf->length); - free(buf->buf); + + if ( buf->type == MTHCA_MMAP ) + munmap(buf->buf, buf->length); + else + free(buf->buf); } diff --git a/src/mthca.h b/src/mthca.h index 66751f3..7db15a7 100644 --- a/src/mthca.h +++ b/src/mthca.h @@ -138,9 +138,15 @@ struct mthca_context { int qp_table_mask; }; +enum mthca_buf_type { + MTHCA_MMAP, + MTHCA_MALIGN +}; + struct mthca_buf { void *buf; size_t length; + enum mthca_buf_type type; }; struct mthca_pd { @@ -291,6 +297,7 @@ static inline int mthca_is_memfree(struct ibv_context *ibctx) } int mthca_alloc_buf(struct mthca_buf *buf, size_t size, int page_size); +int mthca_alloc_page(struct mthca_buf *buf, size_t size, int page_size); void mthca_free_buf(struct mthca_buf *buf); int mthca_alloc_db(struct mthca_db_table *db_tab, enum mthca_db_type type, diff --git a/src/qp.c b/src/qp.c index 84dd206..15f4805 100644 --- a/src/qp.c +++ b/src/qp.c @@ -848,9 +848,10 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, qp->buf_size = qp->send_wqe_offset + (qp->sq.max << qp->sq.wqe_shift); - if (mthca_alloc_buf(&qp->buf, - align(qp->buf_size, to_mdev(pd->context->device)->page_size), - to_mdev(pd->context->device)->page_size)) { + if (mthca_alloc_page(&qp->buf, + align(qp->buf_size, + to_mdev(pd->context->device)->page_size), + to_mdev(pd->context->device)->page_size)) { free(qp->wrid); return -1; } -- 1.6.3.1 From sebastien.dugue at bull.net Thu May 28 01:23:58 2009 From: sebastien.dugue at bull.net (sebastien dugue) Date: Thu, 28 May 2009 10:23:58 +0200 Subject: [ofa-general] [PATCH 2/3] V2 - libmlx4 - Optimize memory allocation of QP buffers In-Reply-To: <20090528102059.2fd85540@frecb007965> References: <20090528102059.2fd85540@frecb007965> Message-ID: <20090528102358.0c5b2124@frecb007965> QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers size to the page size and then allocates page aligned memory using posix_memalign(). However, this allocation is quite wasteful on architectures using 64K pages (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc parameter and chunks are allocated using mmap. thus we end up allocating: (requested size rounded to the page size) + (page size) + (malloc overhead) rounded internally to the page size. So for example, if we request a buffer of page_size bytes, we end up consuming 3 pages. In short, for each QP buffer we allocate, there is an overhead of 2 pages. This is quite visible on large clusters especially where the number of QP can reach several thousands. This patch creates a new function mlx4_alloc_page() for use by mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign(). Signed-off-by: Sebastien Dugue --- src/buf.c | 34 ++++++++++++++++++++++++++++++++-- src/mlx4.h | 7 +++++++ src/qp.c | 5 +++-- 3 files changed, 42 insertions(+), 4 deletions(-) diff --git a/src/buf.c b/src/buf.c index 0e5f9b6..73565e6 100644 --- a/src/buf.c +++ b/src/buf.c @@ -35,6 +35,8 @@ #endif /* HAVE_CONFIG_H */ #include +#include +#include #include "mlx4.h" @@ -69,14 +71,42 @@ int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size) if (ret) free(buf->buf); - if (!ret) + if (!ret) { buf->length = size; + buf->type = MLX4_MALIGN; + } return ret; } +int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size) +{ + int ret; + + /* Use mmap directly to allocate an aligned buffer */ + buf->buf = mmap(0 ,align(size, page_size) , PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + if (buf->buf == MAP_FAILED) + return errno; + + ret = ibv_dontfork_range(buf->buf, size); + if (ret) + munmap(buf->buf, align(size, page_size)); + else { + buf->length = size; + buf->type = MLX4_MMAP; + } + + return ret; + } + void mlx4_free_buf(struct mlx4_buf *buf) { ibv_dofork_range(buf->buf, buf->length); - free(buf->buf); + + if ( buf->type == MLX4_MMAP ) + munmap(buf->buf, buf->length); + else + free(buf->buf); } diff --git a/src/mlx4.h b/src/mlx4.h index 827a201..83547f5 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -161,9 +161,15 @@ struct mlx4_context { pthread_mutex_t db_list_mutex; }; +enum mlx4_buf_type { + MLX4_MMAP, + MLX4_MALIGN +}; + struct mlx4_buf { void *buf; size_t length; + enum mlx4_buf_type type; }; struct mlx4_pd { @@ -288,6 +294,7 @@ static inline struct mlx4_ah *to_mah(struct ibv_ah *ibah) } int mlx4_alloc_buf(struct mlx4_buf *buf, size_t size, int page_size); +int mlx4_alloc_page(struct mlx4_buf *buf, size_t size, int page_size); void mlx4_free_buf(struct mlx4_buf *buf); uint32_t *mlx4_alloc_db(struct mlx4_context *context, enum mlx4_db_type type); diff --git a/src/qp.c b/src/qp.c index d194ae3..557e255 100644 --- a/src/qp.c +++ b/src/qp.c @@ -604,8 +604,9 @@ int mlx4_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, qp->sq.offset = 0; } - if (mlx4_alloc_buf(&qp->buf, - align(qp->buf_size, to_mdev(pd->context->device)->page_size), + if (mlx4_alloc_page(&qp->buf, + align(qp->buf_size, + to_mdev(pd->context->device)->page_size), to_mdev(pd->context->device)->page_size)) { free(qp->sq.wrid); free(qp->rq.wrid); -- 1.6.3.1 From vlad at lists.openfabrics.org Thu May 28 03:24:45 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 28 May 2009 03:24:45 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090528-0200 daily build status Message-ID: <20090528102445.68B4BE28179@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at mellanox.co.il Thu May 28 03:37:49 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 28 May 2009 13:37:49 +0300 Subject: [ofa-general] OFED 1.4.1 GA is available In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD02C12252@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD02C12252@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD02D5F39D@mtlexch01.mtl.com> I am pleased to announce that OFED-1.4.1 GA release is done The tarball is available on: http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.1 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.0 ========================== - New OSes: Added support for RHEL 5.3 and SLES11 - NFS/RDMA: In beta quality with backports for RHEL 5.2, 5.3 and SLES 10 SP2 - Updated MPI packages: - Open MPI 1.3.2 - new version - see OpenMPI release notes for details MVAPICH 1.1.0-3355 - bug fixes version - Updated bonding package: ib-bonding-0.9.0-40 - Updated DAPL: compat-dapl-1.2.14 and dapl-2.0.19 - Updated opensm version to include critical bug fixes - Fixed RDS iWARP support and fixed stability issues - Low level drivers updated: ehca, mlx4, cxgb3, nes, ipath, mthca - mstflint update - Bug fixes See each component release notes for details on enhancements and bug fixes From swise at opengridcomputing.com Thu May 28 06:49:26 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 28 May 2009 08:49:26 -0500 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Report correct port state and mtu. In-Reply-To: References: <20090527190852.16426.82898.stgit@build.ogc.int> Message-ID: <4A1E9666.2060807@opengridcomputing.com> Roland Dreier wrote: > OK, applied. Would be nice if we had a better way to report MTU, but whatever... > Agreed. At this point iWARP is still just plugging into the IB infrastructure for much of this. Got any ideas on how to do this better? From Bob.Ciotti at nasa.gov Thu May 28 10:57:57 2009 From: Bob.Ciotti at nasa.gov (Bob Ciotti) Date: Thu, 28 May 2009 10:57:57 -0700 Subject: [ofa-general] SubnAdmGet (6777) Message-ID: <20090528175757.GA95655@nas.nasa.gov> Sorry to bounce this off the list - should it be too remedial. I promise that I've been consuming a lot of the spec and OFA code. Maybe you consider that a promise or a warning we will be more active :| Our configuration is >6000 CA in a mix of infinihostIII/connectx and longbow extenders and >800 24 port switches on a single subnet. (SGI ICE with lots of other stuff plugged in). Its DDR everywhere except across the longbows. Hosts range from a few different generations of x86 xeon, x86 opteron and itanium. We use lustre but have the srp traffic on a separate subnet. A few weeks ago connection setup times were mentioned on this list along with ARP and path record lookups not being scalable. We experience these problems as well and need to address these scalability issues. I have a quite a bit of test data and a few different ideas to bounce off the list RE path records, once I am a little more versed in the spec. There has already been some work done to limit ARP traffic. Todays question has to do with SM errors. We have been seeing lots of these - sometimes more than others. Digging around some it appears that the 6777 represents the number of duplicates? This value fluctuates around some, but not alot. Comments in the code indicate that any valuse >1 is a problem. Question is, should or is this OK to be happening and how does it occur? We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. We are currently running a pre 1.4 top of tree pull from back in dec. bob May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) .... ------------------------------------------------------------------------- Robert B. Ciotti Supercomputing Systems Lead NASA Advanced Supercomputing (NAS) Division TEL (650) 604-4408 NASA Ames Research Center FAX (650) 604-4377 Moffett Field, CA 94035-1000 Bob.Ciotti at NASA.gov ------------------------------------------------------------------------- From hal.rosenstock at gmail.com Thu May 28 12:06:38 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 28 May 2009 15:06:38 -0400 Subject: [ofa-general] SubnAdmGet (6777) In-Reply-To: <20090528175757.GA95655@nas.nasa.gov> References: <20090528175757.GA95655@nas.nasa.gov> Message-ID: On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti wrote: > > Sorry to bounce this off the list - should it be too remedial. I promise > that I've been consuming a lot of the spec and OFA code. Maybe you consider > that a promise or a warning we will be more active :| > > Our configuration is >6000 CA in a mix of infinihostIII/connectx and > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE > with lots of other stuff plugged in). Its DDR everywhere except across the > longbows. Hosts range from a few different generations of x86 xeon, x86 > opteron and itanium. We use lustre but have the srp traffic on a separate > subnet. > > A few weeks ago connection setup times were mentioned on this list along > with ARP and path record lookups not being scalable. We experience these > problems as well and need to address these scalability issues. I have a quite > a bit of test data and a few different ideas to bounce off the list RE path > records, once I am a little more versed in the spec. There has already been > some work done to limit ARP traffic. > > Todays question has to do with SM errors. > We have been seeing lots of these - sometimes more than others. Digging > around some it appears that the 6777 represents the number of duplicates? > This value fluctuates around some, but not alot. Comments in the code > indicate that any valuse >1 is a problem. Question is, should or is this > OK to be happening and how does it occur? It's an error (and error status of too many records is returned to the SA client in the end node). Gets are only allowed to return 1 record (GetTable requests can deal with more than 1 record in the response) yet many were found by the SA that satisfied the request in responding to the Get. Any idea on what the specific get is that causes this to occur ? -- Hal > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. > We are currently running a pre 1.4 top of tree pull from back in dec. bob > > > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > .... > > > > ------------------------------------------------------------------------- > Robert B. Ciotti                              Supercomputing Systems Lead > NASA Advanced Supercomputing (NAS) Division            TEL (650) 604-4408 > NASA Ames Research Center                              FAX (650) 604-4377 > Moffett Field, CA 94035-1000                          Bob.Ciotti at NASA.gov > ------------------------------------------------------------------------- > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu May 28 15:31:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 28 May 2009 15:31:10 -0700 Subject: [ofa-general] RE: [PATCH] core/mthca: Distinguish multiple IB cards in /proc/interrupts In-Reply-To: <4A1DEBC8.9050207@sgi.com> (Arputham Benjamin's message of "Wed, 27 May 2009 18:41:28 -0700") References: <4A0B560B.3090606@sgi.com> <4A136DF0.7000402@sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992691@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992692@cf--amer001e--3.americas.sgi.com> <1AB9A794DBDDF54A8A81BE2296F7BDFE992696@cf--amer001e--3.americas.sgi.com> <4A1DEBC8.9050207@sgi.com> Message-ID: > > Not sure what you mean. If we put msi-x info under /sys, then you can > > figure out which interrupts belong to a given HCA by following the > > device link from /sys/class/infiniband. Similarly if /proc/interrupts > > gives the PCI device, then you have the same ability. So either way > > works as far as I can tell. > Linux is supposed to move away from procfs to sysfs for this type of device > related info. However, /proc/interrupts is still present in the latest > distro > releases (for example, SLES11) and OFED needs to provide support > for this in procfs until the /proc/interrupts support is removed from > kernel. I think we're talking past each other. I agree that /proc/interrupts is still needed. However, there are two things I see that we can add, and each one suffices to make everything unambiguous: 1) Add PCI device info ("mlx4-comp-1 at pci...") to the interrupt name. Then if userspace cares about the interrupts for device "foo", it can look at the /sys/class/infiniband/foo/device symlink to find the PCI device, and then look in /proc/interrupts for all interrupts related to that PCI device. *OR* 2) Add /sys/devices/pci.../msix/vectorN files (or something like that) so userspace can similarly follow the /sys/class/infiniband/foo/device symlink to the PCI directory and read the MSI-X vector numbers for the device, and then get all info for that interrupt from /proc/interrupts, /proc/irq/NNN/smp_affinity, etc. Either option by itself is completely sufficient. > I have not seen full sysfs support for Ethernet devices . > I have seen IRQ number info but no interrupt counters on a per CPU basis. > Do we know when the full support for ethernet devices will be available > in sysfs? We can enhance OFED at the same time ethernet support is made > available in the kernel. Umm... for ethernet you can get per-CPU counters from /proc/interrupts, if you know the IRQ number. But if you have multiple MSI-X interrupt then you have to get the IRQ number some other way. > 3) We can add dev_alloc_name() functionality to mlx4_core similar to > alloc_name() > present in ib_core. This is consistent with other ethernet device driver > implementations > using the function dev_alloc_name() present in the kernel. (Please see > .../net/core/dev.c) Not sure how this could work. If mlx4_core is allocating device numbers, and I have 3 adapters, only 2 of which are IB HCAs and 1 of which is an ethernet adapter, then how mlx4_core assign numbers that match what the RDMA layer will use? - R. From Bob.Ciotti at nasa.gov Thu May 28 16:41:33 2009 From: Bob.Ciotti at nasa.gov (Bob Ciotti) Date: Thu, 28 May 2009 16:41:33 -0700 Subject: [ofa-general] SubnAdmGet (6777) In-Reply-To: References: <20090528175757.GA95655@nas.nasa.gov> Message-ID: <20090528234133.GA45460@nas.nasa.gov> On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote: > On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti wrote: > > > > Sorry to bounce this off the list - should it be too remedial. I promise > > that I've been consuming a lot of the spec and OFA code. Maybe you consider > > that a promise or a warning we will be more active :| > > > > Our configuration is >6000 CA in a mix of infinihostIII/connectx and > > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE > > with lots of other stuff plugged in). Its DDR everywhere except across the > > longbows. Hosts range from a few different generations of x86 xeon, x86 > > opteron and itanium. We use lustre but have the srp traffic on a separate > > subnet. > > > > A few weeks ago connection setup times were mentioned on this list along > > with ARP and path record lookups not being scalable. We experience these > > problems as well and need to address these scalability issues. I have a quite > > a bit of test data and a few different ideas to bounce off the list RE path > > records, once I am a little more versed in the spec. There has already been > > some work done to limit ARP traffic. > > > > Todays question has to do with SM errors. > > We have been seeing lots of these - sometimes more than others. Digging > > around some it appears that the 6777 represents the number of duplicates? > > This value fluctuates around some, but not alot. Comments in the code > > indicate that any valuse >1 is a problem. Question is, should or is this > > OK to be happening and how does it occur? > > It's an error (and error status of too many records is returned to the > SA client in the end node). > > Gets are only allowed to return 1 record (GetTable requests can deal > with more than 1 record in the response) yet many were found by the SA > that satisfied the request in responding to the Get. Any idea on what > the specific get is that causes this to occur ? Thats the problem. The at the debug level we are running at I can pin down the source. Is there a state I can go look for on the clients to see what its trying to do? bob > -- Hal > > > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. > > We are currently running a pre 1.4 top of tree pull from back in dec. bob > > > > > > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) > > > > .... > > > > > > > > ------------------------------------------------------------------------- > > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead > > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408 > > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377 > > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov > > ------------------------------------------------------------------------- > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From vlad at lists.openfabrics.org Fri May 29 03:24:13 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 29 May 2009 03:24:13 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090529-0200 daily build status Message-ID: <20090529102413.3C9B4E61689@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hry at platform.com Thu May 28 23:28:54 2009 From: hry at platform.com (Hans Westgaard Ry) Date: Fri, 29 May 2009 08:28:54 +0200 Subject: [ofa-general] Memory registration redux In-Reply-To: References: <20090506214628.GM2590@obsidianresearch.com><20090506222638.GA16280@obsidianresearch.com><20090507000231.GB16280@obsidianresearch.com><20090507224806.GF16280@obsidianresearch.com> <5019F239-149F-49E1-8C23-436DE6094AB2@cisco.com> Message-ID: <4A1F80A6.2010709@platform.com> The scheme looks fine to me ! Hans W. Ry Jeff Squyres skrev: > Other MPI implementors -- what do you think of this scheme? > > > On May 27, 2009, at 1:49 PM, Roland Dreier (rdreier) wrote: > >> >> > > /* >> > > * If type field is INVAL, then user_cookie_counter holds the >> > > * user_cookie for the region being reported; if the HINT flag >> is set >> > > * then hint_start/hint_end hold the start and end of the >> mapping that >> > > * was invalidated. (If HINT is not set, then multiple events >> > > * invalidated parts of the registered range and >> hint_start/hint_end >> > > * should be ignored) >> >> > I don't quite grok this. Is the intent that HINT will only be set if >> > an *entire* hint_start/hint_end range is invalidated by a single >> > event? I.e., if only part of the hint_start/hint_end range is >> > invalidated, you'll get the cookie back, but not what part of the >> > range is invalid (because assumedly the entire IBV registration is >> now >> > invalid anyway)? >> >> Basically, I just keep one hint_start/hint_end. If multiple events hit >> the same registration then I just give up and don't give you a hint. >> >> > > * If type is LAST, then the read operation has emptied the list of >> > > * invalidated regions, and user_cookie_counter holds the value >> of the >> > > * kernel's generation counter when the empty list occurred. The >> > > * other fields are not filled in for this event. >> >> > Just to be clear -- we're supposed to keep reading events until we >> get >> > a LAST event? >> >> Yes, that's probably the sanest use case. >> >> > 1. Will it increase by 1 each time a page (or set of pages?) is >> > removed from a user process? >> >> As it stands it increases by 1 every time there is an MMU notification, >> even if that notification hits multiple registrations. It wouldn't be >> hard to change that to count the number of events generated if that >> works better. >> >> > 2. Does it change if pages are *added* to a user process? I.e., does >> > the counter indicate *removals* or *changes* to the user process page >> > table? >> >> No, additions don't trigger any MMU notification -- that's inherent in >> the design of the MMU notifiers stuff. The idea is that you have a >> "secondary MMU" and MMU notifications are the equivalent of TLB >> shootdowns; the secondary MMU is responsible for populating itself on >> faults etc. >> >> > Is the *unm_counter value guaranteed to have been changed by the time >> > munmap() returns? >> >> Yes. >> >> > Did you pick [2] here simply because you're only expecting an INVAL >> > and a LAST event in this specific example? I'm assuming that we >> > should normally loop over reading until we get LAST, correct? >> >> Right. >> >> > What happens if I register multiple regions with the same cookie >> value? >> >> You get in trouble -- I need to fix things to reject duplicated cookies >> actually, because otherwise there's no way to unregister. >> >> > Is a process responsible for guaranteeing that it umn_unregister()s >> > everything before exiting, or will all pending registrations be >> > cleaned up/unregistered/whatever when a process exits? >> >> The kernel cleans up of course to handle crashes etc. >> >> - R. >> > > From hal.rosenstock at gmail.com Fri May 29 06:09:49 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 29 May 2009 09:09:49 -0400 Subject: [ofa-general] SubnAdmGet (6777) In-Reply-To: <20090528234133.GA45460@nas.nasa.gov> References: <20090528175757.GA95655@nas.nasa.gov> <20090528234133.GA45460@nas.nasa.gov> Message-ID: On Thu, May 28, 2009 at 7:41 PM, Bob Ciotti wrote: > On Thu, May 28, 2009 at 02:06:38PM -0500, Hal Rosenstock wrote: >> On Thu, May 28, 2009 at 1:57 PM, Bob Ciotti wrote: >> > >> > Sorry to bounce this off the list - should it be too remedial. I promise >> > that I've been consuming a lot of the spec and OFA code. Maybe you consider >> > that a promise or a warning we will be more active :| >> > >> > Our configuration is >6000 CA in a mix of infinihostIII/connectx and >> > longbow extenders and >800 24 port switches on a single subnet. (SGI ICE >> > with lots of other stuff plugged in). Its DDR everywhere except across the >> > longbows. Hosts range from a few different generations of x86 xeon, x86 >> > opteron and itanium. We use lustre but have the srp traffic on a separate >> > subnet. >> > >> > A few weeks ago connection setup times were mentioned on this list along >> > with ARP and path record lookups not being scalable. We experience these >> > problems as well and need to address these scalability issues. I have a quite >> > a bit of test data and a few different ideas to bounce off the list RE path >> > records, once I am a little more versed in the spec. There has already been >> > some work done to limit ARP traffic. >> > >> > Todays question has to do with SM errors. >> > We have been seeing lots of these - sometimes more than others. Digging >> > around some it appears that the 6777 represents the number of duplicates? >> > This value fluctuates around some, but not alot. Comments in the code >> > indicate that any valuse >1 is a problem. Question is, should or is this >> > OK to be happening and how does it occur? >> >> It's an error (and error status of too many records is returned to the >> SA client in the end node). >> >> Gets are only allowed to return 1 record (GetTable requests can deal >> with more than 1 record in the response) yet many were found by the SA >> that satisfied the request in responding to the Get. Any idea on what >> the specific get is that causes this to occur ? > >  Thats the problem. The at the debug level we are running at I can pin down > the source. Can you change the debug level ? If not, can you instrument OpenSM (add some debug info into osm_sa_path_record.c) ? > Is there a state I can go look for on the clients to see what > its trying to do? Perhaps use madeye. -- Hal > bob > > >> -- Hal >> >> > We will probably do an update to the 1.4 or 1.4.1 SM in the next few days. >> > We are currently running a pre 1.4 top of tree pull from back in dec. bob >> > >> > >> > May 28 00:07:13 324705 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 336912 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 338067 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:13 570497 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:14 417242 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:14 899329 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:15 245929 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:17 076558 [6E38940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 118151 [649EF940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 328783 [76EE7940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 341440 [CFF50940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:18 578154 [E2448940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:19 425249 [BDA58940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:19 907407 [19330940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > May 28 00:07:20 267818 [F4940940] 0x01 -> osm_sa_respond: ERR 4C05: Got more than one record for SubnAdmGet (6777) >> > >> > .... >> > >> > >> > >> > ------------------------------------------------------------------------- >> > Robert B. Ciotti ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Supercomputing Systems Lead >> > NASA Advanced Supercomputing (NAS) Division ?? ?? ?? ?? ?? ??TEL (650) 604-4408 >> > NASA Ames Research Center ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??FAX (650) 604-4377 >> > Moffett Field, CA 94035-1000 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??Bob.Ciotti at NASA.gov >> > ------------------------------------------------------------------------- >> > >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > From hnrose at comcast.net Fri May 29 08:35:15 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 29 May 2009 11:35:15 -0400 Subject: [ofa-general] [PATCH] libibmad/resolve.c: Determine SL properly Message-ID: <20090529153515.GA10301@comcast.net> rather than assuming SL 0 Signed-off-by: Hal Rosenstock --- diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 691bdc3..f17da11 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, return -1; mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid); + mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl); return ib_portid_set(sm_id, lid, 0, 0); } @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, { ib_portid_t sm_portid; char buf[IB_SA_DATA_SIZE] = { 0 }; + ib_portid_t self = { 0 }; + uint64_t selfguid; + ibmad_gid_t selfgid; + uint8_t nodeinfo[64]; if (!sm_id) { sm_id = &sm_portid; if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) return -1; } + + if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport)) + return -1; + mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid); + mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); + mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid); + if (*(uint64_t *) & portid->gid == 0) mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); if ((portid->lid = - ib_path_query_via(srcport, portid->gid, portid->gid, sm_id, + ib_path_query_via(srcport, selfgid, portid->gid, sm_id, buf)) < 0) return -1; + mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl); return 0; } @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, return -1; mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); + mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl); mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix); mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid); From hnrose at comcast.net Fri May 29 12:31:12 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 29 May 2009 15:31:12 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/ibdiag_common.c: Eliminate compile warning on x86_64 archs Message-ID: <20090529193112.GA14170@comcast.net> src/ibdiag_common.c: In function pretty_print src/ibdiag_common.c:95: warning: field precision should have type int, but argument 3 has type long int Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 4ffa3f0..6fb8e01 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -92,7 +92,7 @@ static void pretty_print(int start, int width, const char *str) } if (e - str == 1) e = p; - fprintf(stderr, "%.*s\n%*s", e - str, str, start, ""); + fprintf(stderr, "%.*s\n%*s", (int)(e - str), str, start, ""); str = e; } } From vlad at lists.openfabrics.org Sat May 30 03:28:18 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 30 May 2009 03:28:18 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090530-0200 daily build status Message-ID: <20090530102819.1F56BE613C4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at Voltaire.com Sat May 30 23:41:54 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Sun, 31 May 2009 09:41:54 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090521210049.GY6837@sgi.com> References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com> <20090521210049.GY6837@sgi.com> Message-ID: <4A2226B2.5070604@Voltaire.com> akepner at sgi.com wrote @ http://lists.openfabrics.org/pipermail/general/2009-May/059730.html > What would prevent a race between a tx completion (with an > error) and the cleanup of a neighbour? Okay, so maybe this code/design of using the stashed ipoib_neighbour at the tx completion code is the root cause of all these troubles?! >From a quick look on the code and two patches that touched this area (f56bcd801... "Use separate CQ for UD send completions" and 57ce41d1... "Fix transmit queue stalling forever") - I see that the original tx cq handler - ipoib_ib_handle_tx_wc() doesn't touch the neigbour but today is called only from the drain timer & dev-stop flows. Now, ipoib_cm_handle_tx_wc() is called for "normal" flow both for datagram and connected modes, and this function touches he neighbour. I am not sure why commit f56bcd801... made UD completions to go through ipoib_cm_handle_tx_wc() nor why this function must use the neighbor to access the data-structure it needs to, maybe Eli can comment on that? Or. From eli at dev.mellanox.co.il Sun May 31 00:21:15 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 31 May 2009 10:21:15 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <4A2226B2.5070604@Voltaire.com> References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com> <20090521210049.GY6837@sgi.com> <4A2226B2.5070604@Voltaire.com> Message-ID: <20090531072115.GA9211@mtls03> On Sun, May 31, 2009 at 09:41:54AM +0300, Or Gerlitz wrote: > akepner at sgi.com wrote @ http://lists.openfabrics.org/pipermail/general/2009-May/059730.html > > What would prevent a race between a tx completion (with an > > error) and the cleanup of a neighbour? > > Okay, so maybe this code/design of using the stashed ipoib_neighbour at the tx > completion code is the root cause of all these troubles?! > > >From a quick look on the code and two patches that touched this area (f56bcd801... "Use separate CQ for UD send completions" and 57ce41d1... "Fix transmit queue stalling forever") - I see that the original tx cq handler - ipoib_ib_handle_tx_wc() doesn't touch the neigbour but today is called only from the drain timer & dev-stop flows. Now, ipoib_cm_handle_tx_wc() is called for "normal" flow both for datagram and connected modes, and this function touches he neighbour. Or, I don't follow on you - ipoib_cm_handle_tx_wc() called ipoib_neigh_free() from the first commit. Also please note the following designation of CQs: recv_cq: used for all receives and for CM send send_cq: used for UD send Thus, since in ipoib_poll() we poll "recv_cq", any none receive must be that of CM mode sends. > > I am not sure why commit f56bcd801... made UD completions to go through ipoib_cm_handle_tx_wc() nor why this function must use the neighbor to access the data-structure it needs to, maybe Eli can comment on that? > > Or. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Sun May 31 03:25:20 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 31 May 2009 03:25:20 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090531-0200 daily build status Message-ID: <20090531102520.E5604E616A5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at Voltaire.com Sun May 31 04:34:56 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Sun, 31 May 2009 14:34:56 +0300 Subject: [ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup() In-Reply-To: <20090531072115.GA9211@mtls03> References: <20090519215505.GN6837@sgi.com> <4A13ADDA.5040908@Voltaire.com> <20090521210049.GY6837@sgi.com> <4A2226B2.5070604@Voltaire.com> <20090531072115.GA9211@mtls03> Message-ID: <4A226B60.9070006@Voltaire.com> Eli Cohen wrote: > ipoib_cm_handle_tx_wc() called ipoib_neigh_free() from the first commit. Okay, thanks for pointing this out. Looking on the code, I'm not sure why the non sucess/flush path of ipoib_cm_handle_tx_wc() must access the neighbour while ipoib_ib_handle_tx_wc can get a way with only a warning print... do we agree that accessing the neigbour from the cm tx completion flow is buggy? Or. From dorfman.eli at gmail.com Sun May 31 07:44:46 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 31 May 2009 17:44:46 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: Do not change logical state on SubnAdmSet Message-ID: <4A2297DE.3050707@gmail.com> Do not change logical state on SubnAdmSet When changing physical state do not change logical port state. >From the IB spec When writing PortInfo:PortState, only legal transitions are valid. So if PortState is ACTIVE and we try to set it to ACTIVE this will fail. This patch allows reset in a single MAD. Signed-off-by: Eli Dorfman --- infiniband-diags/src/ibportstate.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index 65c9ca1..d19a2e5 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -275,8 +275,10 @@ int main(int argc, char **argv) /* Only if one of the "set" options is chosen */ if (port_op) { - if (port_op == 1) /* Enable port */ + if (port_op == 1) { /* Enable port */ mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 2); /* Polling */ + mad_set_field(data, 0, IB_PORT_STATE_F, 0); /* No Change */ + } else if ((port_op == 2) || (port_op == 3)) { /* Disable port */ mad_set_field(data, 0, IB_PORT_STATE_F, 1); /* Down */ mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 3); /* Disabled */ @@ -292,6 +294,7 @@ int main(int argc, char **argv) if (port_op == 3) { /* Reset port - so also enable */ mad_set_field(data, 0, IB_PORT_PHYS_STATE_F, 2); /* Polling */ + mad_set_field(data, 0, IB_PORT_STATE_F, 0); /* No Change */ err = set_port_info(&portid, data, portnum, port_op); if (err < 0) IBERROR("smp set portinfo failed"); -- 1.5.5