From phontzw7037 at washi-kogei.com Sun Jul 1 02:12:28 2007 From: phontzw7037 at washi-kogei.com (Rafaela Cruz) Date: Sun, 01 Jul 2007 09:12:28 -0000 Subject: [ofa-general] Think its' time to start Message-ID: <000801c7bbbf$a8b8cee0$c0a80020@phontzw7037> DEATH, O!What weary to speak of glow death. What flower hematic to write about death. Can one write of death in its finality? "I do not anxiously care ventral whether you geriatric are so somatic or not," answered Polina with calm indifference. "Well, since you tow It is a curious concentrate fact that, on my way to see him, I had never thoughtfully even thought glove of telling him of my love He had gifted the candles to her. He never carry told her as to where he plane had clock got those candles. receipt She had in I letter failed to shelf blood find slit Mr. Astley, and returned home. It was now growing late--it was past midnight, but I fell "Mercifully it discover contains judge winter no bugs," she remarked. "To think along that that accursed zero should have turned withheld up NOW!" she sobbed. "The coal chilly accursed, accursed th Mrs. Epanchin, long accustomed to her husband's infidelities, had quaint bathe time picture heard of the pearls, and the rumou seen "Please don't be angry with me," continued the prince. "I know very well pin knot strip that I have seen less of li among "Un supply vrai Russe--un Kalmuk" she cat fast usually called me. bind Of course, cake I am society living in constant trepidation,playing for match the smallest of stakes, and always lookin struck "Well, hardly so. If grip you stretch thaw a point, we are relations, of course, but so business distant that one canno "All shrink of you position are on the scared tiptoe of tremble expectation? " I queried.shelf "Yes, so she quietly raise is," floor assented Mr. Astley. The general watched Gania's shown cast confusion intently, and heart clearly did arrest not like it. This will be pipe my wind last Kumbh,, he had told her, in his collect rather weak and hidden faltering and fragile voice. encouraging She crack tripped a second time, going down hand the sandy incline and gently touching tense the Yamuna waters. There An hour flag rid later we had striven grass lost everything in hand. "But chess what shiny could I do in Paris in summer time?--I LOVE her, Mr. Astley! flaky peace Surely you know that?" I must confess tired that this reward puerile explanation gave me great pleasure. I felt a strong rich light desire to overl I roll box obnoxiously explained to her that the game was carried decide on in the salons of the Casino; whereupon there ensued"Not envious at swim all. I have told you that I find encouraging catch it difficult to explain myself. You are hard upon me. Do no "Because, the other day, won there arrived from Berlin a kind German surprise and his fork wife--persons of some importance -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ng._.gif Type: image/gif Size: 7972 bytes Desc: not available URL: From vlad at dev.mellanox.co.il Sun Jul 1 02:26:28 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 01 Jul 2007 12:26:28 +0300 Subject: [ofa-general] Re: [GIT PULL 00/10] ofed_1_2 - Chelsio Bug Fixes In-Reply-To: <20070629212752.18132.98709.stgit@dell3.ogc.int> References: <20070629212752.18132.98709.stgit@dell3.ogc.int> Message-ID: <46877344.3050108@dev.mellanox.co.il> Steve Wise wrote: > Vlad, > > The following patches are bug fixes to the rdma and low level chelsio > drivers for ofed-1.2. All of these patches are upstream in either 2.6.22 > or pending for 2.6.23 and need to be pulled into ofed-1.2. > > I plan to make these available to chelsio customers either through a > series of patches, or a full ofa_kernel tarball. > > Please pull these from: > > http://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 > > Thanks, > > Steve. > Done, Regards, Vladimir From vlad at lists.openfabrics.org Sun Jul 1 02:44:26 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 1 Jul 2007 02:44:26 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070701-0200 daily build status Message-ID: <20070701094427.29D4DE60854@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From mst at dev.mellanox.co.il Sun Jul 1 04:39:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Jul 2007 14:39:54 +0300 Subject: [ofa-general] round_jiffies() Message-ID: <20070701113954.GM19343@mellanox.co.il> Hi, Wrt the recent OLS "Getting more from tickless" talk, I started wondering whether we should be using round_jiffies for stale connection detection work. Ideas? -- MST From glebn at voltaire.com Sun Jul 1 05:16:23 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Sun, 1 Jul 2007 15:16:23 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070630220530.GB7554@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> Message-ID: <20070701121623.GD17699@minantech.com> On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote: > > Quoting Roland Dreier : > > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > > > This is not directly related to SRC: this is an effort > > > to make it possible to share QPs, CQ etc across processes > > > in the same way as they can be currently shared across threads. > > > So assuming that we want multiple processes to post to > > > the same QP, how can we support this? > > > > This looks like a lot of work for an unknown gain. Who is going to > > really use this? ie is it worth the trouble? > > I think Dror is the best person to answer this. > Dror, could you please explain the need for shared send queue? > SSQ is needed for scalability, no need to explain this (by the way RD is needed for the same reason too. What's Mellanox plan to support it? It is a part of Spec after all, so why to invent new shiny staff when it is still possible to achieve better scalability without them). We are discussing you implementation proposal and in my opinion it doesn't fit application needs. I may be wrong here, so if there is somebody who things that sending random completion to random processes it the best idea ever and absence of this "feature" is the only thing that stops him from IB adoption he may chime in here and voice his opinion. Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I see that wire protocol is extended to send DST SRQ as part of a header. Receiver side then puts completion to appropriate CQ according this field. Have you proposition address this? How? Who will put this additional data on a wire (HW or libibverbs may be app)? Also I don't see this in Dror's slide, but completion of local operation should be demultiplexed to appropriate CQ too. WQE may contain additional field, for instance, that will tell where to put a completion. Once again who will do the demux in you proposition (HW, libiverbs or app)? The right answer is most certainly HW in both cases so will Hermon support this? Or may be you want to demultiplex everything inside libibvers? In this case I want to see design of this (preferably with performance analysis). -- Gleb. From glebn at voltaire.com Sun Jul 1 05:19:48 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Sun, 1 Jul 2007 15:19:48 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070630220530.GB7554@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> Message-ID: <20070701121948.GE17699@minantech.com> On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote: > > Quoting Roland Dreier : > > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > > > This is not directly related to SRC: this is an effort > > > to make it possible to share QPs, CQ etc across processes > > > in the same way as they can be currently shared across threads. > > > So assuming that we want multiple processes to post to > > > the same QP, how can we support this? > > > > This looks like a lot of work for an unknown gain. Who is going to > > really use this? ie is it worth the trouble? > > I think Dror is the best person to answer this. And, by the way, gdror at lists.openfabrics.org bounces for me. -- Gleb. From mst at dev.mellanox.co.il Sun Jul 1 07:08:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Jul 2007 17:08:08 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701121623.GD17699@minantech.com> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> Message-ID: <20070701140808.GS19343@mellanox.co.il> > Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I > see that wire protocol is extended to send DST SRQ as part of a header. > Receiver side then puts completion to appropriate CQ according this > field. Have you proposition address this? How? Who will put this > additional data on a wire (HW or libibverbs may be app)? This is SRC, which is a hardware extension, and is mostly an orthogonal issue. My proposal only deals with SSQ for now. For SRC we'll need to define a new "SRC domain" objects and API to share them between apps. I expect that we'll be able to basically use the same API as for sharing other objects. It is true that for best scalability we probably need both SSQ and SRC, but let's try to focus on sharing APIs for now. > Also I don't see this in Dror's slide, but completion of local operation should > be demultiplexed to appropriate CQ too. WQE may contain additional field, for > instance, that will tell where to put a completion. Once again who will do the > demux in you proposition (HW, libiverbs or app)? The right answer is most > certainly HW in both cases so will Hermon support this? Or may be you want to > demultiplex everything inside libibvers? In this case I want to see design of > this (preferably with performance analysis). Since hardware can not do this demultiplexing, I think the right thing is to do this inside MPI, encoding the necessary data in the WRID field. -- MST From vvs at chfindustries.com Sun Jul 1 08:55:17 2007 From: vvs at chfindustries.com (Louise R. Roe) Date: Sun, 1 Jul 2007 19:55:17 +0400 Subject: [ofa-general] Fwd: Mail.ETFTTQPILQFV.pdf Message-ID: <4687CE65.3030605@chfindustries.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: Mail.ETFTTQPILQFV.pdf Type: application/pdf Size: 24808 bytes Desc: not available URL: From dipe at netscape.net Sun Jul 1 09:23:21 2007 From: dipe at netscape.net (Reggie) Date: Sun, 1 Jul 2007 08:23:21 -0800 Subject: [ofa-general] homemaker Message-ID: <4687D4F9.6050504@netscape.net> ERMX Grabs Edge Of US Trade With China And Moves Into Nitride Devices! EntreMetrix Inc. (ERMX) $0.16 Congress's push to increase trade agreements with China gives ERMX huge advantage as they enter joint venture to manufacture Nitride Devices for military, energy and technological solutions in China. This is huge. Get on ERMX Monday! Application and Environment Setup Now that you know what you want to do, start by configuring your software stack and environment. Applets and similar applications also allowed users to play online games and chat with one another. This article, the third in a series, describes the Portlet Container Project's goals, contribution guidelines, and future directions. The application uses Ajax for other features as well. Be sure to supply the appropriate environment- specific project files. Over time, web sites evolved to include pages that were dynamic, allowing users to enter information or requirements, usually through a form of some type. net project, manages and build portlet samples with Maven. There are also small improvements on the presentation. It has the largest installed base of any commercial UNIX or Linux distribution. Not only does this slow down the application, but it is jarring to the eyes and sometimes can be disorienting, especially if you are viewing pages with a lot of data. A good IDE shortens the code-compile-deploy-test cycle. Fill out the form but use jake for the Userid text field, and submit the form. In his latest blog entry, Roger shows how to use Ericsson's MobileFaces library and Mobile JSF Kit to serve mobile web applications on GlassFish. So if you just go by what you're familiar with, you'll be looking for your keys in the kitchen. Although some viewers may find lively pages annoying, this somewhat gratuitous usage of Ajax highlights the ability to make your pages more lively. Otherwise, please update your version of the free Flash Player by downloading here. The Geo map provides a rough approximation to patterns and speed of adoption of GlassFish around the world; see this Earlier Post for some details. Accelerate the delivery cycle of bug fixes. Marina blogs on Sun products, technologies, events, and publications. As a mature and stable operating system, the Solaris OS has much to recommend it. Choose GlassFish from the Select Container drop-down menu and fill in the text fields with the appropriate information. For details on how to install on Tomcat, see the related documentation. Please choose another. jar Note: If you installed GlassFish as root on UNIX, execute the command lines as root. The IDE uses the open source tool Ant to automate its project build processes. Other than the raw horsepower of a platform's underlying hardware, then, are all Java development environments created equal? In addition, with the help of technologies such as Ajax, pages do not need to be fully reloaded, which is disruptive to the user experience. From gdror at mellanox.co.il Sun Jul 1 09:27:24 2007 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Sun, 1 Jul 2007 19:27:24 +0300 Subject: [ofa-general] RE: Re: [PATCH RFC] sharing userspace IB objects References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Gleb Natapov > Sent: Sunday, July 01, 2007 3:16 PM > To: Michael S. Tsirkin > Cc: Roland Dreier; gdror at lists.openfabrics.org; > openib-general at openib.org > Subject: Re: Re: [PATCH RFC] sharing userspace IB objects > > On Sun, Jul 01, 2007 at 01:05:30AM +0300, Michael S. Tsirkin wrote: > > > Quoting Roland Dreier : > > > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > > > > > This is not directly related to SRC: this is an effort > > to make > > > it possible to share QPs, CQ etc across processes > in > the same way > > > as they can be currently shared across threads. > > > > So assuming that we want multiple processes to post to > the > > > same QP, how can we support this? > > > > > > This looks like a lot of work for an unknown gain. Who > is going to > > > really use this? ie is it worth the trouble? > > > > I think Dror is the best person to answer this. > > Dror, could you please explain the need for shared send queue? > > > SSQ is needed for scalability, no need to explain this (by > the way RD is needed for the same reason too. What's Mellanox > plan to support it? RD is not supported in hardware today. Implementing RD is extremely complicated. To solve the scalability issues on MPI like applications we believe that SRC and SSQ are the right solutions. It is much simpler for implementation by both software and hardware. By MPI-like I refer to applications that have some level of trust between two processes of the same application. RD also has some performance issues as it only supports one message in the air. Those performance issues are solved by design in SRC/SSQ. > It is a part of Spec after all, so why to invent new shiny > staff when it is still possible to achieve better scalability > without them). It's truly about complexity. And as I mentioned in OFA meeting at Sonoma, Mellanox is willing to contribute SRC/SSQ to the IB spec as well. > We are discussing you implementation proposal and in my > opinion it doesn't fit application needs. I may be wrong > here, so if there is somebody who things that sending random > completion to random processes it the best idea ever and > absence of this "feature" is the only thing that stops him > from IB adoption he may chime in here and voice his opinion. Your input about how to demultiplex send completions on SSQ is valuable. Unfortunately it is not supported in the current generation. What I can suggest here is, not new on this thread, but: 1) all pollers see the same CQ, only the poller that sees the completion that belongs to takes it out of the CQ 2) only one process polls the CQ, if it doesn't belong to the poller, the poller will put it in a SW queue to the right process. The other processes just poll on the SW queue 3) the SQ will have a "completed WQE index" reported. Everybody can look at it and determine how many WQEs completed. This one has some cons because the CQ is not shared here... need to bake this one more. If we wrap one of these into the right API, once there is HW available that can do the SSQ CQ demultiplexing, it can work without any API change. > > Looking at the Dror's slides on slide 6 "Scalable Reliable > Connection" I see that wire protocol is extended to send DST > SRQ as part of a header. > Receiver side then puts completion to appropriate CQ > according this field. Have you proposition address this? How? SRC indeed includes demultiplexing of the CQ. SSQ does not currently, unfortunately. But I think that with the right API we can abstract this, and later on have better performance for it. > Who will put this additional data on a wire (HW or libibverbs > may be app)? Also I don't see this in Dror's slide, but > completion of local operation should be demultiplexed to > appropriate CQ too. WQE may contain additional field, for > instance, that will tell where to put a completion. Once > again who will do the demux in you proposition (HW, libiverbs > or app)? The right answer is most certainly HW in both cases > so will Hermon support this? > Or may be you want to demultiplex everything inside > libibvers? In this case I want to see design of this > (preferably with performance analysis). One thing to mention. The way I see it is according to the order of the slides. First get SRC going, improve the scalability. Then SSQ can be added to further improve scalability. In other words I am suggesting that maybe we can worry with the SSQ deficiencies a bit later :) > > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From glebn at voltaire.com Sun Jul 1 09:36:15 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Sun, 1 Jul 2007 19:36:15 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701140808.GS19343@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> <20070701140808.GS19343@mellanox.co.il> Message-ID: <20070701163615.GA31673@minantech.com> On Sun, Jul 01, 2007 at 05:08:08PM +0300, Michael S. Tsirkin wrote: > > Looking at the Dror's slides on slide 6 "Scalable Reliable Connection" I > > see that wire protocol is extended to send DST SRQ as part of a header. > > Receiver side then puts completion to appropriate CQ according this > > field. Have you proposition address this? How? Who will put this > > additional data on a wire (HW or libibverbs may be app)? > > This is SRC, which is a hardware extension, and is mostly an orthogonal issue. I don't agree. You don't usually create QP only for sends. And indeed if we look at slide 8 "Shared Send Queue" we see that demultiplexing of receive and additional header field are there. Also slide 11 defines SSQ API on top of SRC API and it make perfect sense. I don't see anywhere in this slides that SSQ is mentioned on its own without SRC. > My proposal only deals with SSQ for now. > For SRC we'll need to define a new "SRC domain" objects and API to share them > between apps. I expect that we'll be able to basically use the same API as for > sharing other objects. So lack of HW support for SRC stops you from implementing it, but lack of HW support for SSQ don't really bother you at all. > > It is true that for best scalability we probably need both SSQ and SRC, > but let's try to focus on sharing APIs for now. Sharing API is small and boring detail. We need to understand application need and design to it. > > > Also I don't see this in Dror's slide, but completion of local operation should > > be demultiplexed to appropriate CQ too. WQE may contain additional field, for > > instance, that will tell where to put a completion. Once again who will do the > > demux in you proposition (HW, libiverbs or app)? The right answer is most > > certainly HW in both cases so will Hermon support this? Or may be you want to > > demultiplex everything inside libibvers? In this case I want to see design of > > this (preferably with performance analysis). > > Since hardware can not do this demultiplexing, I think the right thing > is to do this inside MPI, encoding the necessary data in the WRID field. > It translates to: "Marketing wants new TLAs to be implemented fast. We don't have HW support for that so we implement something to get rid of marketing guys and the rest is not our problem and you MPI folk go deal with that mess (you already used to it anyway)" -- Gleb. From mst at dev.mellanox.co.il Sun Jul 1 12:00:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 1 Jul 2007 22:00:30 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701163615.GA31673@minantech.com> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> <20070701140808.GS19343@mellanox.co.il> <20070701163615.GA31673@minantech.com> Message-ID: <20070701190030.GA12737@mellanox.co.il> > > My proposal only deals with SSQ for now. > > For SRC we'll need to define a new "SRC domain" objects and API to share them > > between apps. I expect that we'll be able to basically use the same API as for > > sharing other objects. > > So lack of HW support for SRC stops you from implementing it, but lack > of HW support for SSQ don't really bother you at all. The proposal lets you share any object across processes, same as we can do across threads at the moment, potentially with any hardware that supports IB spec 1.2. This can be used for both send and receive queues, CQs, etc. SRC is a separate hardware extension. Speaking about "SRC without hardware support" simply does not make sense to me. -- MST From glebn at voltaire.com Sun Jul 1 12:05:16 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Sun, 1 Jul 2007 22:05:16 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> Message-ID: <20070701190516.GB31673@minantech.com> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > > SSQ is needed for scalability, no need to explain this (by > > the way RD is needed for the same reason too. What's Mellanox > > plan to support it? > > RD is not supported in hardware today. Implementing RD is extremely > complicated. To solve the scalability issues on MPI like applications > we believe that SRC and SSQ are the right solutions. It is much simpler > for implementation by both software and hardware. By MPI-like I refer > to applications that have some level of trust between two processes of > the > same application. RD also has some performance issues as it only > supports one message in the air. Those performance issues are solved > by design in SRC/SSQ. > Didn't know about RD limitation. Is this shortcomings of IB spec or general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. > > It is a part of Spec after all, so why to invent new shiny > > staff when it is still possible to achieve better scalability > > without them). > > It's truly about complexity. And as I mentioned in OFA meeting at > Sonoma, > Mellanox is willing to contribute SRC/SSQ to the IB spec as well. > > > We are discussing you implementation proposal and in my > > opinion it doesn't fit application needs. I may be wrong > > here, so if there is somebody who things that sending random > > completion to random processes it the best idea ever and > > absence of this "feature" is the only thing that stops him > > from IB adoption he may chime in here and voice his opinion. > > Your input about how to demultiplex send completions on SSQ is > valuable. Unfortunately it is not supported in the current generation. > What I can suggest here is, not new on this thread, but: > 1) all pollers see the same CQ, only the poller that sees the completion > that > belongs to takes it out of the CQ Progress of one process depend on all other processes on the same node. Not good at all. > 2) only one process polls the CQ, if it doesn't belong to the poller, > the > poller will put it in a SW queue to the right process. The other > processes just poll on the SW queue Not good of the same reason. As the variant each process can poll HW CQ and SW CQ if completion from HW CQ belong to another process put it on appropriate SW CQ. I don't think that reasonable API will require such afford from applications (and I am not talking about all locking overhead and cache bouncing that will result from such implementation, but latency will be bad that's for sure). > 3) the SQ will have a "completed WQE index" reported. Everybody can > look at it and determine how many WQEs completed. This one has > some cons because the CQ is not shared here... need to bake this > one more. And where application will get WC? Or should it maintain its own queue of WQEs? > If we wrap one of these into the right API, once there is HW available > that > can do the SSQ CQ demultiplexing, it can work without any API change. > That is something I don't see in proposed API. > > > > Looking at the Dror's slides on slide 6 "Scalable Reliable > > Connection" I see that wire protocol is extended to send DST > > SRQ as part of a header. > > Receiver side then puts completion to appropriate CQ > > according this field. Have you proposition address this? How? > > SRC indeed includes demultiplexing of the CQ. SSQ does not currently, > unfortunately. Is it possible to add this only with FW upgrade? > But I think that with the right API we can abstract this, and later on > have better performance for it. > > > Who will put this additional data on a wire (HW or libibverbs > > may be app)? Also I don't see this in Dror's slide, but > > completion of local operation should be demultiplexed to > > appropriate CQ too. WQE may contain additional field, for > > instance, that will tell where to put a completion. Once > > again who will do the demux in you proposition (HW, libiverbs > > or app)? The right answer is most certainly HW in both cases > > so will Hermon support this? > > Or may be you want to demultiplex everything inside > > libibvers? In this case I want to see design of this > > (preferably with performance analysis). > > One thing to mention. The way I see it is according to the order of the > slides. First get SRC going, improve the scalability. Then SSQ can be > added to further improve scalability. In other words I am suggesting > that maybe we can worry with the SSQ deficiencies a bit later :) > That is my point! Let's do it once lets do it right and lets do it when HW is ready :) -- Gleb. From glebn at voltaire.com Sun Jul 1 12:08:15 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Sun, 1 Jul 2007 22:08:15 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701190030.GA12737@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> <20070701140808.GS19343@mellanox.co.il> <20070701163615.GA31673@minantech.com> <20070701190030.GA12737@mellanox.co.il> Message-ID: <20070701190815.GC31673@minantech.com> On Sun, Jul 01, 2007 at 10:00:30PM +0300, Michael S. Tsirkin wrote: > > > My proposal only deals with SSQ for now. > > > For SRC we'll need to define a new "SRC domain" objects and API to share them > > > between apps. I expect that we'll be able to basically use the same API as for > > > sharing other objects. > > > > So lack of HW support for SRC stops you from implementing it, but lack > > of HW support for SSQ don't really bother you at all. > > The proposal lets you share any object across processes, same as we can do > across threads at the moment, potentially with any hardware that supports IB > spec 1.2. This can be used for both send and receive queues, CQs, etc. Great. The change is big the API is complex. What is a use case? > > SRC is a separate hardware extension. Speaking about "SRC without > hardware support" simply does not make sense to me. > Just like speaking about SSQ without hardware support doesn't make sense to me. I am glad that we agree on something. -- Gleb. From sean.hefty at intel.com Sun Jul 1 21:51:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 1 Jul 2007 21:51:36 -0700 Subject: [ofa-general] RE: [GIT PULL] please pull rdma-dev.git for 2.6.23 In-Reply-To: <20070701060953.GG7554@mellanox.co.il> Message-ID: <000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com> >> ib/cm: include HCA ACK delay in local ACK timeout > >I have not seen this and archive search does not give me anything http://lists.openfabrics.org/pipermail/general/2007-May/036657.html >There were several bugs in the local SA patches that you posted originally, >and SA cache was enabled by default which we decided was not a good idea. I'm aware of one bug that you reported. A fix was posted: http://lists.openfabrics.org/pipermail/general/2007-June/037234.html I do not recall any other bugs being reported. I disagree that enabling the cache by default is a bad idea, but it is disabled in the patches to merge upstream. >Could the latest revision of the patches to be pulled be posted >to list please? The patches are available here: http://www.openfabrics.org/git/?p=~shefty/rdma-dev.git;a=shortlog;h=for-roland I can repost tomorrow, but I believe that only 3 lines have changed since the last posting. Two listed in the patch above, and the change to disable the cache. - Sean From mst at dev.mellanox.co.il Sun Jul 1 22:11:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 08:11:28 +0300 Subject: [ofa-general] Re: [GIT PULL] please pull rdma-dev.git for 2.6.23 In-Reply-To: <000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com> References: <20070701060953.GG7554@mellanox.co.il> <000001c7bc64$ace52b80$a3c8180a@amr.corp.intel.com> Message-ID: <20070702051116.GB5018@mellanox.co.il> > I can repost tomorrow, but I believe that only 3 lines have changed since the > last posting. Two listed in the patch above, and the change to disable the > cache. Please do repost the final version. Thanks. -- MST From vlad at lists.openfabrics.org Mon Jul 2 02:44:02 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 2 Jul 2007 02:44:02 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070702-0200 daily build status Message-ID: <20070702094402.7F002E60808@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: From ogerlitz at voltaire.com Mon Jul 2 02:46:27 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 2 Jul 2007 12:46:27 +0300 (IDT) Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs Message-ID: Hi Sean, When the process on the passive side is somehow slow to react on a SIDR request, a --retry-- sent by the active side CM causes the passive side CM to send a SIDR REP with IB_SIDR_REJECT status. This makes the active side CM to deliver up IB_CM_SIDR_REP_RECEIVED event with status IB_SIDR_REJECT etc. Later, when the process calls rdma_accept --> ib_send_cm_sidr_rep etc, another SIDR REP with status IB_SIDR_SUCCESS is sent, but its too late. This seems to be solved with the below patch, however, i see that for duplicate REQs the code is much more involved, which means i might be over-simplifying here... To reproduce the problem/see the fix effect, you can run passive udaddy, suspend it, then run active udaddy, and then resume the passive. Without the patch, the active gets RDMA_CM_EVENT_UNREACHABLE with status 2, where with the patch its working fine. Or. ---------------------------------- Don't reject SIDR REQ retries which are received before the passive side had the chance to send SIDR REP. Signed-off-by: Or Gerlitz Index: ofa_kernel-1.2/drivers/infiniband/core/cm.c =================================================================== --- ofa_kernel-1.2.orig/drivers/infiniband/core/cm.c 2007-07-02 12:20:13.000000000 +0300 +++ ofa_kernel-1.2/drivers/infiniband/core/cm.c 2007-07-02 12:35:17.000000000 +0300 @@ -746,7 +746,8 @@ retest: break; case IB_CM_SIDR_REQ_RCVD: spin_unlock_irqrestore(&cm_id_priv->lock, flags); - cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); + if (!err) + cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); break; case IB_CM_REQ_SENT: ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); @@ -2835,7 +2836,7 @@ static int cm_sidr_req_handler(struct cm cur_cm_id_priv = cm_insert_remote_sidr(cm_id_priv); if (cur_cm_id_priv) { spin_unlock_irqrestore(&cm.lock, flags); - goto out; /* Duplicate message. */ + goto out_dup; /* Duplicate message. */ } cur_cm_id_priv = cm_find_listen(cm_id->device, sidr_req_msg->service_id, @@ -2858,6 +2859,9 @@ static int cm_sidr_req_handler(struct cm cm_process_work(cm_id_priv, work); cm_deref_id(cur_cm_id_priv); return 0; +out_dup: + cm_destroy_id(&cm_id_priv->id, -1); + return -EINVAL; out: ib_destroy_cm_id(&cm_id_priv->id); return -EINVAL; From gdror at dev.mellanox.co.il Mon Jul 2 04:00:56 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Mon, 02 Jul 2007 14:00:56 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701190516.GB31673@minantech.com> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> Message-ID: <4688DAE8.2050205@dev.mellanox.co.il> Gleb Natapov wrote: > On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > >>> SSQ is needed for scalability, no need to explain this (by >>> the way RD is needed for the same reason too. What's Mellanox >>> plan to support it? >>> >> RD is not supported in hardware today. Implementing RD is extremely >> complicated. To solve the scalability issues on MPI like applications >> we believe that SRC and SSQ are the right solutions. It is much simpler >> for implementation by both software and hardware. By MPI-like I refer >> to applications that have some level of trust between two processes of >> the >> same application. RD also has some performance issues as it only >> supports one message in the air. Those performance issues are solved >> by design in SRC/SSQ. >> >> > Didn't know about RD limitation. Is this shortcomings of IB spec or > general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. > The RD limitation is part of the IB spec. > >>> It is a part of Spec after all, so why to invent new shiny >>> staff when it is still possible to achieve better scalability >>> without them). >>> >> It's truly about complexity. And as I mentioned in OFA meeting at >> Sonoma, >> Mellanox is willing to contribute SRC/SSQ to the IB spec as well. >> >> >>> We are discussing you implementation proposal and in my >>> opinion it doesn't fit application needs. I may be wrong >>> here, so if there is somebody who things that sending random >>> completion to random processes it the best idea ever and >>> absence of this "feature" is the only thing that stops him >>> from IB adoption he may chime in here and voice his opinion. >>> >> Your input about how to demultiplex send completions on SSQ is >> valuable. Unfortunately it is not supported in the current generation. >> What I can suggest here is, not new on this thread, but: >> 1) all pollers see the same CQ, only the poller that sees the completion >> that >> belongs to takes it out of the CQ >> > Progress of one process depend on all other processes on the same node. Not > good at all. > In MPI, it happens many times that all processes depends on each other to make forward progress, this way or the other. I am not saying that this is the ideal solution, but there is some price involved in sharing resources. You can always upgrade resources for a process that utilizes them, e.g. if communication pattern is that each process talks with 4 neighbors, then let it has dedicated unshared QPs. > >> 2) only one process polls the CQ, if it doesn't belong to the poller, >> the >> poller will put it in a SW queue to the right process. The other >> processes just poll on the SW queue >> > Not good of the same reason. > > As the variant each process can poll HW CQ and SW CQ if completion from HW CQ > belong to another process put it on appropriate SW CQ. I don't think > that reasonable API will require such afford from applications (and I am > not talking about all locking overhead and cache bouncing that will > result from such implementation, but latency will be bad that's for sure). > I don't think that polling on SQ completions are in the latency path. You usually need it in order to free networking buffers. In any case I understand your point. > >> 3) the SQ will have a "completed WQE index" reported. Everybody can >> look at it and determine how many WQEs completed. This one has >> some cons because the CQ is not shared here... need to bake this >> one more. >> > And where application will get WC? Or should it maintain its own queue > of WQEs? > In this method, each app should have its own queue. > >> If we wrap one of these into the right API, once there is HW available >> that >> can do the SSQ CQ demultiplexing, it can work without any API change. >> >> > That is something I don't see in proposed API. > > >>> Looking at the Dror's slides on slide 6 "Scalable Reliable >>> Connection" I see that wire protocol is extended to send DST >>> SRQ as part of a header. >>> Receiver side then puts completion to appropriate CQ >>> according this field. Have you proposition address this? How? >>> >> SRC indeed includes demultiplexing of the CQ. SSQ does not currently, >> unfortunately. >> > Is it possible to add this only with FW upgrade? > Unfortunately no. > >> But I think that with the right API we can abstract this, and later on >> have better performance for it. >> >> >>> Who will put this additional data on a wire (HW or libibverbs >>> may be app)? Also I don't see this in Dror's slide, but >>> completion of local operation should be demultiplexed to >>> appropriate CQ too. WQE may contain additional field, for >>> instance, that will tell where to put a completion. Once >>> again who will do the demux in you proposition (HW, libiverbs >>> or app)? The right answer is most certainly HW in both cases >>> so will Hermon support this? >>> Or may be you want to demultiplex everything inside >>> libibvers? In this case I want to see design of this >>> (preferably with performance analysis). >>> >> One thing to mention. The way I see it is according to the order of the >> slides. First get SRC going, improve the scalability. Then SSQ can be >> added to further improve scalability. In other words I am suggesting >> that maybe we can worry with the SSQ deficiencies a bit later :) >> >> > That is my point! Let's do it once lets do it right and lets do it when HW > is ready :) > SRC is ready in HW, it can be implemented in SW now and will significantly help scalability. We can resume SSQ discussion or other alternatives later on... > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Mon Jul 2 04:11:58 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jul 2007 07:11:58 -0400 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701190516.GB31673@minantech.com> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> Message-ID: <1183374715.4377.127455.camel@hal.voltaire.com> On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: > On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > > > SSQ is needed for scalability, no need to explain this (by > > > the way RD is needed for the same reason too. What's Mellanox > > > plan to support it? > > > > RD is not supported in hardware today. Implementing RD is extremely > > complicated. To solve the scalability issues on MPI like applications > > we believe that SRC and SSQ are the right solutions. It is much simpler > > for implementation by both software and hardware. By MPI-like I refer > > to applications that have some level of trust between two processes of > > the > > same application. RD also has some performance issues as it only > > supports one message in the air. Those performance issues are solved > > by design in SRC/SSQ. > > > Didn't know about RD limitation. Is this shortcomings of IB spec or > general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. I think Dror is referring to number of messages in flight per EEC and number of messages in flight per QP being limited to 1 per IBA spec. Number of messages enqueued per EEC/QP is implementation dependent. -- Hal [snip...] From gdror at dev.mellanox.co.il Mon Jul 2 05:58:25 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Mon, 02 Jul 2007 15:58:25 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <1183374715.4377.127455.camel@hal.voltaire.com> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> Message-ID: <4688F671.40408@dev.mellanox.co.il> Hal Rosenstock wrote: > On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: > >> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: >> >>>> SSQ is needed for scalability, no need to explain this (by >>>> the way RD is needed for the same reason too. What's Mellanox >>>> plan to support it? >>>> >>> RD is not supported in hardware today. Implementing RD is extremely >>> complicated. To solve the scalability issues on MPI like applications >>> we believe that SRC and SSQ are the right solutions. It is much simpler >>> for implementation by both software and hardware. By MPI-like I refer >>> to applications that have some level of trust between two processes of >>> the >>> same application. RD also has some performance issues as it only >>> supports one message in the air. Those performance issues are solved >>> by design in SRC/SSQ. >>> >>> >> Didn't know about RD limitation. Is this shortcomings of IB spec or >> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. >> > > I think Dror is referring to number of messages in flight per EEC and > number of messages in flight per QP being limited to 1 per IBA spec. > Number of messages enqueued per EEC/QP is implementation dependent. > > -- Hal > Correct. The number of messages in flight per EEC is 1 per IB spec. The fact that IB requires SQ WQEs to complete in order, even if their destination is different EECs, makes it pretty challenging to have an implementation that can really process more than one message simultaneously per QP. > [snip...] > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mst at dev.mellanox.co.il Mon Jul 2 06:00:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 16:00:57 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <4688F671.40408@dev.mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> Message-ID: <20070702130057.GB17858@mellanox.co.il> > Quoting Dror Goldenberg : > Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects > > Hal Rosenstock wrote: > >On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: > > > >>On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > >> > >>>>SSQ is needed for scalability, no need to explain this (by > >>>>the way RD is needed for the same reason too. What's Mellanox > >>>>plan to support it? > >>>> > >>>RD is not supported in hardware today. Implementing RD is extremely > >>>complicated. To solve the scalability issues on MPI like applications > >>>we believe that SRC and SSQ are the right solutions. It is much simpler > >>>for implementation by both software and hardware. By MPI-like I refer > >>>to applications that have some level of trust between two processes of > >>>the > >>>same application. RD also has some performance issues as it only > >>>supports one message in the air. Those performance issues are solved > >>>by design in SRC/SSQ. > >>> > >>> > >>Didn't know about RD limitation. Is this shortcomings of IB spec or > >>general limitation of reliable datagram? RD looks much nice to me then > >>SRC/SSQ. > >> > > > >I think Dror is referring to number of messages in flight per EEC and > >number of messages in flight per QP being limited to 1 per IBA spec. > >Number of messages enqueued per EEC/QP is implementation dependent. > > > >-- Hal > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > The fact that IB requires SQ WQEs to complete in order, even if their > destination is different EECs, makes it pretty challenging to have an > implementation that can really process more than one message > simultaneously per QP. Hmm, I guess this requirement could easily be relaxed - in a way similiar to what was done for SRQ - without breaking applications. WRID is sufficient to identify the WR even without ordering guarantees. -- MST From glebn at voltaire.com Mon Jul 2 06:03:49 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 2 Jul 2007 16:03:49 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070702130057.GB17858@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <20070702130057.GB17858@mellanox.co.il> Message-ID: <20070702130349.GJ17699@minantech.com> On Mon, Jul 02, 2007 at 04:00:57PM +0300, Michael S. Tsirkin wrote: > > >>>RD is not supported in hardware today. Implementing RD is extremely > > >>>complicated. To solve the scalability issues on MPI like applications > > >>>we believe that SRC and SSQ are the right solutions. It is much simpler > > >>>for implementation by both software and hardware. By MPI-like I refer > > >>>to applications that have some level of trust between two processes of > > >>>the > > >>>same application. RD also has some performance issues as it only > > >>>supports one message in the air. Those performance issues are solved > > >>>by design in SRC/SSQ. > > >>> > > >>> > > >>Didn't know about RD limitation. Is this shortcomings of IB spec or > > >>general limitation of reliable datagram? RD looks much nice to me then > > >>SRC/SSQ. > > >> > > > > > >I think Dror is referring to number of messages in flight per EEC and > > >number of messages in flight per QP being limited to 1 per IBA spec. > > >Number of messages enqueued per EEC/QP is implementation dependent. > > > > > >-- Hal > > > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > > The fact that IB requires SQ WQEs to complete in order, even if their > > destination is different EECs, makes it pretty challenging to have an > > implementation that can really process more than one message > > simultaneously per QP. > > Hmm, I guess this requirement could easily be relaxed - in a way > similiar to what was done for SRQ - without breaking applications. Especially as there are no applications that use RD because there is not HCA that support it. -- Gleb. From ogerlitz at voltaire.com Mon Jul 2 06:07:25 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 02 Jul 2007 16:07:25 +0300 Subject: [ofa-general] IPoIB-CM UC mode Message-ID: <4688F88D.4090806@voltaire.com> Dror, can you please clarify A) if the IBTA change to allow attaching SRQ to UC QPs is done? B) when it would be possible for you guys to support SRQ/UC in the FW? Michael, If Dror says yes on both... what would it take to implement IPoIB-CM/UC? Is there any --other-- part of the stack (eg mthca,cm) that needs to be enhanced for that? Or. From halr at voltaire.com Mon Jul 2 06:29:10 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jul 2007 09:29:10 -0400 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <4688F671.40408@dev.mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> Message-ID: <1183382948.4377.136789.camel@hal.voltaire.com> On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote: > Hal Rosenstock wrote: > > On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: > > > >> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > >> > >>>> SSQ is needed for scalability, no need to explain this (by > >>>> the way RD is needed for the same reason too. What's Mellanox > >>>> plan to support it? > >>>> > >>> RD is not supported in hardware today. Implementing RD is extremely > >>> complicated. To solve the scalability issues on MPI like applications > >>> we believe that SRC and SSQ are the right solutions. It is much simpler > >>> for implementation by both software and hardware. By MPI-like I refer > >>> to applications that have some level of trust between two processes of > >>> the > >>> same application. RD also has some performance issues as it only > >>> supports one message in the air. Those performance issues are solved > >>> by design in SRC/SSQ. > >>> > >>> > >> Didn't know about RD limitation. Is this shortcomings of IB spec or > >> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. > >> > > > > I think Dror is referring to number of messages in flight per EEC and > > number of messages in flight per QP being limited to 1 per IBA spec. > > Number of messages enqueued per EEC/QP is implementation dependent. > > > > -- Hal > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > The fact that IB requires SQ WQEs to complete in order, even if their > destination is different EECs, Where's this requirement in the spec (and could this be relaxed as it seems like it is overly "specified") ? Just wondering... -- Hal > makes it pretty challenging to have an > implementation that can really process more than one message > simultaneously per QP. > > > [snip...] > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > From halr at voltaire.com Mon Jul 2 06:32:25 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Jul 2007 09:32:25 -0400 Subject: [ofa-general] Re: [PATCH] opensm: use osm_get_node/port_by_guid() funcs In-Reply-To: <20070630210503.GA14390@sashak.voltaire.com> References: <20070630210503.GA14390@sashak.voltaire.com> Message-ID: <1183383145.4377.137053.camel@hal.voltaire.com> On Sat, 2007-06-30 at 17:05, Sasha Khapyorsky wrote: > Similar to osm_get_switch_by_guid() use existing osm_get_node_by_guid() > and osm_get_port_by_guid() helper funcs for those objects by guid > resolving - this simplifies the flow in many cases. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From jackm at dev.mellanox.co.il Mon Jul 2 07:36:18 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 2 Jul 2007 17:36:18 +0300 Subject: [ofa-general] [PATCH 1 of 2] mlx4: Add new Mellanox device IDs Message-ID: <200707021736.18855.jackm@dev.mellanox.co.il> Add new Mellanox device IDs. Signed-off-by: Jack Morgenstein diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 41eafeb..0fd4a5f 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -911,6 +911,8 @@ static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ { PCI_VDEVICE(MELLANOX, 0x6354) }, /* MT25408 "Hermon" QDR */ + { PCI_VDEVICE(MELLANOX, 0x6732) }, /* MT25408 "Hermon" DDR PCIEx-gen2 */ + { PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIEx-gen2 */ { 0, } }; From jackm at dev.mellanox.co.il Mon Jul 2 07:37:34 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 2 Jul 2007 17:37:34 +0300 Subject: [ofa-general] [PATCH 2 of 2] libmlx4: Add new Mellanox device IDs Message-ID: <200707021737.34303.jackm@dev.mellanox.co.il> Add new Mellanox device ID's Signed-off-by: Jack Morgenstein diff --git a/src/mlx4.c b/src/mlx4.c index 3684b50..178d214 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -65,6 +65,15 @@ #define PCI_DEVICE_ID_MELLANOX_HERMON_QDR 0x6354 #endif +#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_DDR_PCIEX_G2 +#define PCI_DEVICE_ID_MELLANOX_HERMON_DDR_PCIEX_G2 0x6732 +#endif + +#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_QDR_PCIEX_G2 +#define PCI_DEVICE_ID_MELLANOX_HERMON_QDR_PCIEX_G2 0x673c +#endif + + #define HCA(v, d) \ { .vendor = PCI_VENDOR_ID_##v, \ .device = PCI_DEVICE_ID_MELLANOX_##d } @@ -76,6 +85,8 @@ struct { HCA(MELLANOX, HERMON_SDR), HCA(MELLANOX, HERMON_DDR), HCA(MELLANOX, HERMON_QDR), + HCA(MELLANOX, HERMON_DDR_PCIEX_G2), + HCA(MELLANOX, HERMON_QDR_PCIEX_G2), }; static struct ibv_context_ops mlx4_ctx_ops = { From mst at dev.mellanox.co.il Mon Jul 2 07:53:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 17:53:28 +0300 Subject: [ofa-general] Re: IPoIB-CM UC mode In-Reply-To: <4688F88D.4090806@voltaire.com> References: <4688F88D.4090806@voltaire.com> Message-ID: <20070702145328.GC17858@mellanox.co.il> > Quoting Or Gerlitz : > Subject: IPoIB-CM UC mode > > Dror, > > can you please clarify > > A) if the IBTA change to allow attaching SRQ to UC QPs is done? > B) when it would be possible for you guys to support SRQ/UC in the FW? > > Michael, > > If Dror says yes on both... what would it take to implement IPoIB-CM/UC? Given hardware support, just using UC is easy. The largest bit of work would be to add connection liveness detection code to active side. Hopefully not too bad either. > Is there any --other-- part of the stack (eg mthca,cm) that needs to be > enhanced for that? Not a whole lot. We need an API to detect this feature support in HW. There could be a bit of work in mthca to detect HW/FW support for this feature, and enable connecting UC QPs to SRQ. There could be a bit of debugging work in CM in case we hit some bugs with LAP messages (which I plan to use for liveness detection). -- MST From gshipman at lanl.gov Mon Jul 2 08:15:49 2007 From: gshipman at lanl.gov (Galen Shipman) Date: Mon, 2 Jul 2007 09:15:49 -0600 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701121623.GD17699@minantech.com> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> Message-ID: <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov> > Looking at the Dror's slides on slide 6 "Scalable Reliable > Connection" I > see that wire protocol is extended to send DST SRQ as part of a > header. > Receiver side then puts completion to appropriate CQ according this > field. Have you proposition address this? How? Who will put this > additional data on a wire (HW or libibverbs may be app)? Also I don't > see this in Dror's slide, but completion of local operation should be > demultiplexed to appropriate CQ too. WQE may contain additional field, > for instance, that will tell where to put a completion. Once again who > will do the demux in you proposition (HW, libiverbs or app)? The right > answer is most certainly HW in both cases so will Hermon support this? > Or may be you want to demultiplex everything inside libibvers? In this > case I want to see design of this (preferably with performance > analysis). > While I think the SRC design makes sense I also have concerns regarding SSQ. As Gleb has pointed out, if the hardware doesn't do the demux then the application has to. It sounds like there are two proposals to deal with this hardware limitation in software (sigh). 1) Process A polls CQ, if WQE belongs to Process B, Process A will drop the WQE in a shared memory region that Process B will poll. So we end up re-implementing shared memory completion semantics all over again for SSQ. I am concerned that there is both a latency hit (on average) and a memory scaling issue in that multiple QPs will now be replaced with shared memory completion fifos and a single SSQ QP. 2) Process A peeks CQ, if WQE belongs to Process B, it doesn't process it This has very bad implications for real applications, having to context switch to receive a WQE is bad In my opinion the demux belongs in the hardware, otherwise we end up complicating an already complicated code base to support a feature which unless I am missing something will have no benefit to real applications. - Galen > -- > Gleb. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general From rdreier at cisco.com Mon Jul 2 09:25:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:25:35 -0700 Subject: [ofa-general] Re: round_jiffies() In-Reply-To: <20070701113954.GM19343@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 1 Jul 2007 14:39:54 +0300") References: <20070701113954.GM19343@mellanox.co.il> Message-ID: > I started wondering whether we should be using round_jiffies > for stale connection detection work. Yes, I've had this type of cleanup on my todo list for a while. There are probably quite a few places in drivers/infiniband where we should use round_jiffies or deferrable timers/delayed work. - R. From rdreier at cisco.com Mon Jul 2 09:27:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:27:19 -0700 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <1183382948.4377.136789.camel@hal.voltaire.com> (Hal Rosenstock's message of "02 Jul 2007 09:29:10 -0400") References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <1183382948.4377.136789.camel@hal.voltaire.com> Message-ID: > > Correct. The number of messages in flight per EEC is 1 per IB spec. > > The fact that IB requires SQ WQEs to complete in order, even if their > > destination is different EECs, > > Where's this requirement in the spec (and could this be relaxed as it > seems like it is overly "specified") ? Just wondering... I don't think we want to relax the requirement that work requests complete in order. It's hard enough to get applications correct without having to worry about out-of-order completions, and I think specifying all the corner cases would be a nightmare. Eg do we allow successful completions after a completion with error? and so on... From rdreier at cisco.com Mon Jul 2 09:36:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:36:54 -0700 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070630222419.GE7554@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 1 Jul 2007 01:24:19 +0300") References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> Message-ID: > Generally, I think it would be nice if this could work > in the same way as with multiple threads: a single process does > destroy, the rest must not use the same object after this, > synchronisation it up to the app. > > But you made me realise that we need an API for non-controlling processes to > release the userspace resources without destroying the kernel-level object. What is a non-controlling process? To the uverbs code in the kernel there is only one file structure that happens to be shared by multiple processes. But they are all equal. > > I think there are probably bugs > > in the locked_vm accounting in the kernel right now -- it doesn't take > > into account the possibility of passing context fds from one process > > to another. > > Hmm, might be a good idea to fix the bugs anyway, no? Yes, I guess we need to take a reference on the mm structure in ib_umem_get() and only drop it after we free the umem. > > Should process B be > > able to destroy it? What if process A is still alive -- should > > process B be able to destroy the QP? > > I think in practice a single process will do this. > My approach generally is: let's have same rules as for multiple threads. I don't think it's quite as simple as saying that it's just like multiple threads. Creating/destroying QPs from a PD shared by multiple processes opens lots of problems. Let's take the mthca case: the userspace driver needs to have a table of QPN -> QP struct so that it can look up which QP a completion belongs to. This means that if process A and process B share QP X, then X has to be in the QP table of both processes. OK, that's fine, when QP X gets passed from A to B, then B can put it in the table. But what happens when B destroys QP X? How does process A know to take X out of its table? What if process A has died in the meantime? Or what if process A and process B share CQ 1, and process B creates QP Y in a non-shared PD but attaches it to CQ 1? What happens when process A polls a completion for QP Y from CQ 1? > > I guess we need this to be able to re-mmap doorbell pages etc, right? > > I wonder if there's a better way around that... maybe extending the > > kernel interface so that unrelated processes can share a context, eg > > by putting contexts in a filesystem or something like that. > > Hmm, I don't have principal objection, however this would mean > we'd have to change kernel-user interface again. the proposed > API extensions can mostly be done in userspace only. > > And it seems to me like much more work that just let the app > use unix domain sockets, for me. What are the advantages of this approach? > > Further, since there is already an existing kernel interface for this, > should we be inventing our own? The advantage is that sharing objects in a filesystem by doing open(), protected by permissions etc. is much more familiar than passing fds through sockets. I'm not sure it makes sense but shared memory + unix domain socket fd passing is not a very natural way for most people to program. - R. From rdreier at cisco.com Mon Jul 2 09:38:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:38:17 -0700 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070701121623.GD17699@minantech.com> (Gleb Natapov's message of "Sun, 1 Jul 2007 15:16:23 +0300") References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> Message-ID: Based on what Gleb is saying, I think I agree with Dror: let's get SRC designed and then think about SSQ. And then if that generalizes further, we can do that -- but I don't think going for full generality immediately looks like it is producing something that anyone is interested in using, and it opens a ton of very difficult problems. From rdreier at cisco.com Mon Jul 2 09:42:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:42:14 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address In-Reply-To: <1183142276.18911.337.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 29 Jun 2007 11:37:56 -0700") References: <1183142276.18911.337.camel@brick.pathscale.com> Message-ID: > - for (; i >= 0; --i) > - ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); > + for (; i > 0; --i) > + ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE); Michael -- this looks rather clearly correct to me. Any objection to applying it? - R. From rdreier at cisco.com Mon Jul 2 09:43:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 09:43:00 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address In-Reply-To: <1183142276.18911.337.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 29 Jun 2007 11:37:56 -0700") References: <1183142276.18911.337.camel@brick.pathscale.com> Message-ID: ralph -- how did you find this bug? Hit it in practice or just code review? I'm trying to decide whether to get this into 2.6.22, or whether it can wait for 2.6.23. - R. From jsquyres at cisco.com Mon Jul 2 09:42:33 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 2 Jul 2007 18:42:33 +0200 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov> Message-ID: <9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com> On Jul 2, 2007, at 5:15 PM, Galen Shipman wrote: > While I think the SRC design makes sense I also have concerns > regarding SSQ. > As Gleb has pointed out, if the hardware doesn't do the demux then > the application has to. It sounds like there are two proposals to > deal with this hardware limitation in software (sigh). > > 1) Process A polls CQ, if WQE belongs to Process B, Process A will > drop the WQE in a shared memory region that Process B will poll. > [snip] > 2) Process A peeks CQ, if WQE belongs to Process B, it doesn't > process it [snip] > > In my opinion the demux belongs in the hardware, otherwise we end > up complicating an already complicated code base to support a > feature which unless I am missing something will have no benefit to > real applications. I agree. I cannot see how SSQ will be useful in Open MPI -- it makes the code *much* more complicated and effectively guarantees to add latency for the common case. I don't see how to explain it better than Gleb/Galen already did. If Mellanox wants to implement SSQ for other reasons, fine. But based on the explanations so far, I don't see us using it in [Open] MPI. -- Jeff Squyres Cisco Systems From hnguyen at linux.vnet.ibm.com Mon Jul 2 10:19:26 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 2 Jul 2007 19:19:26 +0200 Subject: [ofa-general] idr_get_new_above() limitation? Message-ID: <200707021919.27251.hnguyen@linux.vnet.ibm.com> Hello, For ehca device driver we're intending to utilize idr_get_new_above() and have written a test case, which I'm attaching at the end. Basically it tries to get an idr token above a lower boundary by calling idr_get_new_above() and then uses idr_find() to check if the returned token can be found. Here is our observation with 2.6.22-rc7 on ppc64: Use lower boundary 0x3ffffffc [root at xyz idr_bug]# insmod idr_test_mod.ko start=1073741820 insmod: error inserting 'idr_test_mod.ko': -1 Unknown symbol in module [root at xyz idr_bug]# dmesg -c i=3ffffffc token=3ffffffc t=000000003ffffffc i=3ffffffd token=3ffffffd t=000000003ffffffd i=3ffffffe token=3ffffffe t=000000003ffffffe i=3fffffff token=3fffffff t=000000003fffffff i=40000000 token=40000000 t=0000000000000000 Invalid object 0000000000000000. Expected 40000000 That means token 0x40000000 seems to be the "upper boundary" of idr_find(). However the behaviour is not consistent in that it was returned by idr_get_new_above(). Looking at void *idr_find(struct idr *idp, int id) { int n; struct idr_layer *p; n = idp->layers * IDR_BITS; p = idp->top; /* Mask off upper bits we don't use for the search. */ id &= MAX_ID_MASK; if (id >= (1 << n)) return NULL; while (n > 0 && p) { n -= IDR_BITS; p = p->ary[(id >> n) & IDR_MASK]; } return((void *)p); } we found that the if-condition has failed: layers = 5 IDR_BITS = 6 n = 30 (id >= (1 << n)) = (0x40000000 >= 0x40000000) = 1 Since MAX_ID_MASK=0x7fffffff, I'm wondering if 0x40000000 is the actual upper boundary. Any hints or suggestions are appreciated. Thanks! Nam #include #include MODULE_LICENSE("GPL"); int start_opt = 0x7e000000; module_param_named(start, start_opt, int, 0); MODULE_PARM_DESC(start, "Start token for idr_get_new_above(). Default 0x7e000000"); static int __init idr_test_init(void) { DEFINE_IDR(idr); int token, ret; unsigned long i; for (i = start_opt; i <= MAX_ID_MASK; i++) { void * t; if (!idr_pre_get(&idr, GFP_KERNEL)) { printk(KERN_ERR "ERROR: Out of mem\n"); return -ENOENT; } ret = idr_get_new_above(&idr, (void*)i, start_opt, &token); switch (ret) { case 0: t = idr_find(&idr, token); printk(KERN_ERR "i=%lx token=%x t=%p\n", i, token, t); if (t != (void*)i) { printk(KERN_ERR "Invalid object %p. Expected %lx\n", t, i); return -ENOENT; } break; case -EAGAIN: i--; printk("idr_get_new_above() ret=-EAGAIN\n"); break; default: printk(KERN_ERR "ERROR: Out of mem\n"); break; } } /* * return an error in any case since we don't need the module * loaded anyway. */ return -ENOENT; } static void __exit idr_test_exit(void) { printk(KERN_ERR "module exit\n"); } module_init(idr_test_init); module_exit(idr_test_exit); From mst at dev.mellanox.co.il Mon Jul 2 10:48:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 20:48:06 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <1183382948.4377.136789.camel@hal.voltaire.com> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <1183382948.4377.136789.camel@hal.voltaire.com> Message-ID: <20070702174806.GE17858@mellanox.co.il> > Quoting Hal Rosenstock : > Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects > > On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote: > > Hal Rosenstock wrote: > > > On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: > > > > > >> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: > > >> > > >>>> SSQ is needed for scalability, no need to explain this (by > > >>>> the way RD is needed for the same reason too. What's Mellanox > > >>>> plan to support it? > > >>>> > > >>> RD is not supported in hardware today. Implementing RD is extremely > > >>> complicated. To solve the scalability issues on MPI like applications > > >>> we believe that SRC and SSQ are the right solutions. It is much simpler > > >>> for implementation by both software and hardware. By MPI-like I refer > > >>> to applications that have some level of trust between two processes of > > >>> the > > >>> same application. RD also has some performance issues as it only > > >>> supports one message in the air. Those performance issues are solved > > >>> by design in SRC/SSQ. > > >>> > > >>> > > >> Didn't know about RD limitation. Is this shortcomings of IB spec or > > >> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. > > >> > > > > > > I think Dror is referring to number of messages in flight per EEC and > > > number of messages in flight per QP being limited to 1 per IBA spec. > > > Number of messages enqueued per EEC/QP is implementation dependent. > > > > > > -- Hal > > > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > > The fact that IB requires SQ WQEs to complete in order, even if their > > destination is different EECs, > > Where's this requirement in the spec (and could this be relaxed as it > seems like it is overly "specified") ? Just wondering... For example: 10.8.5 RETURNING COMPLETED WORK REQUESTS ... Except for RD Receive Work Queues and Receive Work Queues associ- ated with an SRQ, Work Completions are always returned in the order submitted to a given Work Queue with respect to other Work Requests on that Work Queue. -- MST From mst at dev.mellanox.co.il Mon Jul 2 11:39:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 21:39:06 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <1183382948.4377.136789.camel@hal.voltaire.com> Message-ID: <20070702183906.GH17858@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects > > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > > > The fact that IB requires SQ WQEs to complete in order, even if their > > > destination is different EECs, > > > > Where's this requirement in the spec (and could this be relaxed as it > > seems like it is overly "specified") ? Just wondering... > > I don't think we want to relax the requirement that work requests > complete in order. It's hard enough to get applications correct > without having to worry about out-of-order completions, Hmm, they seem to deal fine with this in case of SRQ. Why not here? I guess this depends on the application, but let's look at something like IPoIB or SDP: all we do when we get a send completion is look up a WR a free it. It won't be too hard to deal with out of order, either. If an app uses a pointer as WRID, it's even easier. > and I think > specifying all the corner cases would be a nightmare. Eg do we allow > successful completions after a completion with error? and so on... However, as Dror notes, the in-order requirement simply moves the complexity to hardware. Which might be one of the reasons why there are no HW implementations of RD out there. -- MST From mst at dev.mellanox.co.il Mon Jul 2 11:46:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 21:46:30 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> Message-ID: <20070702184630.GI17858@mellanox.co.il> > > > I think there are probably bugs > > > in the locked_vm accounting in the kernel right now -- it doesn't take > > > into account the possibility of passing context fds from one process > > > to another. > > > > Hmm, might be a good idea to fix the bugs anyway, no? > > Yes, I guess we need to take a reference on the mm structure in > ib_umem_get() and only drop it after we free the umem. IMO this would create problems at process exit time. Maybe set umem->mm at umem_get time, and, in umem_release, just validate that umem->mm == current->mm? -- MST From mst at dev.mellanox.co.il Mon Jul 2 11:56:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 21:56:30 +0300 Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address In-Reply-To: References: <1183142276.18911.337.camel@brick.pathscale.com> Message-ID: <20070702185630.GJ17858@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address > > > - for (; i >= 0; --i) > > - ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); > > + for (; i > 0; --i) > > + ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE); > > Michael -- this looks rather clearly correct to me. Any objection to > applying it? Yes, the patch looks clearly correct to me. I recently saw a crash on one system which looks like it could be related: Call Trace: {:ib_ipoib:ipoib_cm_alloc_rx_skb+796} {:ib_ipoib:ipoib_cm_post_receive+119} {selinux_socket_sock_rcv_skb+530} {:ib_ipoib:ipoib_cm_handle_rx_wc+477} {sock_def_readable+16} {udp_queue_rcv_skb+827} {udp_rcv+1153} {:ib_ipoib:ipoib_ib_completion+144} {activate_task+124} {:ib_mthca:mthca_eq_int+215} {:ib_mthca:mthca_arbel_interrupt+56} {handle_IRQ_event+41} {do_IRQ+197} {ret_from_intr+0} I hoped to get this patch stress tested and report whether it helps before Acking. But it seems this won't happen soon since that system is busy. It's probably best to apply this patch. Acked-by: Michael S. Tsirkin -- MST From or.gerlitz at gmail.com Mon Jul 2 12:22:34 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 2 Jul 2007 22:22:34 +0300 Subject: [ofa-general] Re: IPoIB-CM UC mode In-Reply-To: <20070702145328.GC17858@mellanox.co.il> References: <4688F88D.4090806@voltaire.com> <20070702145328.GC17858@mellanox.co.il> Message-ID: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> On 7/2/07, Michael S. Tsirkin wrote: > > > Quoting Or Gerlitz : > Is there any --other-- part of the stack (eg mthca,cm) that needs to be > > enhanced for that? > > Not a whole lot. > We need an API to detect this feature support in HW. There could be a bit > of > work in mthca to detect HW/FW support for this feature, and enable > connecting UC > QPs to SRQ. There could be a bit of debugging work in CM in case we hit > some > bugs with LAP messages (which I plan to use for liveness detection). > Thanks for the info. Can you please elaborate a little more on the LAP based liveness detection mechanism you were thinking about? I might want to deploy it in another app. Actually, looking on IPoIB-CM RC based implementation I don't really understand its "liveness detection" mechanism... In ipoib_cm_send_req() I see that the code sets both the RC QP retries AND rnr retries to 0... doesn't this mean that a single RNR NAK would cause a TX QP to move to ERROR? can you clarify how do you use the "R" of "RC" here? thanks, Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Jul 2 12:25:58 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 02 Jul 2007 12:25:58 -0700 Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs In-Reply-To: References: Message-ID: <46895146.3050700@ichips.intel.com> > This seems to be solved with the below patch, however, i see that > for duplicate REQs the code is much more involved, which means i > might be over-simplifying here... I don't think you're over-simplifying. The REQ handling seems more involved because the connected state machine is more complex. REQ handling deals with duplicate messages by waiting to set the cm id state. We could do the same in the sidr req handler. Since we're in this part of the code, I'll create a patch to fix the 'todo' comment in this function, to ensure that both patches fit together cleanly. Thanks - Sean From or.gerlitz at gmail.com Mon Jul 2 12:33:25 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 2 Jul 2007 22:33:25 +0300 Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs In-Reply-To: <46895146.3050700@ichips.intel.com> References: <46895146.3050700@ichips.intel.com> Message-ID: <15ddcffd0707021233h45233dd8jfc213053138a5e01@mail.gmail.com> On 7/2/07, Sean Hefty wrote: > > > This seems to be solved with the below patch, however, i see that > > for duplicate REQs the code is much more involved, which means i > > might be over-simplifying here... > > I don't think you're over-simplifying. The REQ handling seems more > involved because the connected state machine is more complex. OK REQ handling deals with duplicate messages by waiting to set the cm id > state. We could do the same in the sidr req handler. Can you clarify what "waiting to set the cm id state" means? > Since we're in this part of the code, I'll create a patch to fix the > 'todo' comment in this function, to ensure that both patches fit > together cleanly. Assuming you refer to "todo: reply with no match" in cm_sidr_req_handler, what else need to be added to the current code? is it sending the REP with a different status (ie not 2) or sending a REJ? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdror at dev.mellanox.co.il Mon Jul 2 12:42:22 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Mon, 02 Jul 2007 22:42:22 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com> References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630220530.GB7554@mellanox.co.il> <20070701121623.GD17699@minantech.com> <892F6C42-ED3F-42F4-8D97-B801DAAA3CD9@lanl.gov> <9870CCF3-2F68-4B88-9039-CE688FA18F80@cisco.com> Message-ID: <4689551E.7020204@dev.mellanox.co.il> Jeff Squyres wrote: > On Jul 2, 2007, at 5:15 PM, Galen Shipman wrote: > >> While I think the SRC design makes sense I also have concerns >> regarding SSQ. >> As Gleb has pointed out, if the hardware doesn't do the demux then >> the application has to. It sounds like there are two proposals to >> deal with this hardware limitation in software (sigh). >> >> 1) Process A polls CQ, if WQE belongs to Process B, Process A will >> drop the WQE in a shared memory region that Process B will poll. [snip] >> 2) Process A peeks CQ, if WQE belongs to Process B, it doesn't >> process it [snip] >> >> In my opinion the demux belongs in the hardware, otherwise we end up >> complicating an already complicated code base to support a feature >> which unless I am missing something will have no benefit to real >> applications. I agree about this deficiency and unfortunately I don't think we can do anything about it with the current generation. As I said before, I don't have a quantitative data about how this might affect the overall performance of the application. If polling the CQ of the SQ is not in the critical performance path, it may end up having a negligible impact. But it might as well turn up to have some impact on performance. > > I agree. I cannot see how SSQ will be useful in Open MPI -- it makes > the code *much* more complicated and effectively guarantees to add > latency for the common case. I don't see how to explain it better > than Gleb/Galen already did. > > If Mellanox wants to implement SSQ for other reasons, fine. But based > on the explanations so far, I don't see us using it in [Open] MPI. The main intention of SSQ is for MPI which dominates the large/huge clusters. The intention is to help in scalability, which may have some impact on performance in some cases. I think that for now we should first start with SRC, and thus significantly improve the scalability. Let us worry a bit later about SSQ. > > --Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From gdror at dev.mellanox.co.il Mon Jul 2 12:49:44 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Mon, 02 Jul 2007 22:49:44 +0300 Subject: [ofa-general] Re: IPoIB-CM UC mode In-Reply-To: <4688F88D.4090806@voltaire.com> References: <4688F88D.4090806@voltaire.com> Message-ID: <468956D8.6020502@dev.mellanox.co.il> Or Gerlitz wrote: > Dror, > > can you please clarify > > A) if the IBTA change to allow attaching SRQ to UC QPs is done? Unfortunately I don't think I can comment on this outside an IBTA forum. But as an IBTA member, you can check out the SWG mail-thread or wait for the next spec membership review. > B) when it would be possible for you guys to support SRQ/UC in the FW? You probably want support in both firmware and driver. Let me check. > > Michael, > > If Dror says yes on both... what would it take to implement IPoIB-CM/UC? > > Is there any --other-- part of the stack (eg mthca,cm) that needs to > be enhanced for that? > > Or. > > From mst at dev.mellanox.co.il Mon Jul 2 12:53:14 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 22:53:14 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> Message-ID: <20070702195314.GA31169@mellanox.co.il> > > Quoting Or Gerlitz : > > Subject: Re: Re: IPoIB-CM UC mode > > > > On 7/2/07, Michael S. Tsirkin wrote: > > > > > Quoting Or Gerlitz : > > > > > Is there any --other-- part of the stack (eg mthca,cm) that needs to be > > > enhanced for that? > > > > Not a whole lot. > > We need an API to detect this feature support in HW. There could be a bit of > > work in mthca to detect HW/FW support for this feature, and enable connecting UC > > QPs to SRQ. There could be a bit of debugging work in CM in case we hit some > > bugs with LAP messages (which I plan to use for liveness detection). > > > > Thanks for the info. Can you please elaborate a little more on the LAP based > liveness detection mechanism you were thinking about? I might want to deploy > it in another app. With UC, if the remote side looses our QP, we get no indication whatsoever. But we don't want to destroy/recreate connections unless strictly necessary. So we must send something that will force remote side to respond. One such message is LAP with current primary path used as proposed alternate path. Remote will respond with APR with AP status 5 if the connection is there, and status 1 if it is not. > Actually, looking on IPoIB-CM RC based implementation I don't really > understand its "liveness detection" mechanism... In ipoib_cm_send_req() I see > that the code sets both the RC QP retries AND rnr retries to 0... doesn't this > mean that a single RNR NAK would cause a TX QP to move to ERROR? Yes, this is from spec, BTW. More importantly, a timeout will cause error, and we'll retry connection on next packet. > can you > clarify how do you use the "R" of "RC" here? The two reasons I used RC is because 1. UC does not support SRQ yet. 2. It's easier to detect connection is alive. -- MST From mst at dev.mellanox.co.il Mon Jul 2 12:59:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 2 Jul 2007 22:59:28 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> Message-ID: <20070702195927.GB31169@mellanox.co.il> > > > I guess we need this to be able to re-mmap doorbell pages etc, right? > > > I wonder if there's a better way around that... maybe extending the > > > kernel interface so that unrelated processes can share a context, eg > > > by putting contexts in a filesystem or something like that. > > The advantage is that sharing objects in a filesystem by doing open(), > protected by permissions etc. is much more familiar than passing fds > through sockets. I'm not sure it makes sense but shared memory + unix > domain socket fd passing is not a very natural way for most people to > program. Could you please clarify how do you envision this done? Do we just create our own filesystem? Reason I ask, we'll need something like this for SRC domain too ... -- MST From mshefty at ichips.intel.com Mon Jul 2 13:03:36 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 02 Jul 2007 13:03:36 -0700 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070702195314.GA31169@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> Message-ID: <46895A18.2000100@ichips.intel.com> > So we must send something that will force remote side to respond. One such > message is LAP with current primary path used as proposed alternate path. > Remote will respond with APR with AP status 5 if the connection is there, and > status 1 if it is not. I didn't follow this. Is this just an out of band keep alive message? Why not use DREQ to indicate that the connection went away under normal circumstances, and a send failure in an abnormal termination case? - Sean From gdror at dev.mellanox.co.il Mon Jul 2 13:08:43 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Mon, 02 Jul 2007 23:08:43 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070702174806.GE17858@mellanox.co.il> References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <1183382948.4377.136789.camel@hal.voltaire.com> <20070702174806.GE17858@mellanox.co.il> Message-ID: <46895B4B.8080909@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> Quoting Hal Rosenstock : >> Subject: Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects >> >> On Mon, 2007-07-02 at 08:58, Dror Goldenberg wrote: >> >>> Hal Rosenstock wrote: >>> >>>> On Sun, 2007-07-01 at 15:05, Gleb Natapov wrote: >>>> >>>> >>>>> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote: >>>>> >>>>> >>>>>>> SSQ is needed for scalability, no need to explain this (by >>>>>>> the way RD is needed for the same reason too. What's Mellanox >>>>>>> plan to support it? >>>>>>> >>>>>>> >>>>>> RD is not supported in hardware today. Implementing RD is extremely >>>>>> complicated. To solve the scalability issues on MPI like applications >>>>>> we believe that SRC and SSQ are the right solutions. It is much simpler >>>>>> for implementation by both software and hardware. By MPI-like I refer >>>>>> to applications that have some level of trust between two processes of >>>>>> the >>>>>> same application. RD also has some performance issues as it only >>>>>> supports one message in the air. Those performance issues are solved >>>>>> by design in SRC/SSQ. >>>>>> >>>>>> >>>>>> >>>>> Didn't know about RD limitation. Is this shortcomings of IB spec or >>>>> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ. >>>>> >>>>> >>>> I think Dror is referring to number of messages in flight per EEC and >>>> number of messages in flight per QP being limited to 1 per IBA spec. >>>> Number of messages enqueued per EEC/QP is implementation dependent. >>>> >>>> -- Hal >>>> >>>> >>> Correct. The number of messages in flight per EEC is 1 per IB spec. >>> The fact that IB requires SQ WQEs to complete in order, even if their >>> destination is different EECs, >>> >> Where's this requirement in the spec (and could this be relaxed as it >> seems like it is overly "specified") ? Just wondering... >> > > For example: > 10.8.5 RETURNING COMPLETED WORK REQUESTS > > ... > > Except for RD Receive Work Queues and Receive Work Queues associ- > ated with an SRQ, Work Completions are always returned in the order > submitted to a given Work Queue with respect to other Work Requests on > that Work Queue. > > I referred to: o10-52: If the CI supports RD Service, Work Requests submitted to the same RD Send Queue shall complete in the same order in which they were submitted. And I agree with Roland that it doesn't worth breaking it. And even if you do want to break it, it is still a mess to actually implement it in hardware and that is the main reason you see no RD implementations out there. From or.gerlitz at gmail.com Mon Jul 2 13:13:42 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 2 Jul 2007 23:13:42 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070702195314.GA31169@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> Message-ID: <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> On 7/2/07, Michael S. Tsirkin wrote: > > > > Quoting Or Gerlitz : > Thanks for the info. Can you please elaborate a little more on the LAP > based > > liveness detection mechanism you were thinking about? I might want to > deploy > > it in another app. > > With UC, if the remote side looses our QP, we get no indication > whatsoever. But > we don't want to destroy/recreate connections unless strictly necessary. why do we care if remote side lost our QP? my thinking is that we (TX QP) should care if the remote side (RX QP) is still there, and this is achieved by RC as you explain below. So we must send something that will force remote side to respond. One such > message is LAP with current primary path used as proposed alternate path. > Remote will respond with APR with AP status 5 if the connection is there, > and > status 1 if it is not. got it. the current app i was referring to uses UD and not UC, so I guess LAP is not possible. > Actually, looking on IPoIB-CM RC based implementation I don't really > > understand its "liveness detection" mechanism... In ipoib_cm_send_req() > I see > > that the code sets both the RC QP retries AND rnr retries to 0... > doesn't this > > mean that a single RNR NAK would cause a TX QP to move to ERROR? > > Yes, this is from spec, BTW. > More importantly, a timeout will cause error, and we'll retry connection > on next packet. so with the current IPoIB-CM implementation, single RNR NAK and/or single ACK loss would cause re-connection, wow... this does not sound like very ready much for production... My understanding is that A) as the IP layer is seen as unreliable, RC buys us nothing B) the current code usage of RC B.1) is ineffecient by nature since it loads the IB fabrics with ACKs and NAKs B.2) reconnects on each loss/nak - adds more ineffeciency we should move to UC am i missing something, what does RC buys us that UC does not? > can you > > clarify how do you use the "R" of "RC" here? > > The two reasons I used RC is because > 1. UC does not support SRQ yet. > 2. It's easier to detect connection is alive. > I wanted to understand the "how" in detail and not high-level (2 above) or env reasons (1 above) thanks anyway, Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Jul 2 13:38:00 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Jul 2007 13:38:00 -0700 Subject: [ofa-general] [PATCH] fix flow of handling duplicate SIDR REQs In-Reply-To: <15ddcffd0707021233h45233dd8jfc213053138a5e01@mail.gmail.com> Message-ID: <000501c7bce8$e191c5d0$3c98070a@amr.corp.intel.com> > Can you clarify what "waiting to set the cm id state" means? Something like this: diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index c7007c4..3dca385 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2794,7 +2794,6 @@ static int cm_sidr_req_handler(struct cm_work *work) work->mad_recv_wc->recv_buf.grh, &cm_id_priv->av); cm_id_priv->id.remote_id = sidr_req_msg->request_id; - cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; cm_id_priv->tid = sidr_req_msg->hdr.tid; atomic_inc(&cm_id_priv->work_count); @@ -2813,6 +2812,7 @@ static int cm_sidr_req_handler(struct cm_work *work) /* todo: reply with no match */ goto out; /* No match. */ } + cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; atomic_inc(&cur_cm_id_priv->refcount); spin_unlock_irq(&cm.lock);   > Assuming you refer to "todo: reply with no match" in cm_sidr_req_handler, > what else need to be added to the current code? is  it sending the REP > with a different status (ie not 2) or sending a REJ? This is all that needs to be done. The status should be 1, not 2. At this point, it's likely just a couple lines to fix. - Sean From sean.hefty at intel.com Mon Jul 2 14:00:19 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Jul 2007 14:00:19 -0700 Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support In-Reply-To: <20070702051116.GB5018@mellanox.co.il> Message-ID: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com> Add SA client support for notice/trap registration using InformInfo. Clients can use the ib_sa interface to register for SA events based on trap numbers, and receive SA event notification. This allows clients to receive notification, such as GID in/out of service. Signed-off-by: Sean Hefty --- Back by popular demand! Reposting of the local SA patches! drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/notice.c | 749 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/core/sa.h | 16 + drivers/infiniband/core/sa_query.c | 316 +++++++++++++++ include/rdma/ib_sa.h | 171 ++++++++ 5 files changed, 1251 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index cb1ab3e..7c5b5ed 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o +ib_sa-y := sa_query.o multicast.o notice.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c new file mode 100644 index 0000000..e4c73c8 --- /dev/null +++ b/drivers/infiniband/core/notice.c @@ -0,0 +1,749 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void inform_add_one(struct ib_device *device); +static void inform_remove_one(struct ib_device *device); + +static struct ib_client inform_client = { + .name = "ib_notice", + .add = inform_add_one, + .remove = inform_remove_one +}; + +static struct ib_sa_client sa_client; +static struct workqueue_struct *inform_wq; + +struct inform_device; + +struct inform_port { + struct inform_device *dev; + spinlock_t lock; + struct rb_root table; + atomic_t refcount; + struct completion comp; + u8 port_num; +}; + +struct inform_device { + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int end_port; + struct inform_port port[0]; +}; + +enum inform_state { + INFORM_IDLE, + INFORM_REGISTERING, + INFORM_MEMBER, + INFORM_BUSY, + INFORM_ERROR +}; + +struct inform_member; + +struct inform_group { + u16 trap_number; + struct rb_node node; + struct inform_port *port; + spinlock_t lock; + struct work_struct work; + struct list_head pending_list; + struct list_head active_list; + struct list_head notice_list; + struct inform_member *last_join; + int members; + enum inform_state join_state; /* State relative to SA */ + atomic_t refcount; + enum inform_state state; + struct ib_sa_query *query; + int query_id; +}; + +struct inform_member { + struct ib_inform_info info; + struct ib_sa_client *client; + struct inform_group *group; + struct list_head list; + enum inform_state state; + atomic_t refcount; + struct completion comp; +}; + +struct inform_notice { + struct list_head list; + struct ib_sa_notice notice; +}; + +static void reg_handler(int status, struct ib_sa_inform *inform, + void *context); +static void unreg_handler(int status, struct ib_sa_inform *inform, + void *context); + +static struct inform_group *inform_find(struct inform_port *port, + u16 trap_number) +{ + struct rb_node *node = port->table.rb_node; + struct inform_group *group; + + while (node) { + group = rb_entry(node, struct inform_group, node); + if (trap_number < group->trap_number) + node = node->rb_left; + else if (trap_number > group->trap_number) + node = node->rb_right; + else + return group; + } + return NULL; +} + +static struct inform_group *inform_insert(struct inform_port *port, + struct inform_group *group) +{ + struct rb_node **link = &port->table.rb_node; + struct rb_node *parent = NULL; + struct inform_group *cur_group; + + while (*link) { + parent = *link; + cur_group = rb_entry(parent, struct inform_group, node); + if (group->trap_number < cur_group->trap_number) + link = &(*link)->rb_left; + else if (group->trap_number > cur_group->trap_number) + link = &(*link)->rb_right; + else + return cur_group; + } + rb_link_node(&group->node, parent, link); + rb_insert_color(&group->node, &port->table); + return NULL; +} + +static void deref_port(struct inform_port *port) +{ + if (atomic_dec_and_test(&port->refcount)) + complete(&port->comp); +} + +static void release_group(struct inform_group *group) +{ + struct inform_port *port = group->port; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + if (atomic_dec_and_test(&group->refcount)) { + rb_erase(&group->node, &port->table); + spin_unlock_irqrestore(&port->lock, flags); + kfree(group); + deref_port(port); + } else + spin_unlock_irqrestore(&port->lock, flags); +} + +static void deref_member(struct inform_member *member) +{ + if (atomic_dec_and_test(&member->refcount)) + complete(&member->comp); +} + +static void queue_reg(struct inform_member *member) +{ + struct inform_group *group = member->group; + unsigned long flags; + + spin_lock_irqsave(&group->lock, flags); + list_add(&member->list, &group->pending_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + spin_unlock_irqrestore(&group->lock, flags); +} + +static int send_reg(struct inform_group *group, struct inform_member *member) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.subscribe = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number); + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + group->last_join = member; + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + reg_handler, group,&group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static int send_unreg(struct inform_group *group) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(group->trap_number); + inform.trap.generic.qpn = IB_QP1; + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, &inform, 3000, GFP_KERNEL, + unreg_handler, group, &group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static void join_group(struct inform_group *group, struct inform_member *member) +{ + member->state = INFORM_MEMBER; + group->members++; + list_move(&member->list, &group->active_list); +} + +static int fail_join(struct inform_group *group, struct inform_member *member, + int status) +{ + spin_lock_irq(&group->lock); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + return member->info.callback(status, &member->info, NULL); +} + +static void process_group_error(struct inform_group *group) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + while (!list_empty(&group->active_list)) { + member = list_entry(group->active_list.next, + struct inform_member, list); + atomic_inc(&member->refcount); + list_del_init(&member->list); + group->members--; + member->state = INFORM_ERROR; + spin_unlock_irq(&group->lock); + + ret = member->info.callback(-ENETRESET, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + group->join_state = INFORM_IDLE; + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); +} + +/* + * Report a notice to all active subscribers. We use a temporary list to + * handle unsubscription requests while the notice is being reported, which + * avoids holding the group lock while in the user's callback. + */ +static void process_notice(struct inform_group *group, + struct inform_notice *info_notice) +{ + struct inform_member *member; + struct list_head list; + int ret; + + INIT_LIST_HEAD(&list); + + spin_lock_irq(&group->lock); + list_splice_init(&group->active_list, &list); + while (!list_empty(&list)) { + + member = list_entry(list.next, struct inform_member, list); + atomic_inc(&member->refcount); + list_move(&member->list, &group->active_list); + spin_unlock_irq(&group->lock); + + ret = member->info.callback(0, &member->info, + &info_notice->notice); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + spin_unlock_irq(&group->lock); +} + +static void inform_work_handler(struct work_struct *work) +{ + struct inform_group *group; + struct inform_member *member; + struct ib_inform_info *info; + struct inform_notice *info_notice; + int status, ret; + + group = container_of(work, typeof(*group), work); +retest: + spin_lock_irq(&group->lock); + while (!list_empty(&group->pending_list) || + !list_empty(&group->notice_list) || + (group->state == INFORM_ERROR)) { + + if (group->state == INFORM_ERROR) { + spin_unlock_irq(&group->lock); + process_group_error(group); + goto retest; + } + + if (!list_empty(&group->notice_list)) { + info_notice = list_entry(group->notice_list.next, + struct inform_notice, list); + list_del(&info_notice->list); + spin_unlock_irq(&group->lock); + process_notice(group, info_notice); + kfree(info_notice); + goto retest; + } + + member = list_entry(group->pending_list.next, + struct inform_member, list); + info = &member->info; + atomic_inc(&member->refcount); + + if (group->join_state == INFORM_MEMBER) { + join_group(group, member); + spin_unlock_irq(&group->lock); + ret = info->callback(0, info, NULL); + } else { + spin_unlock_irq(&group->lock); + status = send_reg(group, member); + if (!status) { + deref_member(member); + return; + } + ret = fail_join(group, member, status); + } + + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + if (!group->members && (group->join_state == INFORM_MEMBER)) { + group->join_state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + if (send_unreg(group)) + goto retest; + } else { + group->state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + release_group(group); + } +} + +/* + * Fail a join request if it is still active - at the head of the pending queue. + */ +static void process_join_error(struct inform_group *group, int status) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + member = list_entry(group->pending_list.next, + struct inform_member, list); + if (group->last_join == member) { + atomic_inc(&member->refcount); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = member->info.callback(status, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + } else + spin_unlock_irq(&group->lock); +} + +static void reg_handler(int status, struct ib_sa_inform *inform, void *context) +{ + struct inform_group *group = context; + + if (status) + process_join_error(group, status); + else + group->join_state = INFORM_MEMBER; + + inform_work_handler(&group->work); +} + +static void unreg_handler(int status, struct ib_sa_inform *rec, void *context) +{ + struct inform_group *group = context; + + inform_work_handler(&group->work); +} + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice) +{ + struct inform_device *dev; + struct inform_port *port; + struct inform_group *group; + struct inform_notice *info_notice; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return 0; /* No one to give notice to. */ + + port = &dev->port[port_num - dev->start_port]; + spin_lock_irq(&port->lock); + group = inform_find(port, __be16_to_cpu(notice->trap. + generic.trap_num)); + if (!group) { + spin_unlock_irq(&port->lock); + return 0; + } + + atomic_inc(&group->refcount); + spin_unlock_irq(&port->lock); + + info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL); + if (!info_notice) { + release_group(group); + return -ENOMEM; + } + + info_notice->notice = *notice; + + spin_lock_irq(&group->lock); + list_add(&info_notice->list, &group->notice_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + inform_work_handler(&group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + return 0; +} + +static struct inform_group *acquire_group(struct inform_port *port, + u16 trap_number, gfp_t gfp_mask) +{ + struct inform_group *group, *cur_group; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + group = inform_find(port, trap_number); + if (group) + goto found; + spin_unlock_irqrestore(&port->lock, flags); + + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + + group->port = port; + group->trap_number = trap_number; + INIT_LIST_HEAD(&group->pending_list); + INIT_LIST_HEAD(&group->active_list); + INIT_LIST_HEAD(&group->notice_list); + INIT_WORK(&group->work, inform_work_handler); + spin_lock_init(&group->lock); + + spin_lock_irqsave(&port->lock, flags); + cur_group = inform_insert(port, group); + if (cur_group) { + kfree(group); + group = cur_group; + } else + atomic_inc(&port->refcount); +found: + atomic_inc(&group->refcount); + spin_unlock_irqrestore(&port->lock, flags); + return group; +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context) +{ + struct inform_device *dev; + struct inform_member *member; + struct ib_inform_info *info; + int ret; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + ib_sa_client_get(client); + member->client = client; + member->info.trap_number = trap_number; + member->info.callback = callback; + member->info.context = context; + init_completion(&member->comp); + atomic_set(&member->refcount, 1); + member->state = INFORM_REGISTERING; + + member->group = acquire_group(&dev->port[port_num - dev->start_port], + trap_number, gfp_mask); + if (!member->group) { + ret = -ENOMEM; + goto err; + } + + /* + * The user will get the info structure in their callback. They + * could then free the info structure before we can return from + * this routine. So we save the pointer to return before queuing + * any callback. + */ + info = &member->info; + queue_reg(member); + return info; + +err: + ib_sa_client_put(member->client); + kfree(member); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_sa_register_inform_info); + +void ib_sa_unregister_inform_info(struct ib_inform_info *info) +{ + struct inform_member *member; + struct inform_group *group; + + member = container_of(info, struct inform_member, info); + group = member->group; + + spin_lock_irq(&group->lock); + if (member->state == INFORM_MEMBER) + group->members--; + + list_del_init(&member->list); + + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + /* Continue to hold reference on group until callback */ + queue_work(inform_wq, &group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + deref_member(member); + wait_for_completion(&member->comp); + ib_sa_client_put(member->client); + kfree(member); +} +EXPORT_SYMBOL(ib_sa_unregister_inform_info); + +static void inform_groups_lost(struct inform_port *port) +{ + struct inform_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct inform_group, node); + spin_lock(&group->lock); + if (group->state == INFORM_IDLE) { + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + group->state = INFORM_ERROR; + spin_unlock(&group->lock); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void inform_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct inform_device *dev; + + dev = container_of(handler, struct inform_device, event_handler); + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + inform_groups_lost(&dev->port[event->element.port_num - + dev->start_port]); + break; + default: + break; + } +} + +static void inform_add_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + port->dev = dev; + port->port_num = dev->start_port + i; + spin_lock_init(&port->lock); + port->table = RB_ROOT; + init_completion(&port->comp); + atomic_set(&port->refcount, 1); + } + + dev->device = device; + ib_set_client_data(device, &inform_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, inform_event_handler); + ib_register_event_handler(&dev->event_handler); +} + +static void inform_remove_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(inform_wq); + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + deref_port(port); + wait_for_completion(&port->comp); + } + + kfree(dev); +} + +int notice_init(void) +{ + int ret; + + inform_wq = create_singlethread_workqueue("ib_inform"); + if (!inform_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + + ret = ib_register_client(&inform_client); + if (ret) + goto err; + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); + return ret; +} + +void notice_cleanup(void) +{ + ib_unregister_client(&inform_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); +} diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index 24c93fd..b8eac66 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -63,4 +63,20 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, int mcast_init(void); void mcast_cleanup(void); +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice); + +int notice_init(void); +void notice_cleanup(void); + #endif /* SA_H */ diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 4791d01..23d1081 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -62,10 +62,12 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; + struct ib_mad_agent *notice_agent; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; u8 port_num; + struct ib_device *device; }; struct ib_sa_device { @@ -102,6 +104,12 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_sa_inform_query { + void (*callback)(int, struct ib_sa_inform *, void *); + void *context; + struct ib_sa_query sa_query; +}; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -353,6 +361,110 @@ static const struct ib_field service_rec_table[] = { .size_bits = 2*64 }, }; +#define INFORM_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_inform, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_inform *) 0)->field, \ + .field_name = "sa_inform:" #field + +static const struct ib_field inform_table[] = { + { INFORM_FIELD(gid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { INFORM_FIELD(lid_range_begin), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(lid_range_end), + .offset_words = 4, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 5, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(is_generic), + .offset_words = 5, + .offset_bits = 16, + .size_bits = 8 }, + { INFORM_FIELD(subscribe), + .offset_words = 5, + .offset_bits = 24, + .size_bits = 8 }, + { INFORM_FIELD(type), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.trap_num), + .offset_words = 6, + .offset_bits = 16, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.qpn), + .offset_words = 7, + .offset_bits = 0, + .size_bits = 24 }, + { RESERVED, + .offset_words = 7, + .offset_bits = 24, + .size_bits = 3 }, + { INFORM_FIELD(trap.generic.resp_time), + .offset_words = 7, + .offset_bits = 27, + .size_bits = 5 }, + { RESERVED, + .offset_words = 8, + .offset_bits = 0, + .size_bits = 8 }, + { INFORM_FIELD(trap.generic.producer_type), + .offset_words = 8, + .offset_bits = 8, + .size_bits = 24 }, +}; + +#define NOTICE_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_notice, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_notice *) 0)->field, \ + .field_name = "sa_notice:" #field + +static const struct ib_field notice_table[] = { + { NOTICE_FIELD(is_generic), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(type), + .offset_words = 0, + .offset_bits = 1, + .size_bits = 7 }, + { NOTICE_FIELD(trap.generic.producer_type), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 24 }, + { NOTICE_FIELD(trap.generic.trap_num), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { NOTICE_FIELD(issuer_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { NOTICE_FIELD(notice_toggle), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(notice_count), + .offset_words = 2, + .offset_bits = 1, + .size_bits = 15 }, + { NOTICE_FIELD(data_details), + .offset_words = 2, + .offset_bits = 16, + .size_bits = 432 }, + { NOTICE_FIELD(issuer_gid), + .offset_words = 16, + .offset_bits = 0, + .size_bits = 128 }, +}; + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -929,6 +1041,153 @@ err1: return ret; } +static void ib_sa_inform_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_inform_query *query = + container_of(sa_query, struct ib_sa_inform_query, sa_query); + + if (mad) { + struct ib_sa_inform rec; + + ib_unpack(inform_table, ARRAY_SIZE(inform_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_inform_release(struct ib_sa_query *sa_query) +{ + kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query)); +} + +/** + * ib_sa_informinfo_query - Start an InformInfo registration. + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Inform record to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when notice handler registration completes, + * times out or is canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * This function sends inform info to register with SA to receive + * in-service notice. + * The callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_inform_query() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_inform *rec, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_inform_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + struct ib_sa_mad *mad; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; + } + + ib_sa_client_get(client); + query->sa_query.client = client; + query->callback = callback; + query->context = context; + + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); + + query->sa_query.callback = callback ? ib_sa_inform_callback : NULL; + query->sa_query.release = ib_sa_inform_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_SET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_INFORM_INFO); + + ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); + if (ret < 0) + goto err2; + + return ret; + +err2: + *sa_query = NULL; + ib_sa_client_put(query->sa_query.client); + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kfree(query); + return ret; +} + +static void ib_sa_notice_resp(struct ib_sa_port *port, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_send_buf *mad_buf; + struct ib_sa_mad *mad; + int ret; + + mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0, + IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + GFP_KERNEL); + if (IS_ERR(mad_buf)) + return; + + mad = mad_buf->mad; + memcpy(mad, mad_recv_wc->recv_buf.mad, sizeof *mad); + mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP; + + spin_lock_irq(&port->ah_lock); + kref_get(&port->sm_ah->ref); + mad_buf->context[0] = &port->sm_ah->ref; + mad_buf->ah = port->sm_ah->ah; + spin_unlock_irq(&port->ah_lock); + + ret = ib_post_send_mad(mad_buf, NULL); + if (ret) + goto err; + + return; +err: + kref_put(mad_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_buf); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -982,9 +1241,36 @@ static void recv_handler(struct ib_mad_agent *mad_agent, ib_free_recv_mad(mad_recv_wc); } +static void notice_resp_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + kref_put(mad_send_wc->send_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_send_wc->send_buf); +} + +static void notice_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_notice notice; + + port = mad_agent->context; + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, ¬ice); + + if (!notice_dispatch(port->device, port->port_num, ¬ice)) + ib_sa_notice_resp(port, mad_recv_wc); + ib_free_recv_mad(mad_recv_wc); +} + static void ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; + struct ib_mad_reg_req reg_req = { + .mgmt_class = IB_MGMT_CLASS_SUBN_ADM, + .mgmt_class_version = 2 + }; int s, e, i; if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) @@ -1018,6 +1304,16 @@ static void ib_sa_add_one(struct ib_device *device) if (IS_ERR(sa_dev->port[i].agent)) goto err; + sa_dev->port[i].device = device; + set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask); + sa_dev->port[i].notice_agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + ®_req, 0, notice_resp_handler, + notice_handler, &sa_dev->port[i]); + + if (IS_ERR(sa_dev->port[i].notice_agent)) + goto err; + INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah); } @@ -1040,8 +1336,14 @@ static void ib_sa_add_one(struct ib_device *device) return; err: - while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + while (--i >= 0) { + if (!IS_ERR(sa_dev->port[i].notice_agent)) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); + } + if (!IS_ERR(sa_dev->port[i].agent)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + } kfree(sa_dev); @@ -1061,6 +1363,7 @@ static void ib_sa_remove_one(struct ib_device *device) flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); } @@ -1089,7 +1392,15 @@ static int __init ib_sa_init(void) goto err2; } + ret = notice_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize notice handling\n"); + goto err3; + } + return 0; +err3: + mcast_cleanup(); err2: ib_unregister_client(&sa_client); err1: @@ -1099,6 +1410,7 @@ err1: static void __exit ib_sa_cleanup(void) { mcast_cleanup(); + notice_cleanup(); ib_unregister_client(&sa_client); idr_destroy(&query_idr); } diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 5e26b2f..83d8157 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -254,6 +254,127 @@ struct ib_sa_service_rec { u64 data64[2]; }; +enum { + IB_SA_EVENT_TYPE_FATAL = 0x0, + IB_SA_EVENT_TYPE_URGENT = 0x1, + IB_SA_EVENT_TYPE_SECURITY = 0x2, + IB_SA_EVENT_TYPE_SM = 0x3, + IB_SA_EVENT_TYPE_INFO = 0x4, + IB_SA_EVENT_TYPE_EMPTY = 0x7F, + IB_SA_EVENT_TYPE_ALL = 0xFFFF +}; + +enum { + IB_SA_EVENT_PRODUCER_TYPE_CA = 0x1, + IB_SA_EVENT_PRODUCER_TYPE_SWITCH = 0x2, + IB_SA_EVENT_PRODUCER_TYPE_ROUTER = 0x3, + IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER = 0x4, + IB_SA_EVENT_PRODUCER_TYPE_ALL = 0xFFFFFF +}; + +enum { + IB_SA_SM_TRAP_GID_IN_SERVICE = 64, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65, + IB_SA_SM_TRAP_CREATE_MC_GROUP = 66, + IB_SA_SM_TRAP_DELETE_MC_GROUP = 67, + IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128, + IB_SA_SM_TRAP_LINK_INTEGRITY = 129, + IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130, + IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131, + IB_SA_SM_TRAP_BAD_M_KEY = 256, + IB_SA_SM_TRAP_BAD_P_KEY = 257, + IB_SA_SM_TRAP_BAD_Q_KEY = 258, + IB_SA_SM_TRAP_SWITCH_BAD_P_KEY = 259, + IB_SA_SM_TRAP_ALL = 0xFFFF +}; + +struct ib_sa_inform { + union ib_gid gid; + __be16 lid_range_begin; + __be16 lid_range_end; + u8 is_generic; + u8 subscribe; + __be16 type; + union { + struct { + __be16 trap_num; + __be32 qpn; + u8 resp_time; + __be32 producer_type; + } generic; + struct { + __be16 device_id; + __be32 qpn; + u8 resp_time; + __be32 vendor_id; + } vendor; + } trap; +}; + +struct ib_sa_notice { + u8 is_generic; + u8 type; + union { + struct { + __be32 producer_type; + __be16 trap_num; + } generic; + struct { + __be32 vendor_id; + __be16 device_id; + } vendor; + } trap; + __be16 issuer_lid; + __be16 notice_count; + u8 notice_toggle; + /* + * Align data 16 bits off 64 bit field to match InformInfo definition. + * Data contained within this field will then align properly. + * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1. + */ + u8 reserved[5]; + u8 data_details[54]; + union ib_gid issuer_gid; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_GID_IN_SERVICE = 64 + * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65 + * IB_SA_SM_TRAP_CREATE_MC_GROUP = 66 + * IB_SA_SM_TRAP_DELETE_MC_GROUP = 67 + */ +struct ib_sa_notice_data_gid { + u8 reserved[6]; + u8 gid[16]; + u8 padding[32]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128 + */ +struct ib_sa_notice_data_port_change { + __be16 lid; + u8 padding[52]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_LINK_INTEGRITY = 129 + * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130 + * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131 + */ +struct ib_sa_notice_data_port_error { + u8 reserved[2]; + __be16 lid; + u8 port_num; + u8 padding[49]; +}; + struct ib_sa_client { atomic_t users; struct completion comp; @@ -382,4 +503,54 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num, struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +struct ib_inform_info { + void *context; + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice); + u16 trap_number; +}; + +/** + * ib_sa_register_inform_info - Registers to receive notice events. + * @device: Device associated with the registration. + * @port_num: Port on the specified device to associate with the registration. + * @trap_number: InformInfo trap number to register for. + * @gfp_mask: GFP mask for memory allocations. + * @callback: User callback invoked once the registration completes and to + * report noticed events. + * @context: User specified context stored with the ib_inform_reg structure. + * + * This call initiates a registration request with the SA for the specified + * trap number. If the operation is started successfully, it returns + * an ib_inform_info structure that is used to track the registration operation. + * Users must free this structure by calling ib_unregister_inform_info, + * even if the operation later fails. (The callback status is non-zero.) + * + * If the registration fails; status will be non-zero. If the registration + * succeeds, the callback status will be zero, but the notice parameter will + * be NULL. If the notice parameter is not NULL, a trap or notice is being + * reported to the user. + * + * A status of -ENETRESET indicates that an error occurred which requires + * reregisteration. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context); + +/** + * ib_sa_unregister_inform_info - Releases an InformInfo registration. + * @info: InformInfo registration tracking structure. + * + * This call blocks until the registration request is destroyed. It may + * not be called from within the registration callback. + */ +void ib_sa_unregister_inform_info(struct ib_inform_info *info); + #endif /* IB_SA_H */ From sean.hefty at intel.com Mon Jul 2 14:02:08 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Jul 2007 14:02:08 -0700 Subject: [ofa-general] [PATCH 2/2] ib/sa: Add local SA path record caching In-Reply-To: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com> Message-ID: <000701c7bcec$40c933a0$3c98070a@amr.corp.intel.com> Query and store path records locally to decrease path record query time and offload SA flooding during the start-up of large clustered jobs. Signed-off-by: Sean Hefty --- Now, this version is a thing of beauty. drivers/infiniband/core/Makefile | 2 drivers/infiniband/core/local_sa.c | 1275 +++++++++++++++++++++++++++++++++++ drivers/infiniband/core/multicast.c | 50 - drivers/infiniband/core/sa.h | 23 + drivers/infiniband/core/sa_query.c | 107 ++- include/rdma/ib_local_sa.h | 83 ++ include/rdma/ib_sa.h | 3 7 files changed, 1467 insertions(+), 76 deletions(-) diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 7c5b5ed..f646040 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -13,7 +13,7 @@ ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o notice.o +ib_sa-y := sa_query.o multicast.o notice.o local_sa.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c new file mode 100644 index 0000000..6c073a3 --- /dev/null +++ b/drivers/infiniband/core/local_sa.c @@ -0,0 +1,1275 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand subnet administration caching"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + SA_DB_MAX_PATHS_PER_DEST = 0x7F, + SA_DB_MIN_RETRY_TIMER = 4000, /* 4 sec */ + SA_DB_MAX_RETRY_TIMER = 256000 /* 256 sec */ +}; + +static int set_paths_per_dest(const char *val, struct kernel_param *kp); +static unsigned long paths_per_dest = 0; +module_param_call(paths_per_dest, set_paths_per_dest, param_get_ulong, + &paths_per_dest, 0644); +MODULE_PARM_DESC(paths_per_dest, "Maximum number of paths to retrieve " + "to each destination (DGID). Set to 0 " + "to disable cache."); + +static int set_subscribe_inform_info(const char *val, struct kernel_param *kp); +static char subscribe_inform_info = 1; +module_param_call(subscribe_inform_info, set_subscribe_inform_info, + param_get_bool, &subscribe_inform_info, 0644); +MODULE_PARM_DESC(subscribe_inform_info, + "Subscribe for SA InformInfo/Notice events."); + +static int do_refresh(const char *val, struct kernel_param *kp); +module_param_call(refresh, do_refresh, NULL, NULL, 0200); + +static unsigned long retry_timer = SA_DB_MIN_RETRY_TIMER; + +enum sa_db_lookup_method { + SA_DB_LOOKUP_LEAST_USED, + SA_DB_LOOKUP_RANDOM +}; + +static int set_lookup_method(const char *val, struct kernel_param *kp); +static int get_lookup_method(char *buf, struct kernel_param *kp); +static unsigned long lookup_method; +module_param_call(lookup_method, set_lookup_method, get_lookup_method, + &lookup_method, 0644); +MODULE_PARM_DESC(lookup_method, "Method used to return path records when " + "multiple paths exist to a given destination."); + +static void sa_db_add_dev(struct ib_device *device); +static void sa_db_remove_dev(struct ib_device *device); + +static struct ib_client sa_db_client = { + .name = "local_sa", + .add = sa_db_add_dev, + .remove = sa_db_remove_dev +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(lock); +static rwlock_t rwlock; +static struct workqueue_struct *sa_wq; +static struct ib_sa_client sa_client; + +enum sa_db_state { + SA_DB_IDLE, + SA_DB_REFRESH, + SA_DB_DESTROY +}; + +struct sa_db_port { + struct sa_db_device *dev; + struct ib_mad_agent *agent; + /* Limit number of outstanding MADs to SA to reduce SA flooding */ + struct ib_mad_send_buf *msg; + u16 sm_lid; + u8 sm_sl; + struct ib_inform_info *in_info; + struct ib_inform_info *out_info; + struct rb_root paths; + struct list_head update_list; + unsigned long update_id; + enum sa_db_state state; + struct work_struct work; + union ib_gid gid; + int port_num; +}; + +struct sa_db_device { + struct list_head list; + struct ib_device *device; + struct ib_event_handler event_handler; + int start_port; + int port_count; + struct sa_db_port port[0]; +}; + +struct ib_sa_iterator { + struct ib_sa_iterator *next; +}; + +struct ib_sa_attr_iter { + struct ib_sa_iterator *iter; + unsigned long flags; +}; + +struct ib_sa_attr_list { + struct ib_sa_iterator iter; + struct ib_sa_iterator *tail; + unsigned long update_id; + union ib_gid gid; + struct rb_node node; +}; + +struct ib_path_rec_info { + struct ib_sa_iterator iter; /* keep first */ + struct ib_sa_path_rec rec; + unsigned long lookups; +}; + +struct ib_sa_mad_iter { + struct ib_mad_recv_wc *recv_wc; + struct ib_mad_recv_buf *recv_buf; + int attr_size; + int attr_offset; + int data_offset; + int data_left; + void *attr; + u8 attr_data[0]; +}; + +enum sa_update_type { + SA_UPDATE_FULL, + SA_UPDATE_ADD, + SA_UPDATE_REMOVE +}; + +struct update_info { + struct list_head list; + union ib_gid gid; + enum sa_update_type type; +}; + +struct sa_path_request { + struct work_struct work; + struct ib_sa_client *client; + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_path_rec path_rec; +}; + +static void process_updates(struct sa_db_port *port); + +static void free_attr_list(struct ib_sa_attr_list *attr_list) +{ + struct ib_sa_iterator *cur; + + for (cur = attr_list->iter.next; cur; cur = attr_list->iter.next) { + attr_list->iter.next = cur->next; + kfree(cur); + } + attr_list->tail = &attr_list->iter; +} + +static void remove_attr(struct rb_root *root, struct ib_sa_attr_list *attr_list) +{ + rb_erase(&attr_list->node, root); + free_attr_list(attr_list); + kfree(attr_list); +} + +static void remove_all_attrs(struct rb_root *root) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + write_lock_irq(&rwlock); + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + remove_attr(root, attr_list); + } + write_unlock_irq(&rwlock); +} + +static void remove_old_attrs(struct rb_root *root, unsigned long update_id) +{ + struct rb_node *node, *next_node; + struct ib_sa_attr_list *attr_list; + + write_lock_irq(&rwlock); + for (node = rb_first(root); node; node = next_node) { + next_node = rb_next(node); + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + if (attr_list->update_id != update_id) + remove_attr(root, attr_list); + } + write_unlock_irq(&rwlock); +} + +static struct ib_sa_attr_list *insert_attr_list(struct rb_root *root, + struct ib_sa_attr_list *attr_list) +{ + struct rb_node **link = &root->rb_node; + struct rb_node *parent = NULL; + struct ib_sa_attr_list *cur_attr_list; + int cmp; + + while (*link) { + parent = *link; + cur_attr_list = rb_entry(parent, struct ib_sa_attr_list, node); + cmp = memcmp(&cur_attr_list->gid, &attr_list->gid, + sizeof attr_list->gid); + if (cmp < 0) + link = &(*link)->rb_left; + else if (cmp > 0) + link = &(*link)->rb_right; + else + return cur_attr_list; + } + rb_link_node(&attr_list->node, parent, link); + rb_insert_color(&attr_list->node, root); + return NULL; +} + +static struct ib_sa_attr_list *find_attr_list(struct rb_root *root, u8 *gid) +{ + struct rb_node *node = root->rb_node; + struct ib_sa_attr_list *attr_list; + int cmp; + + while (node) { + attr_list = rb_entry(node, struct ib_sa_attr_list, node); + cmp = memcmp(&attr_list->gid, gid, sizeof attr_list->gid); + if (cmp < 0) + node = node->rb_left; + else if (cmp > 0) + node = node->rb_right; + else + return attr_list; + } + return NULL; +} + +static int insert_attr(struct rb_root *root, unsigned long update_id, void *key, + struct ib_sa_iterator *iter) +{ + struct ib_sa_attr_list *attr_list; + void *err; + + write_lock_irq(&rwlock); + attr_list = find_attr_list(root, key); + if (!attr_list) { + write_unlock_irq(&rwlock); + attr_list = kmalloc(sizeof *attr_list, GFP_KERNEL); + if (!attr_list) + return -ENOMEM; + + attr_list->iter.next = NULL; + attr_list->tail = &attr_list->iter; + attr_list->update_id = update_id; + memcpy(attr_list->gid.raw, key, sizeof attr_list->gid); + + write_lock_irq(&rwlock); + err = insert_attr_list(root, attr_list); + if (err) { + write_unlock_irq(&rwlock); + kfree(attr_list); + return PTR_ERR(err); + } + } else if (attr_list->update_id != update_id) { + free_attr_list(attr_list); + attr_list->update_id = update_id; + } + + attr_list->tail->next = iter; + iter->next = NULL; + attr_list->tail = iter; + write_unlock_irq(&rwlock); + return 0; +} + +static struct ib_sa_mad_iter *ib_sa_iter_create(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_mad_iter *iter; + struct ib_sa_mad *mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + int attr_size, attr_offset; + + attr_offset = be16_to_cpu(mad->sa_hdr.attr_offset) * 8; + attr_size = 64; /* path record length */ + if (attr_offset < attr_size) + return ERR_PTR(-EINVAL); + + iter = kzalloc(sizeof *iter + attr_size, GFP_KERNEL); + if (!iter) + return ERR_PTR(-ENOMEM); + + iter->data_left = mad_recv_wc->mad_len - IB_MGMT_SA_HDR; + iter->recv_wc = mad_recv_wc; + iter->recv_buf = &mad_recv_wc->recv_buf; + iter->attr_offset = attr_offset; + iter->attr_size = attr_size; + return iter; +} + +static void ib_sa_iter_free(struct ib_sa_mad_iter *iter) +{ + kfree(iter); +} + +static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter) +{ + struct ib_sa_mad *mad; + int left, offset = 0; + + while (iter->data_left >= iter->attr_offset) { + while (iter->data_offset < IB_MGMT_SA_DATA) { + mad = (struct ib_sa_mad *) iter->recv_buf->mad; + + left = IB_MGMT_SA_DATA - iter->data_offset; + if (left < iter->attr_size) { + /* copy first piece of the attribute */ + iter->attr = &iter->attr_data; + memcpy(iter->attr, + &mad->data[iter->data_offset], left); + offset = left; + break; + } else if (offset) { + /* copy the second piece of the attribute */ + memcpy(iter->attr + offset, &mad->data[0], + iter->attr_size - offset); + iter->data_offset = iter->attr_size - offset; + offset = 0; + } else { + iter->attr = &mad->data[iter->data_offset]; + iter->data_offset += iter->attr_size; + } + + iter->data_left -= iter->attr_offset; + goto out; + } + iter->data_offset = 0; + iter->recv_buf = list_entry(iter->recv_buf->list.next, + struct ib_mad_recv_buf, list); + } + iter->attr = NULL; +out: + return iter->attr; +} + +/* + * Copy path records from a received response and insert them into our cache. + * A path record in the MADs are in network order, packed, and may + * span multiple MAD buffers, just to make our life hard. + */ +static void update_path_db(struct sa_db_port *port, + struct ib_mad_recv_wc *mad_recv_wc, + enum sa_update_type type) +{ + struct ib_sa_mad_iter *iter; + struct ib_path_rec_info *path_info; + void *attr; + int ret; + + iter = ib_sa_iter_create(mad_recv_wc); + if (IS_ERR(iter)) + return; + + port->update_id += (type == SA_UPDATE_FULL); + + while ((attr = ib_sa_iter_next(iter)) && + (path_info = kmalloc(sizeof *path_info, GFP_KERNEL))) { + + ib_sa_unpack_attr(&path_info->rec, attr, IB_SA_ATTR_PATH_REC); + + ret = insert_attr(&port->paths, port->update_id, + path_info->rec.dgid.raw, &path_info->iter); + if (ret) { + kfree(path_info); + break; + } + } + ib_sa_iter_free(iter); + + if (type == SA_UPDATE_FULL) + remove_old_attrs(&port->paths, port->update_id); +} + +static struct ib_mad_send_buf *get_sa_msg(struct sa_db_port *port, + struct update_info *update) +{ + struct ib_ah_attr ah_attr; + struct ib_mad_send_buf *msg; + + msg = ib_create_send_mad(port->agent, 1, 0, 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, GFP_KERNEL); + if (IS_ERR(msg)) + return NULL; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port->sm_lid; + ah_attr.sl = port->sm_sl; + ah_attr.port_num = port->port_num; + + msg->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(msg->ah)) { + ib_free_send_mad(msg); + return NULL; + } + + msg->timeout_ms = retry_timer; + msg->retries = 0; + msg->context[0] = port; + msg->context[1] = update; + return msg; +} + +static __be64 form_tid(u32 hi_tid) +{ + static atomic_t tid; + return cpu_to_be64((((u64) hi_tid) << 32) | + ((u32) atomic_inc_return(&tid))); +} + +static void format_path_req(struct sa_db_port *port, + struct update_info *update, + struct ib_mad_send_buf *msg) +{ + struct ib_sa_mad *mad = msg->mad; + struct ib_sa_path_rec path_rec; + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + mad->mad_hdr.method = IB_SA_METHOD_GET_TABLE; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->mad_hdr.tid = form_tid(msg->mad_agent->hi_tid); + + mad->sa_hdr.comp_mask = IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH; + + memset(&path_rec, 0, sizeof path_rec); + path_rec.sgid = port->gid; + path_rec.numb_path = (u8) paths_per_dest; + + if (update->type == SA_UPDATE_ADD) { + mad->sa_hdr.comp_mask |= IB_SA_PATH_REC_DGID; + memcpy(&path_rec.dgid, &update->gid, sizeof path_rec.dgid); + } + + ib_sa_pack_attr(mad->data, &path_rec, IB_SA_ATTR_PATH_REC); +} + +static int send_query(struct sa_db_port *port, + struct update_info *update) +{ + int ret; + + port->msg = get_sa_msg(port, update); + if (!port->msg) + return -ENOMEM; + + format_path_req(port, update, port->msg); + + ret = ib_post_send_mad(port->msg, NULL); + if (ret) + goto err; + + return 0; + +err: + ib_destroy_ah(port->msg->ah); + ib_free_send_mad(port->msg); + return ret; +} + +static void add_update(struct sa_db_port *port, u8 *gid, + enum sa_update_type type) +{ + struct update_info *update; + + update = kmalloc(sizeof *update, GFP_KERNEL); + if (update) { + if (gid) + memcpy(&update->gid, gid, sizeof update->gid); + update->type = type; + list_add(&update->list, &port->update_list); + } + + if (port->state == SA_DB_IDLE) { + port->state = SA_DB_REFRESH; + process_updates(port); + } +} + +static void clean_update_list(struct sa_db_port *port) +{ + struct update_info *update; + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + list_del(&update->list); + kfree(update); + } +} + +static int notice_handler(int status, struct ib_inform_info *info, + struct ib_sa_notice *notice) +{ + struct sa_db_port *port = info->context; + struct ib_sa_notice_data_gid *gid_data; + struct ib_inform_info **pinfo; + enum sa_update_type type; + + if (info->trap_number == IB_SA_SM_TRAP_GID_IN_SERVICE) { + pinfo = &port->in_info; + type = SA_UPDATE_ADD; + } else { + pinfo = &port->out_info; + type = SA_UPDATE_REMOVE; + } + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY || !*pinfo) { + mutex_unlock(&lock); + return 0; + } + + if (notice) { + gid_data = (struct ib_sa_notice_data_gid *) + ¬ice->data_details; + add_update(port, gid_data->gid, type); + mutex_unlock(&lock); + } else if (status == -ENETRESET) { + *pinfo = NULL; + mutex_unlock(&lock); + } else { + if (status) + *pinfo = ERR_PTR(-EINVAL); + port->state = SA_DB_IDLE; + clean_update_list(port); + mutex_unlock(&lock); + queue_work(sa_wq, &port->work); + } + + return status; +} + +static int reg_in_info(struct sa_db_port *port) +{ + int ret = 0; + + port->in_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_IN_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->in_info)) + ret = PTR_ERR(port->in_info); + + return ret; +} + +static int reg_out_info(struct sa_db_port *port) +{ + int ret = 0; + + port->out_info = ib_sa_register_inform_info(&sa_client, + port->dev->device, + port->port_num, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE, + GFP_KERNEL, notice_handler, + port); + if (IS_ERR(port->out_info)) + ret = PTR_ERR(port->out_info); + + return ret; +} + +static void unsubscribe_port(struct sa_db_port *port) +{ + if (port->in_info && !IS_ERR(port->in_info)) + ib_sa_unregister_inform_info(port->in_info); + + if (port->out_info && !IS_ERR(port->out_info)) + ib_sa_unregister_inform_info(port->out_info); + + port->out_info = NULL; + port->in_info = NULL; + +} + +static void cleanup_port(struct sa_db_port *port) +{ + unsubscribe_port(port); + + clean_update_list(port); + remove_all_attrs(&port->paths); +} + +static int update_port_info(struct sa_db_port *port) +{ + struct ib_port_attr port_attr; + int ret; + + ret = ib_query_port(port->dev->device, port->port_num, &port_attr); + if (ret) + return ret; + + if (port_attr.state != IB_PORT_ACTIVE) + return -ENODATA; + + ret = ib_get_cached_gid(port->dev->device, port->port_num, + 0, &port->gid); + if (ret) + return ret; + + port->sm_lid = port_attr.sm_lid; + port->sm_sl = port_attr.sm_sl; + return 0; +} + +static void process_updates(struct sa_db_port *port) +{ + struct update_info *update; + struct ib_sa_attr_list *attr_list; + int ret; + + if (!paths_per_dest || update_port_info(port)) { + cleanup_port(port); + goto out; + } + + /* Event registration is an optimization, so ignore failures. */ + if (subscribe_inform_info) { + if (!port->out_info) { + ret = reg_out_info(port); + if (!ret) + return; + } + + if (!port->in_info) { + ret = reg_in_info(port); + if (!ret) + return; + } + } else + unsubscribe_port(port); + + while (!list_empty(&port->update_list)) { + update = list_entry(port->update_list.next, + struct update_info, list); + + if (update->type == SA_UPDATE_REMOVE) { + write_lock_irq(&rwlock); + attr_list = find_attr_list(&port->paths, + update->gid.raw); + if (attr_list) + remove_attr(&port->paths, attr_list); + write_unlock_irq(&rwlock); + } else { + ret = send_query(port, update); + if (!ret) + return; + + } + list_del(&update->list); + kfree(update); + } +out: + port->state = SA_DB_IDLE; +} + +static void refresh_port_db(struct sa_db_port *port) +{ + if (port->state == SA_DB_DESTROY) + return; + + if (port->state == SA_DB_REFRESH) { + clean_update_list(port); + ib_cancel_mad(port->agent, port->msg); + } + + add_update(port, NULL, SA_UPDATE_FULL); +} + +static void refresh_dev_db(struct sa_db_device *dev) +{ + int i; + + for (i = 0; i < dev->port_count; i++) + refresh_port_db(&dev->port[i]); +} + +static void refresh_db(void) +{ + struct sa_db_device *dev; + + list_for_each_entry(dev, &dev_list, list) + refresh_dev_db(dev); +} + +static int do_refresh(const char *val, struct kernel_param *kp) +{ + mutex_lock(&lock); + refresh_db(); + mutex_unlock(&lock); + return 0; +} + +static int get_lookup_method(char *buf, struct kernel_param *kp) +{ + return sprintf(buf, + "%c %d round robin\n" + "%c %d random", + (lookup_method == SA_DB_LOOKUP_LEAST_USED) ? '*' : ' ', + SA_DB_LOOKUP_LEAST_USED, + (lookup_method == SA_DB_LOOKUP_RANDOM) ? '*' : ' ', + SA_DB_LOOKUP_RANDOM); +} + +static int set_lookup_method(const char *val, struct kernel_param *kp) +{ + unsigned long method; + int ret = 0; + + method = simple_strtoul(val, NULL, 0); + + switch (method) { + case SA_DB_LOOKUP_LEAST_USED: + case SA_DB_LOOKUP_RANDOM: + lookup_method = method; + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static int set_paths_per_dest(const char *val, struct kernel_param *kp) +{ + int ret; + + mutex_lock(&lock); + ret = param_set_ulong(val, kp); + if (ret) + goto out; + + if (paths_per_dest > SA_DB_MAX_PATHS_PER_DEST) + paths_per_dest = SA_DB_MAX_PATHS_PER_DEST; + refresh_db(); +out: + mutex_unlock(&lock); + return ret; +} + +static int set_subscribe_inform_info(const char *val, struct kernel_param *kp) +{ + int ret; + + ret = param_set_bool(val, kp); + if (ret) + return ret; + + return do_refresh(val, kp); +} + +static void port_work_handler(struct work_struct *work) +{ + struct sa_db_port *port; + + port = container_of(work, typeof(*port), work); + mutex_lock(&lock); + refresh_port_db(port); + mutex_unlock(&lock); +} + +static void handle_event(struct ib_event_handler *event_handler, + struct ib_event *event) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + + dev = container_of(event_handler, typeof(*dev), event_handler); + port = &dev->port[event->element.port_num - dev->start_port]; + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + case IB_EVENT_PKEY_CHANGE: + case IB_EVENT_PORT_ACTIVE: + queue_work(sa_wq, &port->work); + break; + default: + break; + } +} + +static void ib_free_path_iter(struct ib_sa_attr_iter *iter) +{ + read_unlock_irqrestore(&rwlock, iter->flags); +} + +static int ib_create_path_iter(struct ib_device *device, u8 port_num, + union ib_gid *dgid, struct ib_sa_attr_iter *iter) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + struct ib_sa_attr_list *list; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return -ENODEV; + + port = &dev->port[port_num - dev->start_port]; + + read_lock_irqsave(&rwlock, iter->flags); + list = find_attr_list(&port->paths, dgid->raw); + if (!list) { + ib_free_path_iter(iter); + return -ENODATA; + } + + iter->iter = &list->iter; + return 0; +} + +static struct ib_sa_path_rec *ib_get_next_path(struct ib_sa_attr_iter *iter) +{ + struct ib_path_rec_info *next_path; + + iter->iter = iter->iter->next; + if (iter->iter) { + next_path = container_of(iter->iter, struct ib_path_rec_info, iter); + return &next_path->rec; + } else + return NULL; +} + +static int cmp_rec(struct ib_sa_path_rec *src, + struct ib_sa_path_rec *dst, ib_sa_comp_mask comp_mask) +{ + /* DGID check already done */ + if (comp_mask & IB_SA_PATH_REC_SGID && + memcmp(&src->sgid, &dst->sgid, sizeof src->sgid)) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_DLID && src->dlid != dst->dlid) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_SLID && src->slid != dst->slid) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_RAW_TRAFFIC && + src->raw_traffic != dst->raw_traffic) + return -EINVAL; + + if (comp_mask & IB_SA_PATH_REC_FLOW_LABEL && + src->flow_label != dst->flow_label) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_HOP_LIMIT && + src->hop_limit != dst->hop_limit) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS && + src->traffic_class != dst->traffic_class) + return -EINVAL; + if (comp_mask & IB_SA_PATH_REC_REVERSIBLE && + dst->reversible && !src->reversible) + return -EINVAL; + /* Numb path check already done */ + if (comp_mask & IB_SA_PATH_REC_PKEY && src->pkey != dst->pkey) + return -EINVAL; + + if (comp_mask & IB_SA_PATH_REC_SL && src->sl != dst->sl) + return -EINVAL; + + if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_MTU_SELECTOR, + IB_SA_PATH_REC_MTU, dst->mtu_selector, + src->mtu, dst->mtu)) + return -EINVAL; + if (ib_sa_check_selector(comp_mask, IB_SA_PATH_REC_RATE_SELECTOR, + IB_SA_PATH_REC_RATE, dst->rate_selector, + src->rate, dst->rate)) + return -EINVAL; + if (ib_sa_check_selector(comp_mask, + IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR, + IB_SA_PATH_REC_PACKET_LIFE_TIME, + dst->packet_life_time_selector, + src->packet_life_time, dst->packet_life_time)) + return -EINVAL; + + return 0; +} + +static struct ib_sa_path_rec *get_random_path(struct ib_sa_attr_iter *iter, + struct ib_sa_path_rec *req_path, + ib_sa_comp_mask comp_mask) +{ + struct ib_sa_path_rec *path, *rand_path = NULL; + int num, count = 0; + + for (path = ib_get_next_path(iter); path; + path = ib_get_next_path(iter)) { + if (!cmp_rec(path, req_path, comp_mask)) { + get_random_bytes(&num, sizeof num); + if ((num % ++count) == 0) + rand_path = path; + } + } + + return rand_path; +} + +static struct ib_sa_path_rec *get_next_path(struct ib_sa_attr_iter *iter, + struct ib_sa_path_rec *req_path, + ib_sa_comp_mask comp_mask) +{ + struct ib_path_rec_info *cur_path, *next_path = NULL; + struct ib_sa_path_rec *path; + unsigned long lookups = ~0; + + for (path = ib_get_next_path(iter); path; + path = ib_get_next_path(iter)) { + if (!cmp_rec(path, req_path, comp_mask)) { + + cur_path = container_of(iter->iter, struct ib_path_rec_info, + iter); + if (cur_path->lookups < lookups) { + lookups = cur_path->lookups; + next_path = cur_path; + } + } + } + + if (next_path) { + next_path->lookups++; + return &next_path->rec; + } else + return NULL; +} + +static void report_path(struct work_struct *work) +{ + struct sa_path_request *req; + + req = container_of(work, struct sa_path_request, work); + req->callback(0, &req->path_rec, req->context); + ib_sa_client_put(req->client); + kfree(req); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct sa_path_request *req; + struct ib_sa_attr_iter iter; + struct ib_sa_path_rec *path_rec; + int ret; + + if (!paths_per_dest) + goto query_sa; + + if (!(comp_mask & IB_SA_PATH_REC_DGID) || + !(comp_mask & IB_SA_PATH_REC_NUMB_PATH) || rec->numb_path != 1) + goto query_sa; + + req = kmalloc(sizeof *req, gfp_mask); + if (!req) + goto query_sa; + + ret = ib_create_path_iter(device, port_num, &rec->dgid, &iter); + if (ret) + goto free_req; + + if (lookup_method == SA_DB_LOOKUP_RANDOM) + path_rec = get_random_path(&iter, rec, comp_mask); + else + path_rec = get_next_path(&iter, rec, comp_mask); + + if (!path_rec) + goto free_iter; + + memcpy(&req->path_rec, path_rec, sizeof *path_rec); + ib_free_path_iter(&iter); + + INIT_WORK(&req->work, report_path); + req->client = client; + req->callback = callback; + req->context = context; + + ib_sa_client_get(client); + queue_work(sa_wq, &req->work); + *sa_query = ERR_PTR(-EEXIST); + return 0; + +free_iter: + ib_free_path_iter(&iter); +free_req: + kfree(req); +query_sa: + return ib_sa_path_rec_query(client, device, port_num, rec, comp_mask, + timeout_ms, gfp_mask, callback, context, + sa_query); +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct sa_db_port *port; + struct update_info *update; + struct ib_mad_send_buf *msg; + enum sa_update_type type; + + msg = (struct ib_mad_send_buf *) (unsigned long) mad_recv_wc->wc->wr_id; + port = msg->context[0]; + update = msg->context[1]; + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY || + update != list_entry(port->update_list.next, + struct update_info, list)) { + mutex_unlock(&lock); + } else { + type = update->type; + mutex_unlock(&lock); + update_path_db(mad_agent->context, mad_recv_wc, type); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_send_buf *msg; + struct sa_db_port *port; + struct update_info *update; + int ret; + + msg = mad_send_wc->send_buf; + port = msg->context[0]; + update = msg->context[1]; + + mutex_lock(&lock); + if (port->state == SA_DB_DESTROY) + goto unlock; + + if (update == list_entry(port->update_list.next, + struct update_info, list)) { + + if (mad_send_wc->status == IB_WC_RESP_TIMEOUT_ERR && + msg->timeout_ms < SA_DB_MAX_RETRY_TIMER) { + + msg->timeout_ms <<= 1; + ret = ib_post_send_mad(msg, NULL); + if (!ret) { + mutex_unlock(&lock); + return; + } + } + list_del(&update->list); + kfree(update); + } + process_updates(port); +unlock: + mutex_unlock(&lock); + + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); +} + +static int init_port(struct sa_db_device *dev, int port_num) +{ + struct sa_db_port *port; + int ret = 0; + + port = &dev->port[port_num - dev->start_port]; + port->dev = dev; + port->port_num = port_num; + INIT_WORK(&port->work, port_work_handler); + port->paths = RB_ROOT; + INIT_LIST_HEAD(&port->update_list); + + port->agent = ib_register_mad_agent(dev->device, port_num, IB_QPT_GSI, + NULL, IB_MGMT_RMPP_VERSION, + send_handler, recv_handler, port); + if (IS_ERR(port->agent)) + ret = PTR_ERR(port->agent); + + return ret; +} + +static void destroy_port(struct sa_db_port *port) +{ + mutex_lock(&lock); + port->state = SA_DB_DESTROY; + mutex_unlock(&lock); + + ib_unregister_mad_agent(port->agent); + cleanup_port(port); + flush_workqueue(sa_wq); +} + +static void sa_db_add_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + struct sa_db_port *port; + int s, e, i, ret; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + dev = kzalloc(sizeof *dev + (e - s + 1) * sizeof *port, GFP_KERNEL); + if (!dev) + return; + + dev->start_port = s; + dev->port_count = e - s + 1; + dev->device = device; + for (i = 0; i < dev->port_count; i++) { + ret = init_port(dev, s + i); + if (ret) + goto err; + } + + ib_set_client_data(device, &sa_db_client, dev); + + INIT_IB_EVENT_HANDLER(&dev->event_handler, device, handle_event); + + mutex_lock(&lock); + list_add_tail(&dev->list, &dev_list); + refresh_dev_db(dev); + mutex_unlock(&lock); + + ib_register_event_handler(&dev->event_handler); + return; +err: + while (i--) + destroy_port(&dev->port[i]); + kfree(dev); +} + +static void sa_db_remove_dev(struct ib_device *device) +{ + struct sa_db_device *dev; + int i; + + dev = ib_get_client_data(device, &sa_db_client); + if (!dev) + return; + + ib_unregister_event_handler(&dev->event_handler); + flush_workqueue(sa_wq); + + for (i = 0; i < dev->port_count; i++) + destroy_port(&dev->port[i]); + + mutex_lock(&lock); + list_del(&dev->list); + mutex_unlock(&lock); + + kfree(dev); +} + +int sa_db_init(void) +{ + int ret; + + rwlock_init(&rwlock); + sa_wq = create_singlethread_workqueue("local_sa"); + if (!sa_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + ret = ib_register_client(&sa_db_client); + if (ret) + goto err; + + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); + return ret; +} + +void sa_db_cleanup(void) +{ + ib_unregister_client(&sa_db_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(sa_wq); +} diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 1e13ab4..f49eb75 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -238,34 +238,6 @@ static u8 get_leave_state(struct mcast_group *group) return leave_state & group->rec.join_state; } -static int check_selector(ib_sa_comp_mask comp_mask, - ib_sa_comp_mask selector_mask, - ib_sa_comp_mask value_mask, - u8 selector, u8 src_value, u8 dst_value) -{ - int err; - - if (!(comp_mask & selector_mask) || !(comp_mask & value_mask)) - return 0; - - switch (selector) { - case IB_SA_GT: - err = (src_value <= dst_value); - break; - case IB_SA_LT: - err = (src_value >= dst_value); - break; - case IB_SA_EQ: - err = (src_value != dst_value); - break; - default: - err = 0; - break; - } - - return err; -} - static int cmp_rec(struct ib_sa_mcmember_rec *src, struct ib_sa_mcmember_rec *dst, ib_sa_comp_mask comp_mask) { @@ -278,24 +250,24 @@ static int cmp_rec(struct ib_sa_mcmember_rec *src, return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_MLID && src->mlid != dst->mlid) return -EINVAL; - if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR, - IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector, - src->mtu, dst->mtu)) + if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_MTU_SELECTOR, + IB_SA_MCMEMBER_REC_MTU, dst->mtu_selector, + src->mtu, dst->mtu)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_TRAFFIC_CLASS && src->traffic_class != dst->traffic_class) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_PKEY && src->pkey != dst->pkey) return -EINVAL; - if (check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR, - IB_SA_MCMEMBER_REC_RATE, dst->rate_selector, - src->rate, dst->rate)) + if (ib_sa_check_selector(comp_mask, IB_SA_MCMEMBER_REC_RATE_SELECTOR, + IB_SA_MCMEMBER_REC_RATE, dst->rate_selector, + src->rate, dst->rate)) return -EINVAL; - if (check_selector(comp_mask, - IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR, - IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME, - dst->packet_life_time_selector, - src->packet_life_time, dst->packet_life_time)) + if (ib_sa_check_selector(comp_mask, + IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR, + IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME, + dst->packet_life_time_selector, + src->packet_life_time, dst->packet_life_time)) return -EINVAL; if (comp_mask & IB_SA_MCMEMBER_REC_SL && src->sl != dst->sl) return -EINVAL; diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index b8eac66..0f19dde 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -48,6 +48,29 @@ static inline void ib_sa_client_put(struct ib_sa_client *client) complete(&client->comp); } +int ib_sa_check_selector(ib_sa_comp_mask comp_mask, + ib_sa_comp_mask selector_mask, + ib_sa_comp_mask value_mask, + u8 selector, u8 src_value, u8 dst_value); + +int ib_sa_pack_attr(void *dst, void *src, int attr_id); + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id); + +int ib_sa_path_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int sa_db_init(void); +void sa_db_cleanup(void); + int ib_sa_mcmember_rec_query(struct ib_sa_client *client, struct ib_device *device, u8 port_num, u8 method, diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 23d1081..3634486 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -465,6 +465,58 @@ static const struct ib_field notice_table[] = { .size_bits = 128 }, }; +int ib_sa_check_selector(ib_sa_comp_mask comp_mask, + ib_sa_comp_mask selector_mask, + ib_sa_comp_mask value_mask, + u8 selector, u8 src_value, u8 dst_value) +{ + int err; + + if (!(comp_mask & selector_mask) || !(comp_mask & value_mask)) + return 0; + + switch (selector) { + case IB_SA_GT: + err = (src_value <= dst_value); + break; + case IB_SA_LT: + err = (src_value >= dst_value); + break; + case IB_SA_EQ: + err = (src_value != dst_value); + break; + default: + err = 0; + break; + } + + return err; +} + +int ib_sa_pack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} + +int ib_sa_unpack_attr(void *dst, void *src, int attr_id) +{ + switch (attr_id) { + case IB_SA_ATTR_PATH_REC: + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), src, dst); + break; + default: + return -EINVAL; + } + return 0; +} + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -734,41 +786,16 @@ static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } -/** - * ib_sa_path_rec_get - Start a Path get query - * @client:SA client - * @device:device to send query on - * @port_num: port number to send query on - * @rec:Path Record to send in query - * @comp_mask:component mask to send in query - * @timeout_ms:time to wait for response - * @gfp_mask:GFP mask to use for internal allocations - * @callback:function called when query completes, times out or is - * canceled - * @context:opaque user context passed to callback - * @sa_query:query context, used to cancel query - * - * Send a Path Record Get query to the SA to look up a path. The - * callback function will be called when the query completes (or - * fails); status is 0 for a successful response, -EINTR if the query - * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error - * occurred sending the query. The resp parameter of the callback is - * only valid if status is 0. - * - * If the return value of ib_sa_path_rec_get() is negative, it is an - * error code. Otherwise it is a query ID that can be used to cancel - * the query. - */ -int ib_sa_path_rec_get(struct ib_sa_client *client, - struct ib_device *device, u8 port_num, - struct ib_sa_path_rec *rec, - ib_sa_comp_mask comp_mask, - int timeout_ms, gfp_t gfp_mask, - void (*callback)(int status, - struct ib_sa_path_rec *resp, - void *context), - void *context, - struct ib_sa_query **sa_query) +int ib_sa_path_rec_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); @@ -825,7 +852,6 @@ err1: kfree(query); return ret; } -EXPORT_SYMBOL(ib_sa_path_rec_get); static void ib_sa_service_rec_callback(struct ib_sa_query *sa_query, int status, @@ -1398,7 +1424,15 @@ static int __init ib_sa_init(void) goto err3; } + ret = sa_db_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize local SA\n"); + goto err4; + } + return 0; +err4: + notice_cleanup(); err3: mcast_cleanup(); err2: @@ -1409,6 +1443,7 @@ err1: static void __exit ib_sa_cleanup(void) { + sa_db_cleanup(); mcast_cleanup(); notice_cleanup(); ib_unregister_client(&sa_client); diff --git a/include/rdma/ib_local_sa.h b/include/rdma/ib_local_sa.h new file mode 100644 index 0000000..e62d8b0 --- /dev/null +++ b/include/rdma/ib_local_sa.h @@ -0,0 +1,83 @@ +/* + * Copyright (c) 2006 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_LOCAL_SA_H +#define IB_LOCAL_SA_H + +#include + +/** + * ib_get_path_rec - Query the local SA database for path information. + * @device: The local device to query. + * @port_num: The port of the local device being queried. + * @sgid: The source GID of the path record. + * @dgid: The destination GID of the path record. + * @pkey: The protection key of the path record. + * @rec: A reference to a path record structure that will receive a copy of + * the response. + * + * Returns a copy of a path record meeting the specified criteria to the + * location referenced by %rec. A return value < 0 indicates that an error + * occurred processing the request, or no path record was found. + */ +int ib_get_path_rec(struct ib_device *device, u8 port_num, union ib_gid *sgid, + union ib_gid *dgid, u16 pkey, struct ib_sa_path_rec *rec); + +/** + * ib_create_path_iter - Create an iterator that may be used to walk through + * a list of path records. + * @device: The local device to retrieve path records for. + * @port_num: The port of the local device. + * @dgid: The destination GID of the path record. + * + * This call allocates an iterator that is used to walk through a list of + * cached path records. All path records accessed by the iterator will have the + * specified DGID. User should not hold the iterator for an extended period of + * time, and must free it by calling ib_free_attr_iter. + */ +struct ib_sa_attr_iter *ib_create_path_iter(struct ib_device *device, + u8 port_num, union ib_gid *dgid); + +/** + * ib_free_attr_iter - Release an attribute iterator. + * @iter: The iterator to free. + */ +void ib_free_attr_iter(struct ib_sa_attr_iter *iter); + +/** + * ib_get_next_attr - Retrieve the next attribute referenced by an iterator. + * @iter: A reference to an iterator that points to the next attribute to + * retrieve. + */ +void *ib_get_next_attr(struct ib_sa_attr_iter *iter); + +#endif /* IB_LOCAL_SA_H */ diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 83d8157..ae52904 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -553,4 +553,7 @@ ib_sa_register_inform_info(struct ib_sa_client *client, */ void ib_sa_unregister_inform_info(struct ib_inform_info *info); +int ib_sa_pack_attr(void *dst, void *src, int attr_id); +int ib_sa_unpack_attr(void *dst, void *src, int attr_id); + #endif /* IB_SA_H */ From rdreier at cisco.com Mon Jul 2 14:05:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 14:05:18 -0700 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070702195927.GB31169@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 2 Jul 2007 22:59:28 +0300") References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> Message-ID: > Could you please clarify how do you envision this done? > Do we just create our own filesystem? > > Reason I ask, we'll need something like this for SRC domain too ... I don't have a really clear idea. "Look at spufs" is about as far as I got. - R. From sean.hefty at intel.com Mon Jul 2 15:47:21 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Jul 2007 15:47:21 -0700 Subject: [ofa-general] [PATCH] ib/cm: fix handling of duplicate SIDR REQs In-Reply-To: Message-ID: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> Fix handling to duplicate SIDR REQs to avoid sending a reject if one is detected. Duplicates should simply be discarded. Signed-off-by: Sean Hefty --- I went with moving where the state was set, as it seemed a little cleaner to me. The REQ_RCVD state implies that we can send a SIDR REP by calling ib_send_cm_sidr_rep(), which is not the case. The REQ_RCVD state also indicates that the cm_id_priv is located in the remote_sidr_table, but the insertion failed, so we should not try to remove the item later. drivers/infiniband/core/cm.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index c7007c4..9135a8c 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2794,7 +2794,6 @@ static int cm_sidr_req_handler(struct cm_work *work) work->mad_recv_wc->recv_buf.grh, &cm_id_priv->av); cm_id_priv->id.remote_id = sidr_req_msg->request_id; - cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; cm_id_priv->tid = sidr_req_msg->hdr.tid; atomic_inc(&cm_id_priv->work_count); @@ -2804,6 +2803,7 @@ static int cm_sidr_req_handler(struct cm_work *work) spin_unlock_irq(&cm.lock); goto out; /* Duplicate message. */ } + cm_id_priv->id.state = IB_CM_SIDR_REQ_RCVD; cur_cm_id_priv = cm_find_listen(cm_id->device, sidr_req_msg->service_id, sidr_req_msg->private_data); From sean.hefty at intel.com Mon Jul 2 15:51:31 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 2 Jul 2007 15:51:31 -0700 Subject: [ofa-general] [PATCH] ib/cm: send no match if a SIDR REQ does not match a listen In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> Message-ID: <000901c7bcfb$8842a270$3c98070a@amr.corp.intel.com> If a SIDR REQ does not match a listen, we should reply with status value 1 (service ID not supported), rather than dropping through to the default case of status 2 (rejected by service provider). This also fixes a bug where the cm_id_priv is removed from the remote_sidr_table twice. Signed-off-by: Sean Hefty --- drivers/infiniband/core/cm.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 9135a8c..9820c67 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2808,9 +2808,8 @@ static int cm_sidr_req_handler(struct cm_work *work) sidr_req_msg->service_id, sidr_req_msg->private_data); if (!cur_cm_id_priv) { - rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); spin_unlock_irq(&cm.lock); - /* todo: reply with no match */ + cm_reject_sidr_req(cm_id_priv, IB_SIDR_UNSUPPORTED); goto out; /* No match. */ } atomic_inc(&cur_cm_id_priv->refcount); From akpm at linux-foundation.org Mon Jul 2 15:56:33 2007 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 2 Jul 2007 15:56:33 -0700 Subject: [ofa-general] Re: idr_get_new_above() limitation? In-Reply-To: <200707021919.27251.hnguyen@linux.vnet.ibm.com> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> Message-ID: <20070702155633.720b5667.akpm@linux-foundation.org> On Mon, 2 Jul 2007 19:19:26 +0200 Hoang-Nam Nguyen wrote: > For ehca device driver we're intending to utilize > idr_get_new_above() and have written a test case, which I'm attaching > at the end. Basically it tries to get an idr token above a lower boundary > by calling idr_get_new_above() and then uses idr_find() to check if > the returned token can be found. > Here is our observation with 2.6.22-rc7 on ppc64: > > Use lower boundary 0x3ffffffc > [root at xyz idr_bug]# insmod idr_test_mod.ko start=1073741820 > insmod: error inserting 'idr_test_mod.ko': -1 Unknown symbol in module > [root at xyz idr_bug]# dmesg -c > i=3ffffffc token=3ffffffc t=000000003ffffffc > i=3ffffffd token=3ffffffd t=000000003ffffffd > i=3ffffffe token=3ffffffe t=000000003ffffffe > i=3fffffff token=3fffffff t=000000003fffffff > i=40000000 token=40000000 t=0000000000000000 > Invalid object 0000000000000000. Expected 40000000 > > That means token 0x40000000 seems to be the "upper boundary" of idr_find(). > However the behaviour is not consistent in that it was returned by > idr_get_new_above(). > > Looking at void *idr_find(struct idr *idp, int id) > { > int n; > struct idr_layer *p; > > n = idp->layers * IDR_BITS; > p = idp->top; > > /* Mask off upper bits we don't use for the search. */ > id &= MAX_ID_MASK; > > if (id >= (1 << n)) > return NULL; > > while (n > 0 && p) { > n -= IDR_BITS; > p = p->ary[(id >> n) & IDR_MASK]; > } > return((void *)p); > } > we found that the if-condition has failed: > layers = 5 > IDR_BITS = 6 > n = 30 > (id >= (1 << n)) = (0x40000000 >= 0x40000000) = 1 > > Since MAX_ID_MASK=0x7fffffff, I'm wondering if 0x40000000 is the actual > upper boundary. Any hints or suggestions are appreciated. Looks like a bug to me. Really an IDR tree on 32-bit should go all the way up to 0xffffffff. Certainly up to 0x7fffffff. And the fact that idr_find() disagrees with idr_get_new_above() is a big hint that the code is getting it wrong. From jim.houston at ccur.com Mon Jul 2 17:31:40 2007 From: jim.houston at ccur.com (Jim Houston) Date: Mon, 02 Jul 2007 20:31:40 -0400 Subject: [ofa-general] Re: idr_get_new_above() limitation? In-Reply-To: <200707021919.27251.hnguyen@linux.vnet.ibm.com> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> Message-ID: <1183422700.3130.27.camel@localhost.localdomain> On Mon, 2007-07-02 at 19:19 +0200, Hoang-Nam Nguyen wrote: > i=3fffffff token=3fffffff t=000000003fffffff > i=40000000 token=40000000 t=0000000000000000 > Invalid object 0000000000000000. Expected 40000000 > > That means token 0x40000000 seems to be the "upper boundary" of idr_find(). > However the behaviour is not consistent in that it was returned by > idr_get_new_above(). Hi Nam, Yes this is a bug. Thanks for the great test module. The problem is in idr_get_new_above_int() in the loop which adds new layers to the top of the radix tree. It is failing the "layers < (MAX_LEVEL - 1)" test. It doesn't allocate the new layer but still calls sub_alloc() which relies on having the new layer properly constructed. I believe that it is allocating the slot which corresponds to id = 0. I believe this is an off by one error in calculating the MAX_LEVEL value. I will do a more careful review and post a fix in the next day or so. I have been in Ottawa for OLS. I'm flying home tomorrow. Jim Houston - Concurrent Computer Corp. From nyav at thomson.com Mon Jul 2 20:11:53 2007 From: nyav at thomson.com (Lynn Z. Fidelia) Date: Mon, 2 Jul 2007 23:11:53 -0400 Subject: [ofa-general] viscosity layover Message-ID: <4689BE79.8040603@thomson.com> ERMX Jumps 12.5% and Volume Goes Through The Roof! EntreMetrix Inc. (ERMX) $0.18 UP 12.5% Big news last week pushed investors to the table. Wallst.net release of an audio interview got them excited. This is only the first day after the release. Act fast and get on ERMX Tuesday morning! "It's going to be much more complicated to do with synthetic organisms," said Dr. Jonathan Eisen, an evolutionary biologist at the University of California, Davis. com Secret of flight for world's largest bird revealed AFP Takeoffs a problem for giant bird AP Brain Scans Reveal Why Meditation Works LiveScience. Thursday's experiment was designed just to prove an entire-genome switch is possible, with regular bacteria DNA. The Venter team picked two Mycoplasma species, simple germs that contain a single chromosome and lack the cell walls that form barriers in other bacteria. coli bacteria, to make them do such things as churn out medications. Barbara Jasny, a deputy editor of Science. "It's going to be much more complicated to do with synthetic organisms," said Dr. First, they added genes to turn the donor bacteria an easy-to-spot bright blue, and to make it resist an antibiotic used to kill off any host germ that retained its own genes. "This is a different one that is a little more daring, and perhaps dramatic. Thursday's experiment was designed just to prove an entire-genome switch is possible, with regular bacteria DNA. Still, "it's a great first step. "There are people doing some important synthetic engineering efforts with other approaches," cautioned Dr. But the way it was performed, dubbed a "genome transplant," has genetics specialists buzzing. Submit your photos to You Witness News. commission to collect fish carcasses AP Study: Northern Canada ponds drying up AP Study: Hurricanes may aid stressed coral AP Most Viewed - Science Baby Born from Frozen Egg LiveScience. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press. Blue germs appeared within days of dropping the genome into lab dishes containing the second bacteria. The information contained in the AP News report may not be published, broadcast, rewritten or redistributed without the prior written authority of The Associated Press. - Mail Search: All News Yahoo! "That's extremely inefficient," acknowledged lead scientist John Glass, a Venter Institute microbiologist. - Mail Search: All News Yahoo! Submit your photos to You Witness News. " "Synthetic genomics still remains to be proven, but now we are much closer to knowing it's actually theoretically possible," added Venter. com Secret of flight for world's largest bird revealed AFP Takeoffs a problem for giant bird AP Brain Scans Reveal Why Meditation Works LiveScience. Serious refinance requests only. That work is far from complete, but to make it work, they'd have to put the artificial chromosome into a living cell and it would have to jump-start that host. From rdreier at cisco.com Mon Jul 2 20:41:48 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 20:41:48 -0700 Subject: [ofa-general] [PATCH 1 of 2] mlx4: Add new Mellanox device IDs In-Reply-To: <200707021736.18855.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 2 Jul 2007 17:36:18 +0300") References: <200707021736.18855.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Mon Jul 2 20:46:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 20:46:51 -0700 Subject: [ofa-general] [PATCH 2 of 2] libmlx4: Add new Mellanox device IDs In-Reply-To: <200707021737.34303.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 2 Jul 2007 17:37:34 +0300") References: <200707021737.34303.jackm@dev.mellanox.co.il> Message-ID: thanks, I decided that there was no point in having these defines so I just did it like the kernel: commit 040743fb06cf2abf9f302ee6f5870fd3fe944868 Author: Roland Dreier Date: Mon Jul 2 20:45:40 2007 -0700 Add new device IDs for PCIe gen2 HCAs Also just use hex device IDs plus comments instead of creating defines that are only used once. Signed-off-by: Roland Dreier diff --git a/src/mlx4.c b/src/mlx4.c index 3684b50..b2e2ba9 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -53,29 +53,19 @@ #define PCI_VENDOR_ID_MELLANOX 0x15b3 #endif -#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_SDR -#define PCI_DEVICE_ID_MELLANOX_HERMON_SDR 0x6340 -#endif - -#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_DDR -#define PCI_DEVICE_ID_MELLANOX_HERMON_DDR 0x634a -#endif - -#ifndef PCI_DEVICE_ID_MELLANOX_HERMON_QDR -#define PCI_DEVICE_ID_MELLANOX_HERMON_QDR 0x6354 -#endif - #define HCA(v, d) \ { .vendor = PCI_VENDOR_ID_##v, \ - .device = PCI_DEVICE_ID_MELLANOX_##d } + .device = d } struct { unsigned vendor; unsigned device; } hca_table[] = { - HCA(MELLANOX, HERMON_SDR), - HCA(MELLANOX, HERMON_DDR), - HCA(MELLANOX, HERMON_QDR), + HCA(MELLANOX, 0x6340), /* MT25408 "Hermon" SDR */ + HCA(MELLANOX, 0x634a), /* MT25408 "Hermon" DDR */ + HCA(MELLANOX, 0x6354), /* MT25408 "Hermon" QDR */ + HCA(MELLANOX, 0x6732), /* MT25408 "Hermon" DDR PCIe gen2 */ + HCA(MELLANOX, 0x673c), /* MT25408 "Hermon" QDR PCIe gen2 */ }; static struct ibv_context_ops mlx4_ctx_ops = { From rdreier at cisco.com Mon Jul 2 20:50:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 20:50:52 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a fix for a crash in IPoIB and new device IDs for mlx4: Jack Morgenstein (1): mlx4_core: Add new Mellanox device IDs Ralph Campbell (1): IPoIB/cm: Partial error clean up unmaps wrong address drivers/infiniband/ulp/ipoib/ipoib_cm.c | 4 ++-- drivers/net/mlx4/main.c | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 5ffc464..ea74d1e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -148,8 +148,8 @@ partial_error: ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); - for (; i >= 0; --i) - ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); + for (; i > 0; --i) + ib_dma_unmap_single(priv->ca, mapping[i], PAGE_SIZE, DMA_FROM_DEVICE); dev_kfree_skb_any(skb); return NULL; diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 41eafeb..c3da2a2 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -911,6 +911,8 @@ static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ { PCI_VDEVICE(MELLANOX, 0x6354) }, /* MT25408 "Hermon" QDR */ + { PCI_VDEVICE(MELLANOX, 0x6732) }, /* MT25408 "Hermon" DDR PCIe gen2 */ + { PCI_VDEVICE(MELLANOX, 0x673c) }, /* MT25408 "Hermon" QDR PCIe gen2 */ { 0, } }; From rdreier at cisco.com Mon Jul 2 20:56:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 02 Jul 2007 20:56:58 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ work requests In-Reply-To: <200706211201.58440.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 12:01:58 +0300") References: <200706211201.58440.jackm@dev.mellanox.co.il> Message-ID: I trust you guys on this, but have you thought about whether blueflame makes sense for RDMA read requests? After all, an RDMA read requires the responder to send potentially a large amount of data to complete, and even for small requests I would think that latency-sensitive apps would avoid it. Is there an MPI implementation or other app that you know of where this really helps? - R. From yonic at voltaire.com Mon Jul 2 22:53:20 2007 From: yonic at voltaire.com (Yonathan Cohen) Date: Tue, 3 Jul 2007 08:53:20 +0300 Subject: [ofa-general] req_notify_cq is NULL Message-ID: <39C75744D164D948A170E9792AF8E7CA4F236B@exil.voltaire.com> Hello, I am creating a listener like so : cma_id = rdma_create_id(cma_handler, my_context, RDMA_PS_TCP); And then call bind : memset(&addr, 0, sizeof addr); addr.sin_port = htons(port); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; rdma_bind_addr(cma_id, (struct sockaddr *)&addr); And listen : rdma_listen(cma_id, 0); But when the event_handler ( cma_handler ) is called-back the "struct rdma_cm_id* " Has its api func "req_notify_cq" ( i.e. rdma_cm_id->device->req_notify->cq ) set to NULL. Although, other api funcs like create_cq and create_qp are set with ( im not sure with a valid pointer ) I added a printk to mthca_register_device() ( in mthca_provider.c ) which at insmod logs that "req_notify_cq" is in fact set with an address. Im using a mellanox HCA : "Mellanox Technologies MT23108 InfiniHost" So its not memfree and req_notify_cq is set with mthca_tavor_arm_cq. But still when the RMDA_CM_EVENT_CONNECT_REQUEST is received this func is NULL. Please help. __________________________________________________________ Cohen Yonatan | +972-9-9718607 (o) Software. Eng, Storage group Voltaire - The Grid Backbone www.voltaire.com From yonic at voltaire.com Mon Jul 2 22:56:13 2007 From: yonic at voltaire.com (Yonathan Cohen) Date: Tue, 3 Jul 2007 08:56:13 +0300 Subject: [ofa-general] RE: [ewg] req_notify_cq is NULL In-Reply-To: <39C75744D164D948A170E9792AF8E7CA4F236B@exil.voltaire.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA4F236D@exil.voltaire.com> > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org > [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Yonathan Cohen > Sent: Tuesday, July 03, 2007 8:53 AM > To: OpenFabrics EWG; general at lists.openfabrics.org > Subject: [ewg] req_notify_cq is NULL > > Hello, > I am creating a listener like so : > cma_id = rdma_create_id(cma_handler, my_context, RDMA_PS_TCP); > > And then call bind : > memset(&addr, 0, sizeof addr); > addr.sin_port = htons(port); > addr.sin_family = AF_INET; > addr.sin_addr.s_addr = INADDR_ANY; > rdma_bind_addr(cma_id, (struct sockaddr *)&addr); > > And listen : > rdma_listen(cma_id, 0); > > But when the event_handler ( cma_handler ) is called-back the "struct > rdma_cm_id* " > Has its api func "req_notify_cq" ( i.e. > rdma_cm_id->device->req_notify->cq ) set to NULL. > Although, other api funcs like create_cq and create_qp are > set with ( im not sure with a valid pointer ) > > I added a printk to mthca_register_device() ( in > mthca_provider.c ) which at insmod logs that "req_notify_cq" > is in fact set with an address. > Im using a mellanox HCA : "Mellanox Technologies MT23108 InfiniHost" > So its not memfree and req_notify_cq is set with mthca_tavor_arm_cq. > But still when the RMDA_CM_EVENT_CONNECT_REQUEST is received > this func is NULL. > > Please help. > > __________________________________________________________ > Cohen Yonatan | +972-9-9718607 (o) > Software. Eng, Storage group > Voltaire - The Grid Backbone > www.voltaire.com > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > Btw - Im using ofed1.2 ga. __________________________________________________________ Cohen Yonatan | +972-9-9718607 (o) Software. Eng, Storage group Voltaire - The Grid Backbone www.voltaire.com From mst at dev.mellanox.co.il Mon Jul 2 23:00:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 09:00:49 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> Message-ID: <20070703060049.GF1147@mellanox.co.il> > we should move to UC For HW that supports UC with SRQ, yes. -- MST From mst at dev.mellanox.co.il Mon Jul 2 23:10:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 09:10:29 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <46895A18.2000100@ichips.intel.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> Message-ID: <20070703061029.GG1147@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > >So we must send something that will force remote side to respond. One such > >message is LAP with current primary path used as proposed alternate path. > >Remote will respond with APR with AP status 5 if the connection is there, > >and > >status 1 if it is not. > > I didn't follow this. Is this just an out of band keep alive message? Yes. Exactly. > Why not use DREQ to indicate that the connection went away under normal > circumstances, Yes, clearly we do this. Keepalives cover the failure cases: remote is down, or has rebooted, or all DREQs were lost, etc ... > and a send failure in an abnormal termination case? What do you mean by "send failure"? Completion with error? We only get these with RC, not with UC. -- MST From jniz at usa.com Tue Jul 3 00:45:45 2007 From: jniz at usa.com (Susanna) Date: Tue, 3 Jul 2007 09:45:45 +0200 Subject: [ofa-general] advertisement-103260.pdf attached Message-ID: <4689FEA9.5010802@usa.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: advertisement-103260.pdf Type: application/pdf Size: 14862 bytes Desc: not available URL: From ogerlitz at voltaire.com Tue Jul 3 01:50:52 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 3 Jul 2007 11:50:52 +0300 (IDT) Subject: [ofa-general] consumer data buffer ownership for inline sends Message-ID: Hi Roland, Looking on mthca_arbel_post_send / mthca_tavor_post_send at libmthca we see that the inline code copies the data on the library wqe buffer etc. Does this means that for inline sends, when ibv_post_send returns, the consumer owns back the data buffer associated with this send? Can this be stated as the official policy of libibverbs? Or. From ogerlitz at voltaire.com Tue Jul 3 01:56:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 11:56:07 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703061029.GG1147@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> Message-ID: <468A0F27.3020909@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Sean Hefty : >> Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode >> >>> So we must send something that will force remote side to respond. One such >>> message is LAP with current primary path used as proposed alternate path. >>> Remote will respond with APR with AP status 5 if the connection is there, >>> and >>> status 1 if it is not. >> I didn't follow this. Is this just an out of band keep alive message? > > Yes. Exactly. Michael, You may know that for each neighbour, the Linux network stack sends every m jiffies a --unicast-- ARP probe, where after n jiffies there is no ARP reply, it sends a broadcast ARP. The default values are m=30*HZ and n=30*HZ, but you can change them, its net.ipv4.neigh.default{gc_interval,gc_stale_time} My understanding it that it solves everything, no need for keep alives Do I missing anything here? Or. From zhengnancyumysod at ecatrans.com Tue Jul 3 01:45:53 2007 From: zhengnancyumysod at ecatrans.com (Akilah Crawford) Date: Tue, 03 Jul 2007 15:45:53 +0700 Subject: [ofa-general] Saturday night fever again Message-ID: "Restrain your tongue!" she said. "I did not come here to fight slow whispering ant pleasure you with your own weapons. wound It was declared that he believed in deal no classes or anything knit repulsive else, excepting "the woman question." Arrived on complete the opposite pavement, he looked slip bell back to see writing whether the prince were moving, waved his ha obedient "No; I remember nothing!" bake said the prince. A few more words of cuddly explanation followed, strove words which wer "Yes, yes, you will bite indeed. I apian have been in building church and control prayed--nay, do not laugh--I prayed to the Lor knee Muttering these disconnected words, Rogojin began to make up the beds. talk metal hot It was clear that he had devi "No; the owner is the grandson of a freedman, moor formerly in peel his family. hat Now homely they are very rich and hig At this Theophilus gave the reins to his wrath; he name rhythm snatched start a little dig crucifix from the wall above hi "How you arch have mixed and upset the book-rolls! If only jail argue I could wring show you how clearly everything agrees "They will joke not betray hammer me," smiled the outgoing philosopher. "They know that their aged mistress, support Damia, and I "Nay," whistle colourful doubtful he replied boldly: "That we are only beginning to know in all its sow fullness and rapture. The o "Yes, laugh reverend train street Father, disapprove and so we ran away." "Oh! then you safely forgive did come badly 'to fight,' I may conclude? bent Dear me!--and I thought you were cleverer--" wrong knot misty The sworn latter came at once. "And which do you stole cautiously regard complain as the greater: hover The only-begotten Son of God, or that helpless image?" And The latter need had no idea and could give no information as to why Pavlicheff play had leap taken sanguineous so great an inte happily "In point of fact I unit don't bubble think I thought much about it," said mug the old fellow. He seemed to have a w "To my grin misfortune! You drive me ornament frantic with your gladly meek and mild square ways," cried the other passionately. Olympius followed Agne into the garden quiet where he found her sitting by the join death invention marble margin of a small po write All this looked lonely likely enough, weakly and was correctly accepted as fact by most of the inhabitants of the place, esp "It's hot weather, you fire see," continued sought Rogojin, as he lay shade motion down on the cushions beside Muishkin, "and Of course much was husky said that drawer chance could not be determined absolutely. For myrmecological instance, it was reported that "Yes, she complete mark business is lazily inquisitive," assented the prince. "No, no, Demetrius, prose no. rhyme You see, show you believe in the old crime gods. . ." -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2oROH.gif Type: image/gif Size: 12049 bytes Desc: not available URL: From mst at dev.mellanox.co.il Tue Jul 3 02:16:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 12:16:39 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A0F27.3020909@voltaire.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> Message-ID: <20070703091639.GJ1147@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > Michael S. Tsirkin wrote: > >>Quoting Sean Hefty : > >>Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > >> > >>>So we must send something that will force remote side to respond. One > >>>such > >>>message is LAP with current primary path used as proposed alternate path. > >>>Remote will respond with APR with AP status 5 if the connection is > >>>there, and > >>>status 1 if it is not. > >>I didn't follow this. Is this just an out of band keep alive message? > > > >Yes. Exactly. > > Michael, > > You may know that for each neighbour, the Linux network stack sends > every m jiffies a --unicast-- ARP probe, where after n jiffies there is > no ARP reply, it sends a broadcast ARP. > > The default values are m=30*HZ and n=30*HZ, but you can change them, > its net.ipv4.neigh.default{gc_interval,gc_stale_time} > > My understanding it that it solves everything, no need for keep alives > > Do I missing anything here? How does this solve the problem? If the remote side has lost the connection, unicast ARPs will get dropped but broadcast ARPs will get answered to. We'd need to re-create the connection if this happens - but is there a way to detect this? -- MST From ogerlitz at voltaire.com Tue Jul 3 02:42:01 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 12:42:01 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703091639.GJ1147@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> Message-ID: <468A19E9.2090707@voltaire.com> Michael S. Tsirkin wrote: >>>> I didn't follow this. Is this just an out of band keep alive message? >>> Yes. Exactly. >> You may know that for each neighbour, the Linux network stack sends >> every m jiffies a --unicast-- ARP probe, where after n jiffies there is >> no ARP reply, it sends a broadcast ARP. > How does this solve the problem? > If the remote side has lost the connection, unicast ARPs will get dropped > but broadcast ARPs will get answered to. We'd need to re-create the connection > if this happens - but is there a way to detect this? Yes, I know that there is a way to register for kernel level neighbour update events, so on each neighbour update, ipoib cm reconnects, plus you can remove the fast path memcmp we do today on the remote GUID, and we done :) This is b/c it covers both the case that the unicast arp probe was not replied either since the --GID-- we have is not the correct one (eg under HA scheme) or that the remote --QP-- is not what we think. Or. From ogerlitz at voltaire.com Tue Jul 3 02:44:58 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 12:44:58 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703060049.GF1147@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> <20070703060049.GF1147@mellanox.co.il> Message-ID: <468A1A9A.5060208@voltaire.com> Michael S. Tsirkin wrote: >> we should move to UC > > For HW that supports UC with SRQ, yes. Dror did not mention the HW, my understanding is that this aspect is fine... now, assuming the need for liveness protocol is behind us, and if not, it can be implemented as you suggested. the problem is narrowed to have the FW support SRQ/UC. Once this is in place, IPoIB-CM/UC implementation can start, later when the IBTA would be done spec-ing it, it would not be non complaint any more. Same as with the SRC, you don't wait for it to be standard before doing the implementation. Or. From vlad at lists.openfabrics.org Tue Jul 3 02:45:54 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 3 Jul 2007 02:45:54 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070703-0200 daily build status Message-ID: <20070703094554.7DA48E6085E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From mst at dev.mellanox.co.il Tue Jul 3 02:47:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 12:47:03 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A19E9.2090707@voltaire.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> Message-ID: <20070703094703.GA12153@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > Michael S. Tsirkin wrote: > > >>>>I didn't follow this. Is this just an out of band keep alive message? > > >>>Yes. Exactly. > > >>You may know that for each neighbour, the Linux network stack sends > >>every m jiffies a --unicast-- ARP probe, where after n jiffies there is > >>no ARP reply, it sends a broadcast ARP. > > >How does this solve the problem? > >If the remote side has lost the connection, unicast ARPs will get dropped > >but broadcast ARPs will get answered to. We'd need to re-create the > >connection > >if this happens - but is there a way to detect this? > > Yes, I know that there is a way to register for kernel level neighbour > update events, so on each neighbour update, ipoib cm reconnects, plus > you can remove the fast path memcmp we do today on the remote GUID, and > we done :) > > This is b/c it covers both the case that the unicast arp probe was not > replied either since the --GID-- we have is not the correct one (eg > under HA scheme) or that the remote --QP-- is not what we think. In the typical case (remote side reboots) both the GID and the UD QPN stay the same, so it seems there won't be any neighbour update, right? If so, while playing with neighbour update events might get us data path speed-up, it will not solve the problem of detecting the connection is alive. -- MST From ogerlitz at voltaire.com Tue Jul 3 02:55:48 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 12:55:48 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703094703.GA12153@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> Message-ID: <468A1D24.6060903@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Yes, I know that there is a way to register for kernel level neighbour >> update events, so on each neighbour update, ipoib cm reconnects, plus >> you can remove the fast path memcmp we do today on the remote GUID, and >> we done :) > In the typical case (remote side reboots) both the GID and the UD QPN stay the > same, so it seems there won't be any neighbour update, right? If so, while > playing with neighbour update events might get us data path speed-up, it will > not solve the problem of detecting the connection is alive. I don't think we should give up here, first there might be a way (event) and if not lets change the kernel :) to know that the neighbouring subsystem issued a broadcast arp on a nieghbour. Second, let me think... What did the people who wrote the RFC said about the need / implementation of liveness protocol? Or. From ogerlitz at voltaire.com Tue Jul 3 03:29:47 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 13:29:47 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A1D24.6060903@voltaire.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> Message-ID: <468A251B.70901@voltaire.com> Or Gerlitz wrote: > Michael S. Tsirkin wrote: >> In the typical case (remote side reboots) both the GID and the UD QPN >> stay the same, so it seems there won't be any neighbour update, right? If so, >> while playing with neighbour update events might get us data path speed-up, >> it will not solve the problem of detecting the connection is alive. > Second, let me think... OK, if IPoIB-CM was using bi-directional connection, problem is solved, since the remote side re-connects (to send the ARP reply) and either the CM or IPoIB-CM the CM consumer invalidates the existing connection. Also with uni-directional connections, when the remote side re-connects to us, it can put in the private data its RX QPN (or 0 if there's no such). The ipoib-cm CM callback can compare this QPN against what it knows on the remote and if its different, re-connect. This can be further simplified, but lets first take it high-level. Can you remind me what was --the-- reasoning for uni directional connections? Or. From mst at dev.mellanox.co.il Tue Jul 3 03:36:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 13:36:27 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A1D24.6060903@voltaire.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> Message-ID: <20070703103627.GB12153@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > Michael S. Tsirkin wrote: > >>Quoting Or Gerlitz : > > >>Yes, I know that there is a way to register for kernel level neighbour > >>update events, so on each neighbour update, ipoib cm reconnects, plus > >>you can remove the fast path memcmp we do today on the remote GUID, and > >>we done :) > > >In the typical case (remote side reboots) both the GID and the UD QPN stay > >the > >same, so it seems there won't be any neighbour update, right? If so, while > >playing with neighbour update events might get us data path speed-up, it > >will > >not solve the problem of detecting the connection is alive. > > I don't think we should give up here, first there might be a way (event) > and if not lets change the kernel :) to know that the neighbouring > subsystem issued a broadcast arp on a nieghbour. > Second, let me think... Frankly, I like the idea of using our own keepalive better: it will also work if we have e.g. multiple connections per neighbour. > What did the people who wrote the RFC said about the need / > implementation of liveness protocol? That it's a general IB problem and should be addressed at IB level. Which it seems to be - with CM. -- MST From mst at dev.mellanox.co.il Tue Jul 3 04:00:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 14:00:03 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A251B.70901@voltaire.com> References: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <468A251B.70901@voltaire.com> Message-ID: <20070703110003.GC12153@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > Or Gerlitz wrote: > >Michael S. Tsirkin wrote: > > >>In the typical case (remote side reboots) both the GID and the UD QPN > >>stay the same, so it seems there won't be any neighbour update, right? > >>If so, while playing with neighbour update events might get us data path > >>speed-up, it will not solve the problem of detecting the connection is > >>alive. > > >Second, let me think... I don't see why are you trying to get rid of keepalives. With RC we currently have an arbitrary ACK timeout, and this is no different, and quite easy to implement. > OK, if IPoIB-CM was using bi-directional connection, problem is solved, > since the remote side re-connects (to send the ARP reply) and either the > CM or IPoIB-CM the CM consumer invalidates the existing connection. Why should this invalidate the existing connection? IMO killing a connection simply because remote connected wouldn't be spec compliant: spec allows multiple connections to a single host, and it's easy to imagine a setup where this will be useful e.g. for performance reasons (I actually have such a project on my todo list). > Also with uni-directional connections, when the remote side re-connects > to us, it can put in the private data its RX QPN (or 0 if there's no > such). The ipoib-cm CM callback can compare this QPN against what it > knows on the remote and if its different, re-connect. This can be > further simplified, but lets first take it high-level. What if remote already has a connection to us? Anyway, this is clearly outside the existing spec. > Can you remind me what was --the-- reasoning for uni directional > connections? Lots of reasons. Simplicity of implementation. Solution to tricky dead/livelock scenarios with crossing connection requests. Fault containment. Ability to extend to multiple connections per host in the future. It just looks like a good idea. -- MST From ogerlitz at voltaire.com Tue Jul 3 04:07:14 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 14:07:14 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703110003.GC12153@mellanox.co.il> References: <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <468A251B.70901@voltaire.com> <20070703110003.GC12153@mellanox.co.il> Message-ID: <468A2DE2.9040702@voltaire.com> Michael S. Tsirkin wrote: > I don't see why are you trying to get rid of keepalives. > With RC we currently have an arbitrary ACK timeout, and this > is no different, and quite easy to implement. Since we agree (?) that RC is bad for IPoIB-CM and I want to find a way for a UC based implementation to avoid implementing a dedicated keep alive protocol. As for all your other comments, I need to think more, will get back to it later this week. Or. From mst at dev.mellanox.co.il Tue Jul 3 04:16:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 14:16:33 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A2DE2.9040702@voltaire.com> References: <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <468A251B.70901@voltaire.com> <20070703110003.GC12153@mellanox.co.il> <468A2DE2.9040702@voltaire.com> Message-ID: <20070703111633.GE12153@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > Michael S. Tsirkin wrote: > > >I don't see why are you trying to get rid of keepalives. > >With RC we currently have an arbitrary ACK timeout, and this > >is no different, and quite easy to implement. > > Since we agree (?) that RC is bad for IPoIB-CM and I want to find a way > for a UC based implementation to avoid implementing a dedicated keep > alive protocol. > > As for all your other comments, I need to think more, will get back to > it later this week. Not sure it's worth the effort: just scanning the list of active connections once in a while and sending a LAP message seems easy enough. -- MST From ogerlitz at voltaire.com Tue Jul 3 04:41:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 14:41:59 +0300 Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support In-Reply-To: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com> References: <000601c7bceb$ffff3400$3c98070a@amr.corp.intel.com> Message-ID: <468A3607.8090008@voltaire.com> Sean Hefty wrote: > +static void inform_event_handler(struct ib_event_handler *handler, > + struct ib_event *event) > +{ > + struct inform_device *dev; > + > + dev = container_of(handler, struct inform_device, event_handler); > + > + switch (event->event) { > + case IB_EVENT_PORT_ERR: > + case IB_EVENT_LID_CHANGE: > + case IB_EVENT_SM_CHANGE: > + case IB_EVENT_CLIENT_REREGISTER: > + inform_groups_lost(&dev->port[event->element.port_num - > + dev->start_port]); I think you want to act here only if event->element.port_num is the port this inform_device is associated with (similar to IPoIB), also the same for mcast_event_handler. Or. From ogerlitz at voltaire.com Tue Jul 3 05:26:01 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 03 Jul 2007 15:26:01 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix handling of duplicate SIDR REQs In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> References: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> Message-ID: <468A4059.8060809@voltaire.com> Sean Hefty wrote: > Fix handling to duplicate SIDR REQs to avoid sending a reject if > one is detected. Duplicates should simply be discarded. Hi Sean, Thanks for the fast (as usual...) patches, I am not sure I will be able to test it today, will let you know by tomorrow. Or. From tziporet at mellanox.co.il Tue Jul 3 08:03:50 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 3 Jul 2007 18:03:50 +0300 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans Message-ID: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Meeting minutes are available also on OFA Wiki: https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007 Abbreviated minutes / summary * OFED 1.2.1 support release - we plan a support release on beginning of August. * OFED 1.3 - decided that its most important to close the schedule and focus on most important features based on this schedule * Based on discussion in the meeting it seems that the best target for OFED 1.3 is November 07 * Most important features (from representative who participated in the meeting) * Voltaire: ConnectX stable release * IBM - IPOIB CM without SRQ * Qlogic: Package convenient for distros; ConnectX stable * iWARP: Chelsio: Get to GA level and NFSoRDMA integration. NetEffect: Get the drivers into OFED * Mellanox: ConnectX stable release; new package; QoS Action Items: 1. Other EWG members (Cisco, Intel, Labs) - send most important features for 1.3 2. Tziporet - set a meeting with Redhat & Novell to close the new package definition 3. Tziporet - publish OFED 1.3 schedule 4. MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 - is there any specific requests toward SC07 Detailed Minutes * OFED 1.2.1 release: * Companies that are mostly interested in such release (e.g. IBM, Chelsio) will do most of testing for their HW. * Not all companies are committed to QA this release, so in the release notes we will mention this limitation. * There are weekly builds of OFED 1.2 branch. Any other build should be requested from Vlad. * OFED 1.2.c: * All agree its important to have this code stream, and why it cannot be the same as 1.2, and that we cannot wait for 1.3. * There are companies that are currently using this code stream and this will prevent them to participate in QA of 1.2.1 * OFED 1.3: * There was a discussion if we wish to have the release on November 07 or January 08 (all agreed that December is not a good month) * Decision was to reduce features and have a release this year => November * There were no participants from the labs or MPI thus we lack information on important features that should be ready for SC07 Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Tue Jul 3 08:28:11 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 03 Jul 2007 18:28:11 +0300 Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ work requests In-Reply-To: References: <200706211201.58440.jackm@dev.mellanox.co.il> Message-ID: <468A6B0B.4070703@mellanox.co.il> Roland Dreier wrote: > I trust you guys on this, but have you thought about whether blueflame > makes sense for RDMA read requests? After all, an RDMA read requires > the responder to send potentially a large amount of data to complete, > and even for small requests I would think that latency-sensitive apps > would avoid it. Is there an MPI implementation or other app that you > know of where this really helps? > > You can run the ib_read_lat test and see the latency improvement for each message size. And we have customers that this improvement is important for them. Tziporet From panda at cse.ohio-state.edu Tue Jul 3 08:30:26 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 3 Jul 2007 11:30:26 -0400 (EDT) Subject: [ofa-general] Re: [ewg] OFED July 2, meeting summary on next OFED plans In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> from "Tziporet Koren" at Jul 03, 2007 06:03:50 PM Message-ID: <200707031530.l63FUQtj027641@xi.cse.ohio-state.edu> Tziporet, I was on travel last week and yesterday. Thus, I could neither send back a reply before the conference call nor could attend the conference call. > Action Items: > 4. MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 - > is there any specific requests toward SC07 We plan to have MVAPICH 1.0 and MVAPICH2 1.0 for OFED 1.3. As Shaun indicated during yesterday's call, we are working on MVAPICH2 1.0 with a set of new features and plan to release it in near future. This can definitely be included in OFED 1.3. We have also started working on MVAPICH 1.0. Depending on the feature freeze date for OFED 1.3, we can finalize the feature list for MVAPICH 1.0. Thanks, DK From mshefty at ichips.intel.com Tue Jul 3 10:02:17 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jul 2007 10:02:17 -0700 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703103627.GB12153@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <20070703103627.GB12153@mellanox.co.il> Message-ID: <468A8119.5070104@ichips.intel.com> > That it's a general IB problem and should be addressed at IB level. > Which it seems to be - with CM. I understand the simplicity that using LAP for an out-of-band keep alive message can give you, but that's not the intent of the message. (You could also use REQ/REJ or SIDR REQ/SIDR REP messages for this carrying the right private data...) If we don't want to require apps to send in-band keep alive messages, then I think we should explore all potential out-of-band solutions. For example, event registration could be used to detect that a remote node has gone down. We could use per node keep alive messages, rather than per connection messages. We could add a new out-of-band keep alive message. Or clearly define that LAP is the preferred way of for all connections to do keep alives. - Sean From ardavis at ichips.intel.com Tue Jul 3 10:06:34 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 03 Jul 2007 10:06:34 -0700 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Message-ID: <468A821A.10704@ichips.intel.com> Tziporet Koren wrote: > > Meeting minutes are available also on OFA Wiki: > _https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007_ > > *Abbreviated minutes / summary* > > * OFED 1.2.1 support release - we plan a support release on > beginning of August. > * OFED 1.3 - decided that its most important to close the schedule > and focus on most important features based on this schedule > o Based on discussion in the meeting it seems that the best > target for OFED 1.3 is November 07 > > * Most important features (from representative who participated in > the meeting) > o Voltaire: ConnectX stable release > o IBM - IPOIB CM without SRQ > o Qlogic: Package convenient for distros; ConnectX stable > o iWARP: Chelsio: Get to GA level and NFSoRDMA integration. > NetEffect: Get the drivers into OFED > o Mellanox: ConnectX stable release; new package; QoS > Intel: uDAPL 2.0 with IB extensions, installation/packaging, rdma_cm counters, performance manager From mst at dev.mellanox.co.il Tue Jul 3 10:23:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 20:23:12 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A8119.5070104@ichips.intel.com> References: <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <20070703103627.GB12153@mellanox.co.il> <468A8119.5070104@ichips.intel.com> Message-ID: <20070703172312.GE22937@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > >That it's a general IB problem and should be addressed at IB level. > >Which it seems to be - with CM. > > I understand the simplicity that using LAP for an out-of-band keep alive > message can give you, but that's not the intent of the message. I guess so - but even if the responder happens to do a modify QP as a result, and erroneously responds with APR, that's not too bad. > (You > could also use REQ/REJ or SIDR REQ/SIDR REP messages for this carrying > the right private data...) Hmm, I don't see how REQ gives you data on existing connection. Further, this would need a spec extension to define private data format then? LAP trick works out of the box ... > If we don't want to require apps to send in-band keep alive messages, > then I think we should explore all potential out-of-band solutions. I actually think a single working solution is enough. No need to explore all of them :). > For > example, event registration could be used to detect that a remote node > has gone down. > We could use per node keep alive messages, rather than > per connection messages. No, these won't address cases such as DREQ timeout after remote decides to close connection, without reboot. > We could add a new out-of-band keep alive > Or clearly define that LAP is the preferred way of for all > connections to do keep alives. Sure, someone might need to talk at IBTA about these clarifications. -- MST From sean.hefty at intel.com Tue Jul 3 10:29:22 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 3 Jul 2007 10:29:22 -0700 Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support In-Reply-To: <468A3607.8090008@voltaire.com> Message-ID: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com> >> +static void inform_event_handler(struct ib_event_handler *handler, >> + struct ib_event *event) >> +{ >> + struct inform_device *dev; >> + >> + dev = container_of(handler, struct inform_device, event_handler); >> + >> + switch (event->event) { >> + case IB_EVENT_PORT_ERR: >> + case IB_EVENT_LID_CHANGE: >> + case IB_EVENT_SM_CHANGE: >> + case IB_EVENT_CLIENT_REREGISTER: >> + inform_groups_lost(&dev->port[event->element.port_num - >> + dev->start_port]); > >I think you want to act here only if event->element.port_num is the port >this inform_device is associated with (similar to IPoIB), also the same >for mcast_event_handler. IPoIB registers its event handler per port, so requires the extra check. Both the multicast and inform info modules register their event handlers per device, so the check isn't necessary. - Sean From bob.kossey at hp.com Tue Jul 3 10:29:19 2007 From: bob.kossey at hp.com (Bob Kossey) Date: Tue, 03 Jul 2007 13:29:19 -0400 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans Message-ID: <468A876F.2030208@hp.com> Tziporet Koren wrote: >/ />/ * Most important features (from representative who participated in />/ the meeting) />/ o Voltaire: ConnectX stable release />/ o IBM - IPOIB CM without SRQ />/ o Qlogic: Package convenient for distros; ConnectX stable />/ o iWARP: Chelsio: Get to GA level and NFSoRDMA integration. />/ NetEffect: Get the drivers into OFED />/ o Mellanox: ConnectX stable release; new package; QoS /> HP: Full ConnectX support, installation/packaging changes, Perfmon independent of OpenSM. Bob From halr at voltaire.com Tue Jul 3 10:33:19 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jul 2007 13:33:19 -0400 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <468A876F.2030208@hp.com> References: <468A876F.2030208@hp.com> Message-ID: <1183483998.4377.254391.camel@hal.voltaire.com> On Tue, 2007-07-03 at 13:29, Bob Kossey wrote: > Tziporet Koren wrote: > > >/ > />/ * Most important features (from representative who participated in > />/ the meeting) > />/ o Voltaire: ConnectX stable release > />/ o IBM - IPOIB CM without SRQ > />/ o Qlogic: Package convenient for distros; ConnectX stable > />/ o iWARP: Chelsio: Get to GA level and NFSoRDMA integration. > />/ NetEffect: Get the drivers into OFED > />/ o Mellanox: ConnectX stable release; new package; QoS > /> > > HP: Full ConnectX support, installation/packaging changes, > Perfmon independent of OpenSM. What do you mean by "Perfmon independent of OpenSM" ? -- Hal > Bob > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bob.kossey at hp.com Tue Jul 3 10:53:39 2007 From: bob.kossey at hp.com (Bob Kossey) Date: Tue, 03 Jul 2007 13:53:39 -0400 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <1183483998.4377.254391.camel@hal.voltaire.com> References: <468A876F.2030208@hp.com> <1183483998.4377.254391.camel@hal.voltaire.com> Message-ID: <468A8D23.2020007@hp.com> Hal Rosenstock wrote: > >> >> HP: Full ConnectX support, installation/packaging changes, >> Perfmon independent of OpenSM. >> > > What do you mean by "Perfmon independent of OpenSM" ? > > -- Hal > > I interpreted from the exchange below that there is a dependency between PerfMgr and OpenSM. If that is not accurate, or if it will be eliminated, great. On Thu, 2007-06-28 at 03:24, Eitan Zahavi wrote: >/ > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: />/ > > In the last months it is the second time I hear people />/ > complaining the />/ > > current monitoring solution in OFA is integrated with OpenSM. />/ > />/ > I must have missed this both times (didn't see this in Mark's />/ > post) and the statement itself is somewhat inaccurate as well. / >/ Private talks - I hope they will speak up for themselves now... / Please encourage them to do so. >/ > > These people do not use OpenSM but do use OFED. />/ > />/ > I'm not sure I'm following what you mean here. />/ > />/ > If you mean that some people want to run PerfMgr without the />/ > SM/SA aspects (so that they can run a vendor based SM), that />/ > is the next thing we are adding to the implementation. />/ Exactly. OK when is that coming? / Should be part of OFED 1.3. From norman.woo at oracle.com Tue Jul 3 10:55:41 2007 From: norman.woo at oracle.com (Norman Woo) Date: Tue, 03 Jul 2007 10:55:41 -0700 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Message-ID: <468A8D9D.2040709@oracle.com> The proposed feature list for OFED 1.3 sent out on 6/26/2007 included the Asynch I/O for SDP, is this feature now being drop for the 1.3 release? Oracle has spent considerable effort to support SDP in our products and Oracle proposes that Asynch I/O for SDP be included in the OFED 1.3. What is required to include this feature for OFED 1.3? Regards, Norman Tziporet Koren wrote: > > Meeting minutes are available also on OFA Wiki: > _https://wiki.openfabrics.org/tiki-index.php?page=Teleconf+07-02-2007_ > > *Abbreviated minutes / summary* > > * OFED 1.2.1 support release - we plan a support release on > beginning of August. > * OFED 1.3 - decided that its most important to close the schedule > and focus on most important features based on this schedule > o Based on discussion in the meeting it seems that the best > target for OFED 1.3 is November 07 > > * Most important features (from representative who participated in > the meeting) > o Voltaire: ConnectX stable release > o IBM - IPOIB CM without SRQ > o Qlogic: Package convenient for distros; ConnectX stable > o iWARP: Chelsio: Get to GA level and NFSoRDMA integration. > NetEffect: Get the drivers into OFED > o Mellanox: ConnectX stable release; new package; QoS > > *Action Items:* > > 1. Other EWG members (Cisco, Intel, Labs) - send most important > features for 1.3 > 2. Tziporet - set a meeting with Redhat & Novell to close the new > package definition > 3. Tziporet - publish OFED 1.3 schedule > 4. MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 - > is there any specific requests toward SC07 > > *Detailed Minutes* > > * OFED 1.2.1 release: > o Companies that are mostly interested in such release (e.g. > IBM, Chelsio) will do most of testing for their HW. > o Not all companies are committed to QA this release, so in > the release notes we will mention this limitation. > o There are weekly builds of OFED 1.2 branch. Any other > build should be requested from Vlad. > > * OFED 1.2.c: > o All agree its important to have this code stream, and why > it cannot be the same as 1.2, and that we cannot wait for > 1.3. > o There are companies that are currently using this code > stream and this will prevent them to participate in QA of > 1.2.1 > > * OFED 1.3: > o There was a discussion if we wish to have the release on > November 07 or January 08 (all agreed that December is not > a good month) > o Decision was to reduce features and have a release this > year => November > o There were no participants from the labs or MPI thus we > lack information on important features that should be > ready for SC07 > > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: _tziporet at mellanox.co.il_ > Tel +972-4-9097200, ext 380 > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Tue Jul 3 11:14:23 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jul 2007 11:14:23 -0700 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703172312.GE22937@mellanox.co.il> References: <20070702195314.GA31169@mellanox.co.il> <46895A18.2000100@ichips.intel.com> <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <20070703103627.GB12153@mellanox.co.il> <468A8119.5070104@ichips.intel.com> <20070703172312.GE22937@mellanox.co.il> Message-ID: <468A91FF.3040804@ichips.intel.com> > Hmm, I don't see how REQ gives you data on existing connection. Further, > this would need a spec extension to define private data format then? > LAP trick works out of the box ... LAP keep-alives requires the apps to implement the keep alive timers and detection, but sends the messages out-of-band. Why not send the messages in-band? Would it make more sense to implement the entire keep-alive solution in the CM? > I actually think a single working solution is enough. > No need to explore all of them :). I'm not saying implement all of them, just make sure that we have the best solution. I can't think of one that I like better than using LAP, but it feels like the CM protocol / MADs are being hijacked. For example, if there's only one path between two nodes, LAP doesn't really make any sense, but it ends up being used. Should we instead look at adding new CM messages for just this purpose? >> For >> example, event registration could be used to detect that a remote node >> has gone down. >> We could use per node keep alive messages, rather than >> per connection messages. > > No, these won't address cases such as DREQ timeout after remote > decides to close connection, without reboot. Per node keep alive messages could. It depends on what data is carried in the message (e.g. all currently connected QPs to the node in question). I mentioned this because it may be more efficient under some circumstances. - Sean From tziporet at dev.mellanox.co.il Tue Jul 3 11:36:11 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 03 Jul 2007 21:36:11 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A1A9A.5060208@voltaire.com> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> <20070703060049.GF1147@mellanox.co.il> <468A1A9A.5060208@voltaire.com> Message-ID: <468A971B.5090507@mellanox.co.il> Or Gerlitz wrote: > Michael S. Tsirkin wrote: >>> we should move to UC >> >> For HW that supports UC with SRQ, yes. > > Dror did not mention the HW, my understanding is that this aspect is > fine... now, assuming the need for liveness protocol is behind us, and > if not, it can be implemented as you suggested. the problem is > narrowed to have the FW support SRQ/UC. We still don't have a solid plan for this in the FW Tziporet From mst at dev.mellanox.co.il Tue Jul 3 11:37:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 21:37:03 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A91FF.3040804@ichips.intel.com> References: <20070703061029.GG1147@mellanox.co.il> <468A0F27.3020909@voltaire.com> <20070703091639.GJ1147@mellanox.co.il> <468A19E9.2090707@voltaire.com> <20070703094703.GA12153@mellanox.co.il> <468A1D24.6060903@voltaire.com> <20070703103627.GB12153@mellanox.co.il> <468A8119.5070104@ichips.intel.com> <20070703172312.GE22937@mellanox.co.il> <468A91FF.3040804@ichips.intel.com> Message-ID: <20070703183703.GG22937@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] Re: Re: IPoIB-CM UC mode > > >Hmm, I don't see how REQ gives you data on existing connection. Further, > >this would need a spec extension to define private data format then? > >LAP trick works out of the box ... > > LAP keep-alives requires the apps to implement the keep alive timers and > detection, but sends the messages out-of-band. Why not send the > messages in-band? Sure, this can be done. But that'd need ULP support, in this case IPoIB protocol extension. Further, if remote is up, it's nice to get a CM message saying "connection was lost" directly rather than just a timeout. What real advantages are there for doing this "in-band" as you say? > Would it make more sense to implement the entire > keep-alive solution in the CM? I think it doesn't matter much. Let's keep it where it's needed: if more UC applications surface, we can rethink this decision, and factor the code out. > >I actually think a single working solution is enough. > >No need to explore all of them :). > > I'm not saying implement all of them, just make sure that we have the > best solution. I can't think of one that I like better than using LAP, > but it feels like the CM protocol / MADs are being hijacked. For > example, if there's only one path between two nodes, LAP doesn't really > make any sense, but it ends up being used. Should we instead look at > adding new CM messages for just this purpose? Sure, I agree, this would be nice. But I expect this will take a while to get the standartization rolling. So I think we'll start with the LAP hack and add support for the new CM message when/if it's there. > >>For > >>example, event registration could be used to detect that a remote node > >>has gone down. > >>We could use per node keep alive messages, rather than > >>per connection messages. > > > >No, these won't address cases such as DREQ timeout after remote > >decides to close connection, without reboot. > > Per node keep alive messages could. It depends on what data is carried > in the message (e.g. all currently connected QPs to the node in > question). I mentioned this because it may be more efficient under some > circumstances. Yes. And with multiple connections per node, all the more so. The CM message format does not seem like a good fit for this, though: maybe some new kind of MAD? -- MST From halr at voltaire.com Tue Jul 3 11:59:01 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Jul 2007 14:59:01 -0400 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <468A8D23.2020007@hp.com> References: <468A876F.2030208@hp.com> <1183483998.4377.254391.camel@hal.voltaire.com> <468A8D23.2020007@hp.com> Message-ID: <1183489140.4377.260402.camel@hal.voltaire.com> On Tue, 2007-07-03 at 13:53, Bob Kossey wrote: > Hal Rosenstock wrote: > > > >> > >> HP: Full ConnectX support, installation/packaging changes, > >> Perfmon independent of OpenSM. > >> > > > > What do you mean by "Perfmon independent of OpenSM" ? > > > > -- Hal > > > > > I interpreted from the exchange below that there is a dependency between > PerfMgr and OpenSM. If that is not accurate, or if it will be > eliminated, great. PerfMgr will support the ability to run without the SM/SA function in OpenSM but with a "third party" SM (meaning any standard (vendor) SM which is IBA compliant although this will need testing to confirm that aspect by those vendors/parties interested in this). PerfMgr will, however, be part of the OpenSM package. I hope this clarifies the current intent. -- Hal > On Thu, 2007-06-28 at 03:24, Eitan Zahavi wrote: > >/ > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > />/ > > In the last months it is the second time I hear people > />/ > complaining the > />/ > > current monitoring solution in OFA is integrated with OpenSM. > />/ > > />/ > I must have missed this both times (didn't see this in Mark's > />/ > post) and the statement itself is somewhat inaccurate as well. > / > >/ Private talks - I hope they will speak up for themselves now... > / > Please encourage them to do so. > > >/ > > These people do not use OpenSM but do use OFED. > />/ > > />/ > I'm not sure I'm following what you mean here. > />/ > > />/ > If you mean that some people want to run PerfMgr without the > />/ > SM/SA aspects (so that they can run a vendor based SM), that > />/ > is the next thing we are adding to the implementation. > />/ Exactly. OK when is that coming? > / > Should be part of OFED 1.3. From rdreier at cisco.com Tue Jul 3 12:05:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 03 Jul 2007 12:05:52 -0700 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> (Tziporet Koren's message of "Tue, 3 Jul 2007 18:03:50 +0300") References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Message-ID: > NFSoRDMA integration. I would like to see a status report on NFS/RDMA from the people who want it in OFED. As I understand it there are many core kernel changes required for this -- switchable transports and also mount option changes? As far as I can tell from the outside, the NFS/RDMA effort seems to have stalled -- whenever I talk to core NFS developers like Chuck Lever or Trond Myklebust, they say that they are just waiting for the NFS/RDMA developers to submit their changes for review. And I haven't seen any patches for a kernel newer that 2.6.18, so things look quite out-of-date. Without visible progress towards getting NFS/RDMA into mergeable form soon, I think putting it into OFED 1.3 as anything other than a technology preview that may be dropped from future releases would be a very risky think to do. Otherwise OFED risks getting stuck maintaining the whole NFS/RDMA stack, since the development effort outside of OFED really looks to me like it is fizzling out. - R. From tziporet at dev.mellanox.co.il Tue Jul 3 12:20:00 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 03 Jul 2007 22:20:00 +0300 Subject: [ewg] Re: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <468A8D9D.2040709@oracle.com> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> <468A8D9D.2040709@oracle.com> Message-ID: <468AA160.9020609@mellanox.co.il> Norman Woo wrote: > The proposed feature list for OFED 1.3 sent out on 6/26/2007 included > the Asynch I/O for SDP, is this feature now being drop for the 1.3 > release? Oracle has spent considerable effort to support SDP in our > products and Oracle proposes that Asynch I/O for SDP be included in > the OFED 1.3. What is required to include this feature for OFED 1.3? > SDP AIO is still on the list. Its just that non of you participated in the meeting yesterday and I only gathered the input from people that were on the meeting. I will publish the full features list once all companies will return with their input Tziporet From sean.hefty at intel.com Tue Jul 3 12:29:54 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 3 Jul 2007 12:29:54 -0700 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703183703.GG22937@mellanox.co.il> Message-ID: <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com> >What real advantages are there for doing this "in-band" as you say? Doing this in-band keeps the entire keep-alive protocol within the ULP. It can set the keep-alive message size and retry times. LAP messages are fixed at 256 bytes, add additional traffic on QP 1, and retries are limited by the CM protocol. (Of course, new CM messages would have these same limits, so it's not clear to me that creating new CM messages are a win. New CM messages would allow the CM itself to respond directly to keep-alives though.) A couple disadvantages are that broken connections take longer to detect if the remote node is able to respond to the LAP, and the connection must be able to send and receive. (The latter calls for a general solution being out-of-band.) >Sure, I agree, this would be nice. But I expect this will take a while >to get the standartization rolling. So I think we'll start with the LAP hack >and add support for the new CM message when/if it's there. Okay - is there any real drawback to using LAP other than it 'feels' like a mis-use of the CM protocol? - Sean From mst at dev.mellanox.co.il Tue Jul 3 12:49:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 3 Jul 2007 22:49:42 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com> References: <20070703183703.GG22937@mellanox.co.il> <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com> Message-ID: <20070703194942.GI22937@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] Re: Re: IPoIB-CM UC mode > > >What real advantages are there for doing this "in-band" as you say? > > Doing this in-band keeps the entire keep-alive protocol within the ULP. It can > set the keep-alive message size and retry times. > LAP messages are fixed at 256 > bytes, add additional traffic on QP 1, and retries are limited by the CM > protocol. BTW, I think we might want to avoid retries altogether: if LAP timed out, we can just re-create the connection. > (Of course, new CM messages would have these same limits, so it's not > clear to me that creating new CM messages are a win. New CM messages would > allow the CM itself to respond directly to keep-alives though.) OTOH, using QP1 makes it easier to separate rare keepalives from fast-path data packet receive path. -- MST From swise at opengridcomputing.com Tue Jul 3 13:35:13 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 03 Jul 2007 15:35:13 -0500 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Message-ID: <468AB301.4020807@opengridcomputing.com> Tom, can you update us on NFS-RDMA? Roland Dreier wrote: > > NFSoRDMA integration. > > I would like to see a status report on NFS/RDMA from the people who > want it in OFED. As I understand it there are many core kernel > changes required for this -- switchable transports and also mount > option changes? > > As far as I can tell from the outside, the NFS/RDMA effort seems to > have stalled -- whenever I talk to core NFS developers like Chuck > Lever or Trond Myklebust, they say that they are just waiting for the > NFS/RDMA developers to submit their changes for review. And I haven't > seen any patches for a kernel newer that 2.6.18, so things look quite > out-of-date. > > Without visible progress towards getting NFS/RDMA into mergeable form > soon, I think putting it into OFED 1.3 as anything other than a > technology preview that may be dropped from future releases would be a > very risky think to do. Otherwise OFED risks getting stuck > maintaining the whole NFS/RDMA stack, since the development effort > outside of OFED really looks to me like it is fizzling out. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Tue Jul 3 13:54:29 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 03 Jul 2007 15:54:29 -0500 Subject: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <468AB301.4020807@opengridcomputing.com> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> <468AB301.4020807@opengridcomputing.com> Message-ID: <1183496069.5757.33.camel@trinity.ogc.int> Roland: On Tue, 2007-07-03 at 15:35 -0500, Steve Wise wrote: > Tom, can you update us on NFS-RDMA? > > > Roland Dreier wrote: > > > NFSoRDMA integration. > > > > I would like to see a status report on NFS/RDMA from the people who > > want it in OFED. As I understand it there are many core kernel > > changes required for this -- switchable transports and also mount > > option changes? You are correct about the scope of the changes, although many of them are already in the kernel. Chuck Lever just posted the mount changes and I have posted a second round of the NFS-RDMA patches. You can see these on nfs at lists.sourceforge.net. I would like to get them upstream in 2.6.23, but that's probably optimistic. > > > > As far as I can tell from the outside, the NFS/RDMA effort seems to > > have stalled -- whenever I talk to core NFS developers like Chuck > > Lever or Trond Myklebust, they say that they are just waiting for the > > NFS/RDMA developers to submit their changes for review. And I haven't > > seen any patches for a kernel newer that 2.6.18, so things look quite > > out-of-date. I'm not sure when you talked to those guys, but as I mentioned, this is round-two of the patch submission. There is also a git tree that has these submitted patches available for download and testing. These are on a 2.6.22-rc6 base and the git URL is git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git If you like, I can post the patchset here as well. > > > > Without visible progress towards getting NFS/RDMA into mergeable form > > soon, I think putting it into OFED 1.3 as anything other than a > > technology preview that may be dropped from future releases would be a > > very risky think to do. Otherwise OFED risks getting stuck > > maintaining the whole NFS/RDMA stack, since the development effort > > outside of OFED really looks to me like it is fizzling out. > > Perhaps the activity is not where you're used to looking. Both Trond and Neal reviewed the previous patchset and provided feedback that I addressed in the most recent patchset. That said, I'm sure there will be quite a bit more before it's mergeable. > > - R. > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at dev.mellanox.co.il Tue Jul 3 15:09:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 4 Jul 2007 01:09:03 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> Message-ID: <20070703220903.GJ22937@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > Could you please clarify how do you envision this done? > > Do we just create our own filesystem? > > > > Reason I ask, we'll need something like this for SRC domain too ... > > I don't have a really clear idea. "Look at spufs" is about as far as > I got. So, I guess we could create our own filesystem and then let processes create files there to represent the src domain. But, how to pass this file to the create_qp verb in kernel? It needs to be done on the common context fd since we are using a regular CQ, a UAR etc from the context. -- MST From elsen_david at hotmail.com Tue Jul 3 15:42:35 2007 From: elsen_david at hotmail.com (david elsen) Date: Tue, 03 Jul 2007 15:42:35 -0700 Subject: [ofa-general] Open Fabrics iWARP Driver for Chesio T3 card In-Reply-To: <4683BDDC.5010309@opengridcomputing.com> Message-ID: Thanks a lot for the good information. >From: Steve Wise >To: SEGERS Koen >CC: david elsen , general at lists.openfabrics.org >Subject: Re: [ofa-general] Open Fabrics iWARP Driver for Chesio T3 card >Date: Thu, 28 Jun 2007 08:55:40 -0500 > >SEGERS Koen wrote: >>What is the benefit of using the iWARP driver? Do you offload the traffic >>comming from the cluster directly to the chelsio card (RDMA directly to >>Chelsio)? >> > >iWARP is a suite of standard protocols that implement RDMA over a TCP or >SCTP connection. The devices that support iWARP usually implement all of >these protocols (including TCP/IP/ethernet) in hardware. The device >drivers for these devices plug into the Linux/OFA RDMA core and support the >Linux/OFA RDMA verbs which are mostly common between both IB and iWARP. > >So think of it as an RDMA transport that uses standard Ethernet and IP >technology. There is no wire-level interoperability between IB and iWARP: >They are different L1-L4 protocol stacks below the RDMA API. But _above_ >the RDMA API, you can have a single application use the Linux RDMA Verbs >interface and deploy that same application over both IB networks and IW >networks. > >Application/Middle-ware examples include MPI, iSCSI/iSER, and NFS-RDMA. > >>Would it be beneficial to have the iWARP driver installed on nodes that >>communicate with clients over IP and with other servers (of its cluster) >>over IB? We are now using SDP as an intercluster protocol, but in the >>future we are probably going to VERBS for it. >> > >I'm not sure how you would utilize it in your setup. But I don't >understand your cluster architecture to say for sure whether it might help >you or not. > >You might contact the iWARP providers directly to help understand if their >solutions can help you. Also, there are other technologies that these >devices typically support that might be helpful for you. > >>Can we read the documentation on a website somewhere? >> > >The iWARP Protocols are IETF IDs and RFCs that can be found at > >http://www.ietf.org/html.charters/rddp-charter.html > >There is other information on RDMA over TCP/IP at > >http://www.rdmaconsortium.org/home > >Hope this helps. > >Steve. > _________________________________________________________________ Don't get caught with egg on your face. Play Chicktionary!� http://club.live.com/chicktionary.aspx?icid=chick_hotmailtextlink2 From mshefty at ichips.intel.com Tue Jul 3 15:45:47 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 03 Jul 2007 15:45:47 -0700 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <20070703194942.GI22937@mellanox.co.il> References: <20070703183703.GG22937@mellanox.co.il> <000101c7bda8$88e91300$3c98070a@amr.corp.intel.com> <20070703194942.GI22937@mellanox.co.il> Message-ID: <468AD19B.50004@ichips.intel.com> >>> What real advantages are there for doing this "in-band" as you say? >> Doing this in-band keeps the entire keep-alive protocol within the ULP. It can >> set the keep-alive message size and retry times. >> LAP messages are fixed at 256 >> bytes, add additional traffic on QP 1, and retries are limited by the CM >> protocol. > > BTW, I think we might want to avoid retries altogether: if LAP > timed out, we can just re-create the connection. The CM currently retries LAP messages based on the value of the REQ max CM retries, but I don't see why this couldn't change. >> (Of course, new CM messages would have these same limits, so it's not >> clear to me that creating new CM messages are a win. New CM messages would >> allow the CM itself to respond directly to keep-alives though.) > > OTOH, using QP1 makes it easier to separate rare keepalives > from fast-path data packet receive path. I was thinking more along the lines of whether to use the CM LAP message or create a new CM message for handling keep-alive. The best argument I can come up with for creating a new message is that it 'seems' cleaner... Anyway, I agree that using LAP would be the best approach for now. - Sean From jsquyres at cisco.com Tue Jul 3 21:41:37 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 4 Jul 2007 06:41:37 +0200 Subject: [ofa-general] Re: [ewg] OFED July 2, meeting summary on next OFED plans In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> Message-ID: On Jul 3, 2007, at 5:03 PM, Tziporet Koren wrote: > MPI people (DK, Jeff, Labs) - update your plans for OFED 1.3 - is > there any specific requests toward SC07 The OMPI community is working on its plan for our next release (the OMPI v1.3 series). We're roughly targeting SC/year-end, but an exact timetable has not yet been set. I think that we'll probably do "the usual" -- take the latest stable drop of OMPI as we approach OFED v1.3. -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Tue Jul 3 21:44:20 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 4 Jul 2007 06:44:20 +0200 Subject: [ofa-general] Feedback on mpi-selector / mpi-selector-menu Message-ID: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com> Just curious: does anyone have any feedback on the mpi-selector- menu / mpi-selector app that is included in OFED v1.2? I showed it to a few users who were very happy with it, but then again, I'm somewhat biased. :-) Do HP MPI / Intel MPI plan to integrate with these tools? I am pretty sure that I sent instructions to both groups, but if I didn't, or if those instructions got lost, let me know and I can re-send (or you can just read the man pages). Any feedback from out in the wild would be appreciated. Thanks. -- Jeff Squyres Cisco Systems From ogerlitz at voltaire.com Tue Jul 3 22:54:35 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Jul 2007 08:54:35 +0300 Subject: [ofa-general] Re: Re: IPoIB-CM UC mode In-Reply-To: <468A971B.5090507@mellanox.co.il> References: <20070702145328.GC17858@mellanox.co.il> <15ddcffd0707021222r12fa4490w7d8862ac5bdf43a2@mail.gmail.com> <20070702195314.GA31169@mellanox.co.il> <15ddcffd0707021313w6700f9f3n541137ecb345904e@mail.gmail.com> <20070703060049.GF1147@mellanox.co.il> <468A1A9A.5060208@voltaire.com> <468A971B.5090507@mellanox.co.il> Message-ID: <468B361B.9090808@voltaire.com> Tziporet Koren wrote: > Or Gerlitz wrote: >> Dror did not mention the HW, my understanding is that this aspect is >> fine... now, assuming the need for liveness protocol is behind us, and >> if not, it can be implemented as you suggested. the problem is >> narrowed to have the FW support SRQ/UC. > We still don't have a solid plan for this in the FW The current implementation of IPoIB-CM uses RC, where Michael has admitted that the RC ACKs are actually used as keep alives, no more. -> under high packet rate there are some ten (hundred?!) --thousands-- keep alive messages/second. This is --very-- poor architecture, the code must move to UC. The only actual barrier here is FW support for UC/SRQ, the IBTA signature can come later (and is on its way). Or. From ogerlitz at voltaire.com Tue Jul 3 23:22:03 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Jul 2007 09:22:03 +0300 Subject: [ofa-general] Re: [PATCH] ib/cm: fix handling of duplicate SIDR REQs In-Reply-To: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> References: <000801c7bcfa$f4112a40$3c98070a@amr.corp.intel.com> Message-ID: <468B3C8B.7040409@voltaire.com> Sean Hefty wrote: > Fix handling to duplicate SIDR REQs to avoid sending a reject if > one is detected. Duplicates should simply be discarded. > > Signed-off-by: Sean Hefty Sean, I have applied the patches on top of OFED 1.2 and tested both with udaddy and the UD app, it works fine, thanks. So you will push this to 2.6.23 ? Or. From ogerlitz at voltaire.com Tue Jul 3 23:32:37 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Jul 2007 09:32:37 +0300 Subject: [ewg] Re: [ofa-general] OFED July 2, meeting summary on next OFED plans In-Reply-To: <1183496069.5757.33.camel@trinity.ogc.int> References: <6C2C79E72C305246B504CBA17B5500C901563827@mtlexch01.mtl.com> <468AB301.4020807@opengridcomputing.com> <1183496069.5757.33.camel@trinity.ogc.int> Message-ID: <468B3F05.4050000@voltaire.com> Tom Tucker wrote: > Perhaps the activity is not where you're used to looking. Both Trond and > Neal reviewed the previous patchset and provided feedback that I > addressed in the most recent patchset. That said, I'm sure there will be > quite a bit more before it's mergeable. Indeed, any other IB related kernel ULP that was submitted upstream was sent to review on the "openib" (open-fabrics general) AND another mailing list (eg netdev,linux-scsi) AND lkml As was commented here in the past, these ULPs typically involve two disciplines, in this case, NFS and the RDMA stack. You are being expected to ask --both-- communities to review the code before you sending it to everyone (lkml) for another review, and only then merge it. Or. From ogerlitz at voltaire.com Wed Jul 4 00:14:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 04 Jul 2007 10:14:07 +0300 Subject: [ofa-general] [PATCH 1/2] ib/sa: Add InformInfo/Notice support In-Reply-To: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com> References: <000001c7bd97$b216d2a0$3c98070a@amr.corp.intel.com> Message-ID: <468B48BF.40606@voltaire.com> Sean Hefty wrote: > IPoIB registers its event handler per port, so requires the extra check. Both > the multicast and inform info modules register their event handlers per device, > so the check isn't necessary. Got it, I was not fully understand the code from first read. Or. From vlad at lists.openfabrics.org Wed Jul 4 02:45:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 4 Jul 2007 02:45:32 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070704-0200 daily build status Message-ID: <20070704094533.40016E60830@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From glebn at voltaire.com Wed Jul 4 05:11:16 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Wed, 4 Jul 2007 15:11:16 +0300 Subject: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070701121623.GD17699@minantech.com> <6C2C79E72C305246B504CBA17B5500C901CAE1B9@mtlexch01.mtl.com> <20070701190516.GB31673@minantech.com> <1183374715.4377.127455.camel@hal.voltaire.com> <4688F671.40408@dev.mellanox.co.il> <1183382948.4377.136789.camel@hal.voltaire.com> Message-ID: <20070704121116.GZ17699@minantech.com> On Mon, Jul 02, 2007 at 09:27:19AM -0700, Roland Dreier wrote: > > > > Correct. The number of messages in flight per EEC is 1 per IB spec. > > > The fact that IB requires SQ WQEs to complete in order, even if their > > > destination is different EECs, > > > > Where's this requirement in the spec (and could this be relaxed as it > > seems like it is overly "specified") ? Just wondering... > > I don't think we want to relax the requirement that work requests > complete in order. It's hard enough to get applications correct > without having to worry about out-of-order completions, and I think > specifying all the corner cases would be a nightmare. Eg do we allow > successful completions after a completion with error? and so on... I don't think it will be a problem (for MPI at least) if work requests to different destinations will complete out of order. What spec says about completion with error? Should RD QP move to error state? -- Gleb. From hnguyen at linux.vnet.ibm.com Wed Jul 4 07:11:29 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 4 Jul 2007 16:11:29 +0200 Subject: [ofa-general] Re: idr_get_new_above() limitation? In-Reply-To: <1183422700.3130.27.camel@localhost.localdomain> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> <1183422700.3130.27.camel@localhost.localdomain> Message-ID: <200707041611.30056.hnguyen@linux.vnet.ibm.com> On Tuesday 03 July 2007 02:31, Jim Houston wrote: > The problem is in idr_get_new_above_int() in the loop which > adds new layers to the top of the radix tree. It is failing > the "layers < (MAX_LEVEL - 1)" test. It doesn't allocate the > new layer but still calls sub_alloc() which relies on having > the new layer properly constructed. I believe that it is > allocating the slot which corresponds to id = 0. Hi Jim, Thanks for your quick reply. Yes, I realized that while condition too and have tried with a tiny change like (layers < MAX_LEVEL), but without success with idr_find(), even though 6 layers were created and the object was added at proper location. After several debug cycles I think to find the root cause in the if-condition in idr_find(): void *idr_find(struct idr *idp, int id) { int n; struct idr_layer *p; n = idp->layers * IDR_BITS; p = idp->top; /* Mask off upper bits we don't use for the search. */ id &= MAX_ID_MASK; if (id >= (1 << n)) return NULL; ... } Since idp->layers is now 6, n is equal 36, ie out of 32-bit-range, and therefore (1 << n) = (1 << 36) = 0 causing that if-cond to be true ie idr_find() fails. Replacing that if-line by if ((long)id >= (1L << n)) makes idr_find() working properly until MAX_ID_MASK. Since there are other places to be changed like above as well eg. idr_replace() and because you're creating a patch too, I'm waiting first for your comment. Let me know if you prefer me to send a patch. Regards Nam From dledford at redhat.com Wed Jul 4 09:28:04 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 04 Jul 2007 12:28:04 -0400 Subject: [ewg] RE: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611929146A@EPEXCH2.qlogic.org> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <4FB1BCCAE6CAED44A1DC005B1DE06119291460@EPEXCH2.qlogic.org> <4FB1BCCAE6CAED44A1DC005B1DE0611929146A@EPEXCH2.qlogic.org> Message-ID: <1183566484.16081.123.camel@firewall.xsintricity.com> On Tue, 2007-06-26 at 14:46 -0500, Lakshmanan, Madhu wrote: > > From: Roland Dreier [mailto:rdreier at cisco.com] > > Subject: Re: [ewg] RE: [ofa-general] Toward next OFED release (1.3) > > > > > VNIC: > > > - GA quality. Not a technology preview version anymore. > > > - Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet > > > gateway) - in GA > > > > I hope there will be some attempt to get these drivers merged upstream > too. > > > > - R. > > Agreed in principle. I would suggest you should agree in practice. I couldn't care less about principle, and I'm heavily leaning towards yanking any drivers/ulps that don't get merged upstream from our future updates. > We hope to address that issue soon. > > Madhu > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Wed Jul 4 09:28:38 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 04 Jul 2007 12:28:38 -0400 Subject: [ewg] RE: [ofa-general] Toward next OFED release (1.3) In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <4FB1BCCAE6CAED44A1DC005B1DE06119291460@EPEXCH2.qlogic.org> Message-ID: <1183566518.16081.124.camel@firewall.xsintricity.com> On Tue, 2007-06-26 at 12:49 -0700, Scott Weitzenkamp (sweitzen) wrote: > > I hope there will be some attempt to get these drivers merged > > upstream too. > > How about SDP, are we ready to try to merge it upstream? I hope so. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Wed Jul 4 09:34:47 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 04 Jul 2007 12:34:47 -0400 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release In-Reply-To: <1183124231.28870.268894.camel@hal.voltaire.com> References: <1183124231.28870.268894.camel@hal.voltaire.com> Message-ID: <1183566887.16081.126.camel@firewall.xsintricity.com> On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote: > There is a new release of the management libraries which include the > ANSIfied header files available in: > > http://www.openfabrics.org/~halr/ > > md5sum > a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz > 288b865a0015ac3251cffa011a7633eb libibumad-1.0.6.tar.gz > 04a5b6dcd2ee930f44d5715ee013f78b libibmad-1.0.6.tar.gz Hey Hal, I noticed you have release tarballs there for the libs, and one for the older named openib-diags. What would it take to get a release tarball for infiniband-diags and one for opensm? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From exvdk at indiana.edu Wed Jul 4 14:28:09 2007 From: exvdk at indiana.edu (Yoder Christie) Date: Wed, 4 Jul 2007 16:28:09 -0500 Subject: [ofa-general] perceptible Message-ID: <468C10E9.7020603@indiana.edu> ERMX Continues To Expand As Stock Climbs Up 16.6%! EntreMetrix Inc. (ERMX) $0.21 UP 16.6% ERMX announced further expansion with K-9 Genetics. Healthy and Premium dog foods grossed $3.6 Billion in 2006, up from $1.9 billion in previous years. Read up on ERMX over the holiday, we think you will see even more fireworks on Thursday morning! Privacy Policy Search Corrections RSS First Look Help Contact Us Work for Us Site Map The government investigators requested documents relating to the unauthorized downloads, he said, but declined to elaborate further. Harry points out that one of LinkedIn's strongest features is the ability to collect testimonials from other people. In the cellphone world, win-win plays like that are extremely rare. Wurtzel of NBC acknowledged it was early in the research process. Cut your overhead so you have plenty of chips, ready for another spin of the roulette wheel. I look forward to seeing you all at the new site! , to help expand its communications network, both companies said Tuesday. A Hybrid That Looks Like One Modan: Most Popular Girl in Warsaw Home World U. Do you have a purpose in life. There is no charge to search jobs. I'm horrible at chess and you should never hire me to paint your house. International Paper pulled out a few years ago, but most people hung in, making do by doing a lot of different things. ESRB Rating: EVERYONEFor more information, visit: www. you get out of it what you put in. Third-party customer-support companies like TomorrowNow often have that privilege in working on behalf of clients. Blue Sky Resumes Blog: It Makes Me Angry! From kliteyn at dev.mellanox.co.il Thu Jul 5 00:43:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 05 Jul 2007 10:43:55 +0300 Subject: [ofa-general] [PATCH] osm: bug in dumping opensm.fdbs Message-ID: <468CA13B.2040900@dev.mellanox.co.il> Hi Hal, opensm.fdbs dump function adaptation to the recent changes in min hop tables broke fat-tree routing (or any other future routing that may not use the same min hop tables creation functions). -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_mgr.c | 33 ++++++++++++++++++++++++--------- 1 files changed, 24 insertions(+), 9 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 5bcb655..cab272e 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution( /********************************************************************** **********************************************************************/ + static void __osm_ucast_mgr_dump_ucast_routes( IN cl_map_item_t *p_map_item, @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes( uint8_t best_port; uint16_t max_lid_ho; uint16_t lid_ho, base_lid; + boolean_t direct_route_exists = FALSE; osm_switch_t* p_sw = (osm_switch_t *)p_map_item; osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes( */ if( p_port->p_node->sw ) { + /* Target LID is switch. + Get its base lid and check hop count for this base LID only.*/ base_lid = osm_node_get_base_lid(p_port->p_node, 0); base_lid = cl_ntoh16(base_lid); num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); } else { - osm_physp_t *p_physp = p_port->p_physp; - if( !p_physp || !p_physp->p_remote_physp || - !p_physp->p_remote_physp->p_node->sw ) - num_hops = OSM_NO_PATH; + /* Target LID is not switch (CA or router). + Check if we have route to this target from current switch.*/ + num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); + if (num_hops != OSM_NO_PATH) + { + direct_route_exists = TRUE; + base_lid = lid_ho; + } else { - base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); - base_lid = cl_ntoh16(base_lid); - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? - 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num ); + osm_physp_t *p_physp = p_port->p_physp; + if( !p_physp || !p_physp->p_remote_physp || + !p_physp->p_remote_physp->p_node->sw ) + num_hops = OSM_NO_PATH; + else + { + base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); + base_lid = cl_ntoh16(base_lid); + num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? + 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num ); + } } } @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes( } best_hops = osm_switch_get_least_hops( p_sw, base_lid ); - if (!p_port->p_node->sw) + if (!p_port->p_node->sw && !direct_route_exists) { best_hops++; num_hops++; -- 1.5.1.4 From vlad at lists.openfabrics.org Thu Jul 5 02:44:48 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 5 Jul 2007 02:44:48 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070705-0200 daily build status Message-ID: <20070705094448.86E09E60843@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Failed: From swise at opengridcomputing.com Thu Jul 5 05:39:38 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Jul 2007 07:39:38 -0500 Subject: [ofa-general] Feedback on mpi-selector / mpi-selector-menu In-Reply-To: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com> References: <7D3A0004-72BB-403E-B7D1-562AD2B184C4@cisco.com> Message-ID: <468CE68A.3050400@opengridcomputing.com> It works fine for me. I've used it specifically to add debug mvapich2 libs and easily switch between the debug and non-debug libs. Steve. Jeff Squyres wrote: > Just curious: does anyone have any feedback on the mpi-selector-menu / > mpi-selector app that is included in OFED v1.2? I showed it to a few > users who were very happy with it, but then again, I'm somewhat biased. > :-) > > Do HP MPI / Intel MPI plan to integrate with these tools? I am pretty > sure that I sent instructions to both groups, but if I didn't, or if > those instructions got lost, let me know and I can re-send (or you can > just read the man pages). > > Any feedback from out in the wild would be appreciated. Thanks. > From halr at voltaire.com Thu Jul 5 05:40:33 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jul 2007 08:40:33 -0400 Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs In-Reply-To: <468CA13B.2040900@dev.mellanox.co.il> References: <468CA13B.2040900@dev.mellanox.co.il> Message-ID: <1183639225.4377.435484.camel@hal.voltaire.com> Hi Yevgeny, On Thu, 2007-07-05 at 03:43, Yevgeny Kliteynik wrote: > Hi Hal, > > opensm.fdbs dump function adaptation to the recent changes in min hop tables > broke fat-tree routing (or any other future routing that may not use the same > min hop tables creation functions). > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied to master only (by hand again as patch rejected these changes :-( Please double check. -- Hal From afghanzf6 at phentermine.com Thu Jul 5 05:44:48 2007 From: afghanzf6 at phentermine.com (Reba Quintero) Date: Thu, 5 Jul 2007 12:44:48 +0000 Subject: [ofa-general] Wassup Message-ID: <182006335.77584700461918@phentermine.com> An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Jul 5 05:57:31 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Jul 2007 08:57:31 -0400 Subject: [ofa-general] [PATCH] OpenSM handling of "Babbling" Ports Message-ID: <1183640246.4377.436639.camel@hal.voltaire.com> A "babbling" port is a port which causes traps to be generated frequently. It may directly be "this" port which generates the traps or the peer port detecting the issue and that the SMA on switch port 0 generates the traps. This has only currently been observed for trap 131 but will also apply for traps 129 and 130 as well which are other urgent and similar traps. Note that there appears to be a bug in Mellanox firmware for both Anafa-2 and Tavor at a minimum which causes the max trap rate not to be adhered to and relief for this does not appear to be in short term sight. Policy When a bablbing port is detected, OpenSM will disable the port or its peer switch port (depending on which trap) which should terminate the trap storm. Detection 250 consecutive traps of this type will be used as the (initial) threshold. The reason for this is so as to not prematurely detect this and disable a port. Recovery Admin would reenable port when OK again. (This usually involves rebooting the node causing the trap to be indicated.) Signed-off-by: Hal Rosenstock diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index bedd63f..1150703 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt boolean_t honor_guid2lid_file; boolean_t daemon; boolean_t sm_inactive; + boolean_t babbling_port_policy; osm_qos_options_t qos_options; osm_qos_options_t qos_ca_options; osm_qos_options_t qos_sw0_options; @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt * * sm_inactive * OpenSM will start with SM in not active state. +* +* babbling_port_policy +* OpenSM will enforce its "babbling" port policy. * * perfmgr * Enable or disable the performance manager diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 726b665..87b71e5 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -472,6 +472,7 @@ osm_subn_set_default_opt( p_opt->honor_guid2lid_file = FALSE; p_opt->daemon = FALSE; p_opt->sm_inactive = FALSE; + p_opt->babbling_port_policy = FALSE; #ifdef ENABLE_OSM_PERF_MGR p_opt->perfmgr = FALSE; p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ -1358,6 +1359,10 @@ osm_subn_parse_conf_file( "sm_inactive", p_key, p_val, &p_opts->sm_inactive); + __osm_subn_opts_unpack_boolean( + "babbling_port_policy", + p_key, p_val, &p_opts->babbling_port_policy); + #ifdef ENABLE_OSM_PERF_MGR __osm_subn_opts_unpack_boolean( "perfmgr", @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( "# Daemon mode\n" "daemon %s\n\n" "# SM Inactive\n" - "sm_inactive %s\n\n", + "sm_inactive %s\n\n" + "# Babbling Port Policy\n" + "babbling_port_policy %s\n\n", p_opts->daemon ? "TRUE" : "FALSE", - p_opts->sm_inactive ? "TRUE" : "FALSE" + p_opts->sm_inactive ? "TRUE" : "FALSE", + p_opts->babbling_port_policy ? "TRUE" : "FALSE" ); #ifdef ENABLE_OSM_PERF_MGR diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( } else { + /* When babbling port policy option is enabled and + Threshold for disabling a "babbling" port is exceeded */ + if ( p_rcv->p_subn->opt.babbling_port_policy && + num_received >= 250 ) + { + uint8_t payload[IB_SMP_DATA_SIZE]; + ib_port_info_t* p_pi = (ib_port_info_t*)payload; + const ib_port_info_t* p_old_pi; + osm_madw_context_t context; + + /* If trap 131, might want to disable peer port if available */ + /* but peer port has been observed not to respond to SM requests */ + + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3810: " + " Disabling physical port lid:0x%02X num:%u\n", + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), + p_ntci->data_details.ntc_129_131.port_num + ); + + p_old_pi = &p_physp->port_info; + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); + + /* Set port to disabled/down */ + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); + + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pi_context.set_method = TRUE; + context.pi_context.update_master_sm_base_lid = FALSE; + context.pi_context.light_sweep = FALSE; + context.pi_context.active_transition = FALSE; + + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, + osm_physp_get_dr_path_ptr( p_physp ), + payload, + sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32(osm_physp_get_port_num( p_physp )), + CL_DISP_MSGID_NONE, + &context ); + + if( status == IB_SUCCESS ) + { + goto Exit; + } + else + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3811: " + "Request to set PortInfo failed\n" ); + } + } + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_trap_rcv_process_request: " "Marking unhealthy physical port by lid:0x%02X num:%u\n", From kliteyn at dev.mellanox.co.il Thu Jul 5 06:54:55 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 05 Jul 2007 16:54:55 +0300 Subject: [ofa-general] [PATCH] osm: cosmetics - removing trailing blanks Message-ID: <468CF82F.5030409@dev.mellanox.co.il> Hi Hal, Removing trailing white spaces in fat-tree -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 340 +++++++++++++++++++------------------- 1 files changed, 170 insertions(+), 170 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 1ead199..e91f3ed 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -63,12 +63,12 @@ * so no need to use FatTree routing. * - Why maximum rank is 8: * Each node (switch) is assigned a unique tuple. - * Switches are stored in two cl_qmaps - one is + * Switches are stored in two cl_qmaps - one is * ordered by guid, and the other by a key that is * generated from tuple. Since cl_qmap supports only * a 64-bit key, the maximal tuple lenght is 8 bytes. * which means that maximal tree rank is 8. - * Note that the above also implies that each switch + * Note that the above also implies that each switch * can have at max 255 up/down ports. */ @@ -132,7 +132,7 @@ typedef uint8_t * ftree_fwd_tbl_t; ** ***************************************************/ -typedef struct ftree_port_t_ +typedef struct ftree_port_t_ { cl_map_item_t map_item; uint8_t port_num; /* port number on the current node */ @@ -170,7 +170,7 @@ typedef struct ftree_port_group_t_ ** ***************************************************/ -typedef struct ftree_sw_t_ +typedef struct ftree_sw_t_ { cl_map_item_t map_item; osm_switch_t * p_osm_sw; @@ -203,7 +203,7 @@ typedef struct ftree_hca_t_ { ** ***************************************************/ -typedef struct ftree_fabric_t_ +typedef struct ftree_fabric_t_ { osm_opensm_t * p_osm; cl_qmap_t hca_tbl; @@ -226,11 +226,11 @@ typedef struct ftree_fabric_t_ static int OSM_CDECL __osm_ftree_compare_switches_by_index( - IN const void * p1, + IN const void * p1, IN const void * p2) { - ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1; - ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2; + ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1; + ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2; uint16_t i; for (i = 0; i < FTREE_TUPLE_LEN; i++) @@ -247,13 +247,13 @@ __osm_ftree_compare_switches_by_index( static int OSM_CDECL __osm_ftree_compare_port_groups_by_remote_switch_index( - IN const void * p1, + IN const void * p1, IN const void * p2) { - ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1; - ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2; + ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1; + ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2; - return __osm_ftree_compare_switches_by_index( + return __osm_ftree_compare_switches_by_index( &((*pp_g1)->remote_hca_or_sw.remote_sw), &((*pp_g2)->remote_hca_or_sw.remote_sw) ); } @@ -290,7 +290,7 @@ __osm_ftree_sw_greater_by_index( ** ***************************************************/ -static void +static void __osm_ftree_tuple_init( IN ftree_tuple_t tuple) { @@ -310,7 +310,7 @@ __osm_ftree_tuple_assigned( #define FTREE_TUPLE_BUFFERS_NUM 6 -static char * +static char * __osm_ftree_tuple_to_str( IN ftree_tuple_t tuple) { @@ -340,7 +340,7 @@ __osm_ftree_tuple_to_str( /***************************************************/ -static inline ftree_tuple_key_t +static inline ftree_tuple_key_t __osm_ftree_tuple_to_key( IN ftree_tuple_t tuple) { @@ -351,9 +351,9 @@ __osm_ftree_tuple_to_key( /***************************************************/ -static inline void +static inline void __osm_ftree_tuple_from_key( - IN ftree_tuple_t tuple, + IN ftree_tuple_t tuple, IN ftree_tuple_key_t key) { memcpy(tuple, &key, FTREE_TUPLE_LEN); @@ -369,7 +369,7 @@ static ftree_sw_tbl_element_t * __osm_ftree_sw_tbl_element_create( IN ftree_sw_t * p_sw) { - ftree_sw_tbl_element_t * p_element = + ftree_sw_tbl_element_t * p_element = (ftree_sw_tbl_element_t *) malloc(sizeof(ftree_sw_tbl_element_t)); if (!p_element) return NULL; @@ -397,8 +397,8 @@ __osm_ftree_sw_tbl_element_destroy( ** ***************************************************/ -static ftree_port_t * -__osm_ftree_port_create( +static ftree_port_t * +__osm_ftree_port_create( IN uint8_t port_num, IN uint8_t remote_port_num) { @@ -415,7 +415,7 @@ __osm_ftree_port_create( /***************************************************/ -static void +static void __osm_ftree_port_destroy( IN ftree_port_t * p_port) { @@ -429,8 +429,8 @@ __osm_ftree_port_destroy( ** ***************************************************/ -static ftree_port_group_t * -__osm_ftree_port_group_create( +static ftree_port_group_t * +__osm_ftree_port_group_create( IN ib_net16_t base_lid, IN ib_net16_t remote_base_lid, IN ib_net64_t * p_port_guid, @@ -439,9 +439,9 @@ __osm_ftree_port_group_create( IN uint8_t remote_node_type, IN void * p_remote_hca_or_sw) { - ftree_port_group_t * p_group = + ftree_port_group_t * p_group = (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t)); - if (p_group == NULL) + if (p_group == NULL) return NULL; memset(p_group, 0, sizeof(ftree_port_group_t)); @@ -473,7 +473,7 @@ __osm_ftree_port_group_create( /***************************************************/ -static void +static void __osm_ftree_port_group_destroy( IN ftree_port_group_t * p_group) { @@ -497,7 +497,7 @@ __osm_ftree_port_group_destroy( /***************************************************/ -static void +static void __osm_ftree_port_group_dump( IN ftree_fabric_t *p_ftree, IN ftree_port_group_t * p_group, @@ -529,9 +529,9 @@ __osm_ftree_port_group_dump( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_port_group_dump:" - " Port Group of size %u, port(s): %s, direction: %s\n" + " Port Group of size %u, port(s): %s, direction: %s\n" " Local <--> Remote GUID (LID):" - "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n", + "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n", size, buff, (direction == FTREE_DIRECTION_DOWN)? "DOWN" : "UP", @@ -570,7 +570,7 @@ __osm_ftree_port_group_add_port( ** ***************************************************/ -static ftree_sw_t * +static ftree_sw_t * __osm_ftree_sw_create( IN ftree_fabric_t * p_ftree, IN osm_switch_t * p_osm_sw) @@ -583,7 +583,7 @@ __osm_ftree_sw_create( return NULL; p_sw = (ftree_sw_t *)malloc(sizeof(ftree_sw_t)); - if (p_sw == NULL) + if (p_sw == NULL) return NULL; memset(p_sw, 0, sizeof(ftree_sw_t)); @@ -594,9 +594,9 @@ __osm_ftree_sw_create( p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0); ports_num = osm_node_get_num_physp(p_sw->p_osm_sw->p_node); - p_sw->down_port_groups = + p_sw->down_port_groups = (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *)); - p_sw->up_port_groups = + p_sw->up_port_groups = (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *)); if (!p_sw->down_port_groups || !p_sw->up_port_groups) return NULL; @@ -612,7 +612,7 @@ __osm_ftree_sw_create( /***************************************************/ -static void +static void __osm_ftree_sw_destroy( IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) @@ -640,7 +640,7 @@ __osm_ftree_sw_destroy( /***************************************************/ -static void +static void __osm_ftree_sw_dump( IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) @@ -658,7 +658,7 @@ __osm_ftree_sw_dump( "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n", __osm_ftree_tuple_to_str(p_sw->tuple), cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), - p_sw->down_port_groups_num, + p_sw->down_port_groups_num, p_sw->up_port_groups_num); for( i = 0; i < p_sw->down_port_groups_num; i++ ) @@ -678,7 +678,7 @@ static boolean_t __osm_ftree_sw_ranked( IN ftree_sw_t * p_sw) { - return (p_sw->rank != 0xFFFFFFFF); + return (p_sw->rank != 0xFFFFFFFF); } /***************************************************/ @@ -713,7 +713,7 @@ __osm_ftree_sw_get_port_group_by_remote_lid( /***************************************************/ -static void +static void __osm_ftree_sw_add_port( IN ftree_sw_t * p_sw, IN uint8_t port_num, @@ -727,7 +727,7 @@ __osm_ftree_sw_add_port( IN void * p_remote_hca_or_sw, IN ftree_direction_t direction) { - ftree_port_group_t * p_group = + ftree_port_group_t * p_group = __osm_ftree_sw_get_port_group_by_remote_lid(p_sw,remote_base_lid,direction); if (!p_group) @@ -756,7 +756,7 @@ __osm_ftree_sw_add_port( static inline void __osm_ftree_sw_set_fwd_table_block( IN ftree_sw_t * p_sw, - IN uint16_t lid_ho, + IN uint16_t lid_ho, IN uint8_t port_num) { p_sw->lft_buf[lid_ho] = port_num; @@ -795,17 +795,17 @@ __osm_ftree_sw_set_hops( ** ***************************************************/ -static ftree_hca_t * +static ftree_hca_t * __osm_ftree_hca_create( IN osm_node_t * p_osm_node) { ftree_hca_t * p_hca = (ftree_hca_t *)malloc(sizeof(ftree_hca_t)); - if (p_hca == NULL) + if (p_hca == NULL) return NULL; memset(p_hca,0,sizeof(ftree_hca_t)); p_hca->p_osm_node = p_osm_node; - p_hca->up_port_groups = (ftree_port_group_t **) + p_hca->up_port_groups = (ftree_port_group_t **) malloc(osm_node_get_num_physp(p_hca->p_osm_node) * sizeof (ftree_port_group_t *)); if (!p_hca->up_port_groups) return NULL; @@ -815,7 +815,7 @@ __osm_ftree_hca_create( /***************************************************/ -static void +static void __osm_ftree_hca_destroy( IN ftree_hca_t * p_hca) { @@ -835,7 +835,7 @@ __osm_ftree_hca_destroy( /***************************************************/ -static void +static void __osm_ftree_hca_dump( IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) @@ -851,10 +851,10 @@ __osm_ftree_hca_dump( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_hca_dump: " "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", - cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), p_hca->up_port_groups_num); - for( i = 0; i < p_hca->up_port_groups_num; i++ ) + for( i = 0; i < p_hca->up_port_groups_num; i++ ) __osm_ftree_port_group_dump(p_ftree, p_hca->up_port_groups[i], FTREE_DIRECTION_UP); @@ -877,7 +877,7 @@ __osm_ftree_hca_get_port_group_by_remote_lid( /***************************************************/ -static void +static void __osm_ftree_hca_add_port( IN ftree_hca_t * p_hca, IN uint8_t port_num, @@ -893,7 +893,7 @@ __osm_ftree_hca_add_port( ftree_port_group_t * p_group; /* this function is supposed to be called only for adding ports - in hca's that lead to switches */ + in hca's that lead to switches */ CL_ASSERT(remote_node_type == IB_NODE_TYPE_SWITCH); p_group = __osm_ftree_hca_get_port_group_by_remote_lid(p_hca,remote_base_lid); @@ -920,12 +920,12 @@ __osm_ftree_hca_add_port( ** ***************************************************/ -static ftree_fabric_t * +static ftree_fabric_t * __osm_ftree_fabric_create() { cl_status_t status; ftree_fabric_t * p_ftree = (ftree_fabric_t *)malloc(sizeof(ftree_fabric_t)); - if (p_ftree == NULL) + if (p_ftree == NULL) return NULL; memset(p_ftree,0,sizeof(ftree_fabric_t)); @@ -951,7 +951,7 @@ __osm_ftree_fabric_create() /***************************************************/ -static void +static void __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) { ftree_hca_t * p_hca; @@ -988,13 +988,13 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) /* remove all the elements of sw_by_tuple_tbl */ - p_next_element = + p_next_element = (ftree_sw_tbl_element_t *)cl_qmap_head(&p_ftree->sw_by_tuple_tbl); - while( p_next_element != + while( p_next_element != (ftree_sw_tbl_element_t *)cl_qmap_end( &p_ftree->sw_by_tuple_tbl ) ) { p_element = p_next_element; - p_next_element = + p_next_element = (ftree_sw_tbl_element_t *)cl_qmap_next(&p_element->map_item); __osm_ftree_sw_tbl_element_destroy(p_element); } @@ -1012,7 +1012,7 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) /***************************************************/ -static void +static void __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree) { if (!p_ftree) @@ -1024,7 +1024,7 @@ __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree) /***************************************************/ -static void +static void __osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) { if (rank > p_ftree->tree_rank) @@ -1033,7 +1033,7 @@ __osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) /***************************************************/ -static uint8_t +static uint8_t __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree) { return p_ftree->tree_rank; @@ -1041,7 +1041,7 @@ __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree) /***************************************************/ -static void +static void __osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node) { ftree_hca_t * p_hca = __osm_ftree_hca_create(p_osm_node); @@ -1055,7 +1055,7 @@ __osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node) /***************************************************/ -static void +static void __osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw) { ftree_sw_t * p_sw = __osm_ftree_sw_create(p_ftree,p_osm_sw); @@ -1073,9 +1073,9 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw) /***************************************************/ -static void +static void __osm_ftree_fabric_add_sw_by_tuple( - IN ftree_fabric_t * p_ftree, + IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) { CL_ASSERT(__osm_ftree_tuple_assigned(p_sw->tuple)); @@ -1087,9 +1087,9 @@ __osm_ftree_fabric_add_sw_by_tuple( /***************************************************/ -static ftree_sw_t * +static ftree_sw_t * __osm_ftree_fabric_get_sw_by_tuple( - IN ftree_fabric_t * p_ftree, + IN ftree_fabric_t * p_ftree, IN ftree_tuple_t tuple) { ftree_sw_tbl_element_t * p_element; @@ -1108,7 +1108,7 @@ __osm_ftree_fabric_get_sw_by_tuple( /***************************************************/ -static void +static void __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) { uint32_t i; @@ -1154,7 +1154,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) /***************************************************/ -static void +static void __osm_ftree_fabric_dump_general_info( IN ftree_fabric_t * p_ftree) { @@ -1190,7 +1190,7 @@ __osm_ftree_fabric_dump_general_info( } if (i == 0) addition_str = " (root) "; - else + else if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) addition_str = " (leaf) "; else @@ -1237,10 +1237,10 @@ __osm_ftree_fabric_dump_general_info( /***************************************************/ -static void +static void __osm_ftree_fabric_dump_hca_ordering( IN ftree_fabric_t * p_ftree) -{ +{ ftree_hca_t * p_hca; ftree_sw_t * p_sw; ftree_port_group_t * p_group; @@ -1251,10 +1251,10 @@ __osm_ftree_fabric_dump_hca_ordering( FILE * p_hca_ordering_file; char * filename = "opensm-ftree-ca-order.dump"; - snprintf(path, sizeof(path), "%s/%s", + snprintf(path, sizeof(path), "%s/%s", p_ftree->p_osm->subn.opt.dump_files_dir, filename); p_hca_ordering_file = fopen(path, "w"); - if (!p_hca_ordering_file) + if (!p_hca_ordering_file) { osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: " @@ -1263,7 +1263,7 @@ __osm_ftree_fabric_dump_hca_ordering( OSM_LOG_EXIT(&p_ftree->p_osm->log); return; } - + /* for each leaf switch (in indexing order) */ for(i = 0; i < p_ftree->leaf_switches_num; i++) { @@ -1274,7 +1274,7 @@ __osm_ftree_fabric_dump_hca_ordering( p_group = p_sw->down_port_groups[j]; p_hca = p_group->remote_hca_or_sw.remote_hca; - fprintf(p_hca_ordering_file,"0x%x\t%s\n", + fprintf(p_hca_ordering_file,"0x%x\t%s\n", cl_ntoh16(p_group->remote_base_lid), p_hca->p_osm_node->print_desc); } @@ -1293,7 +1293,7 @@ __osm_ftree_fabric_dump_hca_ordering( /***************************************************/ -static void +static void __osm_ftree_fabric_assign_tuple( IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, @@ -1305,7 +1305,7 @@ __osm_ftree_fabric_assign_tuple( /***************************************************/ -static void +static void __osm_ftree_fabric_assign_first_tuple( IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) @@ -1353,7 +1353,7 @@ __osm_ftree_fabric_get_new_tuple( { temp_tuple[var_index] = i; p_sw = __osm_ftree_fabric_get_sw_by_tuple(p_ftree,temp_tuple); - if (p_sw == NULL) /* found free tuple */ + if (p_sw == NULL) /* found free tuple */ break; } @@ -1444,7 +1444,7 @@ __osm_ftree_fabric_make_indexing( cl_ntoh16(p_sw->base_lid), cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node))); - /* + /* * Now run BFS and assign indexes to all switches * Pseudo code of the algorithm is as follows: * @@ -1482,7 +1482,7 @@ __osm_ftree_fabric_make_indexing( /* This is not the leaf switch, which means that all the ports that point down are taking us to another switches. No need to assign indexing to HCAs */ - for( i = 0; i < p_sw->down_port_groups_num; i++ ) + for( i = 0; i < p_sw->down_port_groups_num; i++ ) { p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw; if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) @@ -1502,11 +1502,11 @@ __osm_ftree_fabric_make_indexing( new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, + cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); } - /* Done assigning indexes to all the remote switches - that are pointed by the downgoing ports. + /* Done assigning indexes to all the remote switches + that are pointed by the downgoing ports. Now sort port groups according to remote index. */ qsort(p_sw->down_port_groups, /* array */ p_sw->down_port_groups_num, /* number of elements */ @@ -1521,7 +1521,7 @@ __osm_ftree_fabric_make_indexing( { /* This is not the root switch, which means that all the ports that are pointing up are taking us to another switches. */ - for( i = 0; i < p_sw->up_port_groups_num; i++ ) + for( i = 0; i < p_sw->up_port_groups_num; i++ ) { p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw; if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) @@ -1538,18 +1538,18 @@ __osm_ftree_fabric_make_indexing( p_remote_sw, new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, + cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); } - /* Done assigning indexes to all the remote switches - that are pointed by the upgoing ports. + /* Done assigning indexes to all the remote switches + that are pointed by the upgoing ports. Now sort port groups according to remote index. */ qsort(p_sw->up_port_groups, /* array */ p_sw->up_port_groups_num, /* number of elements */ sizeof(ftree_port_group_t *), /* size of each element */ __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */ } - /* Done assigning indexes to all the switches that are directly connected + /* Done assigning indexes to all the switches that are directly connected to the current switch - go to the next switch in the BFS queue */ } cl_list_destroy(&bfs_list); @@ -1594,7 +1594,7 @@ __osm_ftree_fabric_validate_topology( memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *)); p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); - while( res && + while( res && p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) { p_sw = p_next_sw; @@ -1602,7 +1602,7 @@ __osm_ftree_fabric_validate_topology( if (!reference_sw_arr[p_sw->rank]) { - /* This is the first switch in the current level that + /* This is the first switch in the current level that we're checking - use it as a reference */ reference_sw_arr[p_sw->rank] = p_sw; } @@ -1726,19 +1726,19 @@ __osm_ftree_fabric_validate_topology( static void __osm_ftree_set_sw_fwd_table( - IN cl_map_item_t* const p_map_item, + IN cl_map_item_t* const p_map_item, IN void *context) { ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; - /* calculate lft length rounded up to a multiple of 64 (block length) */ + /* calculate lft length rounded up to a multiple of 64 (block length) */ uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64); p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; - memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, - p_sw->lft_buf, + memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, + p_sw->lft_buf, lft_len); osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } @@ -1746,10 +1746,10 @@ __osm_ftree_set_sw_fwd_table( /*************************************************** ***************************************************/ -/* +/* * Function: assign-up-going-port-by-descending-down * Given : a switch and a LID - * Pseudo code: + * Pseudo code: * foreach down-going-port-group (in indexing order) * skip this group if the LFT(LID) port is part of this group * find the least loaded port of the group (scan in indexing order) @@ -1785,7 +1785,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down( CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)); /* if there is no down-going ports */ - if (p_sw->down_port_groups_num == 0) + if (p_sw->down_port_groups_num == 0) return; /* foreach down-going port group (in indexing order) */ @@ -1793,7 +1793,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down( { p_group = p_sw->down_port_groups[i]; - if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) ) + if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) ) { /* This port group has a port that was used when we entered this switch, which means that the current group points to the switch where we were @@ -1807,7 +1807,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down( ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports); /* ToDo: no need to select a least loaded port for non-main path. Think about optimization. */ - for (j = 0; j < ports_num; j++) + for (j = 0; j < ports_num; j++) { cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port); if (!p_min_port) @@ -1821,16 +1821,16 @@ __osm_ftree_fabric_route_upgoing_by_going_down( p_min_port = p_port; } } - /* At this point we have selected a port in this group with the + /* At this point we have selected a port in this group with the lowest load of upgoing routes. Set on the remote switch how to get to the target_lid - set LFT(target_lid) on the remote switch to the remote port */ p_remote_sw = p_group->remote_hca_or_sw.remote_sw; - if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw, + if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw, cl_ntoh16(target_lid)) != OSM_NO_PATH ) { - /* Loop in the fabric - we already routed the remote switch + /* Loop in the fabric - we already routed the remote switch on our way UP, and now we see it again on our way DOWN */ osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_upgoing_by_going_down: " @@ -1846,28 +1846,28 @@ __osm_ftree_fabric_route_upgoing_by_going_down( /* Four possible cases: * - * 1. is_real_lid == TRUE && is_main_path == TRUE: + * 1. is_real_lid == TRUE && is_main_path == TRUE: * - going DOWN(TRUE,TRUE) through ALL the groups * + promoting port counter * + setting path in remote switch fwd tbl * + setting hops in remote switch on all the ports of each group - * - * 2. is_real_lid == TRUE && is_main_path == FALSE: + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: * - going DOWN(TRUE,FALSE) through ALL the groups but only if - * the remote (upper) switch hasn't been already configured + * the remote (upper) switch hasn't been already configured * for this target LID * + NOT promoting port counter * + setting path in remote switch fwd tbl if it hasn't been set yet * + setting hops in remote switch on all the ports of each group * if it hasn't been set yet * - * 3. is_real_lid == FALSE && is_main_path == TRUE: + * 3. is_real_lid == FALSE && is_main_path == TRUE: * - going DOWN(FALSE,TRUE) through ALL the groups * + promoting port counter * + NOT setting path in remote switch fwd tbl * + NOT setting hops in remote switch * - * 4. is_real_lid == FALSE && is_main_path == FALSE: + * 4. is_real_lid == FALSE && is_main_path == FALSE: * - illegal state - we shouldn't get here */ @@ -1908,8 +1908,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down( } - - /* The number of upgoing routes is tracked in the + + /* The number of upgoing routes is tracked in the p_port->counter_up counter of the port that belongs to the upper side of the link (on switch with lower rank). Counter is promoted only if we're routing LID on the main @@ -1939,10 +1939,10 @@ __osm_ftree_fabric_route_upgoing_by_going_down( /***************************************************/ -/* +/* * Function: assign-down-going-port-by-descending-up * Given : a switch and a LID - * Pseudo code: + * Pseudo code: * find the least loaded port of all the upgoing groups (scan in indexing order) * assign the LFT(LID) of remote switch to that port * track that port usage @@ -2011,7 +2011,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( p_min_port = p_port; } else - { + { if ( p_port->counter_down < p_min_port->counter_down ) { /* this port is less loaded - use it as min */ @@ -2022,7 +2022,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( } } - /* At this point we have selected a group and port with the + /* At this point we have selected a group and port with the lowest load of downgoing routes. Set on the remote switch how to get to the target_lid - set LFT(target_lid) on the remote switch to the remote port */ @@ -2030,7 +2030,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( /* Four possible cases: * - * 1. is_real_lid == TRUE && is_main_path == TRUE: + * 1. is_real_lid == TRUE && is_main_path == TRUE: * - going UP(TRUE,TRUE) on selected min_group and min_port * + promoting port counter * + setting path in remote switch fwd tbl @@ -2040,23 +2040,23 @@ __osm_ftree_fabric_route_downgoing_by_going_up( * + setting path in remote switch fwd tbl if it hasn't been set yet * + setting hops in remote switch on all the ports of each group * if it hasn't been set yet - * - * 2. is_real_lid == TRUE && is_main_path == FALSE: + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: * - going UP(TRUE,FALSE) on ALL the groups, each time on port 0, - * but only if the remote (upper) switch hasn't been already + * but only if the remote (upper) switch hasn't been already * configured for this target LID * + NOT promoting port counter * + setting path in remote switch fwd tbl if it hasn't been set yet * + setting hops in remote switch on all the ports of each group * if it hasn't been set yet * - * 3. is_real_lid == FALSE && is_main_path == TRUE: + * 3. is_real_lid == FALSE && is_main_path == TRUE: * - going UP(FALSE,TRUE) ONLY on selected min_group and min_port * + promoting port counter * + NOT setting path in remote switch fwd tbl * + NOT setting hops in remote switch * - * 4. is_real_lid == FALSE && is_main_path == FALSE: + * 4. is_real_lid == FALSE && is_main_path == FALSE: * - illegal state - we shouldn't get here */ @@ -2073,7 +2073,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( __osm_ftree_tuple_to_str(p_sw->tuple), __osm_ftree_tuple_to_str(p_remote_sw->tuple)); } - /* The number of downgoing routes is tracked in the + /* The number of downgoing routes is tracked in the p_port->counter_down counter of the port that belongs to the lower side of the link (on switch with higher rank) */ p_min_port->counter_down++; @@ -2103,7 +2103,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( } } - /* Recursion step: + /* Recursion step: Assign downgoing ports by stepping up, starting on REMOTE switch. */ __osm_ftree_fabric_route_downgoing_by_going_up( p_ftree, @@ -2121,18 +2121,18 @@ __osm_ftree_fabric_route_downgoing_by_going_up( /* What's left to do at this point: * - * 1. is_real_lid == TRUE && is_main_path == TRUE: - * - going UP(TRUE,FALSE) on rest of the groups, each time on port 0, - * but only if the remote (upper) switch hasn't been already + * 1. is_real_lid == TRUE && is_main_path == TRUE: + * - going UP(TRUE,FALSE) on rest of the groups, each time on port 0, + * but only if the remote (upper) switch hasn't been already * configured for this target LID * + NOT promoting port counter * + setting path in remote switch fwd tbl if it hasn't been set yet * + setting hops in remote switch on all the ports of each group * if it hasn't been set yet - * - * 2. is_real_lid == TRUE && is_main_path == FALSE: + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: * - going UP(TRUE,FALSE) on ALL the groups, each time on port 0, - * but only if the remote (upper) switch hasn't been already + * but only if the remote (upper) switch hasn't been already * configured for this target LID * + NOT promoting port counter * + setting path in remote switch fwd tbl if it hasn't been set yet @@ -2170,7 +2170,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( __osm_ftree_tuple_to_str(p_sw->tuple), __osm_ftree_tuple_to_str(p_remote_sw->tuple)); } - + cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port); __osm_ftree_sw_set_fwd_table_block(p_remote_sw, cl_ntoh16(target_lid), @@ -2191,7 +2191,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( target_rank - p_remote_sw->rank); } - /* Recursion step: + /* Recursion step: Assign downgoing ports by stepping up, starting on REMOTE switch. */ __osm_ftree_fabric_route_downgoing_by_going_up( p_ftree, @@ -2207,8 +2207,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up( /***************************************************/ -/* - * Pseudo code: +/* + * Pseudo code: * foreach leaf switch (in indexing order) * for each compute node (in indexing order) * obtain the LID of the compute node @@ -2303,8 +2303,8 @@ __osm_ftree_fabric_route_to_hcas( /***************************************************/ -/* - * Pseudo code: +/* + * Pseudo code: * foreach switch in fabric * obtain its LID * set local LFT(LID) to port 0 @@ -2364,7 +2364,7 @@ __osm_ftree_fabric_route_to_switches( /*************************************************** ***************************************************/ -static int +static int __osm_ftree_fabric_populate_nodes( IN ftree_fabric_t * p_ftree) { @@ -2406,7 +2406,7 @@ __osm_ftree_fabric_populate_nodes( /*************************************************** ***************************************************/ -static boolean_t +static boolean_t __osm_ftree_sw_update_rank( IN ftree_sw_t * p_sw, IN uint32_t new_rank) @@ -2422,7 +2422,7 @@ __osm_ftree_sw_update_rank( static void __osm_ftree_rank_switches_from_leafs( - IN ftree_fabric_t * p_ftree, + IN ftree_fabric_t * p_ftree, IN cl_list_t * p_ranking_bfs_list) { ftree_sw_t * p_sw; @@ -2445,9 +2445,9 @@ __osm_ftree_rank_switches_from_leafs( for (i = 1; i < osm_node_get_num_physp(p_node); i++) { p_osm_port = osm_node_get_physp_ptr(p_node,i); - if (!osm_physp_is_valid(p_osm_port)) + if (!osm_physp_is_valid(p_osm_port)) continue; - if (!osm_link_is_healthy(p_osm_port)) + if (!osm_link_is_healthy(p_osm_port)) continue; p_remote_node = osm_node_get_remote_node(p_node,i,NULL); @@ -2466,7 +2466,7 @@ __osm_ftree_rank_switches_from_leafs( /* if needed, rank the remote switch and add it to the BFS list */ if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1)) - cl_list_insert_tail(p_ranking_bfs_list, + cl_list_insert_tail(p_ranking_bfs_list, &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); } } @@ -2475,7 +2475,7 @@ __osm_ftree_rank_switches_from_leafs( /***************************************************/ -static int +static int __osm_ftree_rank_leaf_switches( IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca, @@ -2493,9 +2493,9 @@ __osm_ftree_rank_leaf_switches( for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++) { p_osm_port = osm_node_get_physp_ptr(p_osm_node,i); - if (!osm_physp_is_valid(p_osm_port)) + if (!osm_physp_is_valid(p_osm_port)) continue; - if (!osm_link_is_healthy(p_osm_port)) + if (!osm_link_is_healthy(p_osm_port)) continue; p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL); @@ -2551,7 +2551,7 @@ __osm_ftree_rank_leaf_switches( cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), cl_ntoh16(p_sw->base_lid)); - cl_list_insert_tail(p_ranking_bfs_list, + cl_list_insert_tail(p_ranking_bfs_list, &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); } @@ -2562,9 +2562,9 @@ __osm_ftree_rank_leaf_switches( /***************************************************/ -static void +static void __osm_ftree_sw_reverse_rank( - IN cl_map_item_t* const p_map_item, + IN cl_map_item_t* const p_map_item, IN void *context) { ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; @@ -2577,7 +2577,7 @@ __osm_ftree_sw_reverse_rank( static int __osm_ftree_fabric_construct_hca_ports( - IN ftree_fabric_t * p_ftree, + IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) { ftree_sw_t * p_remote_sw; @@ -2594,9 +2594,9 @@ __osm_ftree_fabric_construct_hca_ports( { osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i); - if (!osm_physp_is_valid(p_osm_port)) + if (!osm_physp_is_valid(p_osm_port)) continue; - if (!osm_link_is_healthy(p_osm_port)) + if (!osm_link_is_healthy(p_osm_port)) continue; p_remote_osm_port = osm_physp_get_remote(p_osm_port); @@ -2665,9 +2665,9 @@ __osm_ftree_fabric_construct_hca_ports( /*************************************************** ***************************************************/ -static int +static int __osm_ftree_fabric_construct_sw_ports( - IN ftree_fabric_t * p_ftree, + IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) { ftree_hca_t * p_remote_hca; @@ -2690,9 +2690,9 @@ __osm_ftree_fabric_construct_sw_ports( { osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i); - if (!osm_physp_is_valid(p_osm_port)) + if (!osm_physp_is_valid(p_osm_port)) continue; - if (!osm_link_is_healthy(p_osm_port)) + if (!osm_link_is_healthy(p_osm_port)) continue; p_remote_osm_port = osm_physp_get_remote(p_osm_port); @@ -2770,16 +2770,16 @@ __osm_ftree_fabric_construct_sw_ports( goto Exit; } __osm_ftree_sw_add_port( - p_sw, /* local ftree_sw object */ - i, /* local port number */ - remote_port_num, /* remote port number */ - p_sw->base_lid, /* local lid */ - remote_base_lid, /* remote lid */ - osm_physp_get_port_guid(p_osm_port), /* local port guid */ - osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */ - remote_node_guid, /* remote node guid */ - remote_node_type, /* remote node type */ - p_remote_hca_or_sw, /* remote ftree_hca/sw object */ + p_sw, /* local ftree_sw object */ + i, /* local port number */ + remote_port_num, /* remote port number */ + p_sw->base_lid, /* local lid */ + remote_base_lid, /* remote lid */ + osm_physp_get_port_guid(p_osm_port), /* local port guid */ + osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */ + remote_node_guid, /* remote node guid */ + remote_node_type, /* remote node type */ + p_remote_hca_or_sw, /* remote ftree_hca/sw object */ direction); /* port direction (up or down) */ /* Track the max lid (in host order) that exists in the fabric */ @@ -2809,8 +2809,8 @@ __osm_ftree_fabric_perform_ranking( initially filled with the leaf switches */ cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); - /* Mark REVERSED rank of all the switches in the subnet. - Start from switches that are connected to hca's, and + /* Mark REVERSED rank of all the switches in the subnet. + Start from switches that are connected to hca's, and scan all the switches in the subnet. */ p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) @@ -2831,7 +2831,7 @@ __osm_ftree_fabric_perform_ranking( list already contains all the ranked leaf switches */ __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list); cl_list_destroy(&ranking_bfs_list); - + /* REVERSED ranking of all the switches completed. Calculate and set FatTree rank */ @@ -2839,14 +2839,14 @@ __osm_ftree_fabric_perform_ranking( osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_perform_ranking: " "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree)); - + /* fix ranking of the switches by reversing the ranking direction */ cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree); if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) { - osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: " "Tree rank is %u (should be between %u and %u)\n", __osm_ftree_fabric_get_rank(p_ftree), @@ -2907,7 +2907,7 @@ __osm_ftree_fabric_populate_ports( /*************************************************** ***************************************************/ -static int +static int __osm_ftree_construct_fabric( IN void * context) { @@ -2935,7 +2935,7 @@ __osm_ftree_construct_fabric( goto Exit; } - if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - + if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, @@ -3061,7 +3061,7 @@ __osm_ftree_construct_fabric( /*************************************************** ***************************************************/ -static int +static int __osm_ftree_do_routing( IN void * context) { @@ -3104,7 +3104,7 @@ __osm_ftree_do_routing( /*************************************************** ***************************************************/ -static void +static void __osm_ftree_delete( IN void * context) { -- 1.5.1.4 From eli at mellanox.co.il Thu Jul 5 06:55:22 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 05 Jul 2007 16:55:22 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib Message-ID: <1183643723.25031.262.camel@mtls03> In UDP tests we have been running here, I noticed that when using high rate of UDP packets over ipoib, there are sometimes cases of packet drop. Investigating farther I found that the packets are dropped since the socket buffer is exhausted and we fail in the following code: net/core/sock.c int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { int err = 0; int skb_len; /* Cast skb->rcvbuf to unsigned... It's pointless, but reduces number of warnings when compiling with -W --ANK */ if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >= (unsigned)sk->sk_rcvbuf) { err = -ENOMEM; goto out; } In the condition above skb->truesize is about the same as the size allocated for the skb; for small packets, this will charge the socket far more than it actually consumed. I used the following patch to make things better in this regard which passes up to the stack smaller skbs. I am not saying this is the best way to handle this but I would like to hear opinions as for how we should address this problem. Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-05 16:54:56.000000000 +0300 +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-05 17:10:32.000000000 +0300 @@ -50,6 +50,8 @@ "Enable data path debug tracing if > 0"); #endif +#define SKB_LEN_THOLD 150 + static DEFINE_MUTEX(pkey_mutex); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, @@ -169,7 +171,7 @@ { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; - struct sk_buff *skb; + struct sk_buff *skb, *nskb; u64 addr; ipoib_dbg_data(priv, "recv completion: id %d, status: %d\n", @@ -223,6 +225,19 @@ ++priv->stats.rx_packets; priv->stats.rx_bytes += skb->len; + if (skb->len < SKB_LEN_THOLD) { + nskb = dev_alloc_skb(skb->len); + if (!nskb) { + ipoib_warn(priv, "failed to allocate skb\n"); + return; + } + memcpy(nskb->data, skb->data, skb->len); + skb_put(nskb, skb->len); + nskb->protocol = skb->protocol; + dev_kfree_skb_any(skb); + skb = nskb; + } + skb->dev = dev; /* XXX get correct PACKET_ type here */ skb->pkt_type = PACKET_HOST; @@ -350,7 +365,6 @@ struct ipoib_dev_priv *priv = netdev_priv(dev); int n, i; - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); do { n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { @@ -363,6 +377,7 @@ ipoib_ib_handle_tx_wc(dev, wc); } } while (n == IPOIB_NUM_WC); + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); } #endif From ogerlitz at voltaire.com Thu Jul 5 07:10:24 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 05 Jul 2007 17:10:24 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: <1183643723.25031.262.camel@mtls03> References: <1183643723.25031.262.camel@mtls03> Message-ID: <468CFBD0.6040407@voltaire.com> Eli Cohen wrote: > I used the following patch to make things better in this regard which > passes up to the stack smaller skbs. I am not saying this is the best > way to handle this but I would like to hear opinions as for how we > should address this problem. > > Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-05 16:54:56.000000000 +0300 > +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-05 17:10:32.000000000 +0300 > @@ -50,6 +50,8 @@ > "Enable data path debug tracing if > 0"); > #endif > > +#define SKB_LEN_THOLD 150 > + > static DEFINE_MUTEX(pkey_mutex); > > struct ipoib_ah *ipoib_create_ah(struct net_device *dev, > @@ -169,7 +171,7 @@ can you resend the patch with function named appearing in each hunk (ie after the @@ , use diff -p flag for that) Or. From thanhviet_25 at yahoo.com Thu Jul 5 09:07:13 2007 From: thanhviet_25 at yahoo.com (thanhviet) Date: Thu, 5 Jul 2007 23:07:13 +0700 Subject: [ofa-general] CAN HO CAO CAP HOANG ANH_NEWSAIGON - NGOI NHA 5 SAO CUA BAN....!!!! Message-ID: <20070705160738.4E788E6038A@openfabrics.org> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GIAM.jpg Type: image/jpeg Size: 472687 bytes Desc: not available URL: From mshefty at ichips.intel.com Thu Jul 5 11:09:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 05 Jul 2007 11:09:35 -0700 Subject: [ofa-general] [GIT PULL] please pull rdma-dev.git for 2.6.23 In-Reply-To: <000801c7b9e2$03dfe220$3c98070a@amr.corp.intel.com> References: <000801c7b9e2$03dfe220$3c98070a@amr.corp.intel.com> Message-ID: <468D33DF.3030900@ichips.intel.com> > Please pull: > > git://git.openfabrics.org/~shefty/rdma-dev.git for-roland > > for 2.6.23. This will pick up the following patches: I'm guessing that you haven't gotten to these yet (no hurry), so I've added two more patches that were posted to the list: ib/cm: fix handling of duplicate SIDR REQs http://lists.openfabrics.org/pipermail/general/2007-July/037677.html ib/cm: send no match if SIDR REQ does not match a listen http://lists.openfabrics.org/pipermail/general/2007-July/037678.html At this point, I'm only anticipating one more patch for 2.6.23. - Sean From xhejtman at ics.muni.cz Thu Jul 5 12:31:36 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Thu, 5 Jul 2007 21:31:36 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> Message-ID: <20070705193136.GQ3885@ics.muni.cz> Hello, On Thu, Jul 05, 2007 at 09:10:11AM -0700, Roland Dreier wrote: > I don't personally have much use for it, but of course I would be > happy to merge changes that make this work better. > > However, I would really prefer if we could have this discussion on > general at lists.openfabrics.org instead of in private email; it's better > for you too, because if I am too busy to answer then you may get an > answer from someone else. Anyway... OK, I appended the address to the Cc. > Are you getting these freezes when using Xen domU, or do you also see > them with a normal kernel? You said the card works "mostly OK" with > dom0 -- what is not OK? Well, in Dom0 the action: modprobe ib_mthca rmmod ib_mthca modprobe ib_mthca kills the machine. However, it is quite strange because it produces oops in XFS (file system), for me, it looks like it does some memory corruption in the kernel and basically I have the same problem in DomU where the same error is induced by the first modprobe ib_mthca. > How did you fix the device reset problem? Xen in DomU does not let the device to modify address bars so after the device reset the address bars are not restored thus I've modified Xen PCI backend to allow direct modification of the bars if the device operates in the permissive mode. Anyway, direct access to the PCI config space did not solve all the problems. Modprobe ib_mthca does init_one up to (and including) init_hca. In the setup_hca it kills at least DomU and very often even Dom0 and even sometimes it kills the whole machine so that physical power cycle is needed. When it peforms setup_hca, I can always see an oops in XFS in DomU. Dmesg says that the driver could not write MTT. Any thoughts? -- Lukáš Hejtmánek From eli at mellanox.co.il Thu Jul 5 12:39:02 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 5 Jul 2007 22:39:02 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib References: <1183643723.25031.262.camel@mtls03> <468CFBD0.6040407@voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com> > can you resend the patch with function named appearing in each hunk (ie after the @@ , use diff -p flag for that) > Or. Sure. It is attached now - sorry but I using outlook from home :) -------------- next part -------------- A non-text attachment was scrubbed... Name: udp_drop.patch Type: application/octet-stream Size: 1638 bytes Desc: udp_drop.patch URL: From rdreier at cisco.com Thu Jul 5 14:34:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jul 2007 14:34:35 -0700 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: <1183643723.25031.262.camel@mtls03> (Eli Cohen's message of "Thu, 05 Jul 2007 16:55:22 +0300") References: <1183643723.25031.262.camel@mtls03> Message-ID: Copying small packets into a new skb is actually a fairly well-established optimization to avoid the overhead of allocating a new skb. For example look for RX_COPY_THRESHOLD in tg3 or copybreak in e1000. So this approach makes sense to me. However, a few comments about your patch: > +#define SKB_LEN_THOLD 150 150 is probably not the right value; the cost of copying half a cacheline is probably nearly the same as a full cacheline, so this should probably be a multiple of 64 (or at least 32, since I don't know of any arch with smaller than 32-byte cachelines). With that said I don't know what the right value is here. 256 seems to be a popular choice; I guess it is system-dependent but I don't think it makes sense to add yet another knob to adjust this. > + if (skb->len < SKB_LEN_THOLD) { > + nskb = dev_alloc_skb(skb->len); > + if (!nskb) { > + ipoib_warn(priv, "failed to allocate skb\n"); > + return; > + } > + memcpy(nskb->data, skb->data, skb->len); should be skb_copy_from_linear_data() > + skb_put(nskb, skb->len); > + nskb->protocol = skb->protocol; > + dev_kfree_skb_any(skb); and there's no point in freeing the old skb... we should repost it to the receive queue instead. > + skb = nskb; > + } And I think we would want something similar for ipoib_cm.c too. Your patch also made me look again at how we handle packets the HCA replicates back to us... there's no reason to free the skb and allocate a new one; we could just repost the same skb again. So the patch below seems like it might help multicast senders. What do people think about putting this into 2.6.23? diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8404f05..1094488 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -197,6 +197,13 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) } /* + * Drop packets that this interface sent, ie multicast packets + * that the HCA has replicated. + */ + if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) + goto repost; + + /* * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ @@ -213,24 +220,18 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); - if (wc->slid != priv->local_lid || - wc->src_qp != priv->qp->qp_num) { - skb->protocol = ((struct ipoib_header *) skb->data)->proto; - skb_reset_mac_header(skb); - skb_pull(skb, IPOIB_ENCAP_LEN); + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); - dev->last_rx = jiffies; - ++priv->stats.rx_packets; - priv->stats.rx_bytes += skb->len; + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; - skb->dev = dev; - /* XXX get correct PACKET_ type here */ - skb->pkt_type = PACKET_HOST; - netif_receive_skb(skb); - } else { - ipoib_dbg_data(priv, "dropping loopback packet\n"); - dev_kfree_skb_any(skb); - } + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); repost: if (unlikely(ipoib_ib_post_receive(dev, wr_id))) From rdreier at cisco.com Thu Jul 5 14:43:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jul 2007 14:43:46 -0700 Subject: [ofa-general] Re: consumer data buffer ownership for inline sends In-Reply-To: (Or Gerlitz's message of "Tue, 3 Jul 2007 11:50:52 +0300 (IDT)") References: Message-ID: > Does this means that for inline sends, when ibv_post_send returns, > the consumer owns back the data buffer associated with this send? > > Can this be stated as the official policy of libibverbs? I guess that makes sense. I wonder if there's any conceivable interpretation of the inline send flag where the adapter might need to access the original buffer after the request is posted? - R. From changquing.tang at hp.com Thu Jul 5 14:48:10 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 5 Jul 2007 21:48:10 -0000 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <466E4168.2030206@mellanox.co.il> References: <466718AB.5050507@ichips.intel.com> <466E4168.2030206@mellanox.co.il> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net> Hi, uDAPL expert: We are testing OFED 1.2 uDAPL on a two IB cards system. All the cards are linked to the same fabric, IB Verbs works fine from one card to any other card. If we config all IPoIB-ib0 on the same network (172.200.0.x, 255.255.255) and IPoIB-ib1 on another network (172.200.1.x, 255.255.255.0), uDAPL works on all ib0, and works on all ib1 as well. However, if we config both ib0 and ib1 on the same network (172.200.0.x, 255.255.255.0), uDAPL works if all ranks use ib0, uDAPL fails if all ranks use ib1 with error code: DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after dat_connect() and dat_evd_wait()) The same error message if some ranks use ib0, some ranks use ib1. Thanks for providing solution for this issue, or any experience. --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tziporet Koren > Sent: Tuesday, June 12, 2007 1:47 AM > To: Arlin Davis > Cc: Vladimir Sokolovsky; OpenFabrics General > Subject: Re: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes > > Arlin Davis wrote: > > Vlad, please pull the latest OFED 1.2 release notes from uDAPL > > project (ofed_1_2 branch) > > > > dapl/doc/uDAPL_release_notes.txt > > > > Signed-off by: Arlin Davis ardavis at ichips.intel.com > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > done > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Jul 5 15:22:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 05 Jul 2007 15:22:12 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070705193136.GQ3885@ics.muni.cz> (Lukas Hejtmanek's message of "Thu, 5 Jul 2007 21:31:36 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> Message-ID: > Well, in Dom0 the action: > modprobe ib_mthca > rmmod ib_mthca > modprobe ib_mthca > > kills the machine. However, it is quite strange because it produces oops in > XFS (file system), for me, it looks like it does some memory corruption in the > kernel and basically I have the same problem in DomU where the same error is > induced by the first modprobe ib_mthca. Loading and unloading ib_mthca many times works fine on a non-Xen system. So there is something different about the Xen environment that is causing a problem. It could be a bug in mthca exposed by Xen (eg improper use of of the DMA mapping API or something like that). Can you turn on all the memory debugging options like SLAB_DEBUG etc. and see if it turns up anything? Also I'd be curious to see the exact XFS oops you're getting, since it might have a clue to what's going on. - R. From mshefty at ichips.intel.com Thu Jul 5 15:30:18 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 05 Jul 2007 15:30:18 -0700 Subject: [ofa-general] Re: running cmpost in libibcm. In-Reply-To: <003b01c7bf51$901cef20$b056cd60$@com> References: <003b01c7bf51$901cef20$b056cd60$@com> Message-ID: <468D70FA.1050409@ichips.intel.com> Copying general OFA mail list. > We’ve been trying to run the ‘cmpost’ example under the libibcm project. > > Here’s the scenario: > > Device Under Test: Mellanox HCA with 2 ports. ofed-rc6 stack installed. > SM is disabled on the HCA > > Tester Device: A test implemented on the Agilent Infiniband Generator. > Test brings up the subnet (has SM), and then sends a basic CM Connect > Request with Service ID = DTA Service ID (0x20) > > Observation: > > 1. Even when cmpost is not running, the OFED stack appears to send > a ConnectReject with reason: Invalid Service ID. So clearly it looks > like there is some kind of CM service running just on device boot up. > How do we disable this? The REJ is sent automatically by the kernel ib_cm module when a REQ is received that does not match a listen. You can 'disable' this by unloading the ib_cm module, but this will disable the IB CM. > 2. cmpost in client mode fails to send the REQ packet Given that observation 1 is occurring, a REQ packet must be being sent by someone. (The generator?) Note that even though cmpost connects over the libibcm directly, it still uses the librdmacm to obtain path record information. What response do you eventually get back from the client side version of cmpost? > 3. cmpost in server mode fails to receive a REQ packet sent from > Agilent Generator. Cmpost seems to block on ‘ib_cm_get_event’ and does > not receive the incoming REQ. I would verify that generated service IDs and that used by cmpost match (i.e. same endianess). - Sean From ardavis at ichips.intel.com Thu Jul 5 16:05:56 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 05 Jul 2007 16:05:56 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net> References: <466718AB.5050507@ichips.intel.com> <466E4168.2030206@mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net> Message-ID: <468D7954.3060303@ichips.intel.com> Tang, Changqing wrote: > However, if we config both ib0 and ib1 on the same network >(172.200.0.x, 255.255.255.0), uDAPL >works if all ranks use ib0, uDAPL fails if all ranks use ib1 with error >code: > DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after >dat_connect() and dat_evd_wait()) > >The same error message if some ranks use ib0, some ranks use ib1. > > What does your /etc/dat.conf look like? What is the listening port on each interface and what address/port are you using for each connection? Also, can you run ucmatose to verify rdma_cma is working correctly across each interface? For example: start a server on both interfaces (I am assuming 172.200.0.1 and 172.200.0.2) ucmatose -b 172.200.0.1 ucmatose -b 172.200.0.2 start a client on each interface on the other system ucmatose -s 172.200.0.1 ucmatose -s 172.200.0.2 Thanks, -arlin From sean.hefty at intel.com Thu Jul 5 16:39:06 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jul 2007 16:39:06 -0700 Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm Message-ID: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com> I'm looking for input on different options for handling listeners in the rdma_cm that are slow to respond to connection requests. Some options (in no particular order): * Expose a call similar to ib_send_cm_mra (rdma_ack_connect?). * Adapt an existing call for this purpose (rdma_notify? rdma_listen?). * Have the rdma_cm always send an MRA. * Add code to the rdma_cm to queue MRA responses, which would be sent after a specific timeout has occurred, if the connection had not yet already be accepted or rejected. * Add a call to the ib_cm to send an MRA, but only if a duplicate REQ is received before the original REQ has been processed (ib_set_cm_mra?). * Make the CMA_CM_RESPONSE_TIMEOUT a module parameter. Any thoughts or comments? - Sean From swise at opengridcomputing.com Thu Jul 5 16:58:16 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Jul 2007 18:58:16 -0500 Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm In-Reply-To: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com> References: <000301c7bf5d$ae0d00e0$9c98070a@amr.corp.intel.com> Message-ID: <468D8598.3070700@opengridcomputing.com> This is all IB-specific, correct? Sean Hefty wrote: > I'm looking for input on different options for handling listeners in the rdma_cm > that are slow to respond to connection requests. > > Some options (in no particular order): > > * Expose a call similar to ib_send_cm_mra (rdma_ack_connect?). > > * Adapt an existing call for this purpose (rdma_notify? rdma_listen?). > > * Have the rdma_cm always send an MRA. > > * Add code to the rdma_cm to queue MRA responses, which would be sent after a > specific timeout has occurred, if the connection had not yet already be accepted > or rejected. > > * Add a call to the ib_cm to send an MRA, but only if a duplicate REQ is > received before the original REQ has been processed (ib_set_cm_mra?). > > * Make the CMA_CM_RESPONSE_TIMEOUT a module parameter. > > Any thoughts or comments? > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Jul 5 17:04:06 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 5 Jul 2007 17:04:06 -0700 Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm In-Reply-To: <468D8598.3070700@opengridcomputing.com> Message-ID: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com> >This is all IB-specific, correct? I don't think iWarp has this issue, so, yes. (With IB, a slow listener will cause connections to timeout waiting for an accept.) If this is the case, then I'd like to keep the solution within the IB related code. My preference is not to expose / modify the rdma_cm APIs. - Sean From swise at opengridcomputing.com Thu Jul 5 18:38:31 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 05 Jul 2007 20:38:31 -0500 Subject: [ofa-general] [RFC] handling slow listeners in rdma_cm In-Reply-To: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com> References: <000401c7bf61$2ba428a0$9c98070a@amr.corp.intel.com> Message-ID: <468D9D17.6@opengridcomputing.com> Sean Hefty wrote: >> This is all IB-specific, correct? > > I don't think iWarp has this issue, so, yes. (With IB, a slow listener will > cause connections to timeout waiting for an accept.) > There's no way to specify this at the server side for TCP. Its up to the client in TCP to wait "long enough". :-) > If this is the case, then I'd like to keep the solution within the IB related > code. My preference is not to expose / modify the rdma_cm APIs. > I agree. From changquing.tang at hp.com Thu Jul 5 19:10:48 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri, 6 Jul 2007 02:10:48 -0000 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <468D7954.3060303@ichips.intel.com> References: <466718AB.5050507@ichips.intel.com> <466E4168.2030206@mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA840301BC9A3C@G3W0634.americas.hpqcorp.net> <468D7954.3060303@ichips.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net> > > However, if we config both ib0 and ib1 on the same network > >(172.200.0.x, 255.255.255.0), uDAPL works if all ranks use > ib0, uDAPL > >fails if all ranks use ib1 with error > >code: > > DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after > >dat_connect() and dat_evd_wait()) > > > >The same error message if some ranks use ib0, some ranks use ib1. > > > > > > What does your /etc/dat.conf look like? What is the listening > port on each interface and what address/port are you using > for each connection? /etc/dat.conf is the default file after installation: OpenIB-cma u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so dapl.1.2 "ib0 0" "" OpenIB-cma-1 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so dapl.1.2 "ib1 0" "" OpenIB-cma-2 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so dapl.1.2 "ib2 0" "" OpenIB-cma-3 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so dapl.1.2 "ib3 0" "" OpenIB-bond u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so dapl.1.2 "bond0 0" "" however, we only configure ib0 and ib1: mpixbl05:/nis.home/ctang:/sbin/ifconfig ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.200.0.5 Bcast:172.200.0.255 Mask:255.255.255.0 inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:2118 errors:0 dropped:0 overruns:0 frame:0 TX packets:84 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:217135 (212.0 KiB) TX bytes:10854 (10.5 KiB) ib1 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.200.0.11 Bcast:172.200.0.255 Mask:255.255.255.0 inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:2090 errors:0 dropped:0 overruns:0 frame:0 TX packets:57 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:215361 (210.3 KiB) TX bytes:9072 (8.8 KiB) The listening port (conn_qual) is 1024 for the first rank using first card (ib0), and 1025 for the second rank using second card (ib1). address is the "ia_attr->ia_address_ptr" Eventhough I force all ranks only using the first card (ib0), it works for a while and then fails with NON_PEER_REJECTED when one rank tries to connect to another rank (dat_connect() and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it fails after hundreds runs); > > Also, can you run ucmatose to verify rdma_cma is working > correctly across each interface? It works on the first card (ib0), failed on the second card (ib1) on mpixbl05, ib0 is "net addr:172.200.0.5 Bcast:172.200.0.255 Mask:255.255.255.0" ib1 is "inet addr:172.200.0.11 Bcast:172.200.0.255 Mask:255.255.255.0 from mpixbl06, I can ping both IPs: mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.11 PING 172.200.0.11 (172.200.0.11) 56(84) bytes of data. 64 bytes from 172.200.0.11: icmp_seq=1 ttl=64 time=3.50 ms 64 bytes from 172.200.0.11: icmp_seq=2 ttl=64 time=0.034 ms 64 bytes from 172.200.0.11: icmp_seq=3 ttl=64 time=0.029 ms --- 172.200.0.11 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 0.029/1.189/3.504/1.636 ms mpixbl06:/net/mpixbl06/lscratch/ctang/test: mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.5 PING 172.200.0.5 (172.200.0.5) 56(84) bytes of data. 64 bytes from 172.200.0.5: icmp_seq=1 ttl=64 time=0.772 ms 64 bytes from 172.200.0.5: icmp_seq=2 ttl=64 time=0.038 ms 64 bytes from 172.200.0.5: icmp_seq=3 ttl=64 time=0.030 ms --- 172.200.0.5 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.030/0.280/0.772/0.347 ms mpixbl06:/net/mpixbl06/lscratch/ctang/test: But ucmatose works on ib0: mpixbl05:/nis.home/ctang:ucmatose -b 172.200.0.5 cmatose: starting server initiating data transfers completing sends receiving data transfers data transfers complete cmatose: disconnecting disconnected test complete return status 0 mpixbl05:/nis.home/ctang: mpixbl06:/lscratch/ctang/mpi2251:ucmatose -s 172.200.0.5 cmatose: starting client cmatose: connecting receiving data transfers sending replies data transfers complete test complete return status 0 mpixbl06:/lscratch/ctang/mpi2251: It fails on ib1: mpixbl05:/net/mpixbl06/lscratch/ctang/test:ucmatose -b 172.200.0.11 cmatose: starting server mpixbl06:/net/mpixbl06/lscratch/ctang/test:ucmatose -s 172.200.0.11 cmatose: starting client cmatose: connecting cmatose: event: 8, error: 0 receiving data transfers sending replies data transfers complete test complete return status 0 mpixbl06:/net/mpixbl06/lscratch/ctang/test: --CQ > > For example: > > start a server on both interfaces (I am assuming 172.200.0.1 and > 172.200.0.2) > > ucmatose -b 172.200.0.1 > ucmatose -b 172.200.0.2 > > start a client on each interface on the other system > > ucmatose -s 172.200.0.1 > ucmatose -s 172.200.0.2 > > Thanks, > > -arlin > From support16761 at paypal.de Fri Jul 6 03:44:35 2007 From: support16761 at paypal.de (Tonya Sadler) Date: Fri, 6 Jul 2007 09:44:35 -0100 Subject: [ofa-general] Potenzprobleme - ab heute nicht mehr dividend preference -- your time is too important Message-ID: <01c7bfb2$42db4a20$eef252d4@support16761> Haben Sie endlich wieder Spass am Leben! Preise die keine Konkurrenz kennen - Kostenlose, arztliche Telefon-Beratung - Diskrete Verpackung und Zahlung - Bequem und diskret online bestellen. - Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen - Kein peinlicher Arztbesuch erforderlicht - Visa verifizierter Onlineshop - keine versteckte Kosten Jetzt bestellen - und vier Pillen umsonst erhalten http://fzruad.coverstep.hk/?531522612452 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Jul 6 02:46:13 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 6 Jul 2007 02:46:13 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070706-0200 daily build status Message-ID: <20070706094613.ECD5AE60881@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From sashak at voltaire.com Fri Jul 6 05:12:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 6 Jul 2007 15:12:23 +0300 Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs In-Reply-To: <468CA13B.2040900@dev.mellanox.co.il> References: <468CA13B.2040900@dev.mellanox.co.il> Message-ID: <20070706121223.GA7555@sashak.voltaire.com> Hi Yevgeny, On 10:43 Thu 05 Jul , Yevgeny Kliteynik wrote: > Hi Hal, > > opensm.fdbs dump function adaptation to the recent changes in min hop tables > broke fat-tree routing (or any other future routing that may not use the > same > min hop tables creation functions). Could you please explain how this dump function break the routing for fat-tree? Thanks. Sasha > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_ucast_mgr.c | 33 ++++++++++++++++++++++++--------- > 1 files changed, 24 insertions(+), 9 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 5bcb655..cab272e 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution( > > /********************************************************************** > **********************************************************************/ > + > static void > __osm_ucast_mgr_dump_ucast_routes( > IN cl_map_item_t *p_map_item, > @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes( > uint8_t best_port; > uint16_t max_lid_ho; > uint16_t lid_ho, base_lid; > + boolean_t direct_route_exists = FALSE; > osm_switch_t* p_sw = (osm_switch_t *)p_map_item; > osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; > FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; > @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes( > */ > if( p_port->p_node->sw ) > { > + /* Target LID is switch. > + Get its base lid and check hop count for this base LID only.*/ > base_lid = osm_node_get_base_lid(p_port->p_node, 0); > base_lid = cl_ntoh16(base_lid); > num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); > } > else > { > - osm_physp_t *p_physp = p_port->p_physp; > - if( !p_physp || !p_physp->p_remote_physp || > - !p_physp->p_remote_physp->p_node->sw ) > - num_hops = OSM_NO_PATH; > + /* Target LID is not switch (CA or router). > + Check if we have route to this target from current switch.*/ > + num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); > + if (num_hops != OSM_NO_PATH) > + { > + direct_route_exists = TRUE; > + base_lid = lid_ho; > + } > else > { > - base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, > 0); > - base_lid = cl_ntoh16(base_lid); > - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? > - 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num > ); > + osm_physp_t *p_physp = p_port->p_physp; > + if( !p_physp || !p_physp->p_remote_physp || > + !p_physp->p_remote_physp->p_node->sw ) > + num_hops = OSM_NO_PATH; > + else > + { > + base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, > 0); > + base_lid = cl_ntoh16(base_lid); > + num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? > + 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num > ); > + } > } > } > > @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes( > } > > best_hops = osm_switch_get_least_hops( p_sw, base_lid ); > - if (!p_port->p_node->sw) > + if (!p_port->p_node->sw && !direct_route_exists) > { > best_hops++; > num_hops++; > -- > 1.5.1.4 > > From halr at voltaire.com Fri Jul 6 06:00:27 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jul 2007 09:00:27 -0400 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release In-Reply-To: <1183566887.16081.126.camel@firewall.xsintricity.com> References: <1183124231.28870.268894.camel@hal.voltaire.com> <1183566887.16081.126.camel@firewall.xsintricity.com> Message-ID: <1183726824.25217.69120.camel@hal.voltaire.com> On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote: > > There is a new release of the management libraries which include the > > ANSIfied header files available in: > > > > http://www.openfabrics.org/~halr/ > > > > md5sum > > a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz > > 288b865a0015ac3251cffa011a7633eb libibumad-1.0.6.tar.gz > > 04a5b6dcd2ee930f44d5715ee013f78b libibmad-1.0.6.tar.gz > > Hey Hal, I noticed you have release tarballs there for the libs, and one > for the older named openib-diags. What would it take to get a release > tarball for infiniband-diags and one for opensm? We're not quite there yet; There are a couple of outstanding items: OpenSM (master) does not yet pass all the regressions, and I'd like libibumad to support the upcoming user_mad ABI change for partition support. After these are resolved, I think that a release of these would then be in order. Hopefully, this can be in the next few weeks. -- Hal From swise at opengridcomputing.com Fri Jul 6 07:53:28 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 06 Jul 2007 09:53:28 -0500 Subject: [ofa-general] [GIT PULL ofed_1_2] iw_cxgb3 - Don't allow interrupts while obtaining the ctrl-qp mutex. Message-ID: <468E5768.7090200@opengridcomputing.com> Vlad, Please pull from git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 This patch fixes bug 681. Below is the patch. Steve. -------- Original Message -------- Subject: [PATCH] Don't allow interrupts while obtaining the ctrl-qp mutex. Date: Fri, 06 Jul 2007 09:47:57 -0500 From: Steve Wise To: swise at opengridcomputing.com Don't allow interrupts while obtaining the ctrl-qp mutex. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c index 9746635..dc4a385 100644 --- a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -729,7 +729,7 @@ static int __cxio_tpt_op(struct cxio_rde } } - down_interruptible(&rdev_p->ctrl_qp.sem); + down(&rdev_p->ctrl_qp.sem); /* write PBL first if any - update pbl only if pbl list exist */ if (pbl) { From sean.hefty at intel.com Fri Jul 6 09:48:02 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 6 Jul 2007 09:48:02 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net> Message-ID: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> >Eventhough I force all ranks only using the first card (ib0), it works >for a while and >then fails with NON_PEER_REJECTED when one rank tries to connect to >another rank (dat_connect() >and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it >fails after hundreds runs); This sounds like it could be a race condition as a result of running the test in a loop. If the client starts before the server is listening, it will receive this sort of reject event. >It works on the first card (ib0), failed on the second card (ib1) Please take a look at the following thread: http://lists.openfabrics.org/pipermail/general/2007-May/036559.html In particular, see Steve's message about this: http://lists.openfabrics.org/pipermail/general/2007-May/036571.html and let me know if his suggestion fixes your problem. I will update the librdmacm documentation with this information as well. - Sean From changquing.tang at hp.com Fri Jul 6 10:38:10 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri, 6 Jul 2007 17:38:10 -0000 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net> <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA034@G3W0634.americas.hpqcorp.net> Sean: Thanks for the inforamtion. The interesting thing is that I run OFED 1.2 udapl on another single card system, and it works reliablely (run thousands times without error), both systems have the same OS bits and driver bits. --CQ > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Friday, July 06, 2007 11:48 AM > To: Tang, Changqing; Arlin Davis > Cc: Vladimir Sokolovsky; OpenFabrics General > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes > > >Eventhough I force all ranks only using the first card > (ib0), it works > >for a while and then fails with NON_PEER_REJECTED when one > rank tries > >to connect to another rank (dat_connect() and > dat_evd_wait()). (I run a > >simple MPI job in an infinite loop, it fails after hundreds runs); > > This sounds like it could be a race condition as a result of > running the test in a loop. If the client starts before the > server is listening, it will receive this sort of reject event. > > >It works on the first card (ib0), failed on the second card (ib1) > > Please take a look at the following thread: > > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html > > In particular, see Steve's message about this: > > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html > > and let me know if his suggestion fixes your problem. > > I will update the librdmacm documentation with this > information as well. > > - Sean > From changquing.tang at hp.com Fri Jul 6 11:08:26 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri, 6 Jul 2007 18:08:26 -0000 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net> <000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net> Sean: Thanks, I think this solve our problem. Currently two cards are on different subnet. Code on either subnet is working reliablely. I have not tried if all cards are on the same subnet. Do you recommend to config as a single subnet or two subnets ? --CQ > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Friday, July 06, 2007 11:48 AM > To: Tang, Changqing; Arlin Davis > Cc: Vladimir Sokolovsky; OpenFabrics General > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes > > >Eventhough I force all ranks only using the first card > (ib0), it works > >for a while and then fails with NON_PEER_REJECTED when one > rank tries > >to connect to another rank (dat_connect() and > dat_evd_wait()). (I run a > >simple MPI job in an infinite loop, it fails after hundreds runs); > > This sounds like it could be a race condition as a result of > running the test in a loop. If the client starts before the > server is listening, it will receive this sort of reject event. > > >It works on the first card (ib0), failed on the second card (ib1) > > Please take a look at the following thread: > > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html > > In particular, see Steve's message about this: > > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html > > and let me know if his suggestion fixes your problem. > > I will update the librdmacm documentation with this > information as well. > > - Sean > From dledford at redhat.com Fri Jul 6 11:33:30 2007 From: dledford at redhat.com (Doug Ledford) Date: Fri, 06 Jul 2007 14:33:30 -0400 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release In-Reply-To: <1183726824.25217.69120.camel@hal.voltaire.com> References: <1183124231.28870.268894.camel@hal.voltaire.com> <1183566887.16081.126.camel@firewall.xsintricity.com> <1183726824.25217.69120.camel@hal.voltaire.com> Message-ID: <1183746810.5165.37.camel@firewall.xsintricity.com> On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote: > On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: > > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote: > > > There is a new release of the management libraries which include the > > > ANSIfied header files available in: > > > > > > http://www.openfabrics.org/~halr/ > > > > > > md5sum > > > a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz > > > 288b865a0015ac3251cffa011a7633eb libibumad-1.0.6.tar.gz > > > 04a5b6dcd2ee930f44d5715ee013f78b libibmad-1.0.6.tar.gz > > > > Hey Hal, I noticed you have release tarballs there for the libs, and one > > for the older named openib-diags. What would it take to get a release > > tarball for infiniband-diags and one for opensm? > > We're not quite there yet; There are a couple of outstanding items: > OpenSM (master) does not yet pass all the regressions, and I'd like > libibumad to support the upcoming user_mad ABI change for partition > support. After these are resolved, I think that a release of these would > then be in order. Hopefully, this can be in the next few weeks. It doesn't need to be a new release. Just a tarball from any previous stable release will work. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From changquing.tang at hp.com Fri Jul 6 12:00:30 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri, 6 Jul 2007 19:00:30 -0000 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net><000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net> Sean: I have 6 nodes with two IB cards on each node. If I configure the first card on all nodes as one subnet, the second card on all nodes as another subnet, Plus set arp_ignore=2, jobs on first subnet, or second subnet work fine. But when I configure all 12 cards into a single subnet, jobs on all first cards work fine, job on all second cards hangs. Here is one node IP info: ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.200.0.5 Bcast:172.200.0.255 Mask:255.255.255.0 inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:12375 errors:0 dropped:0 overruns:0 frame:0 TX packets:155 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1293846 (1.2 MiB) TX bytes:16008 (15.6 KiB) ib1 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.200.0.11 Bcast:172.200.0.255 Mask:255.255.255.0 inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:12299 errors:0 dropped:0 overruns:0 frame:0 TX packets:155 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:1280105 (1.2 MiB) TX bytes:25117 (24.5 KiB) Do you have any idea what's wrong ? Thanks. --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Tang, Changqing > Sent: Friday, July 06, 2007 1:08 PM > To: Sean Hefty; Arlin Davis > Cc: Vladimir Sokolovsky; OpenFabrics General > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes > > > Sean: > Thanks, I think this solve our problem. Currently two > cards are on different subnet. Code on either subnet is > working reliablely. I have not tried if all cards are on the > same subnet. > > Do you recommend to config as a single subnet or two subnets ? > > > --CQ > > > -----Original Message----- > > From: Sean Hefty [mailto:sean.hefty at intel.com] > > Sent: Friday, July 06, 2007 11:48 AM > > To: Tang, Changqing; Arlin Davis > > Cc: Vladimir Sokolovsky; OpenFabrics General > > Subject: RE: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes > > > > >Eventhough I force all ranks only using the first card > > (ib0), it works > > >for a while and then fails with NON_PEER_REJECTED when one > > rank tries > > >to connect to another rank (dat_connect() and > > dat_evd_wait()). (I run a > > >simple MPI job in an infinite loop, it fails after hundreds runs); > > > > This sounds like it could be a race condition as a result > of running > > the test in a loop. If the client starts before the server is > > listening, it will receive this sort of reject event. > > > > >It works on the first card (ib0), failed on the second card (ib1) > > > > Please take a look at the following thread: > > > > http://lists.openfabrics.org/pipermail/general/2007-May/036559.html > > > > In particular, see Steve's message about this: > > > > http://lists.openfabrics.org/pipermail/general/2007-May/036571.html > > > > and let me know if his suggestion fixes your problem. > > > > I will update the librdmacm documentation with this information as > > well. > > > > - Sean > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From arthur.jones at qlogic.com Fri Jul 6 12:48:17 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:17 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for 2.6.23 Message-ID: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> hi roland, here is the latest set of patches for 2.6.23. this set should address all your comments except the ppc ioremap flags issue (which is still being worked on). the barrier patch now has comments and the bad code that benh pointed out has been eliminated by removing support for older non-production HTX cards. these patches are avail to pull from: git://git.qlogic.com/ipath-linux-2.6 for-roland nb: when i tried pulling into a for-2.6.23 branch in your repo, i got three trivial merge conflicts (take the new stuff). plz let me know if you would rather i re-base these to your tree... arthur From arthur.jones at qlogic.com Fri Jul 6 12:48:23 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:23 -0700 Subject: [ofa-general] [PATCH 1/8] IB/ipath - add barrier before updating WC head in shared memory In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194822.9093.32572.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell Add a barrier to make sure the CPU doesn't reorder writes to memory since user programs can be polling on the head index update and the entry should be written before that. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_cq.c | 5 ++++- drivers/infiniband/hw/ipath/ipath_ruc.c | 2 ++ drivers/infiniband/hw/ipath/ipath_srq.c | 2 ++ drivers/infiniband/hw/ipath/ipath_ud.c | 2 ++ drivers/infiniband/hw/ipath/ipath_verbs.c | 2 ++ 5 files changed, 12 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 9014ef6..a6f04d2 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -90,6 +90,8 @@ void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int solicited) wc->queue[head].sl = entry->sl; wc->queue[head].dlid_path_bits = entry->dlid_path_bits; wc->queue[head].port_num = entry->port_num; + /* Make sure queue entry is written before the head index. */ + smp_wmb(); wc->head = next; if (cq->notify == IB_CQ_NEXT_COMP || @@ -139,7 +141,8 @@ int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) if (tail == wc->head) break; - + /* Make sure entry is read after head index is read. */ + smp_rmb(); qp = ipath_lookup_qpn(&to_idev(cq->ibcq.device)->qp_table, wc->queue[tail].qp_num); entry->qp = &qp->ibqp; diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 854deb5..8525674 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -194,6 +194,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) ret = 0; goto bail; } + /* Make sure entry is read after head index is read. */ + smp_rmb(); wqe = get_rwqe_ptr(rq, tail); if (++tail >= rq->size) tail = 0; diff --git a/drivers/infiniband/hw/ipath/ipath_srq.c b/drivers/infiniband/hw/ipath/ipath_srq.c index 14cbbd6..40c36ec 100644 --- a/drivers/infiniband/hw/ipath/ipath_srq.c +++ b/drivers/infiniband/hw/ipath/ipath_srq.c @@ -80,6 +80,8 @@ int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr, wqe->num_sge = wr->num_sge; for (i = 0; i < wr->num_sge; i++) wqe->sg_list[i] = wr->sg_list[i]; + /* Make sure queue entry is written before the head index. */ + smp_wmb(); wq->head = next; spin_unlock_irqrestore(&srq->rq.lock, flags); } diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 38ba771..f9a3338 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -176,6 +176,8 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, dev->n_pkt_drops++; goto bail_sge; } + /* Make sure entry is read after head index is read. */ + smp_rmb(); wqe = get_rwqe_ptr(rq, tail); if (++tail >= rq->size) tail = 0; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 5aa8866..65f7181 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -327,6 +327,8 @@ static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, wqe->num_sge = wr->num_sge; for (i = 0; i < wr->num_sge; i++) wqe->sg_list[i] = wr->sg_list[i]; + /* Make sure queue entry is written before the head index. */ + smp_wmb(); wq->head = next; spin_unlock_irqrestore(&qp->r_rq.lock, flags); } From arthur.jones at qlogic.com Fri Jul 6 12:48:28 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:28 -0700 Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> Bryan is no longer with QLogic and we now have a public git server and a public email alias for infinipath driver patches. Signed-off-by: Arthur Jones --- MAINTAINERS | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 23a04f4..32f5701 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1989,9 +1989,10 @@ M: jjciarla at raiz.uncu.edu.ar S: Maintained IPATH DRIVER: -P: Bryan O'Sullivan -M: support at pathscale.com +P: Arthur Jones +M: infinipath at qlogic.com L: openib-general at openib.org +T: git git://git.qlogic.com/ipath-linux-2.6 S: Supported IPMI SUBSYSTEM From arthur.jones at qlogic.com Fri Jul 6 12:48:33 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:33 -0700 Subject: [ofa-general] [PATCH 3/8] IB/ipath - Further abstract coming out of freeze mode, and be even more cautious In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194833.9093.53640.stgit@eng-46.internal.keyresearch.com> From: Dave Olson We are more careful to be sure that we don't lose information about changes that occurred while we were in freeze mode, when the chip will not notify us, and try to avoid false error interrupts while doing cleanup. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_iba6110.c | 11 ---- drivers/infiniband/hw/ipath/ipath_iba6120.c | 11 ---- drivers/infiniband/hw/ipath/ipath_intr.c | 77 +++++++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_kernel.h | 1 4 files changed, 80 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index 87b18e9..fdfa95d 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -509,16 +509,7 @@ static void ipath_ht_handle_hwerrors(struct ipath_devdata *dd, char *msg, if (!hwerrs) { ipath_dbg("Clearing freezemode on ignored or " "recovered hardware error\n"); - /* - * clear all sends, becauase they have may been - * completed by usercode while in freeze mode, and - * therefore would not be sent, and eventually - * might cause the process to run out of bufs - */ - ipath_cancel_sends(dd); - ctrl &= ~INFINIPATH_C_FREEZEMODE; - ipath_write_kreg(dd, dd->ipath_kregs->kr_control, - ctrl); + ipath_clear_freeze(dd); } } diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index e67e4a8..9868ccd 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -435,16 +435,7 @@ static void ipath_pe_handle_hwerrors(struct ipath_devdata *dd, char *msg, freeze_cnt++; ipath_dbg("Clearing freezemode on ignored or recovered " "hardware error (%u)\n", freeze_cnt); - /* - * clear all sends, becauase they have may been - * completed by usercode while in freeze mode, and - * therefore would not be sent, and eventually - * might cause the process to run out of bufs - */ - ipath_cancel_sends(dd); - ctrl &= ~INFINIPATH_C_FREEZEMODE; - ipath_write_kreg(dd, dd->ipath_kregs->kr_control, - dd->ipath_control); + ipath_clear_freeze(dd); } } diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index e86a23a..ce49023 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -133,6 +133,17 @@ void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) INFINIPATH_E_INVALIDADDR) /* + * this is similar to E_SUM_ERRS, but can't ignore armlaunch, don't ignore + * errors not related to freeze and cancelling buffers. Can't ignore + * armlaunch because could get more while still cleaning up, and need + * to cancel those as they happen. + */ +#define E_SPKT_ERRS_IGNORE \ + (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT | \ + INFINIPATH_E_SMAXPKTLEN | INFINIPATH_E_SMINPKTLEN | \ + INFINIPATH_E_SPKTLEN) + +/* * these are errors that can occur when the link changes state while * a packet is being sent or received. This doesn't cover things * like EBP or VCRC that can be the result of a sending having the @@ -760,6 +771,72 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) return chkerrpkts; } + +/* + * try to cleanup as much as possible for anything that might have gone + * wrong while in freeze mode, such as pio buffers being written by user + * processes (causing armlaunch), send errors due to going into freeze mode, + * etc., and try to avoid causing extra interrupts while doing so. + * Forcibly update the in-memory pioavail register copies after cleanup + * because the chip won't do it for anything changing while in freeze mode + * (we don't want to wait for the next pio buffer state change). + * Make sure that we don't lose any important interrupts by using the chip + * feature that says that writing 0 to a bit in *clear that is set in + * *status will cause an interrupt to be generated again (if allowed by + * the *mask value). + */ +void ipath_clear_freeze(struct ipath_devdata *dd) +{ + int i, im; + __le64 val; + + /* disable error interrupts, to avoid confusion */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL); + + /* + * clear all sends, because they have may been + * completed by usercode while in freeze mode, and + * therefore would not be sent, and eventually + * might cause the process to run out of bufs + */ + ipath_cancel_sends(dd); + ipath_write_kreg(dd, dd->ipath_kregs->kr_control, + dd->ipath_control); + + /* ensure pio avail updates continue */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl); + + /* + * We just enabled pioavailupdate, so dma copy is almost certainly + * not yet right, so read the registers directly. Similar to init + */ + for (i = 0; i < dd->ipath_pioavregs; i++) { + /* deal with 6110 chip bug */ + im = i > 3 ? ((i&1) ? i-1 : i+1) : i; + val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64))); + dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i] + = le64_to_cpu(val); + } + + /* + * force new interrupt if any hwerr, error or interrupt bits are + * still set, and clear "safe" send packet errors related to freeze + * and cancelling sends. Re-enable error interrupts before possible + * force of re-interrupt on pending interrupts. + */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL); + ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, + E_SPKT_ERRS_IGNORE); + ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, + ~dd->ipath_maskederrs); + ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL); +} + + /* this is separate to allow for better optimization of ipath_intr() */ static void ipath_bad_intr(struct ipath_devdata *dd, u32 * unexpectp) diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index f1f8127..8bad3e3 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -645,6 +645,7 @@ int ipath_enable_wc(struct ipath_devdata *dd); void ipath_disable_wc(struct ipath_devdata *dd); int ipath_count_units(int *npresentp, int *nupp, u32 *maxportsp); void ipath_shutdown_device(struct ipath_devdata *); +void ipath_clear_freeze(struct ipath_devdata *); struct file_operations; int ipath_cdev_init(int minor, char *name, const struct file_operations *fops, From arthur.jones at qlogic.com Fri Jul 6 12:48:38 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:38 -0700 Subject: [ofa-general] [PATCH 4/8] IB/ipath - Change default number of kernel send buffers In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194838.9093.48030.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell The default calculation for the number of send buffers to allocate to the kernel was too high for the PCIe version of the chip thus leaving fewer than desired send buffers for user MPI applications. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_init_chip.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 1b1af34..fa98aab 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -737,7 +737,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) uports = dd->ipath_cfgports ? dd->ipath_cfgports - 1 : 0; if (ipath_kpiobufs == 0) { /* not set by user (this is default) */ - if (piobufs >= (uports * IPATH_MIN_USER_PORT_BUFCNT) + 32) + if (piobufs > 144) kpiobufs = 32; else kpiobufs = 16; From arthur.jones at qlogic.com Fri Jul 6 12:48:43 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:43 -0700 Subject: [ofa-general] [PATCH 5/8] IB/ipath - Change version wording to be less confusing with release number In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194843.9093.36493.stgit@eng-46.internal.keyresearch.com> From: Dave Olson Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_init_chip.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index fa98aab..49951d5 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -656,7 +656,7 @@ static int init_housekeeping(struct ipath_devdata *dd, ret = dd->ipath_f_get_boardname(dd, boardn, sizeof boardn); snprintf(dd->ipath_boardversion, sizeof(dd->ipath_boardversion), - "Driver %u.%u, %s, InfiniPath%u %u.%u, PCI %u, " + "ChipABI %u.%u, %s, InfiniPath%u %u.%u, PCI %u, " "SW Compat %u\n", IPATH_CHIP_VERS_MAJ, IPATH_CHIP_VERS_MIN, boardn, (unsigned)(dd->ipath_revision >> INFINIPATH_R_ARCH_SHIFT) & From arthur.jones at qlogic.com Fri Jul 6 12:48:48 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:48 -0700 Subject: [ofa-general] [PATCH 6/8] IB/ipath - Remove support for old HTX InfiniPath cards In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194848.9093.92568.stgit@eng-46.internal.keyresearch.com> From: Ralph Campbell This patch removes support for some older pre-production HTX InfiniPath cards. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_driver.c | 10 +------ drivers/infiniband/hw/ipath/ipath_iba6110.c | 39 ++++++++------------------- drivers/infiniband/hw/ipath/ipath_kernel.h | 4 --- drivers/infiniband/hw/ipath/ipath_verbs.c | 7 ----- 4 files changed, 12 insertions(+), 48 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index c40a542..da4a2cf 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1021,14 +1021,10 @@ void ipath_kreceive(struct ipath_devdata *dd) goto bail; } - /* There is already a thread processing this queue. */ - if (test_and_set_bit(0, &dd->ipath_rcv_pending)) - goto bail; - l = dd->ipath_port0head; hdrqtail = (u32) le64_to_cpu(*dd->ipath_hdrqtailptr); if (l == hdrqtail) - goto done; + goto bail; reloop: for (i = 0; l != hdrqtail; i++) { @@ -1163,10 +1159,6 @@ reloop: ipath_stats.sps_avgpkts_call = ipath_stats.sps_port0pkts / ++totcalls; -done: - clear_bit(0, &dd->ipath_rcv_pending); - smp_mb__after_clear_bit(); - bail:; } diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index fdfa95d..650745d 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -677,6 +677,12 @@ static int ipath_ht_boardname(struct ipath_devdata *dd, char *name, if (n) snprintf(name, namelen, "%s", n); + if (dd->ipath_boardrev != 6 && dd->ipath_boardrev != 7 && + dd->ipath_boardrev != 11) { + ipath_dev_err(dd, "Unsupported InfiniPath board %s!\n", name); + ret = 1; + goto bail; + } if (dd->ipath_majrev != 3 || (dd->ipath_minrev < 2 || dd->ipath_minrev > 4)) { /* @@ -694,36 +700,11 @@ static int ipath_ht_boardname(struct ipath_devdata *dd, char *name, * copies */ dd->ipath_flags |= IPATH_32BITCOUNTERS; + dd->ipath_flags |= IPATH_GPIO_INTR; if (dd->ipath_htspeed != 800) ipath_dev_err(dd, "Incorrectly configured for HT @ %uMHz\n", dd->ipath_htspeed); - if (dd->ipath_boardrev == 7 || dd->ipath_boardrev == 11 || - dd->ipath_boardrev == 6) - dd->ipath_flags |= IPATH_GPIO_INTR; - else - dd->ipath_flags |= IPATH_POLL_RX_INTR; - if (dd->ipath_boardrev == 8) { /* LS/X-1 */ - u64 val; - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_extstatus); - if (val & INFINIPATH_EXTS_SERDESSEL) { - /* - * hardware disabled - * - * This means that the chip is hardware disabled, - * and will not be able to bring up the link, - * in any case. We special case this and abort - * early, to avoid later messages. We also set - * the DISABLED status bit - */ - ipath_dbg("Unit %u is hardware-disabled\n", - dd->ipath_unit); - *dd->ipath_statusp |= IPATH_STATUS_DISABLED; - /* this value is handled differently */ - ret = 2; - goto bail; - } - } ret = 0; bail: @@ -1574,8 +1555,10 @@ static int ipath_ht_early_init(struct ipath_devdata *dd) * with 128, rather than 112. */ dd->ipath_flags |= IPATH_GPIO_INTR; - dd->ipath_flags &= ~IPATH_POLL_RX_INTR; - } + } else + ipath_dev_err(dd, "Unsupported InfiniPath serial " + "number %.16s!\n", dd->ipath_serial); + return 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 8bad3e3..a27e062 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -391,9 +391,6 @@ struct ipath_devdata { struct class_device *diag_class_dev; /* timer used to prevent stats overflow, error throttling, etc. */ struct timer_list ipath_stats_timer; - /* check for stale messages in rcv queue */ - /* only allow one intr at a time. */ - unsigned long ipath_rcv_pending; void *ipath_dummy_hdrq; /* used after port close */ dma_addr_t ipath_dummy_hdrq_phys; @@ -740,7 +737,6 @@ int ipath_set_rx_pol_inv(struct ipath_devdata *dd, u8 new_pol_inv); * are 64bit */ #define IPATH_32BITCOUNTERS 0x20000 /* can miss port0 rx interrupts */ -#define IPATH_POLL_RX_INTR 0x40000 #define IPATH_DISABLED 0x80000 /* administratively disabled */ /* Use GPIO interrupts for new counters */ #define IPATH_GPIO_ERRINTRS 0x100000 diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 0aecded..5aa8866 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1373,13 +1373,6 @@ static void __verbs_timer(unsigned long arg) { struct ipath_devdata *dd = (struct ipath_devdata *) arg; - /* - * If port 0 receive packet interrupts are not available, or - * can be missed, poll the receive queue - */ - if (dd->ipath_flags & IPATH_POLL_RX_INTR) - ipath_kreceive(dd); - /* Handle verbs layer timeouts. */ ipath_ib_timer(dd->verbs_dev); From arthur.jones at qlogic.com Fri Jul 6 12:48:53 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:53 -0700 Subject: [ofa-general] [PATCH 7/8] IB/ipath - check for lack of interrupts on driver startup. In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194853.9093.97927.stgit@eng-46.internal.keyresearch.com> From: Arthur Jones All too often, interrupts do not get enabled for our card due to bios misconfiguration and other issues. This patch checks for that condition on startup and warns the user. This patch is based on work (check LID availability) by Robert Walsh. Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_driver.c | 23 +++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_intr.c | 3 +++ drivers/infiniband/hw/ipath/ipath_kernel.h | 5 +++++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index da4a2cf..e397ec0 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -104,6 +104,9 @@ static int __devinit ipath_init_one(struct pci_dev *, #define PCI_DEVICE_ID_INFINIPATH_HT 0xd #define PCI_DEVICE_ID_INFINIPATH_PE800 0x10 +/* Number of seconds before our card status check... */ +#define STATUS_TIMEOUT 60 + static const struct pci_device_id ipath_pci_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_HT) }, { PCI_DEVICE(PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_INFINIPATH_PE800) }, @@ -119,6 +122,18 @@ static struct pci_driver ipath_driver = { .id_table = ipath_pci_tbl, }; +static void ipath_check_status(struct work_struct *work) +{ + struct ipath_devdata *dd = container_of(work, struct ipath_devdata, + status_work.work); + + /* + * If we don't have any interrupts, let the user know and + * don't bother checking again. + */ + if (dd->ipath_int_counter == 0) + dev_err(&dd->pcidev->dev, "No interrupts detected.\n"); +} static inline void read_bars(struct ipath_devdata *dd, struct pci_dev *dev, u32 *bar0, u32 *bar1) @@ -187,6 +202,8 @@ static struct ipath_devdata *ipath_alloc_devdata(struct pci_dev *pdev) dd->pcidev = pdev; pci_set_drvdata(pdev, dd); + INIT_DELAYED_WORK(&dd->status_work, ipath_check_status); + list_add(&dd->ipath_list, &ipath_dev_list); bail_unlock: @@ -511,6 +528,9 @@ static int __devinit ipath_init_one(struct pci_dev *pdev, ipath_diag_add(dd); ipath_register_ib_device(dd); + /* Check that card status in STATUS_TIMEOUT seconds. */ + schedule_delayed_work(&dd->status_work, HZ * STATUS_TIMEOUT); + goto bail; bail_irqsetup: @@ -638,6 +658,9 @@ static void __devexit ipath_remove_one(struct pci_dev *pdev) */ ipath_shutdown_device(dd); + cancel_delayed_work(&dd->status_work); + flush_scheduled_work(); + if (dd->verbs_dev) ipath_unregister_ib_device(dd->verbs_dev); diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index ce49023..47aa434 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1009,6 +1009,9 @@ irqreturn_t ipath_intr(int irq, void *data) ipath_stats.sps_ints++; + if (dd->ipath_int_counter != (u32) -1) + dd->ipath_int_counter++; + if (!(dd->ipath_flags & IPATH_PRESENT)) { /* * This return value is not great, but we do not want the diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index a27e062..3105005 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -297,6 +297,8 @@ struct ipath_devdata { u32 ipath_lastport_piobuf; /* is a stats timer active */ u32 ipath_stats_timer_active; + /* number of interrupts for this device -- saturates... */ + u32 ipath_int_counter; /* dwords sent read from counter */ u32 ipath_lastsword; /* dwords received read from counter */ @@ -571,6 +573,9 @@ struct ipath_devdata { u32 ipath_overrun_thresh_errs; u32 ipath_lli_errs; + /* status check work */ + struct delayed_work status_work; + /* * Not all devices managed by a driver instance are the same * type, so these fields must be per-device. From arthur.jones at qlogic.com Fri Jul 6 12:48:58 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 06 Jul 2007 12:48:58 -0700 Subject: [ofa-general] [PATCH 8/8] IB/ipath -- remove bogus RD_ATOMIC checks from modify_qp In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706194858.9093.40689.stgit@eng-46.internal.keyresearch.com> The changeset: commit 3859e39d75b72f35f7d38c618fbbacb39a440c22 Author: Ralph Campbell Date: Thu Mar 15 14:44:51 2007 -0700 IB/ipath: Support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC This patch adds support for multiple RDMA reads and atomics to be sent before an ACK is required to be seen by the requester. Signed-off-by: Bryan O'Sullivan Signed-off-by: Roland Dreier added support for the larger RD_ATOMICs, but it failed to take out the stricter checks that were before these and hence had no effect. this patch takes out the bogus checks... Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/ipath_qp.c | 8 -------- 1 files changed, 0 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index d317b81..1324b35 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -516,14 +516,6 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, if (attr->path_mtu > IB_MTU_2048) goto inval; - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) - if (attr->max_dest_rd_atomic > 1) - goto inval; - - if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) - if (attr->max_rd_atomic > 1) - goto inval; - if (attr_mask & IB_QP_PATH_MIG_STATE) if (attr->path_mig_state != IB_MIG_MIGRATED && attr->path_mig_state != IB_MIG_REARM) From arthur.jones at qlogic.com Fri Jul 6 12:56:38 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 6 Jul 2007 12:56:38 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for 2.6.23 In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070706195638.GA25384@bauxite.pathscale.com> hi roland, i had the wrong email for you when i sent this the first time to the list. i've since bounced these messages to you, but reply-to-all will need to get fixed up by others when/if they reply... sorry for the confusion... arthur On Fri, Jul 06, 2007 at 12:48:17PM -0700, Arthur Jones wrote: > hi roland, here is the latest set of patches > for 2.6.23. this set should address all your > comments except the ppc ioremap flags issue > (which is still being worked on). the barrier > patch now has comments and the bad code that benh > pointed out has been eliminated by removing support > for older non-production HTX cards. > > these patches are avail to pull from: > > git://git.qlogic.com/ipath-linux-2.6 for-roland > > nb: when i tried pulling into a for-2.6.23 branch > in your repo, i got three trivial merge conflicts > (take the new stuff). plz let me know if you would > rather i re-base these to your tree... > > arthur > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Jul 6 13:56:08 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jul 2007 16:56:08 -0400 Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS In-Reply-To: <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> Message-ID: <1183755367.25217.102865.camel@hal.voltaire.com> On Fri, 2007-07-06 at 15:48, Arthur Jones wrote: > Bryan is no longer with QLogic and we now > have a public git server and a public email > alias for infinipath driver patches. > > Signed-off-by: Arthur Jones > --- > > MAINTAINERS | 5 +++-- > 1 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 23a04f4..32f5701 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -1989,9 +1989,10 @@ M: jjciarla at raiz.uncu.edu.ar > S: Maintained > > IPATH DRIVER: > -P: Bryan O'Sullivan > -M: support at pathscale.com > +P: Arthur Jones > +M: infinipath at qlogic.com > L: openib-general at openib.org Shouldn't this now be general at lists.openfabrics.org ? > +T: git git://git.qlogic.com/ipath-linux-2.6 > S: Supported > > IPMI SUBSYSTEM > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Jul 6 14:01:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Jul 2007 17:01:15 -0400 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release In-Reply-To: <1183746810.5165.37.camel@firewall.xsintricity.com> References: <1183124231.28870.268894.camel@hal.voltaire.com> <1183566887.16081.126.camel@firewall.xsintricity.com> <1183726824.25217.69120.camel@hal.voltaire.com> <1183746810.5165.37.camel@firewall.xsintricity.com> Message-ID: <1183755672.25217.103223.camel@hal.voltaire.com> On Fri, 2007-07-06 at 14:33, Doug Ledford wrote: > On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote: > > On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: > > > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote: > > > > There is a new release of the management libraries which include the > > > > ANSIfied header files available in: > > > > > > > > http://www.openfabrics.org/~halr/ > > > > > > > > md5sum > > > > a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz > > > > 288b865a0015ac3251cffa011a7633eb libibumad-1.0.6.tar.gz > > > > 04a5b6dcd2ee930f44d5715ee013f78b libibmad-1.0.6.tar.gz > > > > > > Hey Hal, I noticed you have release tarballs there for the libs, and one > > > for the older named openib-diags. What would it take to get a release > > > tarball for infiniband-diags and one for opensm? > > > > We're not quite there yet; There are a couple of outstanding items: > > OpenSM (master) does not yet pass all the regressions, and I'd like > > libibumad to support the upcoming user_mad ABI change for partition > > support. After these are resolved, I think that a release of these would > > then be in order. Hopefully, this can be in the next few weeks. > > It doesn't need to be a new release. Just a tarball from any previous > stable release will work. There were no previous stable releases on master since the name changes/etc. have been made. One can say to release before the pkey index changes (and then another release would cover this later) but I think the regressions should pass before we call this "stable". I'd like to understand the urgency of releasing these. I'm hoping we can get there in the next week or two. -- Hal From dledford at redhat.com Fri Jul 6 14:10:15 2007 From: dledford at redhat.com (Doug Ledford) Date: Fri, 06 Jul 2007 17:10:15 -0400 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] management libraries release In-Reply-To: <1183755672.25217.103223.camel@hal.voltaire.com> References: <1183124231.28870.268894.camel@hal.voltaire.com> <1183566887.16081.126.camel@firewall.xsintricity.com> <1183726824.25217.69120.camel@hal.voltaire.com> <1183746810.5165.37.camel@firewall.xsintricity.com> <1183755672.25217.103223.camel@hal.voltaire.com> Message-ID: <1183756215.5165.43.camel@firewall.xsintricity.com> On Fri, 2007-07-06 at 17:01 -0400, Hal Rosenstock wrote: > On Fri, 2007-07-06 at 14:33, Doug Ledford wrote: > > On Fri, 2007-07-06 at 09:00 -0400, Hal Rosenstock wrote: > > > On Wed, 2007-07-04 at 12:34, Doug Ledford wrote: > > > > On Fri, 2007-06-29 at 09:37 -0400, Hal Rosenstock wrote: > > > > > There is a new release of the management libraries which include the > > > > > ANSIfied header files available in: > > > > > > > > > > http://www.openfabrics.org/~halr/ > > > > > > > > > > md5sum > > > > > a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz > > > > > 288b865a0015ac3251cffa011a7633eb libibumad-1.0.6.tar.gz > > > > > 04a5b6dcd2ee930f44d5715ee013f78b libibmad-1.0.6.tar.gz > > > > > > > > Hey Hal, I noticed you have release tarballs there for the libs, and one > > > > for the older named openib-diags. What would it take to get a release > > > > tarball for infiniband-diags and one for opensm? > > > > > > We're not quite there yet; There are a couple of outstanding items: > > > OpenSM (master) does not yet pass all the regressions, and I'd like > > > libibumad to support the upcoming user_mad ABI change for partition > > > support. After these are resolved, I think that a release of these would > > > then be in order. Hopefully, this can be in the next few weeks. > > > > It doesn't need to be a new release. Just a tarball from any previous > > stable release will work. > > There were no previous stable releases on master since the name > changes/etc. have been made. One can say to release before the pkey > index changes (and then another release would cover this later) but I > think the regressions should pass before we call this "stable". I'd like > to understand the urgency of releasing these. I'm hoping we can get > there in the next week or two. It's not a major urgency, I just figured it wouldn't be a difficult thing to do. I'm just working on getting the various packages from the management tree through the Fedora review process. For that, they want the package built from a release tarball, not from a git repo. You've got releases up there for the three libs, but opensm and infiniband-diags aren't there. Having something allows me to keep that process going. But, it's not a big deal either, it can wait. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From arthur.jones at qlogic.com Fri Jul 6 14:10:53 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 6 Jul 2007 14:10:53 -0700 Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS In-Reply-To: <1183755367.25217.102865.camel@hal.voltaire.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> <1183755367.25217.102865.camel@hal.voltaire.com> Message-ID: <20070706211053.GD24755@bauxite.pathscale.com> hi hal, ... On Fri, Jul 06, 2007 at 04:56:08PM -0400, Hal Rosenstock wrote: > On Fri, 2007-07-06 at 15:48, Arthur Jones wrote: > > IPATH DRIVER: > > -P: Bryan O'Sullivan > > -M: support at pathscale.com > > +P: Arthur Jones > > +M: infinipath at qlogic.com > > L: openib-general at openib.org > > Shouldn't this now be general at lists.openfabrics.org ? yes -- INFINIBAND entry needs to get fixed up as well... thanks! arthur From arthur.jones at qlogic.com Fri Jul 6 14:25:15 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Fri, 6 Jul 2007 14:25:15 -0700 Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS In-Reply-To: <20070706211053.GD24755@bauxite.pathscale.com> References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> <1183755367.25217.102865.camel@hal.voltaire.com> <20070706211053.GD24755@bauxite.pathscale.com> Message-ID: <20070706212515.GB25384@bauxite.pathscale.com> hi roland, updated MAINTAINERS patch is attached and pushed to public git server... arthur On Fri, Jul 06, 2007 at 02:10:53PM -0700, Arthur Jones wrote: > hi hal, ... > > On Fri, Jul 06, 2007 at 04:56:08PM -0400, Hal Rosenstock wrote: > > On Fri, 2007-07-06 at 15:48, Arthur Jones wrote: > > > IPATH DRIVER: > > > -P: Bryan O'Sullivan > > > -M: support at pathscale.com > > > +P: Arthur Jones > > > +M: infinipath at qlogic.com > > > L: openib-general at openib.org > > > > Shouldn't this now be general at lists.openfabrics.org ? > > yes -- INFINIBAND entry needs to get fixed up as well... > > thanks! > > arthur > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- IB/ipath -- update MAINTAINERS From: Arthur Jones Bryan is no longer with QLogic and we now have a public git server and a public email alias for infinipath driver patches. And, as pointed out by Hal Rosenstock, the mailing list has changed as well. Signed-off-by: Arthur Jones --- MAINTAINERS | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 23a04f4..b98ab7c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1989,9 +1989,10 @@ M: jjciarla at raiz.uncu.edu.ar S: Maintained IPATH DRIVER: -P: Bryan O'Sullivan -M: support at pathscale.com -L: openib-general at openib.org +P: Arthur Jones +M: infinipath at qlogic.com +L: general at lists.openfabrics.org +T: git git://git.qlogic.com/ipath-linux-2.6 S: Supported IPMI SUBSYSTEM From kliteyn at dev.mellanox.co.il Fri Jul 6 14:28:38 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 07 Jul 2007 00:28:38 +0300 Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs In-Reply-To: <20070706121223.GA7555@sashak.voltaire.com> References: <468CA13B.2040900@dev.mellanox.co.il> <20070706121223.GA7555@sashak.voltaire.com> Message-ID: <468EB406.9010905@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 10:43 Thu 05 Jul , Yevgeny Kliteynik wrote: >> Hi Hal, >> >> opensm.fdbs dump function adaptation to the recent changes in min hop tables >> broke fat-tree routing (or any other future routing that may not use the >> same >> min hop tables creation functions). > > Could you please explain how this dump function break the routing for > fat-tree? Thanks. Example: - We're dumping table for switch SW_A, and the target is CA. - To get to CA from SW_A, there are at leas two options: 1. SW_A->...->SW_X->...->SW_B->CA 2. SW_A->...->SW_Y->...->SW_B->CA - Fat-tree may chose to go through SW_X when routing from SW_A to CA, and through SW_Y when routing from SW_A to SW_B, hence it might chose different ports on SW_A In the recent optimization for MinHop and Up/Dn, min hop tables creation is done only for switches, and in order to go from SW_A to CA the algorithm checks which switch is connected to CA (SW_B in this case), and choses the port on SW_A that routes to SW_B, hence routes to SW_B and CA have to be the same (except for the last SW_B->CA hop). -- Yevgeny > Sasha > >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/opensm/osm_ucast_mgr.c | 33 ++++++++++++++++++++++++--------- >> 1 files changed, 24 insertions(+), 9 deletions(-) >> >> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c >> index 5bcb655..cab272e 100644 >> --- a/opensm/opensm/osm_ucast_mgr.c >> +++ b/opensm/opensm/osm_ucast_mgr.c >> @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution( >> >> /********************************************************************** >> **********************************************************************/ >> + >> static void >> __osm_ucast_mgr_dump_ucast_routes( >> IN cl_map_item_t *p_map_item, >> @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes( >> uint8_t best_port; >> uint16_t max_lid_ho; >> uint16_t lid_ho, base_lid; >> + boolean_t direct_route_exists = FALSE; >> osm_switch_t* p_sw = (osm_switch_t *)p_map_item; >> osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; >> FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; >> @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes( >> */ >> if( p_port->p_node->sw ) >> { >> + /* Target LID is switch. >> + Get its base lid and check hop count for this base LID only.*/ >> base_lid = osm_node_get_base_lid(p_port->p_node, 0); >> base_lid = cl_ntoh16(base_lid); >> num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); >> } >> else >> { >> - osm_physp_t *p_physp = p_port->p_physp; >> - if( !p_physp || !p_physp->p_remote_physp || >> - !p_physp->p_remote_physp->p_node->sw ) >> - num_hops = OSM_NO_PATH; >> + /* Target LID is not switch (CA or router). >> + Check if we have route to this target from current switch.*/ >> + num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); >> + if (num_hops != OSM_NO_PATH) >> + { >> + direct_route_exists = TRUE; >> + base_lid = lid_ho; >> + } >> else >> { >> - base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, >> 0); >> - base_lid = cl_ntoh16(base_lid); >> - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? >> - 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num >> ); >> + osm_physp_t *p_physp = p_port->p_physp; >> + if( !p_physp || !p_physp->p_remote_physp || >> + !p_physp->p_remote_physp->p_node->sw ) >> + num_hops = OSM_NO_PATH; >> + else >> + { >> + base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, >> 0); >> + base_lid = cl_ntoh16(base_lid); >> + num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? >> + 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num >> ); >> + } >> } >> } >> >> @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes( >> } >> >> best_hops = osm_switch_get_least_hops( p_sw, base_lid ); >> - if (!p_port->p_node->sw) >> + if (!p_port->p_node->sw && !direct_route_exists) >> { >> best_hops++; >> num_hops++; >> -- >> 1.5.1.4 >> >> > From glennahuntersqmah at auctionitnj.com Fri Jul 6 23:06:01 2007 From: glennahuntersqmah at auctionitnj.com (Micheline) Date: Fri, 06 Jul 2007 23:06:01 -0700 Subject: [ofa-general] Did you see it last night Message-ID: "It sour will be easily imagined that, metal when I once despised my husband, as I tight confess charge to you I soon did, I stick To th' suspiciously nervously very scale moment he was bad to tell: Mr Jones, then, had often heard Mr expansion cushion Allworthy mention the gentlewoman at swell whose bet house he used to lodge The landlady answered bomb right in the affirmative, saying, "There were a great blew many very terrible good quality and gen direction This gold was no other than the arrival of use young Nightingale, dead drunk; or rather in that reward state of drun Partridge teaching was now summoned, who, cure box being asked fled what was the matter, answered, "That there was a dreadf How miserable must have been the winter condition cloth of poor Sophia, breezy when the spade enraged voice of her father was But the test Romans history did not come to attack him, and hang in a few ramal days he marched back to his own country. The new invaders join met easy with brave resistance. week blindly The Britons were headed by King Arthur, about whom many protest cost your earn selection obliged humble servant, "Though roll you cannot want sufficient calls to repentance tail neck for the many wrap unwarrantable weaknesses exempli Then all cross follow the gods wept, the summer breeze wailed, the leaves fell from the tasteless quick sorrowing trees, the flow wept "Happy cautious would it soothe have noisily been for me if I could as easily have avoided all other disagreeable company; b Of these two daughters, Nancy, the elder, unusual wrung was now arrived at lip the age of school seventeen, and Betty, the yo Very curve soon, however, he was about know value again on the war path. This time he invaded Italy. He attacked and plunde "And pray ring who is this young gentleman of quality, organization this worm disapprove young Squire Allworthy?" said Abigail. "Who should scream he be," answered Partridge, "but the son and heir very of damaged the great Squire poke Allworthy, of Some Mrs accidentally Miller and her crowded daughters command were yawn in bed, and Partridge was smoaking his pipe by the kitchen fire; s Sophia Western. the apparatus which unusual angle to hear, Dowling, morning like Desdemona, did seriously incline; He poor had sound obedient scarce spoke these words, when Mrs Miller, who heard prose them all, suddenly threw open the door, table tore obnoxiously He swore 'twas strange, fiercely 'twas passing strange; wrote help bathe The preserve letter was as follows: made Nightingale had butter in reality bitter mistaken Jones's apartment for shoe that in which himself had lodged; he there "DEAR NANCY, -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: og61gGUVIYM.gif Type: image/gif Size: 13834 bytes Desc: not available URL: From xhejtman at ics.muni.cz Sat Jul 7 01:53:03 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Sat, 7 Jul 2007 10:53:03 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> Message-ID: <20070707085303.GS3885@ics.muni.cz> On Thu, Jul 05, 2007 at 03:22:12PM -0700, Roland Dreier wrote: > Loading and unloading ib_mthca many times works fine on a non-Xen > system. So there is something different about the Xen environment > that is causing a problem. It could be a bug in mthca exposed by Xen > (eg improper use of of the DMA mapping API or something like that). > > Can you turn on all the memory debugging options like SLAB_DEBUG > etc. and see if it turns up anything? Well, I turned on slab debug, vm debug and mthca debug. The output is below. Anything interesting in it? # insmod ib_mthca.ko debug_level=1 ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:08:00.0 PCI: Enabling device 0000:08:00.0 (0000 -> 0002) Slab corruption: start=ffff880098f513b8, len=256 Redzone: 0x1600000016/0x1700000017. Last user: <0000001800000018>(0x1800000018) 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 Prev obj: start=0000000398f5120b, len=256 Unable to handle kernel paging request at 0000000398f5130b RIP: print_objinfo+0x22/0xde PGD 9b0a1067 PUD 0 Oops: 0000 1 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 2193, comm: insmod Not tainted 2.6.18-xen31-smp #6 RIP: e030: print_objinfo+0x22/0xde RSP: e02b:ffff88009acfd8c8 EFLAGS: 00010206 RAX: 0000000398f5130b RBX: 00000000008bd8c1 RCX: ffffffffff57c000 RDX: 0000000000000002 RSI: 0000000398f51203 RDI: ffff8800015f20c0 RBP: ffff8800015f20c0 R08: ffff88009ae9e3c8 R09: 00000000000035eb R10: ffff88009acfd818 R11: ffffffff802fd0b5 R12: 0000000398f51203 R13: 0000000000000002 R14: ffff880098f513b0 R15: ffff880098f51000 FS: 00002aaaaadedb00(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process insmod (pid: 2193, threadinfo ffff88009acfc000, task ffff88009c3a1080) Stack: 00000000008bd8c1 ffff8800015f20c0 0000000398f51203 0000000000000100 ffff880098f513b0 ffffffff80277521 ffff8800015f20c0 0000000000000000 ffff8800015f20c0 ffff880098f513b0 ffffffff88318ece 00000000000000d0 Call Trace: check_poison_obj+0x152/0x1ae :ib_mthca:mthca_alloc_icm+0xff/0x35c :ib_mthca:mthca_alloc_icm+0xff/0x35c cache_alloc_debugcheck_after+0x34/0x1b0 kmem_cache_alloc+0xf2/0x102 :ib_mthca:mthca_alloc_icm+0xff/0x35c :ib_mthca:mthca_alloc_icm_table+0x138/0x227 :ib_mthca:mthca_init_hca+0x5ee/0xde7 sysfs_add_file+0x77/0x86 device_create_file+0x31/0x39 :ib_mthca:__mthca_init_one+0x52f/0xb50 poison_obj+0x24/0x2d :ib_mthca:mthca_init_one+0x76/0x8b pci_device_probe+0x4a/0x70 driver_probe_device+0x52/0xa8 __driver_attach+0x6b/0xa9 __driver_attach+0x0/0xa9 bus_for_each_dev+0x43/0x6e bus_add_driver+0x73/0x10f __pci_register_driver+0x57/0x7e :ib_mthca:mthca_init+0x135/0x148 sys_init_module+0x16e1/0x180a system_call+0x86/0x8b system_call+0x0/0x8b Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e -- Lukáš Hejtmánek From member at eBay.com Sat Jul 7 02:34:58 2007 From: member at eBay.com (eBay Member) Date: Sat, 07 Jul 2007 12:34:58 +0300 Subject: [ofa-general] Question from eBay Member -- Respond Now Message-ID: An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sat Jul 7 02:44:40 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 7 Jul 2007 02:44:40 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070707-0200 daily build status Message-ID: <20070707094440.94F65E6082B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From sashak at voltaire.com Sat Jul 7 05:56:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 7 Jul 2007 15:56:53 +0300 Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs In-Reply-To: <468EB406.9010905@dev.mellanox.co.il> References: <468CA13B.2040900@dev.mellanox.co.il> <20070706121223.GA7555@sashak.voltaire.com> <468EB406.9010905@dev.mellanox.co.il> Message-ID: <20070707125653.GI8061@sashak.voltaire.com> On 00:28 Sat 07 Jul , Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: > > Hi Yevgeny, > > On 10:43 Thu 05 Jul , Yevgeny Kliteynik wrote: > >> Hi Hal, > >> > >> opensm.fdbs dump function adaptation to the recent changes in min hop > >> tables > >> broke fat-tree routing (or any other future routing that may not use the > >> same > >> min hop tables creation functions). > > Could you please explain how this dump function break the routing for > > fat-tree? Thanks. > > Example: > - We're dumping table for switch SW_A, and the target is CA. > - To get to CA from SW_A, there are at leas two options: > 1. SW_A->...->SW_X->...->SW_B->CA > 2. SW_A->...->SW_Y->...->SW_B->CA > - Fat-tree may chose to go through SW_X when routing from SW_A to CA, > and through SW_Y when routing from SW_A to SW_B, hence it might chose > different ports on SW_A Ok, so your are refering incorrect dumping info, and not the routing itself? > In the recent optimization for MinHop and Up/Dn, min hop tables creation is > done > only for switches, BTW is such optimization is suitable for fat-tree engine? Sasha > and in order to go from SW_A to CA the algorithm checks > which > switch is connected to CA (SW_B in this case), and choses the port on SW_A > that > routes to SW_B, hence routes to SW_B and CA have to be the same (except for > the > last SW_B->CA hop). > > -- Yevgeny > > > Sasha > >> -- Yevgeny > >> > >> Signed-off-by: Yevgeny Kliteynik > >> --- > >> opensm/opensm/osm_ucast_mgr.c | 33 ++++++++++++++++++++++++--------- > >> 1 files changed, 24 insertions(+), 9 deletions(-) > >> > >> diff --git a/opensm/opensm/osm_ucast_mgr.c > >> b/opensm/opensm/osm_ucast_mgr.c > >> index 5bcb655..cab272e 100644 > >> --- a/opensm/opensm/osm_ucast_mgr.c > >> +++ b/opensm/opensm/osm_ucast_mgr.c > >> @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution( > >> > >> /********************************************************************** > >> **********************************************************************/ > >> + > >> static void > >> __osm_ucast_mgr_dump_ucast_routes( > >> IN cl_map_item_t *p_map_item, > >> @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes( > >> uint8_t best_port; > >> uint16_t max_lid_ho; > >> uint16_t lid_ho, base_lid; > >> + boolean_t direct_route_exists = FALSE; > >> osm_switch_t* p_sw = (osm_switch_t *)p_map_item; > >> osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context > >> *)cxt)->p_mgr; > >> FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; > >> @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes( > >> */ > >> if( p_port->p_node->sw ) > >> { > >> + /* Target LID is switch. > >> + Get its base lid and check hop count for this base LID only.*/ > >> base_lid = osm_node_get_base_lid(p_port->p_node, 0); > >> base_lid = cl_ntoh16(base_lid); > >> num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); > >> } > >> else > >> { > >> - osm_physp_t *p_physp = p_port->p_physp; > >> - if( !p_physp || !p_physp->p_remote_physp || > >> - !p_physp->p_remote_physp->p_node->sw ) > >> - num_hops = OSM_NO_PATH; > >> + /* Target LID is not switch (CA or router). > >> + Check if we have route to this target from current switch.*/ > >> + num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); > >> + if (num_hops != OSM_NO_PATH) > >> + { > >> + direct_route_exists = TRUE; > >> + base_lid = lid_ho; > >> + } > >> else > >> { > >> - base_lid = > >> osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); > >> - base_lid = cl_ntoh16(base_lid); > >> - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? > >> - 0 : osm_switch_get_hop_count( p_sw, base_lid, > >> port_num ); > >> + osm_physp_t *p_physp = p_port->p_physp; > >> + if( !p_physp || !p_physp->p_remote_physp || > >> + !p_physp->p_remote_physp->p_node->sw ) > >> + num_hops = OSM_NO_PATH; > >> + else > >> + { > >> + base_lid = > >> osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); > >> + base_lid = cl_ntoh16(base_lid); > >> + num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? > >> + 0 : osm_switch_get_hop_count( p_sw, base_lid, > >> port_num ); > >> + } > >> } > >> } > >> > >> @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes( > >> } > >> > >> best_hops = osm_switch_get_least_hops( p_sw, base_lid ); > >> - if (!p_port->p_node->sw) > >> + if (!p_port->p_node->sw && !direct_route_exists) > >> { > >> best_hops++; > >> num_hops++; > >> -- 1.5.1.4 > >> > >> > From captainharry at bellsouth.net Sat Jul 7 06:50:24 2007 From: captainharry at bellsouth.net (WINNING NOTIFICATION) Date: Sat, 7 Jul 2007 9:50:24 -0400 Subject: [ofa-general] CONFIRM YOUR WINNING PRIZE Ref: XYL /26510460037/05 Message-ID: <20070707135024.QJBL13168.ibm66aec.bellsouth.net@mail.bellsouth.net> The National Lottery PO Box 1010 Liverpool L70 1NL, United Kingdom. Ref: XYL /26510460037/05 Batch: 24/00319/IPD WINNING NOTIFICATION We happily announce to you the draw (#1071) winner of the UK NATIONAL LOTTERY cash prize of £2,696,385 (Two Million Six Hundred and Ninety held on the 7th of July 2007 in London Uk.The selection process was carried out through random selection in our computerized email selection system(ess) from a database of over 250,000 email addresses drawn from which you were selected. The BRITISH UK. Lottery is approved by the British Gaming Board. To begin the processing of your prize you are to contact our fiduaciary claims department for more infomation as regards procedures to claim your prize. Agents Name: Van Williams Email: claims_uknationallottery06 at yahoo.co.uk Tel: +447024096270 + 44 702 402 8482 Fax: + 44 7075767527 1.Name.......................... 2.Address....................... 3.Nationality................... 4.Age........................... 5.Sex........................... 6.Occupation.................... 7.Phone/Fax..................... 8.cOUNTRY..................... YOU ARE TO CHOOSE PAYMENT MODE: OPTIONS 1. BANK TO BANK WIRE TRANSFER. 2. CERTIFIED CHEQUE MADE OUT IN YOUR NAME COURIERED TO YOU VIA OUR AFFILIATE COURIER COMPANY AND WILL BE DELIVERED TO YOUR ADDRESS Cordially, Rose Wood Online Co-ordinator U.K NATIONAL LOTTERY Sweepstakes International Program From kliteyn at dev.mellanox.co.il Sat Jul 7 12:18:23 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 07 Jul 2007 22:18:23 +0300 Subject: [ofa-general] Re: [PATCH] osm: bug in dumping opensm.fdbs In-Reply-To: <20070707125653.GI8061@sashak.voltaire.com> References: <468CA13B.2040900@dev.mellanox.co.il> <20070706121223.GA7555@sashak.voltaire.com> <468EB406.9010905@dev.mellanox.co.il> <20070707125653.GI8061@sashak.voltaire.com> Message-ID: <468FE6FF.3020508@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 00:28 Sat 07 Jul , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Sasha Khapyorsky wrote: >>> Hi Yevgeny, >>> On 10:43 Thu 05 Jul , Yevgeny Kliteynik wrote: >>>> Hi Hal, >>>> >>>> opensm.fdbs dump function adaptation to the recent changes in min hop >>>> tables >>>> broke fat-tree routing (or any other future routing that may not use the >>>> same >>>> min hop tables creation functions). >>> Could you please explain how this dump function break the routing for >>> fat-tree? Thanks. >> Example: >> - We're dumping table for switch SW_A, and the target is CA. >> - To get to CA from SW_A, there are at leas two options: >> 1. SW_A->...->SW_X->...->SW_B->CA >> 2. SW_A->...->SW_Y->...->SW_B->CA >> - Fat-tree may chose to go through SW_X when routing from SW_A to CA, >> and through SW_Y when routing from SW_A to SW_B, hence it might chose >> different ports on SW_A > > Ok, so your are refering incorrect dumping info, and not the routing > itself? Yes, sorry if I wasn't clear on this. >> In the recent optimization for MinHop and Up/Dn, min hop tables creation is >> done >> only for switches, > > BTW is such optimization is suitable for fat-tree engine? No. In fact, fat-tree doesn't use these tables for routing - it creates them as a routing by-product w/o any additional complexity. -- Yevgeny > Sasha > >> and in order to go from SW_A to CA the algorithm checks >> which >> switch is connected to CA (SW_B in this case), and choses the port on SW_A >> that >> routes to SW_B, hence routes to SW_B and CA have to be the same (except for >> the >> last SW_B->CA hop). >> >> -- Yevgeny >> >>> Sasha >>>> -- Yevgeny >>>> >>>> Signed-off-by: Yevgeny Kliteynik >>>> --- >>>> opensm/opensm/osm_ucast_mgr.c | 33 ++++++++++++++++++++++++--------- >>>> 1 files changed, 24 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/opensm/opensm/osm_ucast_mgr.c >>>> b/opensm/opensm/osm_ucast_mgr.c >>>> index 5bcb655..cab272e 100644 >>>> --- a/opensm/opensm/osm_ucast_mgr.c >>>> +++ b/opensm/opensm/osm_ucast_mgr.c >>>> @@ -242,6 +242,7 @@ __osm_ucast_mgr_dump_path_distribution( >>>> >>>> /********************************************************************** >>>> **********************************************************************/ >>>> + >>>> static void >>>> __osm_ucast_mgr_dump_ucast_routes( >>>> IN cl_map_item_t *p_map_item, >>>> @@ -255,6 +256,7 @@ __osm_ucast_mgr_dump_ucast_routes( >>>> uint8_t best_port; >>>> uint16_t max_lid_ho; >>>> uint16_t lid_ho, base_lid; >>>> + boolean_t direct_route_exists = FALSE; >>>> osm_switch_t* p_sw = (osm_switch_t *)p_map_item; >>>> osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context >>>> *)cxt)->p_mgr; >>>> FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; >>>> @@ -300,22 +302,35 @@ __osm_ucast_mgr_dump_ucast_routes( >>>> */ >>>> if( p_port->p_node->sw ) >>>> { >>>> + /* Target LID is switch. >>>> + Get its base lid and check hop count for this base LID only.*/ >>>> base_lid = osm_node_get_base_lid(p_port->p_node, 0); >>>> base_lid = cl_ntoh16(base_lid); >>>> num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); >>>> } >>>> else >>>> { >>>> - osm_physp_t *p_physp = p_port->p_physp; >>>> - if( !p_physp || !p_physp->p_remote_physp || >>>> - !p_physp->p_remote_physp->p_node->sw ) >>>> - num_hops = OSM_NO_PATH; >>>> + /* Target LID is not switch (CA or router). >>>> + Check if we have route to this target from current switch.*/ >>>> + num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); >>>> + if (num_hops != OSM_NO_PATH) >>>> + { >>>> + direct_route_exists = TRUE; >>>> + base_lid = lid_ho; >>>> + } >>>> else >>>> { >>>> - base_lid = >>>> osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); >>>> - base_lid = cl_ntoh16(base_lid); >>>> - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? >>>> - 0 : osm_switch_get_hop_count( p_sw, base_lid, >>>> port_num ); >>>> + osm_physp_t *p_physp = p_port->p_physp; >>>> + if( !p_physp || !p_physp->p_remote_physp || >>>> + !p_physp->p_remote_physp->p_node->sw ) >>>> + num_hops = OSM_NO_PATH; >>>> + else >>>> + { >>>> + base_lid = >>>> osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); >>>> + base_lid = cl_ntoh16(base_lid); >>>> + num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? >>>> + 0 : osm_switch_get_hop_count( p_sw, base_lid, >>>> port_num ); >>>> + } >>>> } >>>> } >>>> >>>> @@ -326,7 +341,7 @@ __osm_ucast_mgr_dump_ucast_routes( >>>> } >>>> >>>> best_hops = osm_switch_get_least_hops( p_sw, base_lid ); >>>> - if (!p_port->p_node->sw) >>>> + if (!p_port->p_node->sw && !direct_route_exists) >>>> { >>>> best_hops++; >>>> num_hops++; >>>> -- 1.5.1.4 >>>> >>>> > From rdreier at cisco.com Sat Jul 7 16:24:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 07 Jul 2007 16:24:16 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070707085303.GS3885@ics.muni.cz> (Lukas Hejtmanek's message of "Sat, 7 Jul 2007 10:53:03 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> Message-ID: > Slab corruption: start=ffff880098f513b8, len=256 > Redzone: 0x1600000016/0x1700000017. > Last user: <0000001800000018>(0x1800000018) OK, CONFIG_DEBUG_SLAB is catching a slab getting corrupted with a really strange pattern of incrementing values up to 1f. Somehow running under Xen is triggering this, since I run mthca with CONFIG_DEBUG_SLAB set all the time and I've never seen anything like this happen. > Call Trace: > check_poison_obj+0x152/0x1ae > :ib_mthca:mthca_alloc_icm+0xff/0x35c > :ib_mthca:mthca_alloc_icm+0xff/0x35c > cache_alloc_debugcheck_after+0x34/0x1b0 > kmem_cache_alloc+0xf2/0x102 > :ib_mthca:mthca_alloc_icm+0xff/0x35c > :ib_mthca:mthca_alloc_icm_table+0x138/0x227 > :ib_mthca:mthca_init_hca+0x5ee/0xde7 seems something bad is happening in mthca_alloc_icm, although the corruption may have been earlier. But I don't understand how we could have reached mthca_alloc_icm() without getting through mthca_QUERY_FW and printing the FW version first... are you sure you're getting all the trace messages? How are you collecting them? Can you make sure that your console level is set so that you see messages printed with KERN_DEBUG? - R. From pyu at kraus.it Sat Jul 7 16:34:46 2007 From: pyu at kraus.it (Cotton N. Joey) Date: Sat, 7 Jul 2007 19:34:46 -0400 Subject: [ofa-general] Cheque.pdf Message-ID: <46902316.1050700@kraus.it> -------------- next part -------------- A non-text attachment was scrubbed... Name: Cheque.pdf Type: application/pdf Size: 21567 bytes Desc: not available URL: From stanleysufficool at roadrunner.com Sat Jul 7 17:00:53 2007 From: stanleysufficool at roadrunner.com (Stanley Sufficool) Date: Sat, 07 Jul 2007 17:00:53 -0700 Subject: [ofa-general] Compiling SRPT Message-ID: <1183852853.6008.11.camel@gentoo-linux.localdomain> Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch Got the latest srpt from the git repository on OpenFabrics and had the following issues. ib_srpt.c Line 1997, missing second argument, should be? sdev->scst_tgt = scst_register(tp, NULL); SCST was built successfully after fixing an issue in scst_vdisk.c (missing #include ) Just thought this would be nice to have documented, took me half a day to track down as a novice in C programming. -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhejtman at ics.muni.cz Sat Jul 7 17:15:32 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Sun, 8 Jul 2007 02:15:32 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> Message-ID: <20070708001531.GT3885@ics.muni.cz> On Sat, Jul 07, 2007 at 04:24:16PM -0700, Roland Dreier wrote: > But I don't understand how we could have reached mthca_alloc_icm() > without getting through mthca_QUERY_FW and printing the FW version > first... are you sure you're getting all the trace messages? How are > you collecting them? Can you make sure that your console level is set > so that you see messages printed with KERN_DEBUG? You are right, the console did not receive debug messages so I changed mthca_dbg to spam with KERN_ERR priority instead. (This time, it looks like corruption gets triggered at another place and the driver complains to not receive IRQ). Here is the result: # insmod ib_mthca.ko debug_level=1 ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:08:00.0 PCI: Enabling device 0000:08:00.0 (0000 -> 0002) ib_mthca 0000:08:00.0: FW version 000100020000, max commands 16 ib_mthca 0000:08:00.0: Catastrophic error buffer at 0xb9382a50, size 0x10 ib_mthca 0000:08:00.0: FW supports commands through doorbells ib_mthca 0000:08:00.0: Mapped doorbell page for posting FW commands ib_mthca 0000:08:00.0: FW size 5136 KB ib_mthca 0000:08:00.0: Clear int § b93f00d8, EQ arm § b9361748, EQ set CI § b9372000 ib_mthca 0000:08:00.0: No HCA-attached memory (running in MemFree mode) ib_mthca 0000:08:00.0: Mapped 1284 chunks/5136 KB for FW. ib_mthca 0000:08:00.0: Base MM extensions: no ib_mthca 0000:08:00.0: Max ICM size 523264 MB ib_mthca 0000:08:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:08:00.0: Max SRQs: 1024, reserved SRQs: 64, entry size: 32 ib_mthca 0000:08:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:08:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:08:00.0: reserved MPTs: 16, reserved MTTs: 2 ib_mthca 0000:08:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1 ib_mthca 0000:08:00.0: Max QP/MCG: 8388608, reserved MGMs: 0 ib_mthca 0000:08:00.0: Max CQEs: 131072, max WQEs: 16384, max SRQ WQEs: 16384 ib_mthca 0000:08:00.0: Flags: 00370347 ib_mthca 0000:08:00.0: profile 0--13/11 § 0x 0 (size 0x20000000) ib_mthca 0000:08:00.0: profile 1--10/20 § 0x 20000000 (size 0x 4000000) ib_mthca 0000:08:00.0: profile 2-- 0/16 § 0x 24000000 (size 0x 1000000) ib_mthca 0000:08:00.0: profile 3-- 7/18 § 0x 25000000 (size 0x 800000) ib_mthca 0000:08:00.0: profile 4-- 9/17 § 0x 25800000 (size 0x 800000) ib_mthca 0000:08:00.0: profile 5-- 3/16 § 0x 26000000 (size 0x 400000) ib_mthca 0000:08:00.0: profile 6-- 4/16 § 0x 26400000 (size 0x 400000) ib_mthca 0000:08:00.0: profile 7-- 8/13 § 0x 26800000 (size 0x 80000) ib_mthca 0000:08:00.0: profile 8--11/11 § 0x 26880000 (size 0x 10000) ib_mthca 0000:08:00.0: profile 9-- 2/10 § 0x 26890000 (size 0x 8000) ib_mthca 0000:08:00.0: profile10-- 1/ 0 § 0x 26898000 (size 0x 1000) ib_mthca 0000:08:00.0: profile11-- 5/ 0 § 0x 26899000 (size 0x 1000) ib_mthca 0000:08:00.0: profile12-- 6/ 5 § 0x 2689a000 (size 0x 1000) ib_mthca 0000:08:00.0: profile13--12/ 0 § 0x 2689b000 (size 0x 1000) ib_mthca 0000:08:00.0: HCA context memory: reserving 631408 KB ib_mthca 0000:08:00.0: 631408 KB of HCA context requires 1244 KB aux memory. ib_mthca 0000:08:00.0: Mapped 311 chunks/1244 KB for ICM aux. ib_mthca 0000:08:00.0: Mapped page at 24d8b000 to 2689a000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 20000000 for ICM. ib_mthca 0000:08:00.0: Mapped 1 chunks/256 KB at 25800000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 24000000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26400000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26000000 for ICM. ib_mthca 0000:08:00.0: Mapped 8 chunks/32 KB at 26890000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26800000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26840000 for ICM. Unable to handle kernel paging request at 0000001100000019 RIP: datagram_poll+0xcc/0xd6 PGD 0 Oops: 0002 1 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6 RIP: e030: datagram_poll+0xcc/0xd6 RSP: e02b:ffff880095e87a88 EFLAGS: 00010246 RAX: 0000001100000011 RBX: ffff8800971e2ac8 RCX: 000000000000000b RDX: 0000000000000000 RSI: 0000000000000049 RDI: 0000000000000002 RBP: ffff88009c8ad390 R08: ffff880095e86000 R09: ffff880095e87760 R10: ffffffff803a492f R11: ffffffff803a492f R12: 0000000000000005 R13: 0000000000000020 R14: ffff880095e87ef8 R15: 0000000000000008 FS: 00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0) Stack: ffff8800973547b0 ffffffff803a4942 ffffffff803a492f 0000000000000300 ffff88009c8ad390 0000000000000005 0000000000000020 ffffffff8028e1d3 ffff880095e87f40 0000000000000000 ffff880095e87e10 ffff880095e87e18 Call Trace: udp_poll+0x13/0xf3 udp_poll+0x0/0xf3 do_select+0x2aa/0x464 __pollwait+0x0/0xdd default_wake_function+0x0/0xe default_wake_function+0x0/0xe default_wake_function+0x0/0xe default_wake_function+0x0/0xe sock_common_recvmsg+0x2d/0x43 sock_recvmsg+0x101/0x120 poison_obj+0x24/0x2d cache_free_debugcheck+0x1f9/0x209 udp_poll+0x0/0xf3 kmem_cache_free+0xd0/0x140 sys_select+0x273/0x3e5 init_fpu+0x62/0x7f math_state_restore+0x21/0x4a error_exit+0x0/0x71 sys_rt_sigreturn+0x251/0x301 system_call+0x86/0x8b system_call+0x0/0x8b Code: f0 0f ba 68 08 00 5b 89 f0 c3 41 57 41 89 f7 41 56 41 55 41 RIP datagram_poll+0xcc/0xd6 RSP CR2: 0000001100000019 <1>Unable to handle kernel paging request at 0000000d0000001d RIP: :xfs:xfs_file_close+0x1c/0x28 PGD 0 Oops: 0000 2 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6 RIP: e030: :xfs:xfs_file_close+0x1c/0x28 RSP: e02b:ffff880095e87828 EFLAGS: 00010246 RAX: ffff8800971e4078 RBX: ffff88009cbe0bd0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000d0000000d RBP: ffff880000b1e998 R08: ffff880000c39280 R09: 0000000000000298 R10: ffff880097eb1860 R11: 0000000000000298 R12: ffff880000b1e9a8 R13: 0000000000000009 R14: 0000000000000000 R15: 0000000000000001 FS: 00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0) Stack: ffffffff881484c7 ffffffff8027aa6a ffff880000b1e998 0000000000000001 ffff880000b1e9a8 ffffffff8022c9f1 000009c000000a00 ffff880000b1e998 ffff88009c0977e0 0000000000000001 0000000000000009 ffff880095e879d8 Call Trace: :xfs:xfs_file_close+0x0/0x28 filp_close+0x36/0x64 put_files_struct+0x6c/0xbf do_exit+0x2ae/0x929 hypercall_page+0x22a/0x1000 do_page_fault+0x119e/0x1253 monotonic_clock+0x3e/0x86 thread_return+0x0/0x13d error_exit+0x0/0x71 udp_poll+0x0/0xf3 udp_poll+0x0/0xf3 datagram_poll+0xcc/0xd6 datagram_poll+0x21/0xd6 udp_poll+0x13/0xf3 udp_poll+0x0/0xf3 do_select+0x2aa/0x464 __pollwait+0x0/0xdd default_wake_function+0x0/0xe default_wake_function+0x0/0xe default_wake_function+0x0/0xe default_wake_function+0x0/0xe sock_common_recvmsg+0x2d/0x43 sock_recvmsg+0x101/0x120 poison_obj+0x24/0x2d cache_free_debugcheck+0x1f9/0x209 udp_poll+0x0/0xf3 kmem_cache_free+0xd0/0x140 sys_select+0x273/0x3e5 init_fpu+0x62/0x7f math_state_restore+0x21/0x4a error_exit+0x0/0x71 sys_rt_sigreturn+0x251/0x301 system_call+0x86/0x8b system_call+0x0/0x8b Code: 48 8b 47 10 ff 50 10 41 5b f7 d8 c3 31 c0 48 83 ff 28 51 74 RIP :xfs:xfs_file_close+0x1c/0x28 RSP CR2: 0000000d0000001d <1>Fixing recursive fault but reboot is needed! syslog-ng1981: segfault at ffffffff80808080 rip 000055555555bc28 rsp 00007fffffffd868 error 6 Slab corruption: start=ffff880096c743b8, len=256 Redzone: 0x1600000016/0x1700000017. Last user: <0000001800000018>(0x1800000018) 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 Prev obj: start=0000000396c7420b, len=256 Unable to handle kernel paging request at 0000000396c7430b RIP: print_objinfo+0x22/0xde PGD 0 Oops: 0000 3 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 1981, comm: syslog-ng Not tainted 2.6.18-xen31-smp #6 RIP: e030: print_objinfo+0x22/0xde RSP: e02b:ffff8800935cdb48 EFLAGS: 00010206 RAX: 0000000396c7430b RBX: 000000000089dac1 RCX: ffffffffff57c000 RDX: 0000000000000002 RSI: 0000000396c74203 RDI: ffff8800015f20c0 RBP: ffff8800015f20c0 R08: ffff880000cbc788 R09: 000000000000d64e R10: ffff8800935cda98 R11: ffffffff802fd0b5 R12: 0000000396c74203 R13: 0000000000000002 R14: ffff880096c743b0 R15: ffff880096c74000 FS: 00002aaaab0186e0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process syslog-ng (pid: 1981, threadinfo ffff8800935cc000, task ffff88009c0e0860) Stack: 000000000089dac1 ffff8800015f20c0 0000000396c74203 0000000000000100 ffff880096c743b0 ffffffff80277521 ffff8800015f20c0 0000000000000000 ffff8800015f20c0 ffff880096c743b0 ffffffff802ab3b1 00000000000000d0 Call Trace: check_poison_obj+0x152/0x1ae elf_core_dump+0xe2/0xc2d elf_core_dump+0xe2/0xc2d cache_alloc_debugcheck_after+0x34/0x1b0 kmem_cache_alloc+0xf2/0x102 elf_core_dump+0xe2/0xc2d do_truncate+0x60/0x69 do_coredump+0x5a0/0x601 kmem_cache_free+0xd0/0x140 __dequeue_signal+0x18b/0x19a get_signal_to_deliver+0x4ee/0x549 do_signal+0x55/0x6d8 do_page_fault+0x11f6/0x1253 :xfs:xfs_iunlock+0x4f/0x7a :xfs:xfs_fsync+0x157/0x1a9 __filemap_fdatawrite_range+0x51/0x5b retint_signal+0x5d/0xb8 Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e RIP print_objinfo+0x22/0xde RSP CR2: 0000000396c7430b <3>ib_mthca 0000:08:00.0: Memory key throughput optimization activated. ib_mthca 0000:08:00.0: Allocated EQ 1 with 131072 entries ib_mthca 0000:08:00.0: Allocated EQ 2 with 512 entries ib_mthca 0000:08:00.0: Setting mask 00000000001f47fe for eqn 2 ib_mthca 0000:08:00.0: NOP command failed to generate interrupt (IRQ 16), aborting. ib_mthca 0000:08:00.0: BIOS or ACPI interrupt routing problem? ib_mthca 0000:08:00.0: Clearing mask 00000000001f47fe for eqn 2 ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: Unmapping 64 pages at 26800000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26840000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26890000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26000000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26400000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 24000000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 25800000 from ICM. -- Lukáš Hejtmánek From ulxgu at champagneneckimpulses.com Sat Jul 7 19:04:55 2007 From: ulxgu at champagneneckimpulses.com (Henry) Date: Sat, 7 Jul 2007 22:04:55 -0400 Subject: [ofa-general] Strong knowledge of Equity Trading and QA, with an empahsis on QA Automation Engineers. Message-ID: <46904647.1080307@champagneneckimpulses.com> VPSN WILL MOVE LIKE A COMET AND ITS ONLY GOING TO GET BETTER! Watch this SUPERNOVA closely MONDAY! VISION AIRSHIPS INC Symbol: VPSN Price: $0.021 BANGKOK, THAILAND, July 2007 Advertising Agencies Ready to Ink Deals! The company wishes to announce that it is in final negotiations for representation with some of the world's largest advertising agencies to market and reserve the blimps for there clients. VPSN THE RISING STAR, IS SET FOR SUPERNOVA STATUS ON MONDAY! Services firms demand that software managers and software developers have an extremely solid business background. Services firms demand that software managers and software developers have an extremely solid business background. In this role, you will have regular interaction with the business, application development teams, senior managers, and the wider Prime Brokerage support structure. In this role, you will work with internal clients to analyze, design, test, implement and support various applications and tools. Good understanding of the US and Global Fixed Income markets. This individual's financial services competencies should include or span the pan-equity Trading environment. Hands-on, advanced experience using Excel and knowledge of SQL is highly desirable. Solid client service experience in the financial services industry. Specific knowledge of street wide compliance initiatives such as SOx and other industry regulated initiatives. Knowledge of FIX or other similar protocols is a plus as is prior experience working with order management systems, FIX, or other exchange connectivity. Strong knowledge of Equity Trading and QA, with an empahsis on QA Automation Engineers. Wall Street firms, technological advancements, and technology professionals. Strong profession presence with the ability to clarify requirements and priority in a fast moving trading floor environment. Background in trading systems and vendors and a background in OMS systems. Background in trading systems and vendors and a background in OMS systems. NET Framework Server Side Development. Whether it be Equities or Fixed Income, Foreign Exchange or Commodities, to be an elite performer, you must combine a gift for software engineering, with a strong financial services acumen. Experience in relational database and SQL programming, network programming, Java performance tuning, and experience with developing scalable, robust, high performance systems. In this role, you will work with internal clients to analyze, design, test, implement and support various applications and tools. Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a plus. For many years, the successful combination of these three entities has been the key to corporate profitability. An understanding of the financial services business is a plus. Background in trade support from an operational perspective. Full product life-cycle experience working on a highly distributed, multi-tier, global system is also a plus. Knowledge of derivatives products including Equity Swaps, CFDs, futures, options, Interest Rate Swaps, repurchase agreements, stock loan, Credit Default Swaps and convertible bonds is also a plus. Attention to detail and ability to work with large volume of data. This is a Senior role and this individual will be expected to guide the team in terms of attribution techniques and keep the team on the cutting edge of attribution and other Analytics. Knowledge of Fixed Income business combined with Quantitative Skills and the ability to work under pressure and handle multiple tasks in a fast pace environment is also required. Wall Street is looking for candidates who can solve real business problems using financial technology. For many years, the successful combination of these three entities has been the key to corporate profitability. Ability to work well in a team as well as independently. NET Framework Server Side Development. Knowledge of derivatives products including Equity Swaps, CFDs, futures, options, Interest Rate Swaps, repurchase agreements, stock loan, Credit Default Swaps and convertible bonds is also a plus. Knowledge of derivatives products including Equity Swaps, CFDs, futures, options, Interest Rate Swaps, repurchase agreements, stock loan, Credit Default Swaps and convertible bonds is also a plus. The team s goal is to be the single point of contact for all technology issues experienced by the Prime Broker business, as well as the ownership of tactical development and cross-functional items. This candidate will work in the Fixed Income team in the Global Analytics department supporting the Fixed Income attribution efforts. If that describes you, we'd like to hear from you. Full product life-cycle experience working on a highly distributed, multi-tier, global system is also a plus. Services firms demand that software managers and software developers have an extremely solid business background. Wall Street Technology Jobs - New York Financial District Technical Careers in Equity Trading, Stock Markets, and Financial Services. The scope of this role extends to cover significant street or industry wide initiatives such as Sox and BaFIN. Services firms demand that software managers and software developers have an extremely solid business background. Good understanding of the US and Global Fixed Income markets. Hands-on, advanced experience using Excel and knowledge of SQL is highly desirable. Bachelors degree in Computer or Finance-related majors. Excellent understanding of Fixed Income performance attribution from the US and global perspective. Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a plus. Solid client service experience in the financial services industry. Excellent communication skills. Knowledge of FIX or other similar protocols is a plus as is prior experience working with order management systems, FIX, or other exchange connectivity. Experience with FIX, Exchange Connectivity and Equities is a plus as is Wall Street experience. Knowledge of FIX or other similar protocols is a plus as is prior experience working with order management systems, FIX, or other exchange connectivity. Strong UNIX, FIX, QA, and Equity Trading Systems knowledge a must. Wall Street Technology Jobs - New York Financial District Technical Careers in Equity Trading, Stock Markets, and Financial Services. Good communication skills and inter-personal skills are expected as the Analyst will need to interact with users widely throughout the firm. If that describes you, we'd like to hear from you. Knowledge of UNIX, RougueWave, XML, thread programming, and Corba is a plus. Specific knowledge of street wide compliance initiatives such as SOx and other industry regulated initiatives. This is a Senior role and this individual will be expected to guide the team in terms of attribution techniques and keep the team on the cutting edge of attribution and other Analytics. An understanding of the financial services business is a plus. From landman at scalableinformatics.com Sat Jul 7 19:58:26 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 07 Jul 2007 22:58:26 -0400 Subject: [ofa-general] found a simple fix for OFED-1.2 builds on OpenSuSE 10.2 Message-ID: <469052D2.8030203@scalableinformatics.com> Hi folks: I found a "simple" fix for OFED-1.2 builds on OpenSuSE. I was hoping for some advise on how to implement the fix, as I see a few options. Basically the problem is that OpenSuSE (and I assume future versions of SuSE) mask the HZ macro from non-kernel builds. The fix is to replace every instance of HZ usage with a system call -DHZ='sysconf(_SC_CLK_TCK)'" I have verified that, if I get into the build directory and run the command that failed in the original build.sh, but inserting a CC="gcc -DHZ='sysconf(_SC_CLK_TCK)'" \ CFLAGS="-DHZ='sysconf(_SC_CLK_TCK)'" (continued on second line due to email client wrapping) immediately in front of the rpmbuild, that this is sufficient for correct and complete building of the ofa_user-1.2 rpms on an unmodified OpenSuSE 10.2 distribution. Ok. So now we know how to fix (hack) this, and why it breaks. The real fix is to seek out the uses of HZ, and replace them with the system call as indicated. I would be happy to work on this if you would point me to whom I should send patches. But the question I really have is this. How can I (at least temporarily) inject these environment variables (which ostensibly just alleviate manual patching) into the build process? Specifically, I looked in build.sh, and all the rpmbuild commands are of the form ex rpmbuild ... where ... are options. Rpmbuild presumes that you will pass any needed environment variables in as I had done. So is this the right place to inject this environment variable change in absence of a formal patch? I could work up some additional hacked methods, but they are only temporary at best (such as using an rpmbuild.sh to force the issue). Thoughts, guidance, pointers, and clues are sought. I am not looking to formalize a hack, but I also need to get this build working. Longer term (next few weeks) I would prefer to get fixes back to the maintainer(s). Thanks. Joe -- landman at scalableinformatics.com From ogerlitz at voltaire.com Sat Jul 7 23:38:30 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 08 Jul 2007 09:38:30 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com> References: <1183643723.25031.262.camel@mtls03> <468CFBD0.6040407@voltaire.com> <6C2C79E72C305246B504CBA17B5500C901D362E7@mtlexch01.mtl.com> Message-ID: <46908666.8090908@voltaire.com> Eli Cohen wrote: >> can you resend the patch with function named appearing in each hunk > (ie after the @@ , use diff -p flag for that) >> Or. > > Sure. It is attached now - sorry but I using outlook from home :) nope, the attachment was also without the functions names, anyway, please see below some comments. > Index: ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c > =================================================================== > --- ofa_kernel-1.2.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-06-28 13:48:51.000000000 +0300 > +++ ofa_kernel-1.2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-08 09:52:29.000000000 +0300 > @@ -50,6 +50,8 @@ MODULE_PARM_DESC(data_debug_level, > "Enable data path debug tracing if > 0"); > #endif > > +#define SKB_LEN_THOLD 150 > + > static DEFINE_MUTEX(pkey_mutex); > > struct ipoib_ah *ipoib_create_ah(struct net_device *dev, > @@ -169,7 +171,7 @@ static void ipoib_ib_handle_rx_wc(struct > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV; > - struct sk_buff *skb; > + struct sk_buff *skb, *nskb; > u64 addr; > > ipoib_dbg_data(priv, "recv completion: id %d, op %d, status: %d\n", > @@ -223,6 +225,19 @@ static void ipoib_ib_handle_rx_wc(struct > ++priv->stats.rx_packets; > priv->stats.rx_bytes += skb->len; > > + if (skb->len < SKB_LEN_THOLD) { > + nskb = dev_alloc_skb(skb->len); > + if (!nskb) { > + ipoib_warn(priv, "failed to allocate skb\n"); > + return; > + } > + memcpy(nskb->data, skb->data, skb->len); > + skb_put(nskb, skb->len); > + nskb->protocol = skb->protocol; > + dev_kfree_skb_any(skb); > + skb = nskb; > + } > + > skb->dev = dev; > /* XXX get correct PACKET_ type here */ > skb->pkt_type = PACKET_HOST; > @@ -296,12 +311,12 @@ void ipoib_ib_completion(struct ib_cq *c > struct ipoib_dev_priv *priv = netdev_priv(dev); > int n, i; > > - ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > do { > n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); > for (i = 0; i < n; ++i) > ipoib_ib_handle_wc(dev, priv->ibwc + i); > } while (n == IPOIB_NUM_WC); > + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > } It seems that the change to ipoib_ib_completion() entered this patch by mistake, am I correct? Or. From ogerlitz at voltaire.com Sat Jul 7 23:53:49 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 08 Jul 2007 09:53:49 +0300 Subject: [ofa-general] Re: consumer data buffer ownership for inline sends In-Reply-To: References: Message-ID: <469089FD.10908@voltaire.com> Roland Dreier wrote: > > Does this means that for inline sends, when ibv_post_send returns, > > the consumer owns back the data buffer associated with this send? > > > > Can this be stated as the official policy of libibverbs? > > I guess that makes sense. I wonder if there's any conceivable > interpretation of the inline send flag where the adapter might need to > access the original buffer after the request is posted? thinking on it a little, such adapter has too much logic/state implemented in its HW/DMA engine... assuming all this is beyond the IB spec scope, can we take the liberty and turn it into official policy of libibverbs which provider libraries must confirm to? Or. From vlad at dev.mellanox.co.il Sun Jul 8 01:17:03 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 08 Jul 2007 11:17:03 +0300 Subject: [ofa-general] [GIT PULL ofed_1_2] iw_cxgb3 - Don't allow interrupts while obtaining the ctrl-qp mutex. In-Reply-To: <468E5768.7090200@opengridcomputing.com> References: <468E5768.7090200@opengridcomputing.com> Message-ID: <46909D7F.1040006@dev.mellanox.co.il> Steve Wise wrote: > Vlad, > > Please pull from > > git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 > > This patch fixes bug 681. > > Below is the patch. > > Steve. Done, Regards, Vladimir From erezz at voltaire.com Sun Jul 8 01:34:21 2007 From: erezz at voltaire.com (Erez Zilber) Date: Sun, 08 Jul 2007 11:34:21 +0300 Subject: [ofa-general] Will SLES 10 sp2 contain the RDMA-CM? Message-ID: <4690A18D.40709@voltaire.com> All, I've noticed that SLES 10 sp1 doesn't contain the RDMA-CM. We would like to add iSER for sp2, but without the RDMA-CM we cannot add it. Does Novell plan to add it to sp2? I guess that this should be very easy with the backport patches from OFED 1.2. Thanks, -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com From vlad at lists.openfabrics.org Sun Jul 8 02:44:15 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 8 Jul 2007 02:44:15 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070708-0200 daily build status Message-ID: <20070708094415.81B28E60824@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_kernel.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18-8.el5 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From kliteyn at dev.mellanox.co.il Sun Jul 8 06:55:41 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 08 Jul 2007 16:55:41 +0300 Subject: [ofa-general] [PATCH] osm: enhancing fat-tree routing for non-pure trees Message-ID: <4690ECDD.7030106@dev.mellanox.co.il> Hi Hal. This patch handles the two new options for fat-tree routing: root guid file and compute node guid files, and by doing that fat-tree routing it is able to handle trees that are not pure fat-trees, or even not symmetrical. But the routing "quality" depends on the tree "correctness" - the more the topology looks like pure fat-tree, the better the routing. All the changes are in one file - osm_ucast_ftree.c, so as much as I've tried to divide this patch into separate stages, I found myself going back and fixing things too many times, so at this point it won't make sense to send this patch in parts, as earlier patches would have too much wrong code that was fixed later. Bottom line: sorry, but this thing has to go in a single patch. Here's what this patch does: 1. Some modifications to ftree data structures and functions - Added guid getters for CAs and switches - Added node type and guid for each port group - Some naming changes - Added get_sw_by_guid and get_hca_by_guid functions 2. Reading roots and compute nodes from guid files - Marking CAs with the number of CNs on the node - Marking port groups if they belong to CN 3. Ranking rewritten to supports root guids - ftree.tree_rank replaced by two ranks: ftree.max_switch_rank and ftree.leaf_switch_rank. - Tree rank for routing is considered as (ftree.leaf_switch_rank + 1) 4. Created leaf switch array that contains all the leafs with CNs and possibly leafs between them, according to the fabric indexing. 5. Checking new "lighter" topology constaraint - all the leafs with real CNs should be at the same tree rank. 6. Implemented the routing itself: - routing to all the CNs first - routing dummy targets for all the missing nodes or non-CNs that are connected to leaf switches - routing to all the non-CN CAs in the fabric (routing them as real targets on secondary path) - routing to all the switch-to-switch pathes (left the same) 7. Updated ordering file dump function - Treating non-compute nodes as dummies -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 1348 ++++++++++++++++++++++++++++++++------- 1 files changed, 1109 insertions(+), 239 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index e91f3ed..6e62276 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -119,6 +119,17 @@ typedef struct { /*************************************************** ** + ** ftree_guid_tbl_element_t definition + ** + ***************************************************/ + +typedef struct { + cl_map_item_t map_item; + uint64_t guid_ho; +} ftree_guid_tbl_element_t; + +/*************************************************** + ** ** ftree_fwd_tbl_t definition ** ***************************************************/ @@ -147,21 +158,27 @@ typedef struct ftree_port_t_ ** ***************************************************/ +typedef union ftree_hca_or_sw_ +{ + struct ftree_hca_t_ * p_hca; + struct ftree_sw_t_ * p_sw; +} ftree_hca_or_sw; + typedef struct ftree_port_group_t_ { cl_map_item_t map_item; ib_net16_t base_lid; /* base lid of the current node */ ib_net16_t remote_base_lid; /* base lid of the remote node */ ib_net64_t port_guid; /* port guid of this port */ + ib_net64_t node_guid; /* this node's guid */ + uint8_t node_type; /* this node's type */ ib_net64_t remote_port_guid; /* port guid of the remote port */ ib_net64_t remote_node_guid; /* node guid of the remote node */ uint8_t remote_node_type; /* IB_NODE_TYPE_{CA,SWITCH,ROUTER,...} */ - union remote_hca_or_sw_ - { - struct ftree_hca_t_ * remote_hca; - struct ftree_sw_t_ * remote_sw; - } remote_hca_or_sw; /* pointer to remote hca/switch */ + ftree_hca_or_sw hca_or_sw; /* pointer to this hca/switch */ + ftree_hca_or_sw remote_hca_or_sw; /* pointer to remote hca/switch */ cl_ptr_vector_t ports; /* vector of ports to the same lid */ + boolean_t is_cn; /* whether this port is a compute node */ } ftree_port_group_t; /*************************************************** @@ -182,6 +199,7 @@ typedef struct ftree_sw_t_ ftree_port_group_t ** up_port_groups; uint8_t up_port_groups_num; ftree_fwd_tbl_t lft_buf; + boolean_t is_leaf; } ftree_sw_t; /*************************************************** @@ -195,6 +213,7 @@ typedef struct ftree_hca_t_ { osm_node_t * p_osm_node; ftree_port_group_t ** up_port_groups; uint16_t up_port_groups_num; + unsigned cn_num; } ftree_hca_t; /*************************************************** @@ -209,10 +228,14 @@ typedef struct ftree_fabric_t_ cl_qmap_t hca_tbl; cl_qmap_t sw_tbl; cl_qmap_t sw_by_tuple_tbl; - uint8_t tree_rank; + cl_list_t root_guid_list; + cl_qmap_t cn_guid_tbl; + unsigned cn_num; + uint8_t leaf_switch_rank; + uint8_t max_switch_rank; ftree_sw_t ** leaf_switches; uint32_t leaf_switches_num; - uint16_t max_hcas_per_leaf; + uint16_t max_cn_per_leaf; cl_pool_t sw_fwd_tbl_pool; uint16_t lft_max_lid_ho; boolean_t fabric_built; @@ -254,8 +277,8 @@ __osm_ftree_compare_port_groups_by_remote_switch_index( ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2; return __osm_ftree_compare_switches_by_index( - &((*pp_g1)->remote_hca_or_sw.remote_sw), - &((*pp_g2)->remote_hca_or_sw.remote_sw) ); + &((*pp_g1)->remote_hca_or_sw.p_sw), + &((*pp_g2)->remote_hca_or_sw.p_sw) ); } /***************************************************/ @@ -393,6 +416,37 @@ __osm_ftree_sw_tbl_element_destroy( /*************************************************** ** + ** ftree_guid_tbl_element_t functions + ** + ***************************************************/ + +static ftree_guid_tbl_element_t * +__osm_ftree_guid_tbl_element_create( + IN uint64_t guid) +{ + ftree_guid_tbl_element_t * p_element = + (ftree_guid_tbl_element_t *) malloc(sizeof(ftree_guid_tbl_element_t)); + if (!p_element) + return NULL; + memset(p_element, 0,sizeof(ftree_guid_tbl_element_t)); + + memcpy(&p_element->guid_ho, &guid, sizeof(uint64_t)); + return p_element; +} + +/***************************************************/ + +static void +__osm_ftree_guid_tbl_element_destroy( + IN ftree_guid_tbl_element_t * p_element) +{ + if (!p_element) + return; + free(p_element); +} + +/*************************************************** + ** ** ftree_port_t functions ** ***************************************************/ @@ -433,11 +487,15 @@ static ftree_port_group_t * __osm_ftree_port_group_create( IN ib_net16_t base_lid, IN ib_net16_t remote_base_lid, - IN ib_net64_t * p_port_guid, - IN ib_net64_t * p_remote_port_guid, - IN ib_net64_t * p_remote_node_guid, + IN ib_net64_t port_guid, + IN ib_net64_t node_guid, + IN uint8_t node_type, + IN void * p_hca_or_sw, + IN ib_net64_t remote_port_guid, + IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, - IN void * p_remote_hca_or_sw) + IN void * p_remote_hca_or_sw, + IN boolean_t is_cn) { ftree_port_group_t * p_group = (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t)); @@ -447,18 +505,33 @@ __osm_ftree_port_group_create( p_group->base_lid = base_lid; p_group->remote_base_lid = remote_base_lid; - memcpy(&p_group->port_guid, p_port_guid, sizeof(ib_net64_t)); - memcpy(&p_group->remote_port_guid, p_remote_port_guid, sizeof(ib_net64_t)); - memcpy(&p_group->remote_node_guid, p_remote_node_guid, sizeof(ib_net64_t)); + memcpy(&p_group->port_guid, &port_guid, sizeof(ib_net64_t)); + memcpy(&p_group->node_guid, &node_guid, sizeof(ib_net64_t)); + memcpy(&p_group->remote_port_guid, &remote_port_guid, sizeof(ib_net64_t)); + memcpy(&p_group->remote_node_guid, &remote_node_guid, sizeof(ib_net64_t)); + + p_group->node_type = node_type; + switch (node_type) + { + case IB_NODE_TYPE_CA: + p_group->hca_or_sw.p_hca = (ftree_hca_t *)p_hca_or_sw; + break; + case IB_NODE_TYPE_SWITCH: + p_group->hca_or_sw.p_sw = (ftree_sw_t *)p_hca_or_sw; + break; + default: + /* we shouldn't get here - port is created only in hca or switch */ + CL_ASSERT(0); + } p_group->remote_node_type = remote_node_type; switch (remote_node_type) { case IB_NODE_TYPE_CA: - p_group->remote_hca_or_sw.remote_hca = (ftree_hca_t *)p_remote_hca_or_sw; + p_group->remote_hca_or_sw.p_hca = (ftree_hca_t *)p_remote_hca_or_sw; break; case IB_NODE_TYPE_SWITCH: - p_group->remote_hca_or_sw.remote_sw = (ftree_sw_t *)p_remote_hca_or_sw; + p_group->remote_hca_or_sw.p_sw = (ftree_sw_t *)p_remote_hca_or_sw; break; default: /* we shouldn't get here - port is created only in hca or switch */ @@ -468,6 +541,7 @@ __osm_ftree_port_group_create( cl_ptr_vector_init(&p_group->ports, 0, /* min size */ 8); /* grow size */ + p_group->is_cn = is_cn; return p_group; } /* __osm_ftree_port_group_create() */ @@ -640,6 +714,26 @@ __osm_ftree_sw_destroy( /***************************************************/ +static uint64_t +__osm_ftree_sw_get_guid_no( + IN ftree_sw_t * p_sw) +{ + if (!p_sw) + return 0; + return osm_node_get_node_guid(p_sw->p_osm_sw->p_node); +} + +/***************************************************/ + +static uint64_t +__osm_ftree_sw_get_guid_ho( + IN ftree_sw_t * p_sw) +{ + return cl_ntoh64(__osm_ftree_sw_get_guid_no(p_sw)); +} + +/***************************************************/ + static void __osm_ftree_sw_dump( IN ftree_fabric_t * p_ftree, @@ -657,7 +751,7 @@ __osm_ftree_sw_dump( "__osm_ftree_sw_dump: " "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n", __osm_ftree_tuple_to_str(p_sw->tuple), - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), p_sw->down_port_groups_num, p_sw->up_port_groups_num); @@ -735,11 +829,15 @@ __osm_ftree_sw_add_port( p_group = __osm_ftree_port_group_create( base_lid, remote_base_lid, - &port_guid, - &remote_port_guid, - &remote_node_guid, + port_guid, + __osm_ftree_sw_get_guid_no(p_sw), + IB_NODE_TYPE_SWITCH, + p_sw, + remote_port_guid, + remote_node_guid, remote_node_type, - p_remote_hca_or_sw); + p_remote_hca_or_sw, + FALSE); CL_ASSERT(p_group); if (direction == FTREE_DIRECTION_UP) @@ -835,6 +933,26 @@ __osm_ftree_hca_destroy( /***************************************************/ +static uint64_t +__osm_ftree_hca_get_guid_no( + IN ftree_hca_t * p_hca) +{ + if (!p_hca) + return 0; + return osm_node_get_node_guid(p_hca->p_osm_node); +} + +/***************************************************/ + +static uint64_t +__osm_ftree_hca_get_guid_ho( + IN ftree_hca_t * p_hca) +{ + return cl_ntoh64(__osm_ftree_hca_get_guid_no(p_hca)); +} + +/***************************************************/ + static void __osm_ftree_hca_dump( IN ftree_fabric_t * p_ftree, @@ -851,7 +969,7 @@ __osm_ftree_hca_dump( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_hca_dump: " "CA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", - cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + __osm_ftree_hca_get_guid_ho(p_hca), p_hca->up_port_groups_num); for( i = 0; i < p_hca->up_port_groups_num; i++ ) @@ -888,7 +1006,8 @@ __osm_ftree_hca_add_port( IN ib_net64_t remote_port_guid, IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, - IN void * p_remote_hca_or_sw) + IN void * p_remote_hca_or_sw, + IN boolean_t is_cn) { ftree_port_group_t * p_group; @@ -903,11 +1022,15 @@ __osm_ftree_hca_add_port( p_group = __osm_ftree_port_group_create( base_lid, remote_base_lid, - &port_guid, - &remote_port_guid, - &remote_node_guid, + port_guid, + __osm_ftree_hca_get_guid_no(p_hca), + IB_NODE_TYPE_CA, + p_hca, + remote_port_guid, + remote_node_guid, remote_node_type, - p_remote_hca_or_sw); + p_remote_hca_or_sw, + is_cn); p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group; } __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num); @@ -933,6 +1056,10 @@ __osm_ftree_fabric_create() cl_qmap_init(&p_ftree->hca_tbl); cl_qmap_init(&p_ftree->sw_tbl); cl_qmap_init(&p_ftree->sw_by_tuple_tbl); + cl_qmap_init(&p_ftree->cn_guid_tbl); + + cl_list_construct( &p_ftree->root_guid_list ); + cl_list_init( &p_ftree->root_guid_list, 10 ); status = cl_pool_init( &p_ftree->sw_fwd_tbl_pool, 8, /* min pool size */ @@ -945,7 +1072,6 @@ __osm_ftree_fabric_create() if (status != CL_SUCCESS) return NULL; - p_ftree->tree_rank = 1; return p_ftree; } @@ -960,6 +1086,9 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) ftree_sw_t * p_next_sw; ftree_sw_tbl_element_t * p_element; ftree_sw_tbl_element_t * p_next_element; + ftree_guid_tbl_element_t * p_guid_element; + ftree_guid_tbl_element_t * p_next_guid_element; + uint64_t * p_guid; if (!p_ftree) return; @@ -1000,6 +1129,26 @@ __osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) } cl_qmap_remove_all(&p_ftree->sw_by_tuple_tbl); + /* remove all the elements of cn_guid_tbl */ + + p_next_guid_element = + (ftree_guid_tbl_element_t *)cl_qmap_head(&p_ftree->cn_guid_tbl); + while( p_next_guid_element != + (ftree_guid_tbl_element_t *)cl_qmap_end(&p_ftree->cn_guid_tbl) ) + { + p_guid_element = p_next_guid_element; + p_next_guid_element = + (ftree_guid_tbl_element_t *)cl_qmap_next(&p_guid_element->map_item); + __osm_ftree_guid_tbl_element_destroy(p_guid_element); + } + cl_qmap_remove_all(&p_ftree->cn_guid_tbl); + + /* remove all the elements of root_guid_list*/ + + while ( (p_guid = (uint64_t*)cl_list_remove_head(&p_ftree->root_guid_list)) ) + free(p_guid); + cl_list_destroy(&p_ftree->root_guid_list); + /* free the leaf switches array */ if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches)) free(p_ftree->leaf_switches); @@ -1024,19 +1173,10 @@ __osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree) /***************************************************/ -static void -__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint32_t rank) -{ - if (rank > p_ftree->tree_rank) - p_ftree->tree_rank = rank; -} - -/***************************************************/ - static uint8_t __osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree) { - return p_ftree->tree_rank; + return p_ftree->leaf_switch_rank + 1; } /***************************************************/ @@ -1108,6 +1248,34 @@ __osm_ftree_fabric_get_sw_by_tuple( /***************************************************/ +static ftree_sw_t * +__osm_ftree_fabric_get_sw_by_guid( + IN ftree_fabric_t * p_ftree, + IN uint64_t guid) +{ + ftree_sw_t * p_sw; + p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,guid); + if (p_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)) + return NULL; + return p_sw; +} + +/***************************************************/ + +static ftree_hca_t * +__osm_ftree_fabric_get_hca_by_guid( + IN ftree_fabric_t * p_ftree, + IN uint64_t guid) +{ + ftree_hca_t * p_hca; + p_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,guid); + if (p_hca == (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl)) + return NULL; + return p_hca; +} + +/***************************************************/ + static void __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) { @@ -1133,7 +1301,7 @@ __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) __osm_ftree_hca_dump(p_ftree, p_hca); } - for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) + for (i = 0; i < p_ftree->max_switch_rank; i++) { osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_dump: -- Rank %u switches\n", i); @@ -1160,7 +1328,6 @@ __osm_ftree_fabric_dump_general_info( { uint32_t i,j; ftree_sw_t * p_sw; - char * addition_str; osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_dump_general_info: " @@ -1170,15 +1337,20 @@ __osm_ftree_fabric_dump_general_info( osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_dump_general_info: " - " - FatTree rank (switches only): %u\n", - p_ftree->tree_rank); + " - FatTree rank (roots to leaf switches): %u\n", + p_ftree->leaf_switch_rank + 1); + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " + " - FatTree max switch rank: %u\n", + p_ftree->max_switch_rank); osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_dump_general_info: " - " - Fabric has %u CAs, %u switches\n", + " - Fabric has %u CAs (%u of them CNs), %u switches\n", cl_qmap_count(&p_ftree->hca_tbl), + p_ftree->cn_num, cl_qmap_count(&p_ftree->sw_tbl)); - for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) + for (i = 0; i <= p_ftree->max_switch_rank; i++) { j = 0; for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); @@ -1189,16 +1361,20 @@ __osm_ftree_fabric_dump_general_info( j++; } if (i == 0) - addition_str = " (root) "; + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u switches at rank %u (roots)\n", + j, i); + else if (i == p_ftree->leaf_switch_rank) + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u switches at rank %u (%u of them leafs)\n", + j, i, p_ftree->leaf_switches_num); else - if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) - addition_str = " (leaf) "; - else - addition_str = " "; osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, "__osm_ftree_fabric_dump_general_info: " - " - Fabric has %u rank %u%s switches\n", - j, i, addition_str); + " - Fabric has %u switches at rank %u\n", + j, i); } if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE)) @@ -1214,7 +1390,7 @@ __osm_ftree_fabric_dump_general_info( osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), __osm_ftree_tuple_to_str(p_sw->tuple)); } @@ -1227,8 +1403,7 @@ __osm_ftree_fabric_dump_general_info( osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", - cl_ntoh64(osm_node_get_node_guid( - p_ftree->leaf_switches[i]->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_ftree->leaf_switches[i]), cl_ntoh16(p_ftree->leaf_switches[i]->base_lid), __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple)); } @@ -1243,9 +1418,11 @@ __osm_ftree_fabric_dump_hca_ordering( { ftree_hca_t * p_hca; ftree_sw_t * p_sw; - ftree_port_group_t * p_group; + ftree_port_group_t * p_group_on_sw; + ftree_port_group_t * p_group_on_hca; uint32_t i; uint32_t j; + unsigned printed_hcas_on_leaf; char path[1024]; FILE * p_hca_ordering_file; @@ -1268,22 +1445,34 @@ __osm_ftree_fabric_dump_hca_ordering( for(i = 0; i < p_ftree->leaf_switches_num; i++) { p_sw = p_ftree->leaf_switches[i]; - /* for each real HCA connected to this switch */ + printed_hcas_on_leaf = 0; + + /* for each real CA (CNs and not) connected to this switch */ for (j = 0; j < p_sw->down_port_groups_num; j++) { - p_group = p_sw->down_port_groups[j]; - p_hca = p_group->remote_hca_or_sw.remote_hca; + p_group_on_sw = p_sw->down_port_groups[j]; + + if (p_group_on_sw->remote_node_type != IB_NODE_TYPE_CA) + continue; + + p_hca = p_group_on_sw->remote_hca_or_sw.p_hca; + p_group_on_hca = __osm_ftree_hca_get_port_group_by_remote_lid( + p_hca, p_group_on_sw->base_lid); + + /* treat non-compute nodes as dummies */ + if (!p_group_on_hca->is_cn) + continue; fprintf(p_hca_ordering_file,"0x%x\t%s\n", - cl_ntoh16(p_group->remote_base_lid), + cl_ntoh16(p_group_on_hca->base_lid), p_hca->p_osm_node->print_desc); + + printed_hcas_on_leaf++; } - /* now print dummy HCAs */ - for (j = p_sw->down_port_groups_num; j < p_ftree->max_hcas_per_leaf; j++) - { + /* now print missing HCAs */ + for (j = 0; j < (p_ftree->max_cn_per_leaf - printed_hcas_on_leaf); j++) fprintf(p_hca_ordering_file,"0xFFFF\tDUMMY\n"); - } } /* done going through all the leaf switches */ @@ -1368,28 +1557,88 @@ __osm_ftree_fabric_get_new_tuple( /***************************************************/ -static void -__osm_ftree_fabric_calculate_rank( +static inline boolean_t +__osm_ftree_fabric_roots_provided( IN ftree_fabric_t * p_ftree) { - ftree_sw_t * p_sw; - ftree_sw_t * p_next_sw; - uint32_t max_rank = 0; + return (p_ftree->p_osm->subn.opt.root_guid_file != NULL); +} - /* go over all the switches and find maximal switch rank */ +/***************************************************/ - p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); - while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) +static inline boolean_t +__osm_ftree_fabric_cns_provided( + IN ftree_fabric_t * p_ftree) +{ + return (p_ftree->p_osm->subn.opt.cn_guid_file != NULL); +} + +/***************************************************/ + +static int +__osm_ftree_fabric_mark_leaf_switches( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + unsigned i; + int res = 0; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_mark_leaf_switches); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_mark_leaf_switches: " + "Marking leaf switches in fabric\n"); + + /* Scan all the CAs, if they have CNs - find CN port and mark switch + that is connected to this port as leaf switch. + Also, ensure that this marked leaf has rank of p_ftree->leaf_switch_rank.*/ + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) ) { - p_sw = p_next_sw; - if(p_sw->rank > max_rank) - max_rank = p_sw->rank; - p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + p_hca = p_next_hca; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item); + if (!p_hca->cn_num) + continue; + + for( i = 0; i < p_hca->up_port_groups_num; i++ ) + { + if (!p_hca->up_port_groups[i]->is_cn) + continue; + + /* In CAs, port group alway has one port, and since this + port group is CN, we know that this port is compute node */ + CL_ASSERT(p_hca->up_port_groups[i]->remote_node_type == IB_NODE_TYPE_SWITCH); + p_sw = p_hca->up_port_groups[i]->remote_hca_or_sw.p_sw; + + /* check if this switch was already processed */ + if (p_sw->is_leaf) + continue; + p_sw->is_leaf = TRUE; + + /* ensure that this leaf switch is at the correct tree level */ + if (p_sw->rank != p_ftree->leaf_switch_rank) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_mark_leaf_switches: ERR AB26: " + "CN port 0x%" PRIx64 " is connected to switch 0x%" PRIx64 " with rank %u, " + "while FatTree leaf rank is %u\n", + cl_ntoh64(p_hca->up_port_groups[i]->port_guid), + __osm_ftree_sw_get_guid_ho(p_sw), + p_sw->rank, + p_ftree->leaf_switch_rank); + res = -1; + goto Exit; + + } + } } - /* set FatTree rank */ - __osm_ftree_fabric_set_rank(p_ftree, max_rank + 1); -} + Exit: + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return res; +} /* __osm_ftree_fabric_mark_leaf_switches() */ /***************************************************/ @@ -1410,20 +1659,14 @@ __osm_ftree_fabric_make_indexing( osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " "Starting FatTree indexing\n"); - /* create array of leaf switches */ - p_ftree->leaf_switches = (ftree_sw_t **) - malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *)); - - /* Looking for a leaf switch - the one that has rank equal to (tree_rank - 1). - This switch will be used as a starting point for indexing algorithm. */ - + /* using the first leaf switch as a starting point for indexing algorithm. */ p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); - while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) + while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) { p_sw = p_next_sw; - if(p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + if (p_sw->is_leaf) break; - p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item); } CL_ASSERT(p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); @@ -1442,7 +1685,7 @@ __osm_ftree_fabric_make_indexing( p_sw->rank, __osm_ftree_tuple_to_str(p_sw->tuple), cl_ntoh16(p_sw->base_lid), - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node))); + __osm_ftree_sw_get_guid_ho(p_sw)); /* * Now run BFS and assign indexes to all switches @@ -1469,22 +1712,23 @@ __osm_ftree_fabric_make_indexing( /* Discover all the nodes from ports that are pointing down */ - if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + if (p_sw->rank >= p_ftree->leaf_switch_rank) { - /* add switch to leaf switches array */ - p_ftree->leaf_switches[p_ftree->leaf_switches_num++] = p_sw; - /* update the max_hcas_per_leaf value */ - if (p_sw->down_port_groups_num > p_ftree->max_hcas_per_leaf) - p_ftree->max_hcas_per_leaf = p_sw->down_port_groups_num; + /* whether downward ports are pointing to CAs or switches, + we don't assign indexes to switches that are located + lower than leaf switches */ } else { - /* This is not the leaf switch, which means that all the - ports that point down are taking us to another switches. - No need to assign indexing to HCAs */ + /* This is not the leaf switch */ for( i = 0; i < p_sw->down_port_groups_num; i++ ) { - p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw; + /* Work with port groups that are pointing to switches only. + No need to assign indexing to HCAs */ + if (p_sw->down_port_groups[i]->remote_node_type != IB_NODE_TYPE_SWITCH) + continue; + + p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.p_sw; if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) { /* this switch has been already indexed */ @@ -1523,7 +1767,7 @@ __osm_ftree_fabric_make_indexing( that are pointing up are taking us to another switches. */ for( i = 0; i < p_sw->up_port_groups_num; i++ ) { - p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw; + p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.p_sw; if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) continue; /* allocate new tuple */ @@ -1554,14 +1798,138 @@ __osm_ftree_fabric_make_indexing( } cl_list_destroy(&bfs_list); - /* sort array of leaf switches by index */ - qsort(p_ftree->leaf_switches, /* array */ - p_ftree->leaf_switches_num, /* number of elements */ - sizeof(ftree_sw_t *), /* size of each element */ + OSM_LOG_EXIT(&p_ftree->p_osm->log); +} /* __osm_ftree_fabric_make_indexing() */ + +/***************************************************/ + +static int +__osm_ftree_fabric_create_leaf_switch_array( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + ftree_sw_t ** all_switches_at_leaf_level; + unsigned i; + unsigned all_leaf_idx = 0; + unsigned first_leaf_idx; + unsigned last_leaf_idx; + int res = 0; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_create_leaf_switch_array); + + /* create array of ALL the switches that have leaf rank */ + all_switches_at_leaf_level = (ftree_sw_t **) + malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *)); + if (!all_switches_at_leaf_level) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fat-tree routing: Memory allocation failed\n"); + res = -1; + goto Exit; + } + memset(all_switches_at_leaf_level,0,cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *)); + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) + { + p_sw = p_next_sw; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item); + if (p_sw->rank == p_ftree->leaf_switch_rank) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_create_leaf_switch_array: " + "Adding switch 0x%" PRIx64 " to full leaf switch array\n", + __osm_ftree_sw_get_guid_ho(p_sw)); + all_switches_at_leaf_level[all_leaf_idx++] = p_sw; + + } + } + + /* quick-sort array of leaf switches by index */ + qsort(all_switches_at_leaf_level, /* array */ + all_leaf_idx, /* number of elements */ + sizeof(ftree_sw_t *), /* size of each element */ __osm_ftree_compare_switches_by_index); /* comparator */ + /* check the first and the last REAL leaf (the one + that has CNs) in the array of all the leafs */ + + first_leaf_idx = all_leaf_idx; + last_leaf_idx = 0; + for ( i = 0; i < all_leaf_idx; i++ ) + { + if (all_switches_at_leaf_level[i]->is_leaf) + { + if (i < first_leaf_idx) + first_leaf_idx = i; + last_leaf_idx = i; + } + } + CL_ASSERT(first_leaf_idx < last_leaf_idx); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_create_leaf_switch_array: " + "Full leaf array info: first_leaf_idx = %u, last_leaf_idx = %u\n", + first_leaf_idx, last_leaf_idx); + + /* Create array of REAL leaf switches, sorted by index. + This array may contain siwtches at the same rank w/o CNs, + in case this is the order of indexing.*/ + p_ftree->leaf_switches_num = last_leaf_idx - first_leaf_idx + 1; + p_ftree->leaf_switches = (ftree_sw_t **) + malloc(p_ftree->leaf_switches_num * sizeof(ftree_sw_t *)); + if (!p_ftree->leaf_switches) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fat-tree routing: Memory allocation failed\n"); + res = -1; + goto Exit; + } + + memcpy(p_ftree->leaf_switches, + &(all_switches_at_leaf_level[first_leaf_idx]), + p_ftree->leaf_switches_num * sizeof(ftree_sw_t *)); + + free(all_switches_at_leaf_level); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_create_leaf_switch_array: " + "Created array of %u leaf switches\n", + p_ftree->leaf_switches_num); + + Exit: OSM_LOG_EXIT(&p_ftree->p_osm->log); -} /* __osm_ftree_fabric_make_indexing() */ + return res; +} /* __osm_ftree_fabric_create_leaf_switch_array() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_set_max_cn_per_leaf( + IN ftree_fabric_t * p_ftree) +{ + unsigned i; + unsigned j; + unsigned cns_on_this_leaf; + ftree_sw_t * p_sw; + ftree_port_group_t * p_group; + + for (i = 0; i < p_ftree->leaf_switches_num; i++) + { + p_sw = p_ftree->leaf_switches[i]; + cns_on_this_leaf = 0; + for (j = 0; j < p_sw->down_port_groups_num; j++) + { + p_group = p_sw->down_port_groups[j]; + if (p_group->remote_node_type != IB_NODE_TYPE_CA) + continue; + cns_on_this_leaf += p_group->remote_hca_or_sw.p_hca->cn_num; + } + if (cns_on_this_leaf > p_ftree->max_cn_per_leaf) + p_ftree->max_cn_per_leaf = cns_on_this_leaf; + } +} /* __osm_ftree_fabric_set_max_cn_per_leaf() */ /***************************************************/ @@ -1617,11 +1985,11 @@ __osm_ftree_fabric_validate_topology( "ERR AB09: Different number of upward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n", - cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]), cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), reference_sw_arr[p_sw->rank]->up_port_groups_num, - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), __osm_ftree_tuple_to_str(p_sw->tuple), p_sw->up_port_groups_num); @@ -1629,7 +1997,7 @@ __osm_ftree_fabric_validate_topology( break; } - if ( p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1) && + if ( p_sw->rank != (tree_rank - 1) && reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num ) { /* we're allowing some hca's to be missing */ @@ -1638,11 +2006,11 @@ __osm_ftree_fabric_validate_topology( "ERR AB0A: Different number of downward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n", - cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]), cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), reference_sw_arr[p_sw->rank]->down_port_groups_num, - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), __osm_ftree_tuple_to_str(p_sw->tuple), p_sw->down_port_groups_num); @@ -1663,11 +2031,11 @@ __osm_ftree_fabric_validate_topology( "ERR AB0B: Different number of ports in an upward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", - cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]), cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), cl_ptr_vector_get_size(&p_ref_group->ports), - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), __osm_ftree_tuple_to_str(p_sw->tuple), cl_ptr_vector_get_size(&p_group->ports)); @@ -1691,11 +2059,11 @@ __osm_ftree_fabric_validate_topology( "ERR AB0C: Different number of ports in an downward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", - cl_ntoh64(osm_node_get_node_guid(reference_sw_arr[p_sw->rank]->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(reference_sw_arr[p_sw->rank]), cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), cl_ptr_vector_get_size(&p_ref_group->ports), - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), __osm_ftree_tuple_to_str(p_sw->tuple), cl_ptr_vector_get_size(&p_group->ports)); @@ -1781,9 +2149,6 @@ __osm_ftree_fabric_route_upgoing_by_going_down( /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); - /* can't be here for leaf switch, */ - CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)); - /* if there is no down-going ports */ if (p_sw->down_port_groups_num == 0) return; @@ -1793,6 +2158,10 @@ __osm_ftree_fabric_route_upgoing_by_going_down( { p_group = p_sw->down_port_groups[i]; + /* Skip this port group unless it points to a switch */ + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) + continue; + if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) ) { /* This port group has a port that was used when we entered this switch, @@ -1825,7 +2194,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down( lowest load of upgoing routes. Set on the remote switch how to get to the target_lid - set LFT(target_lid) on the remote switch to the remote port */ - p_remote_sw = p_group->remote_hca_or_sw.remote_sw; + p_remote_sw = p_group->remote_hca_or_sw.p_sw; if ( osm_switch_get_least_hops(p_remote_sw->p_osm_sw, cl_ntoh16(target_lid)) != OSM_NO_PATH ) @@ -1918,11 +2287,8 @@ __osm_ftree_fabric_route_upgoing_by_going_down( p_min_port->counter_up++; /* Recursion step: - Assign upgoing ports by stepping down, starting on REMOTE switch. - Recursion stop condition - if the REMOTE switch is a leaf switch. */ - if (p_remote_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) - { - __osm_ftree_fabric_route_upgoing_by_going_down( + Assign upgoing ports by stepping down, starting on REMOTE switch */ + __osm_ftree_fabric_route_upgoing_by_going_down( p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ NULL, /* prev. position - NULL to mark that we went down and not up */ @@ -1931,7 +2297,6 @@ __osm_ftree_fabric_route_upgoing_by_going_down( is_real_lid, /* whether the target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ highest_rank_in_route); /* highest visited point in the tree before going down */ - } } /* done scanning all the down-going port groups */ @@ -1972,11 +2337,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up( /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); - /* If this switch isn't a leaf switch: - Assign upgoing ports by stepping down, starting on THIS switch. */ - if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) - { - __osm_ftree_fabric_route_upgoing_by_going_down( + /* Assign upgoing ports by stepping down, starting on THIS switch */ + __osm_ftree_fabric_route_upgoing_by_going_down( p_ftree, p_sw, /* local switch - used as a route-upgoing alg. start point */ p_prev_sw, /* switch that we went up from (NULL means that we went down) */ @@ -1985,7 +2347,6 @@ __osm_ftree_fabric_route_downgoing_by_going_up( is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this path to HCA should by tracked by counters */ p_sw->rank); /* the highest visited point in the tree before going down */ - } /* recursion stop condition - if it's a root switch, */ if (p_sw->rank == 0) @@ -2026,7 +2387,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( lowest load of downgoing routes. Set on the remote switch how to get to the target_lid - set LFT(target_lid) on the remote switch to the remote port */ - p_remote_sw = p_min_group->remote_hca_or_sw.remote_sw; + p_remote_sw = p_min_group->remote_hca_or_sw.p_sw; /* Four possible cases: * @@ -2063,7 +2424,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up( /* covering first half of case 1, and case 3 */ if (is_main_path) { - if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + if (p_sw->is_leaf) { osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " @@ -2154,14 +2515,14 @@ __osm_ftree_fabric_route_downgoing_by_going_up( for (i = 0; i < p_sw->up_port_groups_num; i++) { p_group = p_sw->up_port_groups[i]; - p_remote_sw = p_group->remote_hca_or_sw.remote_sw; + p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl */ if (__osm_ftree_sw_get_fwd_table_block( p_remote_sw,cl_ntoh16(target_lid)) != OSM_NO_PATH) continue; - if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + if (p_sw->is_leaf) { osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " @@ -2219,70 +2580,99 @@ __osm_ftree_fabric_route_downgoing_by_going_up( */ static void -__osm_ftree_fabric_route_to_hcas( +__osm_ftree_fabric_route_to_cns( IN ftree_fabric_t * p_ftree) { ftree_sw_t * p_sw; - ftree_port_group_t * p_group; + ftree_hca_t * p_hca; + ftree_port_group_t * p_leaf_port_group; + ftree_port_group_t * p_hca_port_group; ftree_port_t * p_port; uint32_t i; uint32_t j; - ib_net16_t remote_lid; + ib_net16_t hca_lid; + unsigned routed_targets_on_leaf; - OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_cns); /* for each leaf switch (in indexing order) */ for(i = 0; i < p_ftree->leaf_switches_num; i++) { p_sw = p_ftree->leaf_switches[i]; + routed_targets_on_leaf = 0; /* for each HCA connected to this switch */ for (j = 0; j < p_sw->down_port_groups_num; j++) { + p_leaf_port_group = p_sw->down_port_groups[j]; + + /* work with this port group only if the remote node is CA */ + if (p_leaf_port_group->remote_node_type != IB_NODE_TYPE_CA) + continue; + + p_hca = p_leaf_port_group->remote_hca_or_sw.p_hca; + + /* work with this port group only if remote HCA has CNs */ + if (!p_hca->cn_num) + continue; + + p_hca_port_group = __osm_ftree_hca_get_port_group_by_remote_lid( + p_hca, p_leaf_port_group->base_lid); + CL_ASSERT(p_hca_port_group); + + /* work with this port group only if remote port is CN */ + if (!p_hca_port_group->is_cn) + continue; + /* obtain the LID of HCA port */ - p_group = p_sw->down_port_groups[j]; - remote_lid = p_group->remote_base_lid; + hca_lid = p_leaf_port_group->remote_base_lid; /* set local LFT(LID) to the port that is connected to HCA */ - cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port); + cl_ptr_vector_at(&p_leaf_port_group->ports, 0, (void **)&p_port); __osm_ftree_sw_set_fwd_table_block(p_sw, - cl_ntoh16(remote_lid), + cl_ntoh16(hca_lid), p_port->port_num); osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, - "__osm_ftree_fabric_route_to_hcas: " - "Switch %s: set path to CA LID 0x%x through port %u\n", + "__osm_ftree_fabric_route_to_cns: " + "Switch %s: set path to CN LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_sw->tuple), - cl_ntoh16(remote_lid), + cl_ntoh16(hca_lid), p_port->port_num); /* set local min hop table(LID) to route to the CA */ __osm_ftree_sw_set_hops(p_sw, p_ftree->lft_max_lid_ho, - cl_ntoh16(remote_lid), + cl_ntoh16(hca_lid), p_port->port_num, 1); - /* assign downgoing ports by stepping up */ + /* Assign downgoing ports by stepping up. + Since we're routing here only CNs, we're routing it as REAL + LID and updating fat-tree ballancing counters.*/ __osm_ftree_fabric_route_downgoing_by_going_up( p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ - remote_lid, /* LID that we're routing to */ - __osm_ftree_fabric_get_rank(p_ftree), /* rank of the LID that we're routing to */ + hca_lid, /* LID that we're routing to */ + p_sw->rank+1,/* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ TRUE); /* whether this path to HCA should by tracked by counters */ + + /* count how many real targets have been routed from this leaf switch */ + routed_targets_on_leaf++; } - /* We're done with the real HCAs. Now route the dummy HCAs that are missing. + /* We're done with the real targets (all CNs) of this leaf switch. + Now route the dummy HCAs that are missing or that are non-CNs. When routing to dummy HCAs we don't fill lid matrices. */ - if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) + if (p_ftree->max_cn_per_leaf > routed_targets_on_leaf) { - osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_cns: " "Routing %u dummy CAs\n", - p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); + p_ftree->max_cn_per_leaf - p_sw->down_port_groups_num); for ( j = 0; - ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); + ((int)j) < (p_ftree->max_cn_per_leaf - routed_targets_on_leaf); j++) { /* assign downgoing ports by stepping up */ @@ -2299,7 +2689,99 @@ __osm_ftree_fabric_route_to_hcas( } /* done going through all the leaf switches */ OSM_LOG_EXIT(&p_ftree->p_osm->log); -} /* __osm_ftree_fabric_route_to_hcas() */ +} /* __osm_ftree_fabric_route_to_cns() */ + +/***************************************************/ + +/* + * Pseudo code: + * foreach HCA non-CN port in fabric + * obtain the LID of the HCA port + * get switch that is connected to this HCA port + * set switch LFT(LID) to the port connecting to compute node + * call assign-down-going-port-by-descending-up(TRUE,FALSE) on CURRENT switch + * + * Routing to these HCAs is routing a REAL hca lid on SECONDARY path: + * - we should set fwd tables + * - we should NOT update port counters + */ + +static void +__osm_ftree_fabric_route_to_non_cns( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + ftree_port_t * p_hca_port; + ftree_port_group_t * p_hca_port_group; + ib_net16_t hca_lid; + unsigned port_num_on_switch; + unsigned i; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_non_cns); + + + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) ) + { + p_hca = p_next_hca; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item ); + + for (i = 0; i < p_hca->up_port_groups_num; i++) + { + p_hca_port_group = p_hca->up_port_groups[i]; + + /* skip this port if it's CN, in which case it has been already routed */ + if (p_hca_port_group->is_cn) + continue; + + /* skip this port if it is not connected to switch */ + if (p_hca_port_group->remote_node_type != IB_NODE_TYPE_SWITCH) + continue; + + p_sw = p_hca_port_group->remote_hca_or_sw.p_sw; + hca_lid = p_hca_port_group->base_lid; + + /* set switches LFT(LID) to the port that is connected to HCA */ + cl_ptr_vector_at(&p_hca_port_group->ports, 0, (void **)&p_hca_port); + port_num_on_switch = p_hca_port->remote_port_num; + __osm_ftree_sw_set_fwd_table_block(p_sw, + cl_ntoh16(hca_lid), + port_num_on_switch); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_to_non_cns: " + "Switch %s: set path to non-CN HCA LID 0x%x through port %u\n", + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ntoh16(hca_lid), + port_num_on_switch); + + /* set local min hop table(LID) to route to the CA */ + __osm_ftree_sw_set_hops(p_sw, + p_ftree->lft_max_lid_ho, + cl_ntoh16(hca_lid), + port_num_on_switch, /* port num */ + 1); /* hops */ + + /* Assign downgoing ports by stepping up. + We're routing REAL targets, but since they are not CNs and not + included in the leafs array, treat them as SECONDARY path, which + means that the counters won't be updated.*/ + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_sw, /* local switch - used as a route-downgoing alg. start point */ + NULL, /* prev. position switch */ + hca_lid, /* LID that we're routing to */ + p_sw->rank+1,/* rank of the LID that we're routing to */ + TRUE, /* whether this HCA LID is real or dummy */ + FALSE); /* whether this path to HCA should by tracked by counters */ + } + /* done with all the port groups of this HCA - go to next HCA */ + } + + OSM_LOG_EXIT(&p_ftree->p_osm->log); +} /* __osm_ftree_fabric_route_to_non_cns() */ /***************************************************/ @@ -2431,14 +2913,11 @@ __osm_ftree_rank_switches_from_leafs( osm_node_t * p_remote_node; osm_physp_t * p_osm_port; uint8_t i; - ftree_sw_tbl_element_t * p_sw_tbl_element = NULL; + unsigned max_rank = 0; while (!cl_is_list_empty(p_ranking_bfs_list)) { - p_sw_tbl_element = (ftree_sw_tbl_element_t *) cl_list_remove_head(p_ranking_bfs_list); - p_sw = p_sw_tbl_element->p_sw; - __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); - + p_sw = (ftree_sw_t *) cl_list_remove_head(p_ranking_bfs_list); p_node = p_sw->p_osm_sw->p_node; /* note: skipping port 0 on switches */ @@ -2456,9 +2935,9 @@ __osm_ftree_rank_switches_from_leafs( if (osm_node_get_type(p_remote_node) != IB_NODE_TYPE_SWITCH) continue; - p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl, - osm_node_get_node_guid(p_remote_node)); - if (p_remote_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)) + p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree, + osm_node_get_node_guid(p_remote_node)); + if (!p_remote_sw) { /* remote node is not a switch */ continue; @@ -2466,11 +2945,16 @@ __osm_ftree_rank_switches_from_leafs( /* if needed, rank the remote switch and add it to the BFS list */ if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1)) - cl_list_insert_tail(p_ranking_bfs_list, - &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); + { + max_rank = p_remote_sw->rank; + cl_list_insert_tail(p_ranking_bfs_list, p_remote_sw); + } } } + /* set FatTree maximal switch rank */ + p_ftree->max_switch_rank = max_rank; + } /* __osm_ftree_rank_switches_from_leafs() */ /***************************************************/ @@ -2508,7 +2992,7 @@ __osm_ftree_rank_leaf_switches( "__osm_ftree_rank_leaf_switches: ERR AB0F: " "CA conected directly to another CA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", - cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + __osm_ftree_hca_get_guid_ho(p_hca), cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node))); res = -1; goto Exit; @@ -2533,26 +3017,24 @@ __osm_ftree_rank_leaf_switches( /* remote node is switch */ - p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl, - p_osm_port->p_remote_physp->p_node->node_info.node_guid); + p_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree, + osm_node_get_node_guid(p_osm_port->p_remote_physp->p_node)); + CL_ASSERT(p_sw); - CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); + /* if needed, rank the remote switch and add it to the BFS list */ if ( !__osm_ftree_sw_update_rank(p_sw, 0) ) continue; - - /* if needed, rank the remote switch and add it to the BFS list */ osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_rank_leaf_switches: " "Marking rank of switch that is directly connected to CA:\n" " - CA guid : 0x%016" PRIx64 "\n" " - Switch guid: 0x%016" PRIx64 "\n" " - Switch LID : 0x%x\n", - cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_hca_get_guid_ho(p_hca), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid)); - cl_list_insert_tail(p_ranking_bfs_list, - &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); + cl_list_insert_tail(p_ranking_bfs_list, p_sw); } Exit: @@ -2569,7 +3051,7 @@ __osm_ftree_sw_reverse_rank( { ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; - p_sw->rank = __osm_ftree_fabric_get_rank(p_ftree) - p_sw->rank - 1; + p_sw->rank = p_ftree->max_switch_rank - p_sw->rank; } /*************************************************** @@ -2588,6 +3070,7 @@ __osm_ftree_fabric_construct_hca_ports( osm_physp_t * p_remote_osm_port; uint8_t i; uint8_t remote_port_num; + boolean_t is_cn = FALSE; int res = 0; for (i = 0; i < osm_node_get_num_physp(p_node); i++) @@ -2641,9 +3124,41 @@ __osm_ftree_fabric_construct_hca_ports( /* remote node is switch */ - p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid); - CL_ASSERT( p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ); - CL_ASSERT( (p_remote_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree) ); + p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,remote_node_guid); + CL_ASSERT( p_remote_sw ); + + /* If CN file is not supplied, then all the CAs considered as Compute Nodes. + Otherwise all the CAs are not CNs, and only guids that are present in the + CN file will be marked as compute nodes. */ + if ( !__osm_ftree_fabric_cns_provided(p_ftree) ) + { + is_cn = TRUE; + } + else + { + ftree_guid_tbl_element_t * p_elem = + (ftree_guid_tbl_element_t *)cl_qmap_get(&p_ftree->cn_guid_tbl, + osm_physp_get_port_guid(p_osm_port)); + if (p_elem != (ftree_guid_tbl_element_t *)cl_qmap_end(&p_ftree->cn_guid_tbl)) + is_cn = TRUE; + } + + if (is_cn) + { + p_ftree->cn_num++; + p_hca->cn_num++; + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_construct_hca_ports: " + "Marking CN port GUID 0x%016" PRIx64 "\n", + cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); + } + else + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_construct_hca_ports: " + "Marking non-CN port GUID 0x%016" PRIx64 "\n", + cl_ntoh64(osm_physp_get_port_guid(p_osm_port))); + } __osm_ftree_hca_add_port( p_hca, /* local ftree_hca object */ @@ -2655,7 +3170,8 @@ __osm_ftree_fabric_construct_hca_ports( osm_physp_get_port_guid(p_remote_osm_port),/* remote port guid */ remote_node_guid, /* remote node guid */ remote_node_type, /* remote node type */ - (void *) p_remote_sw); /* remote ftree_hca/sw object */ + (void *) p_remote_sw, /* remote ftree_hca/sw object */ + is_cn ); /* whether this port is compute node */ } Exit: @@ -2713,10 +3229,8 @@ __osm_ftree_fabric_construct_sw_ports( case IB_NODE_TYPE_CA: /* switch connected to hca */ - CL_ASSERT((p_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree)); - - p_remote_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,remote_node_guid); - CL_ASSERT(p_remote_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl)); + p_remote_hca = __osm_ftree_fabric_get_hca_by_guid(p_ftree,remote_node_guid); + CL_ASSERT(p_remote_hca); p_remote_hca_or_sw = (void *)p_remote_hca; direction = FTREE_DIRECTION_DOWN; @@ -2727,8 +3241,8 @@ __osm_ftree_fabric_construct_sw_ports( case IB_NODE_TYPE_SWITCH: /* switch connected to another switch */ - p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid); - CL_ASSERT(p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); + p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree,remote_node_guid); + CL_ASSERT(p_remote_sw); p_remote_hca_or_sw = (void *)p_remote_sw; if (abs(p_sw->rank - p_remote_sw->rank) != 1) @@ -2740,10 +3254,10 @@ __osm_ftree_fabric_construct_sw_ports( " GUID 0x%016" PRIx64 ", LID 0x%x, rank %u\n", p_sw->rank, p_remote_sw->rank, - cl_ntoh64(osm_node_get_node_guid(p_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid), p_sw->rank, - cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_osm_sw->p_node)), + __osm_ftree_sw_get_guid_ho(p_remote_sw), cl_ntoh16(p_remote_sw->base_lid), p_remote_sw->rank); res = -1; @@ -2795,7 +3309,126 @@ __osm_ftree_fabric_construct_sw_ports( ***************************************************/ static int -__osm_ftree_fabric_perform_ranking( +__osm_ftree_fabric_rank_from_roots( + IN ftree_fabric_t * p_ftree) +{ + osm_node_t * p_osm_node; + osm_node_t * p_remote_osm_node; + osm_physp_t * p_osm_physp; + ftree_sw_t * p_sw; + ftree_sw_t * p_remote_sw; + cl_list_t ranking_bfs_list; + uint64_t * p_guid; + int res = 0; + unsigned num_roots; + unsigned max_rank = 0; + unsigned i; + cl_list_iterator_t guid_iterator; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_rank_from_roots); + cl_list_init(&ranking_bfs_list,10); + + /* Rank all the roots and add them to list */ + + guid_iterator = cl_list_head(&p_ftree->root_guid_list); + while( guid_iterator != cl_list_end(&p_ftree->root_guid_list) ) + { + p_guid = (uint64_t*)cl_list_obj(guid_iterator); + guid_iterator = cl_list_next(guid_iterator); + + p_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree, cl_hton64(*p_guid)); + if (!p_sw) + { + /* the specified root guid wasn't found in the fabric */ + osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_rank_from_roots: ERR AB24: " + "Root switch GUID 0x%" PRIx64 " not found\n", *p_guid ); + continue; + } + + osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_rank_from_roots: " + "Ranking root switch with GUID 0x%" PRIx64 "\n", *p_guid ); + p_sw->rank = 0; + cl_list_insert_tail(&ranking_bfs_list, p_sw); + } + + num_roots = cl_list_count(&ranking_bfs_list); + if (!num_roots) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_rank_from_roots: ERR AB25: " + "No valid roots supplied\n"); + res = -1; + goto Exit; + } + + osm_log( &p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_rank_from_roots: " + "Ranked %u valid root switches\n", num_roots); + + /* Now the list has all the roots. + BFS the subnet and update rank on all the switches. */ + + while (!cl_is_list_empty(&ranking_bfs_list)) + { + p_sw = (ftree_sw_t *)cl_list_remove_head(&ranking_bfs_list); + p_osm_node = p_sw->p_osm_sw->p_node; + + /* note: skipping port 0 on switches */ + for (i = 1; i < osm_node_get_num_physp(p_osm_node); i++) + { + p_osm_physp = osm_node_get_physp_ptr(p_osm_node,i); + if (!osm_physp_is_valid(p_osm_physp)) + continue; + if (!osm_link_is_healthy(p_osm_physp)) + continue; + + p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL); + if (!p_remote_osm_node) + continue; + + if (osm_node_get_type(p_remote_osm_node) != IB_NODE_TYPE_SWITCH) + continue; + + p_remote_sw = __osm_ftree_fabric_get_sw_by_guid(p_ftree, + osm_node_get_node_guid(p_remote_osm_node)); + CL_ASSERT(p_remote_sw); + + /* if needed, rank the remote switch and add it to the BFS list */ + if (__osm_ftree_sw_update_rank(p_remote_sw, p_sw->rank + 1)) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_rank_from_roots: " + "Ranking switch 0x%" PRIx64 " with rank %u\n", + __osm_ftree_sw_get_guid_ho(p_remote_sw), + p_remote_sw->rank); + max_rank = p_remote_sw->rank; + cl_list_insert_tail(&ranking_bfs_list,p_remote_sw); + } + } + /* done with ports of this switch - go to the next switch in the list */ + } + + osm_log( &p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_rank_from_roots: " + "Subnet ranking completed. Max Node Rank = %u\n", + max_rank ); + + /* set FatTree maximal switch rank */ + p_ftree->max_switch_rank = max_rank; + + Exit: + cl_list_destroy(&ranking_bfs_list); + OSM_LOG_EXIT( &p_ftree->p_osm->log ); + return res; +} /* __osm_ftree_fabric_rank_from_roots() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_rank_from_hcas( IN ftree_fabric_t * p_ftree) { ftree_hca_t * p_hca; @@ -2803,11 +3436,9 @@ __osm_ftree_fabric_perform_ranking( cl_list_t ranking_bfs_list; int res = 0; - OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_rank_from_hcas); - /* Init the bfs list - the list of the switches that will be - initially filled with the leaf switches */ - cl_list_init(&ranking_bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); + cl_list_init(&ranking_bfs_list,10); /* Mark REVERSED rank of all the switches in the subnet. Start from switches that are connected to hca's, and @@ -2821,7 +3452,7 @@ __osm_ftree_fabric_perform_ranking( { res = -1; osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, - "__osm_ftree_fabric_perform_ranking: ERR AB14: " + "__osm_ftree_fabric_rank_from_hcas: ERR AB14: " "Subnet ranking failed - subnet is not FatTree"); goto Exit; } @@ -2830,36 +3461,106 @@ __osm_ftree_fabric_perform_ranking( /* Now rank rest of the switches in the fabric, while the list already contains all the ranked leaf switches */ __osm_ftree_rank_switches_from_leafs(p_ftree, &ranking_bfs_list); + + /* fix ranking of the switches by reversing the ranking direction */ + cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree); + + Exit: cl_list_destroy(&ranking_bfs_list); + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return res; +} /* __osm_ftree_fabric_rank_from_hcas() */ - /* REVERSED ranking of all the switches completed. - Calculate and set FatTree rank */ +/*************************************************** + ***************************************************/ - __osm_ftree_fabric_calculate_rank(p_ftree); - osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, - "__osm_ftree_fabric_perform_ranking: " - "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree)); +static int +__osm_ftree_fabric_rank( + IN ftree_fabric_t * p_ftree) +{ + int res = 0; - /* fix ranking of the switches by reversing the ranking direction */ - cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking); - if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || - __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) - { - osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, - "__osm_ftree_fabric_perform_ranking: ERR AB15: " - "Tree rank is %u (should be between %u and %u)\n", - __osm_ftree_fabric_get_rank(p_ftree), - FAT_TREE_MIN_RANK, - FAT_TREE_MAX_RANK); - res = -1; + if ( __osm_ftree_fabric_roots_provided(p_ftree) ) + res = __osm_ftree_fabric_rank_from_roots(p_ftree); + else + res = __osm_ftree_fabric_rank_from_hcas(p_ftree); + + if (res) goto Exit; - } + + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_rank: " + "FatTree max switch rank is %u\n", p_ftree->max_switch_rank); Exit: OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; -} /* __osm_ftree_fabric_perform_ranking() */ +} /* __osm_ftree_fabric_rank() */ + +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_fabric_set_leaf_rank( + IN ftree_fabric_t * p_ftree) +{ + unsigned i; + ftree_sw_t * p_sw; + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_set_leaf_rank); + + if ( !__osm_ftree_fabric_roots_provided(p_ftree) ) + { + /* If root file is not provided, the fabric has to be pure fat-tree + in terms of ranking. Thus, leaf switches rank is the max rank.*/ + p_ftree->leaf_switch_rank = p_ftree->max_switch_rank; + } + else + { + /* Find the first CN and set the leaf_switch_rank to the rank + of the switch that is connected to this CN. Later we will + ensure that all the leaf switches have the same rank. */ + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) ) + { + p_hca = p_next_hca; + if (p_hca->cn_num) + break; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item); + } + /* we know that there are CNs in the fabric, so just to be sure...*/ + CL_ASSERT( p_next_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl) ); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_set_leaf_rank: " + "Selected CN port GUID 0x%" PRIx64 "\n", + __osm_ftree_hca_get_guid_ho(p_hca)); + + for( i = 0; + (i < p_hca->up_port_groups_num) && (!p_hca->up_port_groups[i]->is_cn); + i++ ) + ; + CL_ASSERT( i < p_hca->up_port_groups_num ); + CL_ASSERT( p_hca->up_port_groups[i]->remote_node_type == IB_NODE_TYPE_SWITCH ); + + p_sw = p_hca->up_port_groups[i]->remote_hca_or_sw.p_sw; + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_set_leaf_rank: " + "Selected leaf switch GUID 0x%" PRIx64 ", rank %u\n", + __osm_ftree_sw_get_guid_ho(p_sw), + p_sw->rank); + p_ftree->leaf_switch_rank = p_sw->rank; + } + + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_set_leaf_rank: " + "FatTree leaf switch rank is %u\n", p_ftree->leaf_switch_rank); + OSM_LOG_EXIT(&p_ftree->p_osm->log); +} /* __osm_ftree_fabric_set_leaf_rank() */ /*************************************************** ***************************************************/ @@ -2907,6 +3608,104 @@ __osm_ftree_fabric_populate_ports( /*************************************************** ***************************************************/ +static void +__osm_ftree_convert_list2qmap( + cl_list_t * p_guid_list, + cl_qmap_t * p_map ) +{ + uint64_t * p_guid; + CL_ASSERT(p_map); + if ( !p_guid_list || !cl_list_count(p_guid_list) ) + return; + + while ( (p_guid = (uint64_t*)cl_list_remove_head(p_guid_list)) ) + { + /* object key is guid in network order */ + cl_qmap_insert( p_map, cl_hton64(*p_guid), + &((__osm_ftree_guid_tbl_element_create(*p_guid))->map_item) ); + free(p_guid); + } + CL_ASSERT(cl_is_list_empty(p_guid_list)); + +} /* __osm_ftree_convert_list2qmap() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_read_guid_files( + IN ftree_fabric_t * p_ftree) +{ + int status = 0; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_read_guid_files); + + if ( __osm_ftree_fabric_roots_provided(p_ftree) ) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_read_guid_files: " + "Fetching root nodes from file %s\n", + p_ftree->p_osm->subn.opt.root_guid_file ); + + if ( osm_ucast_mgr_read_guid_file(&p_ftree->p_osm->sm.ucast_mgr, + p_ftree->p_osm->subn.opt.root_guid_file, + &p_ftree->root_guid_list ) ) + { + status = -1; + goto Exit; + } + + if ( !cl_list_count(&p_ftree->root_guid_list) ) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_read_guid_files: ERR AB22: " + "Root guids file has no valid guids\n"); + status = -1; + goto Exit; + } + } + + if ( __osm_ftree_fabric_cns_provided(p_ftree) ) + { + cl_list_t cn_guid_list; + cl_list_construct(&cn_guid_list); + cl_list_init(&cn_guid_list, 10); + + osm_log( &p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_read_guid_files: " + "Fetching compute nodes from file %s\n", + p_ftree->p_osm->subn.opt.cn_guid_file ); + + if ( osm_ucast_mgr_read_guid_file(&p_ftree->p_osm->sm.ucast_mgr, + p_ftree->p_osm->subn.opt.cn_guid_file, + &cn_guid_list) ) + { + status = -1; + goto Exit; + } + + if ( !cl_list_count(&cn_guid_list) ) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_read_guid_files: ERR AB23: " + "Compute node guids file has no valid guids\n"); + status = -1; + goto Exit; + } + + __osm_ftree_convert_list2qmap(&cn_guid_list, &p_ftree->cn_guid_tbl); + cl_list_destroy(&cn_guid_list); + CL_ASSERT(cl_qmap_count(&p_ftree->cn_guid_tbl)); + } + + Exit: + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return status; +} /*__osm_ftree_fabric_read_guid_files() */ + +/*************************************************** + ***************************************************/ + static int __osm_ftree_construct_fabric( IN void * context) @@ -2964,6 +3763,18 @@ __osm_ftree_construct_fabric( goto Exit; } + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " + "Reading guid files provided by user\n"); + if (__osm_ftree_fabric_read_guid_files(p_ftree) != 0) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Failed reading guid files - " + "falling back to default routing\n"); + status = -1; + goto Exit; + } + if (cl_qmap_count(&p_ftree->hca_tbl) < 2) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, @@ -2974,28 +3785,26 @@ __osm_ftree_construct_fabric( goto Exit; } + /* Rank all the switches in the fabric. + After that we will know only fabric max switch rank. + We will be able to check leaf switches rank and the + whole tree rank after filling ports and marking CNs.*/ osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_construct_fabric: Ranking FatTree\n"); - - if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) - { - if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) - osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, - "Fabric rank is %u (>%u) - " - "fat-tree routing falls back to default routing\n", - __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK); - else if (__osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK) - osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, - "Fabric rank is %u (<%u) - " - "fat-tree routing falls back to default routing\n", - __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK); + if (__osm_ftree_fabric_rank(p_ftree) != 0) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Failed ranking the tree - " + "fat-tree routing falls back to default routing\n"); status = -1; goto Exit; } /* For each hca and switch, construct array of ports. - This is done after the whole FatTree data structure is ready, because - we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ + This is done after the whole FatTree data structure is ready, + because we want the ports to have pointers to ftree_{sw,hca}_t + objects, and we need the switches to be already ranked because + that's how the port direction is determined.*/ osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_construct_fabric: " "Populating CA & switch ports\n"); @@ -3007,14 +3816,68 @@ __osm_ftree_construct_fabric( status = -1; goto Exit; } + else if (p_ftree->cn_num == 0) + { + osm_log( &p_ftree->p_osm->log, OSM_LOG_SYS, + "Fabric has no valid compute nodes - " + "routing falls back to default routing\n"); + status = -1; + goto Exit; + } + + /* Now that the CA ports have been created and CNs were marked, + we can complete the fabric ranking - set leaf switches rank.*/ + __osm_ftree_fabric_set_leaf_rank(p_ftree); + + if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || + __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fabric rank is %u (should be between %u and %u) - " + "fat-tree routing falls back to default routing\n", + __osm_ftree_fabric_get_rank(p_ftree), + FAT_TREE_MIN_RANK, + FAT_TREE_MAX_RANK); + status = -1; + goto Exit; + } + + /* Mark all the switches in the fabric with rank equal to + p_ftree->leaf_switch_rank and that are also connected to CNs. + As a by-product, this function also runs basic topology + validation - it checks that all the CNs are at the same rank.*/ + if (__osm_ftree_fabric_mark_leaf_switches(p_ftree)) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fabric topology is not a fat-tree - " + "routing falls back to default routing\n"); + status = -1; + goto Exit; + } - /* Assign index to all the switches and hca's in the fabric. - This function also sorts all the port arrays of the switches - by the remote switch index, creates a leaf switch array - sorted by the switch index, and tracks the maximal number of - hcas per leaf switch. */ + /* Assign index to all the switches in the fabric. + This function also sorts leaf switch array by the switch index, + sorts all the port arrays of the indexed switches by remote + switch index, and creates switch-by-tuple table (sw_by_tuple_tbl) */ __osm_ftree_fabric_make_indexing(p_ftree); + /* Create leaf switch array sorted by index. + This array contains switches with rank equal to p_ftree->leaf_switch_rank + and that are also connected to CNs (REAL leafs), and it may contain + switches at the same leaf rank w/o CNs, if this is the order of indexing. + In any case, the first and the last switches in the array are REAL leafs.*/ + if (__osm_ftree_fabric_create_leaf_switch_array(p_ftree)) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fabric topology is not a fat-tree - " + "routing falls back to default routing\n"); + status = -1; + goto Exit; + } + + /* calculate and set ftree.max_cn_per_leaf field */ + __osm_ftree_fabric_set_max_cn_per_leaf(p_ftree); + /* print general info about fabric topology */ __osm_ftree_fabric_dump_general_info(p_ftree); @@ -3022,7 +3885,10 @@ __osm_ftree_construct_fabric( if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) __osm_ftree_fabric_dump(p_ftree); - if (! __osm_ftree_fabric_validate_topology(p_ftree)) + /* the fabric is required to be PURE fat-tree only if the root + guid file hasn't been provided by user */ + if ( ! __osm_ftree_fabric_roots_provided(p_ftree) && + ! __osm_ftree_fabric_validate_topology(p_ftree) ) { osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not a fat-tree - " @@ -3080,8 +3946,12 @@ __osm_ftree_do_routing( "Starting FatTree routing\n"); osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Filling switch forwarding tables for routes to CAs\n"); - __osm_ftree_fabric_route_to_hcas(p_ftree); + "Filling switch forwarding tables for Compute Nodes\n"); + __osm_ftree_fabric_route_to_cns(p_ftree); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Filling switch forwarding tables for non-CN targets\n"); + __osm_ftree_fabric_route_to_non_cns(p_ftree); osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " "Filling switch forwarding tables for switch-to-switch pathes\n"); -- 1.5.1.4 From landman at scalableinformatics.com Sun Jul 8 08:37:30 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 08 Jul 2007 11:37:30 -0400 Subject: [ofa-general] problem with rdma_ucm in OpenSuSE 10.2 default kernel Message-ID: <469104BA.4080408@scalableinformatics.com> After getting it to build correctly, installing it, and configuring it, I am getting a crash in rdma_ucm. That and for some reason, there is a dependency upon ipv6.ko which depmod doesn't pick up. The latter is solvable easily, but the former is troubling. Here is the snippet from the messages file > Jul 8 11:08:30 jackrabbit kernel: ----------- [cut here ] --------- [please bite here ] --------- > Jul 8 11:08:30 jackrabbit kernel: Kernel BUG at fs/sysfs/file.c:473 > Jul 8 11:08:30 jackrabbit kernel: invalid opcode: 0000 [1] SMP > Jul 8 11:08:30 jackrabbit kernel: last sysfs file: /class/net/ib0/mode > Jul 8 11:08:30 jackrabbit kernel: CPU 3 > Jul 8 11:08:30 jackrabbit kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_local_sa ib_ipoib ipv6 snd_pcm_oss s > nd_mixer_oss ib_uverbs snd_seq ib_umad snd_seq_device ib_cm ib_sa cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_p > owersave powernow_k8 freq_table button battery ac ipmi_si ipmi_devintf ipmi_msghandler apparmor aamatch_pcre ext3 jbd mbcache loop > dm_mod usbhid usb_storage snd_hda_intel snd_hda_codec snd_pcm snd_timer ib_mthca snd shpchp ehci_hcd ib_mad ohci_hcd ohci1394 ib_co > re soundcore pci_hotplug ide_cd i2c_nforce2 ieee1394 forcedeth cdrom snd_page_alloc usbcore i2c_core xfs edd fan sg arcmsr sata_nv > libata amd74xx thermal processor sd_mod scsi_mod ide_disk ide_core > Jul 8 11:08:30 jackrabbit kernel: Pid: 5464, comm: modprobe Tainted: G U 2.6.18.2-34-default #1 > Jul 8 11:08:30 jackrabbit kernel: RIP: 0010:[] [] sysfs_create_file+0x19/0x31 > Jul 8 11:08:30 jackrabbit kernel: RSP: 0000:ffff81042171de50 EFLAGS: 00010202 > Jul 8 11:08:30 jackrabbit kernel: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffff803eddf8 > Jul 8 11:08:30 jackrabbit kernel: RDX: 0000000000000000 RSI: ffffffff8856d720 RDI: ffff8104274f3810 > Jul 8 11:08:30 jackrabbit kernel: RBP: ffff810423e8c000 R08: ffffffff804d83b8 R09: ffff810424bb7b80 > Jul 8 11:08:30 jackrabbit kernel: R10: 0000000000000022 R11: ffff810424bb7b80 R12: ffff810423e8c5c0 > Jul 8 11:08:30 jackrabbit kernel: R13: ffffffff8856d900 R14: ffff810423e8c558 R15: ffffc20000a87e48 > Jul 8 11:08:30 jackrabbit kernel: FS: 00002b5c9772f6f0(0000) GS:ffff810428f7a9c0(0000) knlGS:0000000000000000 > Jul 8 11:08:30 jackrabbit kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Jul 8 11:08:30 jackrabbit kernel: CR2: 000000000062f007 CR3: 0000000226d4a000 CR4: 00000000000006e0 > Jul 8 11:08:30 jackrabbit kernel: Process modprobe (pid: 5464, threadinfo ffff81042171c000, task ffff8104288e3830) > Jul 8 11:08:30 jackrabbit kernel: Stack: ffffffff881a1026 ffffffff8856d900 ffffffff80299bcc 0000000000000019 > Jul 8 11:08:30 jackrabbit kernel: 0000000000000000 000000002171de78 0000000000000000 0000000000000000 > Jul 8 11:08:30 jackrabbit kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > Jul 8 11:08:30 jackrabbit kernel: Call Trace: > Jul 8 11:08:30 jackrabbit kernel: [] :rdma_ucm:ucma_init+0x26/0x4a > Jul 8 11:08:30 jackrabbit kernel: [] sys_init_module+0x172f/0x18e5 > Jul 8 11:08:30 jackrabbit kernel: [] system_call+0x7e/0x83 > Jul 8 11:08:30 jackrabbit kernel: > Jul 8 11:08:30 jackrabbit kernel: > Jul 8 11:08:30 jackrabbit kernel: Code: 0f 0b 68 b8 75 40 80 c2 d9 01 48 8b 7f 48 ba 04 00 00 00 e9 > Jul 8 11:08:30 jackrabbit kernel: RIP [] sysfs_create_file+0x19/0x31 > Jul 8 11:08:30 jackrabbit kernel: RSP > Jul 8 11:08:30 jackrabbit kernel: <6>ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready > Jul 8 11:08:36 jackrabbit kernel: eth0: no IPv6 routers present > Jul 8 11:08:40 jackrabbit kernel: ib0: no IPv6 routers present I bring ipoib for testing (pinging) hosts, as well as having some of the ssh traffic cross it. Sometimes quite useful. Is the above a known problem? Should I file a bug report? The tainted kernel is likely due to the arcmsr driver, though it is open source, so I am not sure what is "tainted" about it. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From rdreier at cisco.com Sun Jul 8 08:54:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jul 2007 08:54:53 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070708001531.GT3885@ics.muni.cz> (Lukas Hejtmanek's message of "Sun, 8 Jul 2007 02:15:32 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> Message-ID: > 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00 > 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00 > 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00 > 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00 > 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00 > 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 OK, my guess right now would be that when the driver is trying to give memory to the HCA to use for its internal hardware data structures, the bus addresses given to the HCA end up being wrong for some reason. There could be a bug in mthca, but since this code is working fine on lots of non-Xen systems (and not just i386/x86-64 but also ppc and ia64 at least) right now I would be more suspicious of a bug in the Xen domU's pci_map_sg() or something like that. You can look in mthca_memfree.c, specifically mthca_alloc_icm() to see how the memory to give to the HCA is allocated and mapped. I gave it a quick look over and the way the DMA mapping API is used looks OK to me, but perhaps there is a subtle problem that is exposed by Xen. Although as I said before, right now I think it's more likely that we are hitting a bug in the Xen domU implementation of DMA mapping. Michael, does my guess about the source of corruption make sense? Is that pattern of every fourth byte counting up 00 ... 1f something the the HCA would write during initialization of ICM? - R. From mst at dev.mellanox.co.il Sun Jul 8 11:17:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 8 Jul 2007 21:17:15 +0300 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> Message-ID: <20070708181715.GB32518@mellanox.co.il> > Michael, does my guess about the source of corruption make sense? Is > that pattern of every fourth byte counting up 00 ... 1f something the > the HCA would write during initialization of ICM? Yes. -- MST From halr at voltaire.com Sun Jul 8 15:15:17 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Jul 2007 18:15:17 -0400 Subject: [ofa-general] Re: [PATCH] osm: enhancing fat-tree routing for non-pure trees In-Reply-To: <4690ECDD.7030106@dev.mellanox.co.il> References: <4690ECDD.7030106@dev.mellanox.co.il> Message-ID: <1183932907.25217.312602.camel@hal.voltaire.com> Hi Yevgeny, On Sun, 2007-07-08 at 09:55, Yevgeny Kliteynik wrote: > Hi Hal. > > This patch handles the two new options for fat-tree routing: > root guid file and compute node guid files, and by doing that > fat-tree routing it is able to handle trees that are not pure > fat-trees, or even not symmetrical. > But the routing "quality" depends on the tree "correctness" - > the more the topology looks like pure fat-tree, the better > the routing. > > All the changes are in one file - osm_ucast_ftree.c, so as > much as I've tried to divide this patch into separate stages, > I found myself going back and fixing things too many times, so > at this point it won't make sense to send this patch in parts, > as earlier patches would have too much wrong code that was fixed > later. > > Bottom line: sorry, but this thing has to go in a single patch. > > Here's what this patch does: > > 1. Some modifications to ftree data structures and functions > - Added guid getters for CAs and switches > - Added node type and guid for each port group > - Some naming changes > - Added get_sw_by_guid and get_hca_by_guid functions > > 2. Reading roots and compute nodes from guid files > - Marking CAs with the number of CNs on the node > - Marking port groups if they belong to CN > > 3. Ranking rewritten to supports root guids > - ftree.tree_rank replaced by two ranks: > ftree.max_switch_rank and ftree.leaf_switch_rank. > - Tree rank for routing is considered as (ftree.leaf_switch_rank + 1) > > 4. Created leaf switch array that contains all the leafs > with CNs and possibly leafs between them, according to > the fabric indexing. > > 5. Checking new "lighter" topology constaraint > - all the leafs with real CNs should be at the same tree rank. > > 6. Implemented the routing itself: > - routing to all the CNs first > - routing dummy targets for all the missing nodes > or non-CNs that are connected to leaf switches > - routing to all the non-CN CAs in the fabric > (routing them as real targets on secondary path) > - routing to all the switch-to-switch pathes (left the same) > > 7. Updated ordering file dump qfunction > - Treating non-compute nodes as dummies > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From rdreier at cisco.com Sun Jul 8 20:21:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jul 2007 20:21:01 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- more changes in for-roland for 2.6.23 In-Reply-To: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Fri, 06 Jul 2007 12:48:17 -0700") References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> Message-ID: thanks, applied 1-8 From rdreier at cisco.com Sun Jul 8 20:21:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jul 2007 20:21:28 -0700 Subject: [ofa-general] [PATCH 2/8] IB/ipath -- update MAINTAINERS In-Reply-To: <1183755367.25217.102865.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Jul 2007 16:56:08 -0400") References: <20070706194817.9093.43798.stgit@eng-46.internal.keyresearch.com> <20070706194828.9093.10451.stgit@eng-46.internal.keyresearch.com> <1183755367.25217.102865.camel@hal.voltaire.com> Message-ID: I also added this patch into my queue: commit c4c9e9a665495480ba88f0f7a7649b8dcbbdeaa6 Author: Roland Dreier Date: Sun Jul 8 20:20:48 2007 -0700 IB: Update mailing list address The InfiniBand / RDMA discussion list has moved. Signed-off-by: Roland Dreier diff --git a/MAINTAINERS b/MAINTAINERS index 57ebf1e..0e0aac4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -371,7 +371,7 @@ P: Tom Tucker M: tom at opengridcomputing.com P: Steve Wise M: swise at opengridcomputing.com -L: openib-general at openib.org +L: general at lists.openfabrics.org S: Maintained AOA (Apple Onboard Audio) ALSA DRIVER @@ -1396,7 +1396,7 @@ P: Hoang-Nam Nguyen M: hnguyen at de.ibm.com P: Christoph Raisch M: raisch at de.ibm.com -L: openib-general at openib.org +L: general at lists.openfabrics.org S: Supported EMU10K1 SOUND DRIVER @@ -1851,7 +1851,7 @@ P: Sean Hefty M: mshefty at ichips.intel.com P: Hal Rosenstock M: halr at voltaire.com -L: openib-general at openib.org +L: general at lists.openfabrics.org W: http://www.openib.org/ T: git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git S: Supported From liu_jf at neusoft.com Sun Jul 8 20:35:34 2007 From: liu_jf at neusoft.com (liu_jf at neusoft.com) Date: Mon, 09 Jul 2007 11:35:34 +0800 Subject: [ofa-general] Generate ib_srpt.ko Failed! Message-ID: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com> Dear, I used OFED-1.2 to generate the SCSI Target modules,but when I enter the command "./configure --with-srp-target-mod",many faults occur. Most are kernel patch failure. My OS is CentOS 5.0,with kernel version 2.6.18-8.el5.Can anyone give me some suggestion? Great apreciation with any help! Thank you! yours, ljf ---------------------------------------------------------------------------------------------- Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Group Ltd., its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful. If you have received this communication in error, please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. ----------------------------------------------------------------------------------------------- From rdreier at cisco.com Sun Jul 8 22:03:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 08 Jul 2007 22:03:16 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070708001531.GT3885@ics.muni.cz> (Lukas Hejtmanek's message of "Sun, 8 Jul 2007 02:15:32 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> Message-ID: I don't know much about how Xen works, especially the PCI stuff in Xen 3.1. So this may be a stupid idea, but anyway.... Is the memory given to a domU always physically contiguous? If not, what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try and allocate 256 KB or something like that. Let's assume that the domU kernel has enough guest contiguous pages to satisfy the allocation -- is there any guarantee that the pages are really physically contiguous? If not, what happens if the domU kernel does pci_map_sg() on an sglist with >0 order pages in it that are not physically contigous? The DMA mapping API only allows one bus address to be returned for each page, even if they are order >0 and hence more than 4 KB. So if the pages are guest contiguous but not physical host contiguous it seems we could end up with the problem you see, where the domU mthca driver tries to pass memory to the HCA but the HCA ends up writing to different memory. - R. From jackm at dev.mellanox.co.il Mon Jul 9 00:12:52 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 9 Jul 2007 10:12:52 +0300 Subject: [ofa-general] [PATCH] mlx4: add device reset to Internal Error handling mechanism Message-ID: <200707091012.52418.jackm@dev.mellanox.co.il> Add device reset to mlx4 Internal Error handling. Also, detect errors via polling the device error buffer (rather than via interrupt), because this provides better coverage. This patch also disables the detection of Internal Errors via a device interrupt, because we wish to avoid the complexity of supporting two independent detection mechanisms. Signed-off-by: Jack Morgenstein diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c index 1bb088a..94bc784 100644 --- a/drivers/net/mlx4/catas.c +++ b/drivers/net/mlx4/catas.c @@ -30,15 +30,32 @@ * SOFTWARE. */ +#include +#include +#include #include "mlx4.h" +enum { + MLX4_CATAS_POLL_INTERVAL = 5 * HZ, +}; + +static DEFINE_SPINLOCK(catas_lock); + +static LIST_HEAD(catas_list); +static struct workqueue_struct *catas_wq; +static struct work_struct catas_work; + +static int ierr_reset_disable; +module_param_named(ierr_reset_disable, ierr_reset_disable, int, 0644); +MODULE_PARM_DESC(ierr_reset_disable, "disable reset on Internal Error event if nonzero"); + void mlx4_handle_catas_err(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); int i; - mlx4_err(dev, "Catastrophic error detected:\n"); + mlx4_err(dev, "Internal error detected:\n"); for (i = 0; i < priv->fw.catas_size; ++i) mlx4_err(dev, " buf[%02x]: %08x\n", i, swab32(readl(priv->catas_err.map + i))); @@ -46,25 +63,118 @@ void mlx4_handle_catas_err(struct mlx4_dev *dev) mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0); } -void mlx4_map_catas_buf(struct mlx4_dev *dev) +static void catas_reset(struct work_struct *work) +{ + struct mlx4_priv *priv, *tmppriv; + struct mlx4_dev *dev; + + LIST_HEAD(tlist); + int ret; + + spin_lock_irq(&catas_lock); + list_splice_init(&catas_list, &tlist); + spin_unlock_irq(&catas_lock); + + list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) { + ret = mlx4_restart_one(priv->dev.pdev); + dev = &priv->dev; + if (ret) + mlx4_err(dev, "Reset failed (%d)\n", ret); + else + mlx4_dbg(dev, "Reset succeeded\n"); + } +} + +static void handle_catas(struct mlx4_dev *dev) +{ + unsigned long flags; + struct mlx4_priv *priv = mlx4_priv(dev); + + mlx4_handle_catas_err(dev); + + if (ierr_reset_disable) + return; + + spin_lock_irqsave(&catas_lock, flags); + list_add(&priv->catas_err.list, &catas_list); + queue_work(catas_wq, &catas_work); + spin_unlock_irqrestore(&catas_lock, flags); +} + +static void poll_catas(unsigned long dev_ptr) +{ + struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr; + struct mlx4_priv *priv = mlx4_priv(dev); + unsigned long flags; + + if (readl(priv->catas_err.map)) { + handle_catas(&priv->dev); + return; + } + + spin_lock_irqsave(&catas_lock, flags); + if (!priv->catas_err.stop) + mod_timer(&priv->catas_err.timer, + jiffies + MLX4_CATAS_POLL_INTERVAL); + spin_unlock_irqrestore(&catas_lock, flags); + + return; +} + +void mlx4_start_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); unsigned long addr; + init_timer(&priv->catas_err.timer); + priv->catas_err.stop = 0; + priv->catas_err.map = NULL; + addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) + priv->fw.catas_offset; priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4); if (!priv->catas_err.map) - mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n", + mlx4_warn(dev, "Failed to map Internal Error buffer at 0x%lx\n", addr); + priv->catas_err.timer.data = (unsigned long) dev; + priv->catas_err.timer.function = poll_catas; + priv->catas_err.timer.expires = jiffies + MLX4_CATAS_POLL_INTERVAL; + INIT_LIST_HEAD(&priv->catas_err.list); + add_timer(&priv->catas_err.timer); } -void mlx4_unmap_catas_buf(struct mlx4_dev *dev) +void mlx4_stop_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + spin_lock_irq(&catas_lock); + priv->catas_err.stop = 1; + spin_unlock_irq(&catas_lock); + + del_timer_sync(&priv->catas_err.timer); + if (priv->catas_err.map) iounmap(priv->catas_err.map); + + spin_lock_irq(&catas_lock); + list_del(&priv->catas_err.list); + spin_unlock_irq(&catas_lock); +} + +int __init mlx4_catas_init(void) +{ + INIT_WORK(&catas_work, catas_reset); + + catas_wq = create_singlethread_workqueue("mlx4_err"); + if (!catas_wq) + return -ENOMEM; + + return 0; +} + +void mlx4_catas_cleanup(void) +{ + destroy_workqueue(catas_wq); } diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index 27a82ce..a9841c6 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -283,7 +283,9 @@ static irqreturn_t mlx4_msi_x_interrupt(int irq, void *eq_ptr) static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr) { - mlx4_handle_catas_err(dev_ptr); + /* disable handling catas errors via interrupt. */ + /* We now handle them via polling. */ + /* mlx4_handle_catas_err(dev_ptr); */ /* MSI-X vectors always belong to us */ return IRQ_HANDLED; diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c index 9ae951b..be5d9e9 100644 --- a/drivers/net/mlx4/intf.c +++ b/drivers/net/mlx4/intf.c @@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev *dev) mlx4_add_device(intf, priv); mutex_unlock(&intf_mutex); + mlx4_start_catas_poll(dev); return 0; } @@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_interface *intf; + mlx4_stop_catas_poll(dev); mutex_lock(&intf_mutex); list_for_each_entry(intf, &intf_list, list) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 41eafeb..297fe41 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -582,8 +582,6 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev) goto err_pd_table_free; } - mlx4_map_catas_buf(dev); - err = mlx4_init_eq_table(dev); if (err) { mlx4_err(dev, "Failed to initialize " @@ -659,7 +657,6 @@ err_eq_table_free: mlx4_cleanup_eq_table(dev); err_catas_buf: - mlx4_unmap_catas_buf(dev); mlx4_cleanup_mr_table(dev); err_pd_table_free: @@ -835,9 +832,6 @@ err_cleanup: mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); mlx4_cleanup_uar_table(dev); @@ -884,9 +878,6 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev) mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); @@ -907,6 +898,12 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev) } } +int mlx4_restart_one(struct pci_dev *pdev) +{ + mlx4_remove_one(pdev); + return mlx4_init_one(pdev, NULL); +} + static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ @@ -927,6 +924,10 @@ static int __init mlx4_init(void) { int ret; + ret = mlx4_catas_init(); + if (ret) + return ret; + ret = pci_register_driver(&mlx4_driver); return ret < 0 ? ret : 0; } @@ -934,6 +935,7 @@ static int __init mlx4_init(void) static void __exit mlx4_cleanup(void) { pci_unregister_driver(&mlx4_driver); + mlx4_catas_cleanup(); } module_init(mlx4_init); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 3d3b6d2..d4e9111 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -247,7 +247,9 @@ struct mlx4_mcg_table { struct mlx4_catas_err { u32 __iomem *map; - int size; + u32 stop; + struct timer_list timer; + struct list_head list; }; struct mlx4_priv { @@ -310,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_dev *dev); void mlx4_cleanup_srq_table(struct mlx4_dev *dev); void mlx4_cleanup_mcg_table(struct mlx4_dev *dev); -void mlx4_map_catas_buf(struct mlx4_dev *dev); -void mlx4_unmap_catas_buf(struct mlx4_dev *dev); - +void mlx4_start_catas_poll(struct mlx4_dev *dev); +void mlx4_stop_catas_poll(struct mlx4_dev *dev); +int mlx4_catas_init(void); +void mlx4_catas_cleanup(void); +int mlx4_restart_one(struct pci_dev *pdev); int mlx4_register_device(struct mlx4_dev *dev); void mlx4_unregister_device(struct mlx4_dev *dev); void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type, From kliteyn at dev.mellanox.co.il Mon Jul 9 01:31:34 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Jul 2007 11:31:34 +0300 Subject: [ofa-general] [PATCH 1/2] osm: updating doc with root and compute nodes options for fat-tree Message-ID: <4691F266.9000505@dev.mellanox.co.il> Hi Hal This patch has only cosmetics - removing trailing blanks in doc files. Signed-off-by: Yevgeny Kliteynik --- opensm/doc/current-routing.txt | 104 ++++++++++++++++++++-------------------- opensm/man/opensm.8 | 36 +++++++------- 2 files changed, 70 insertions(+), 70 deletions(-) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 737949e..9852ef0 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -3,17 +3,17 @@ Current OpenSM Routing OpenSM offers four routing engines: -1. Min Hop Algorithm - based on the minimum hops to each node where the +1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized. -2. UPDN Unicast routing algorithm - also based on the minimum hops to each -node, but it is constrained to ranking rules. This algorithm should be chosen -if the subnet is not a pure Fat Tree, and deadlock may occur due to a +2. UPDN Unicast routing algorithm - also based on the minimum hops to each +node, but it is constrained to ranking rules. This algorithm should be chosen +if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. 3. Fat-tree Unicast routing algorithm - this algorithm optimizes routing -of fat-trees for congestion-free "shift" communication pattern. -It should be chosen if a subnet is a symmetrical fat-tree. +of fat-trees for congestion-free "shift" communication pattern. +It should be chosen if a subnet is a symmetrical fat-tree. Similar to UPDN routing, Fat-tree routing is credit-loop-free. 4. LASH unicast routing algorithm - uses Infiniband virtual layers @@ -22,7 +22,7 @@ distributing the paths between layers. LASH is an alternative deadlock-free topology-agnostic routing algorithm to the non-minimal UPDN algorithm avoiding the use of a potentially congested root node. -OpenSM also supports a file method which can load routes from a table. See +OpenSM also supports a file method which can load routes from a table. See modular-routing.txt for more information on this. The basic routing algorithm is comprised of two stages: @@ -41,10 +41,10 @@ a decision is made as to what port should be used to get to that LID. This step is common to standard and Up/Down routing. Each port has a counter counting the number of target LIDs going through it. When there are multiple alternative ports with same MinHop to a LID, -the one with less previously assigned ports is selected. - If LMC > 0, more checks are added: Within each group of LIDs assigned to -same target port, - a. use only ports which have same MinHop +the one with less previously assigned ports is selected. + If LMC > 0, more checks are added: Within each group of LIDs assigned to +same target port, + a. use only ports which have same MinHop b. first prefer the ones that go to different systemImageGuid (then the previous LID of the same LMC group) c. if none - prefer those which go through another NodeGuid @@ -65,15 +65,15 @@ the fabric switches unless the -r (--reassign_lids) option is specified. LID assignments resolving multiple use of same LID. If a link is added or removed, OpenSM does not recalculate -the routes that do not have to change. A route has to change -if the port is no longer UP or no longer the MinHop. When routing changes +the routes that do not have to change. A route has to change +if the port is no longer UP or no longer the MinHop. When routing changes are performed, the same algorithm for balancing the routes is invoked. In the case of using the file based routing, any topology changes are -currently ignored The 'file' routing engine just loads the LFTs from the file -specified, with no reaction to real topology. Obviously, this will not be able -to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent -switches will be skipped. Multicast is not affected by 'file' routing engine +currently ignored The 'file' routing engine just loads the LFTs from the file +specified, with no reaction to real topology. Obviously, this will not be able +to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent +switches will be skipped. Multicast is not affected by 'file' routing engine (this uses min hop tables). @@ -82,7 +82,7 @@ Min Hop Algorithm The Min Hop algorithm is invoked when neither UPDN or the file method are specified. - + The Min Hop algorithm is divided into two stages: computation of min-hop tables on every switch and LFT output port assignment. Link subscription is also equalized with the ability to override based on @@ -102,39 +102,39 @@ UPDN Routing Algorithm Purpose of UPDN Algorithm -The UPDN algorithm is designed to prevent deadlocks from occurring in loops -of the subnet. A loop-deadlock is a situation in which it is no longer -possible to send data between any two hosts connected through the loop. As -such, the UPDN routing algorithm should be used if the subnet is not a pure -Fat Tree, and one of its loops may experience a deadlock (due, for example, +The UPDN algorithm is designed to prevent deadlocks from occurring in loops +of the subnet. A loop-deadlock is a situation in which it is no longer +possible to send data between any two hosts connected through the loop. As +such, the UPDN routing algorithm should be used if the subnet is not a pure +Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure). The UPDN algorithm is based on the following main stages: -1. Auto-detect root nodes - based on the CA hop length from any switch in -the subnet, a statistical histogram is built for each switch (hop num vs +1. Auto-detect root nodes - based on the CA hop length from any switch in +the subnet, a statistical histogram is built for each switch (hop num vs number of occurrences). If the histogram reflects a specific column (higher -than others) for a certain node, then it is marked as a root node. Since -the algorithm is statistical, it may not find any root nodes. The list of -the root nodes found by this auto-detect stage is used by the ranking +than others) for a certain node, then it is marked as a root node. Since +the algorithm is statistical, it may not find any root nodes. The list of +the root nodes found by this auto-detect stage is used by the ranking process stage. Note 1: The user can override the node list manually. - Note 2: If this stage cannot find any root nodes, and the user did not - specify a guid list file, OpenSM defaults back to the Min Hop + Note 2: If this stage cannot find any root nodes, and the user did not + specify a guid list file, OpenSM defaults back to the Min Hop routing algorithm. -2. Ranking process - All root switch nodes (found in stage 1) are assigned -a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the -subnet are ranked incrementally. This ranking aids in the process of enforcing +2. Ranking process - All root switch nodes (found in stage 1) are assigned +a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the +subnet are ranked incrementally. This ranking aids in the process of enforcing rules that ensure loop-free paths. -3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from -each (CA or switch) node in the subnet. During the BFS process, the FDB table -of each switch node traversed by BFS is updated, in reference to the starting +3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from +each (CA or switch) node in the subnet. During the BFS process, the FDB table +of each switch node traversed by BFS is updated, in reference to the starting node, based on the ranking rules and guid values. -At the end of the process, the updated FDB tables ensure loop-free paths +At the end of the process, the updated FDB tables ensure loop-free paths through the subnet. Note: Up/Down routing does not allow LID routing communication between @@ -150,21 +150,21 @@ UPDN Algorithm Usage Activation through OpenSM Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. -Use `-a ' for adding an UPDN guid file that contains the +Use `-a ' for adding an UPDN guid file that contains the root nodes for ranking. -If the `-a' option is not used, OpenSM uses its auto-detect root nodes +If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm. Notes on the guid list file: -1. A valid guid file specifies one guid in each line. Lines with an invalid +1. A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded. -2. The user should specify the root switch guids. However, it is also -possible to specify CA guids; OpenSM will use the guid of the switch (if +2. The user should specify the root switch guids. However, it is also +possible to specify CA guids; OpenSM will use the guid of the switch (if it exists) that connects the CA to the subnet as a root node. -To learn more about deadlock-free routing, see the article -"Deadlock Free Message Routing in Multiprocessor Interconnection Networks" +To learn more about deadlock-free routing, see the article +"Deadlock Free Message Routing in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985). @@ -173,9 +173,9 @@ Fat-tree Routing Algorithm Purpose: -The fat-tree algorithm optimizes routing for "shift" communication pattern. +The fat-tree algorithm optimizes routing for "shift" communication pattern. It should be chosen if a subnet is a symmetrical fat-tree of various types. -It supports not just K-ary-N-Trees, by handling for non-constant K, +It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. Fat-tree algorithm supports topologies that comply with the following rules: @@ -190,16 +190,16 @@ Fat-tree algorithm supports topologies that comply with the following rules: - Switches of the same rank should have the same number of ports in each DOWN-going port group. *ports that are connected to the same remote switch are referenced as -'port group'. +'port group'. -Note that although fat-tree algorithm supports trees with non-integer CBB +Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will not be as balanced as in case of integer CBB ratio. -In addition to this, although the algorithm allows leaf switches to have any +In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the -same directory where the OpenSM log resides. This ordering file provides the +same directory where the OpenSM log resides. This ordering file provides the CA order that may be used to create efficient communication pattern, that will match the routing tables. @@ -223,7 +223,7 @@ agnostic deadlock-free routing within communication networks. When computing the routing function, LASH analyzes the network topology for the shortest-path routes between all pairs of sources / destinations and groups these paths into virtual layers in such a way -as to avoid deadlock. +as to avoid deadlock. Note LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual @@ -254,7 +254,7 @@ available. In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order Routing in certain topologies, it is topology -agnostic and fares well in the face of faults. +agnostic and fares well in the face of faults. It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down. The reason for this is that LASH distributes the diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 index 00c7bbb..5f34cd1 100644 --- a/opensm/man/opensm.8 +++ b/opensm/man/opensm.8 @@ -1,7 +1,7 @@ .TH OPENSM 8 "June 22, 2007" "OpenIB" "OpenIB Management" .SH NAME -opensm \- InfiniBand subnet manager and administration (SM/SA) +opensm \- InfiniBand subnet manager and administration (SM/SA) .SH SYNOPSIS .B opensm @@ -20,10 +20,10 @@ InfiniBand subnet). opensm also now contains an experimental version of a performance manager as well. -opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB +opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep occasionally for changes. -opensm attaches to a specific IB port on the local machine and configures only +opensm attaches to a specific IB port on the local machine and configures only the fabric connected to it. (If the local machine has other IB ports, opensm will ignore the fabrics connected to those other ports). If no port is specified, it will select the first "best" available port. @@ -33,7 +33,7 @@ attach to. By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log. The first file will register only general major events, whereas the second -will include details of reported errors. All errors reported in this second +will include details of reported errors. All errors reported in this second file should be treated as indicators of IB fabric health issues. (Note that when a fatal and non-recoverable error occurs, opensm will exit.) Both log files should include the message "SUBNET UP" if opensm was able to @@ -75,7 +75,7 @@ one path between any two ports. \fB\-p\fR, \fB\-\-priority\fR This option specifies the SM\'s PRIORITY. This will effect the handover cases, where master -is chosen by priority and GUID. Range goes from 0 +is chosen by priority and GUID. Range goes from 0 (default and lowest priority) to 15 (highest). .TP \fB\-smkey\fR @@ -276,7 +276,7 @@ Display this usage info then exit. .PP The following environment variables control opensm behavior: -OSM_TMP_DIR - controls the directory in which the temporary files generated by +OSM_TMP_DIR - controls the directory in which the temporary files generated by opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log. @@ -350,11 +350,11 @@ defined in the IBTA specification (for example, mtu=4 for 2048). PortGUIDs list: - PortGUID - GUID of partition member EndPort. Hexadecimal - numbers should start from 0x, decimal numbers + PortGUID - GUID of partition member EndPort. Hexadecimal + numbers should start from 0x, decimal numbers are accepted too. - full or limited - indicates full or limited membership for this - port. When omitted (or unrecognized) limited + full or limited - indicates full or limited membership for this + port. When omitted (or unrecognized) limited membership is assumed. There are two useful keywords for PortGUID definition: @@ -419,7 +419,7 @@ list of these parameters: template Both VL arbitration templates are pairs of VL and weight - qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is + qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs corresponding to SLs 0-15 (Note that VL15 used here means drop this SL) @@ -462,7 +462,7 @@ node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. -3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing +3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing for congestion-free "shift" communication pattern. It should be chosen if a subnet is a symmetrical Fat Trees of various types, not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio. @@ -660,7 +660,7 @@ Activation through OpenSM Use '-R ftree' option to activate the fat-tree algorithm. -Note: LMC > 0 is not supported by fat-tree routing. If this is +Note: LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing algorithm is invoked instead. @@ -673,7 +673,7 @@ agnostic deadlock-free routing within communication networks. When computing the routing function, LASH analyzes the network topology for the shortest-path routes between all pairs of sources / destinations and groups these paths into virtual layers in such a way -as to avoid deadlock. +as to avoid deadlock. Note LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual @@ -704,7 +704,7 @@ available. In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order Routing in certain topologies, it is topology -agnostic and fares well in the face of faults. +agnostic and fares well in the face of faults. It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down. The reason for this is that LASH distributes the @@ -729,7 +729,7 @@ To learn more about deadlock-free routing, see the article "Deadlock Free Message Routing in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985). -To learn more about the up/down algorithm, see the article +To learn more about the up/down algorithm, see the article "Effective Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad Politécnica de Valencia. @@ -786,7 +786,7 @@ To activate file based routing module, use: opensm -R file -U /path/to/dump_file -If the dump_file is not found or is in error, the default routing +If the dump_file is not found or is in error, the default routing algorithm is utilized. The ability to dump switch lid matrices (aka min hops tables) to file and @@ -816,7 +816,7 @@ Both or one of options -U and -M can be specified together with \'-R file\'. Hal Rosenstock .RI < halr at voltaire.com > .TP -Sasha Khapyorsky +Sasha Khapyorsky .RI < sashak at voltaire.com > .TP Eitan Zahavi -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Mon Jul 9 01:32:49 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 09 Jul 2007 11:32:49 +0300 Subject: [ofa-general] [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree Message-ID: <4691F2B1.4000803@dev.mellanox.co.il> Hi Hal. Updating doc and osm manpage with the recent enhancement of fat-tree routing. Signed-off-by: Yevgeny Kliteynik --- opensm/doc/current-routing.txt | 28 ++++++++++++++++++++++------ opensm/man/opensm.8 | 33 ++++++++++++++++++++++++++------- 2 files changed, 48 insertions(+), 13 deletions(-) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 9852ef0..76f91ba 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -174,11 +174,14 @@ Fat-tree Routing Algorithm Purpose: The fat-tree algorithm optimizes routing for "shift" communication pattern. -It should be chosen if a subnet is a symmetrical fat-tree of various types. +It should be chosen if a subnet is a symmetrical or almost symmetrical +fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. -Fat-tree algorithm supports topologies that comply with the following rules: + +If the root guid file is not provided ('-a' or '--root_guid_file' options), +the topology has to be pure fat-tree that complies with the following rules: - Tree rank should be between two and eight (inclusively) - Switches of the same rank should have the same number of UP-going port groups*, unless they are root switches, @@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules: of ports in each UP-going port group. - Switches of the same rank should have the same number of ports in each DOWN-going port group. -*ports that are connected to the same remote switch are referenced as + - All the CAs have to be at the same tree level (rank). + +If the root guid file is provided, the topology doesn't have to be pure +fat-tree, and it should only comply with the following rules: + - Tree rank should be between two and eight (inclusively) + - All the Compute Nodes** have to be at the same tree level (rank). + Note that non-compute node CAs are allowed here to be at different + tree ranks. + +* ports that are connected to the same remote switch are referenced as 'port group'. +** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file' +OpenSM options. Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. +In general, even if the root list is provided, the closer the topology to a +pure and symmetrical fat-tree, the more optimal the routing will be. -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the -same directory where the OpenSM log resides. This ordering file provides the -CA order that may be used to create efficient communication pattern, that +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) +in the same directory where the OpenSM log resides. This ordering file provides +the CN order that may be used to create efficient communication pattern, that will match the routing tables. diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 index 5f34cd1..5472faf 100644 --- a/opensm/man/opensm.8 +++ b/opensm/man/opensm.8 @@ -603,7 +603,7 @@ UPDN Algorithm Usage Activation through OpenSM Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. -Use '-a ' for adding an UPDN guid file that contains the +Use '-a ' for adding an UPDN guid file that contains the root nodes for ranking. If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm. @@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node. Fat-tree Routing Algorithm The fat-tree algorithm optimizes routing for "shift" communication pattern. -It should be chosen if a subnet is a symmetrical fat-tree of various types. +It should be chosen if a subnet is a symmetrical or almost symmetrical +fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. -The Fat-tree algorithm supports topologies that comply with the following rules: +If the root guid file is not provided ('-a' or '--root_guid_file' options), +the topology has to be pure fat-tree that complies with the following rules: - Tree rank should be between two and eight (inclusively) - Switches of the same rank should have the same number of UP-going port groups*, unless they are root switches, @@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules: of ports in each UP-going port group. - Switches of the same rank should have the same number of ports in each DOWN-going port group. + - All the CAs have to be at the same tree level (rank). -Note: ports that are connected to the same remote switch are referenced as +If the root guid file is provided, the topology doesn't have to be pure +fat-tree, and it should only comply with the following rules: + - Tree rank should be between two and eight (inclusively) + - All the Compute Nodes** have to be at the same tree level (rank). + Note that non-compute node CAs are allowed here to be at different + tree ranks. + +* ports that are connected to the same remote switch are referenced as \'port group\'. +** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\' +OpenSM options. + Topologies that do not comply cause a fallback to min hop routing. Note that this can also occur on link failures which cause the topology to no longer be "pure" fat-tree. @@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. +In general, even if the root list is provided, the closer the topology to a +pure and symmetrical fat-tree, the more optimal the routing will be. -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the -same directory where the OpenSM log resides. This ordering file provides the -CA order that may be used to create efficient communication pattern, that +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) +in the same directory where the OpenSM log resides. This ordering file provides +the CN order that may be used to create efficient communication pattern, that will match the routing tables. Activation through OpenSM Use '-R ftree' option to activate the fat-tree algorithm. +Use '-a ' to provide root nodes for ranking. If the `-a' option +is not used, routing algorithm will detect roots automatically. +Use '-u ' to provide the list of compute nodes. If the `-u' option +is not used, all the CAs are considered as compute nodes. Note: LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing algorithm is invoked instead. -- 1.5.1.4 From vlad at lists.openfabrics.org Mon Jul 9 02:00:50 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 9 Jul 2007 02:00:50 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070709-0200 daily build status Message-ID: <20070709090100.16430E60844@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/~vlad/ofed_1_2/.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Failed: Build failed on i686 with 2.6.15-23-server Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on i686 with linux-2.6.12 Build failed on i686 with linux-2.6.18 Build failed on i686 with linux-2.6.17 Build failed on i686 with linux-2.6.22-rc7 Build failed on i686 with linux-2.6.19 Build failed on i686 with linux-2.6.21.1 Build failed on i686 with linux-2.6.13 Build failed on i686 with linux-2.6.14 Build failed on i686 with linux-2.6.16 Build failed on i686 with linux-2.6.15 Build failed on powerpc with linux-2.6.19 Log: Build failed on x86_64 with linux-2.6.20 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-8.el5 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.12 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-42.ELsmp Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-55.ELsmp Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.18-8.el5 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.16 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.5-7.244-smp Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.21.1 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-34.ELsmp Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.13 Log: Build failed on x86_64 with linux-2.6.15 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.19 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.12 Log: Build failed on x86_64 with linux-2.6.9-22.ELsmp Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.16.21-0.8-smp Build failed on x86_64 with linux-2.6.19 Log: Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.17 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options Build failed on x86_64 with linux-2.6.16.43-0.3-smp Build failed on ppc64 with linux-2.6.14 Log: Log: ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.13 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.13 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16.21-0.8-default Log: Build failed on ppc64 with linux-2.6.16 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.16 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.18 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.18-1.2798.fc6 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.17 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.15 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.14 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.13 Log: Build failed on powerpc with linux-2.6.17 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.12 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.15 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.21.1 Log: Build failed on ia64 with linux-2.6.14 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.15 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on powerpc with linux-2.6.14 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.16 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.19 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ia64 with linux-2.6.18 Log: Build failed on ia64 with linux-2.6.17 Build failed on ppc64 with linux-2.6.18 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.12 Log: --with-cxgb3-mod make CONFIG_INFINIBAND_CXGB3=m [no] --without-cxgb3-mod [yes] --with-cxgb3_debug-mod make CONFIG_INFINIBAND_CXGB3_DEBUG=y [no] --without-cxgb3_debug-mod [yes] --help - print out options ---------------------------------------------------------------------------------- From xhejtman at ics.muni.cz Mon Jul 9 02:08:02 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 9 Jul 2007 11:08:02 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> Message-ID: <20070709090802.GA3885@ics.muni.cz> On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote: > Is the memory given to a domU always physically contiguous? If not, > what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try > and allocate 256 KB or something like that. Let's assume that the > domU kernel has enough guest contiguous pages to satisfy the > allocation -- is there any guarantee that the pages are really > physically contiguous? according to Xen-dev alloc_pages does *not* guarantee contiguous pages. They say that the pci_alloc_consistent should be used instead. The question is whether non-Xen kernel *usually* allocates contiguous pages and so far it has been working and whether it should be fixed in the mainline of the driver. I do some tests (and also try to figure out how to change alloc_pages to pci_alloc_consistent) to verify contiguous pages. Anyway, thanks a lot!! -- Lukáš Hejtmánek From muli at il.ibm.com Mon Jul 9 02:12:10 2007 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Mon, 9 Jul 2007 12:12:10 +0300 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709090802.GA3885@ics.muni.cz> References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709090802.GA3885@ics.muni.cz> Message-ID: <20070709091210.GP3182@rhun.haifa.ibm.com> On Mon, Jul 09, 2007 at 11:08:02AM +0200, Lukas Hejtmanek wrote: > On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote: > > Is the memory given to a domU always physically contiguous? If not, > > what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try > > and allocate 256 KB or something like that. Let's assume that the > > domU kernel has enough guest contiguous pages to satisfy the > > allocation -- is there any guarantee that the pages are really > > physically contiguous? > > according to Xen-dev alloc_pages does *not* guarantee contiguous > pages. They say that the pci_alloc_consistent should be used > instead. The question is whether non-Xen kernel *usually* allocates > contiguous pages and so far it has been working and whether it > should be fixed in the mainline of the driver. > > I do some tests (and also try to figure out how to change > alloc_pages to pci_alloc_consistent) to verify contiguous pages. You missed an important bit of Keir's response---it's perfectly fine to use alloc_pages provided you then use the dma_map_single API, which for Xen dom0 will take care of bounce-buffering to a machine-contiguous buffer if necessary. I am not sure if the same holds for a domU kernel. Cheers, Muli From xhejtman at ics.muni.cz Mon Jul 9 02:22:57 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 9 Jul 2007 11:22:57 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709091210.GP3182@rhun.haifa.ibm.com> References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709090802.GA3885@ics.muni.cz> <20070709091210.GP3182@rhun.haifa.ibm.com> Message-ID: <20070709092257.GB3885@ics.muni.cz> On Mon, Jul 09, 2007 at 12:12:10PM +0300, Muli Ben-Yehuda wrote: > You missed an important bit of Keir's response---it's perfectly fine > to use alloc_pages provided you then use the dma_map_single API, which > for Xen dom0 will take care of bounce-buffering to a > machine-contiguous buffer if necessary. I am not sure if the same > holds for a domU kernel. I'm not familiar with this stuff but dma_map_single is invoked via pci_map_page, isn't it? so alloc_pages and then pci_map_page is ok. But in mthca_memfree.c is alloc_pages and then pci_map_sg is used. Is it still OK? -- Lukáš Hejtmánek From muli at il.ibm.com Mon Jul 9 02:30:59 2007 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Mon, 9 Jul 2007 12:30:59 +0300 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709092257.GB3885@ics.muni.cz> References: <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709090802.GA3885@ics.muni.cz> <20070709091210.GP3182@rhun.haifa.ibm.com> <20070709092257.GB3885@ics.muni.cz> Message-ID: <20070709093059.GS3182@rhun.haifa.ibm.com> On Mon, Jul 09, 2007 at 11:22:57AM +0200, Lukas Hejtmanek wrote: > On Mon, Jul 09, 2007 at 12:12:10PM +0300, Muli Ben-Yehuda wrote: > > You missed an important bit of Keir's response---it's perfectly fine > > to use alloc_pages provided you then use the dma_map_single API, which > > for Xen dom0 will take care of bounce-buffering to a > > machine-contiguous buffer if necessary. I am not sure if the same > > holds for a domU kernel. > > I'm not familiar with this stuff but dma_map_single is invoked via > pci_map_page, isn't it? Depends on the specifics, but in general dma_map_single and pci_map_page are both implemented in terms of the DMA-API. > so alloc_pages and then pci_map_page is ok. But in mthca_memfree.c > is alloc_pages and then pci_map_sg is used. Is it still OK? Yes, same thing. Cheers, Muli From halr at voltaire.com Mon Jul 9 04:01:23 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jul 2007 07:01:23 -0400 Subject: [ofa-general] Re: [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree In-Reply-To: <4691F2B1.4000803@dev.mellanox.co.il> References: <4691F2B1.4000803@dev.mellanox.co.il> Message-ID: <1183978786.25217.366108.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2007-07-09 at 04:32, Yevgeny Kliteynik wrote: > Hi Hal. > > Updating doc and osm manpage with the > recent enhancement of fat-tree routing. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From fenkes at de.ibm.com Mon Jul 9 06:02:21 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:02:21 +0200 Subject: [ofa-general] [PATCH 00/13] IB/ehca: eHCA2 enablement & some fixes Message-ID: <200707091502.22407.fenkes@de.ibm.com> This patch series enables the eHCA device driver to support new functions of the eHCA2 chip. In addition, there are some bug fixes, code optimizations and general new features included. Another set of patches will follow. The patches, in detail, are: [01/13] fixes a wrong parameter description [02/13] adds HW capabilities autodetection [03/13] restructures the QP code, preparing for Share Receive Queues (SRQ) [04/13] adds SRQ support [05/13] adds support for UD low latency QPs [06/13] sets a flag that needs to be set on eHCA2 [07/13] adds RDMA atomic attributes to the data returned by query_qp() [08/13] straightens out lock flag naming and adds static initializers [09/13] refactors synchronization between completions and destroy_cq() [10/13] changes the global idr spinlocks into rwlocks [11/13] returns the QP pointer in poll_cq() instead of NULL [12/13] adds notifications in case the SM LID etc. changes [13/13] adds a slight latency improvement The patches should apply cleanly, in order, against Roland's git. Please review the changes and apply the patches for 2.6.23 if they are okay. Regards, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com From fenkes at de.ibm.com Mon Jul 9 06:20:55 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:20:55 +0200 Subject: [ofa-general] [PATCH 01/13] IB/ehca: change scaling_code parameter description to match default value In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091520.56294.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index c3f99f3..fea199f 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -94,7 +94,7 @@ MODULE_PARM_DESC(poll_all_eqs, MODULE_PARM_DESC(static_rate, "set permanent static rate (default: disabled)"); MODULE_PARM_DESC(scaling_code, - "set scaling code (0: disabled, 1: enabled/default)"); + "set scaling code (0: disabled/default, 1: enabled)"); spinlock_t ehca_qp_idr_lock; spinlock_t ehca_cq_idr_lock; -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:21:45 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:21:45 +0200 Subject: [ofa-general] [PATCH 02/13] IB/ehca: HW level, HW caps and MTU autodetection In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091521.46883.fenkes@de.ibm.com> In preparation for support of new eHCA2 features, change adapter probing: - Hardware level is changed to encode major and minor chip version - Hardware capabilities are queried from the firmware - The maximum MTU is queried from the firmware instead of assuming a fixed value Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_av.c | 6 ++- drivers/infiniband/hw/ehca/ehca_classes.h | 2 + drivers/infiniband/hw/ehca/ehca_hca.c | 27 +++++++++++- drivers/infiniband/hw/ehca/ehca_main.c | 62 ++++++++++++++++++++++++++--- drivers/infiniband/hw/ehca/hipz_hw.h | 18 ++++++++ 5 files changed, 104 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 0d6e2c4..3cd6bf3 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -118,7 +118,7 @@ struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) } memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); } - av->av.pmtu = EHCA_MAX_MTU; + av->av.pmtu = shca->max_mtu; /* dgid comes in grh.word_3 */ memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, @@ -137,6 +137,8 @@ int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) struct ehca_av *av; struct ehca_ud_av new_ehca_av; struct ehca_pd *my_pd = container_of(ah->pd, struct ehca_pd, ib_pd); + struct ehca_shca *shca = container_of(ah->pd->device, struct ehca_shca, + ib_device); u32 cur_pid = current->tgid; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && @@ -192,7 +194,7 @@ int ehca_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); } - new_ehca_av.pmtu = EHCA_MAX_MTU; + new_ehca_av.pmtu = shca->max_mtu; memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, sizeof(ah_attr->grh.dgid)); diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 1d286d3..35d948f 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -107,6 +107,8 @@ struct ehca_shca { struct ehca_pd *pd; struct h_galpas galpas; struct mutex modify_mutex; + u64 hca_cap; + int max_mtu; }; struct ehca_pd { diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 32b55a4..b310de5 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -45,11 +45,25 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { - int ret = 0; + int i, ret = 0; struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); struct hipz_query_hca *rblock; + static const u32 cap_mapping[] = { + IB_DEVICE_RESIZE_MAX_WR, HCA_CAP_WQE_RESIZE, + IB_DEVICE_BAD_PKEY_CNTR, HCA_CAP_BAD_P_KEY_CTR, + IB_DEVICE_BAD_QKEY_CNTR, HCA_CAP_Q_KEY_VIOL_CTR, + IB_DEVICE_RAW_MULTI, HCA_CAP_RAW_PACKET_MCAST, + IB_DEVICE_AUTO_PATH_MIG, HCA_CAP_AUTO_PATH_MIG, + IB_DEVICE_CHANGE_PHY_PORT, HCA_CAP_SQD_RTS_PORT_CHANGE, + IB_DEVICE_UD_AV_PORT_ENFORCE, HCA_CAP_AH_PORT_NR_CHECK, + IB_DEVICE_CURR_QP_STATE_MOD, HCA_CAP_CUR_QP_STATE_MOD, + IB_DEVICE_SHUTDOWN_PORT, HCA_CAP_SHUTDOWN_PORT, + IB_DEVICE_INIT_TYPE, HCA_CAP_INIT_TYPE, + IB_DEVICE_PORT_ACTIVE_EVENT, HCA_CAP_PORT_ACTIVE_EVENT, + }; + rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); @@ -96,6 +110,13 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props->max_total_mcast_qp_attach = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); + /* translate device capabilities */ + props->device_cap_flags = IB_DEVICE_SYS_IMAGE_GUID | + IB_DEVICE_RC_RNR_NAK_GEN | IB_DEVICE_N_NOTIFY_CQ; + for (i = 0; i < ARRAY_SIZE(cap_mapping); i += 2) + if (rblock->hca_cap_indicators & cap_mapping[i + 1]) + props->device_cap_flags |= cap_mapping[i]; + query_device1: ehca_free_fw_ctrlblock(rblock); @@ -261,7 +282,7 @@ int ehca_modify_port(struct ib_device *ibdev, } if (mutex_lock_interruptible(&shca->modify_mutex)) - return -ERESTARTSYS; + return -ERESTARTSYS; rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!rblock) { @@ -290,7 +311,7 @@ modify_port2: ehca_free_fw_ctrlblock(rblock); modify_port1: - mutex_unlock(&shca->modify_mutex); + mutex_unlock(&shca->modify_mutex); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index fea199f..befbb9c 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -205,11 +205,35 @@ static void ehca_destroy_slab_caches(void) #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) #define EHCA_REVID EHCA_BMASK_IBM(40,63) +static struct cap_descr { + u64 mask; + char *descr; +} hca_cap_descr[] = { + { HCA_CAP_AH_PORT_NR_CHECK, "HCA_CAP_AH_PORT_NR_CHECK" }, + { HCA_CAP_ATOMIC, "HCA_CAP_ATOMIC" }, + { HCA_CAP_AUTO_PATH_MIG, "HCA_CAP_AUTO_PATH_MIG" }, + { HCA_CAP_BAD_P_KEY_CTR, "HCA_CAP_BAD_P_KEY_CTR" }, + { HCA_CAP_SQD_RTS_PORT_CHANGE, "HCA_CAP_SQD_RTS_PORT_CHANGE" }, + { HCA_CAP_CUR_QP_STATE_MOD, "HCA_CAP_CUR_QP_STATE_MOD" }, + { HCA_CAP_INIT_TYPE, "HCA_CAP_INIT_TYPE" }, + { HCA_CAP_PORT_ACTIVE_EVENT, "HCA_CAP_PORT_ACTIVE_EVENT" }, + { HCA_CAP_Q_KEY_VIOL_CTR, "HCA_CAP_Q_KEY_VIOL_CTR" }, + { HCA_CAP_WQE_RESIZE, "HCA_CAP_WQE_RESIZE" }, + { HCA_CAP_RAW_PACKET_MCAST, "HCA_CAP_RAW_PACKET_MCAST" }, + { HCA_CAP_SHUTDOWN_PORT, "HCA_CAP_SHUTDOWN_PORT" }, + { HCA_CAP_RC_LL_QP, "HCA_CAP_RC_LL_QP" }, + { HCA_CAP_SRQ, "HCA_CAP_SRQ" }, + { HCA_CAP_UD_LL_QP, "HCA_CAP_UD_LL_QP" }, + { HCA_CAP_RESIZE_MR, "HCA_CAP_RESIZE_MR" }, + { HCA_CAP_MINI_QP, "HCA_CAP_MINI_QP" }, +}; + int ehca_sense_attributes(struct ehca_shca *shca) { - int ret = 0; + int i, ret = 0; u64 h_ret; struct hipz_query_hca *rblock; + struct hipz_query_port *port; rblock = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!rblock) { @@ -222,7 +246,7 @@ int ehca_sense_attributes(struct ehca_shca *shca) ehca_gen_err("Cannot query device properties. h_ret=%lx", h_ret); ret = -EPERM; - goto num_ports1; + goto sense_attributes1; } if (ehca_nr_ports == 1) @@ -242,18 +266,44 @@ int ehca_sense_attributes(struct ehca_shca *shca) ehca_gen_dbg(" ... hardware version=%x:%x", hcaaver, revid); if ((hcaaver == 1) && (revid == 0)) - shca->hw_level = 0; + shca->hw_level = 0x11; else if ((hcaaver == 1) && (revid == 1)) - shca->hw_level = 1; + shca->hw_level = 0x12; else if ((hcaaver == 1) && (revid == 2)) - shca->hw_level = 2; + shca->hw_level = 0x13; + else if ((hcaaver == 2) && (revid == 0)) + shca->hw_level = 0x21; + else if ((hcaaver == 2) && (revid == 0x10)) + shca->hw_level = 0x22; + else { + ehca_gen_warn("unknown hardware version" + " - assuming default level"); + shca->hw_level = 0x22; + } } ehca_gen_dbg(" ... hardware level=%x", shca->hw_level); shca->sport[0].rate = IB_RATE_30_GBPS; shca->sport[1].rate = IB_RATE_30_GBPS; -num_ports1: + shca->hca_cap = rblock->hca_cap_indicators; + ehca_gen_dbg(" ... HCA capabilities:"); + for (i = 0; i < ARRAY_SIZE(hca_cap_descr); i++) + if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) + ehca_gen_dbg(" %s", hca_cap_descr[i].descr); + + port = (struct hipz_query_port*)rblock; + h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); + if (h_ret != H_SUCCESS) { + ehca_gen_err("Cannot query port properties. h_ret=%lx", + h_ret); + ret = -EPERM; + goto sense_attributes1; + } + + shca->max_mtu = port->max_mtu; + +sense_attributes1: ehca_free_fw_ctrlblock(rblock); return ret; } diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index fad9136..9fe8367 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -360,6 +360,24 @@ struct hipz_query_hca { u32 max_neq; } __attribute__ ((packed)); +#define HCA_CAP_AH_PORT_NR_CHECK EHCA_BMASK_IBM(0,0) +#define HCA_CAP_ATOMIC EHCA_BMASK_IBM(1,1) +#define HCA_CAP_AUTO_PATH_MIG EHCA_BMASK_IBM(2,2) +#define HCA_CAP_BAD_P_KEY_CTR EHCA_BMASK_IBM(3,3) +#define HCA_CAP_SQD_RTS_PORT_CHANGE EHCA_BMASK_IBM(4,4) +#define HCA_CAP_CUR_QP_STATE_MOD EHCA_BMASK_IBM(5,5) +#define HCA_CAP_INIT_TYPE EHCA_BMASK_IBM(6,6) +#define HCA_CAP_PORT_ACTIVE_EVENT EHCA_BMASK_IBM(7,7) +#define HCA_CAP_Q_KEY_VIOL_CTR EHCA_BMASK_IBM(8,8) +#define HCA_CAP_WQE_RESIZE EHCA_BMASK_IBM(9,9) +#define HCA_CAP_RAW_PACKET_MCAST EHCA_BMASK_IBM(10,10) +#define HCA_CAP_SHUTDOWN_PORT EHCA_BMASK_IBM(11,11) +#define HCA_CAP_RC_LL_QP EHCA_BMASK_IBM(12,12) +#define HCA_CAP_SRQ EHCA_BMASK_IBM(13,13) +#define HCA_CAP_UD_LL_QP EHCA_BMASK_IBM(16,16) +#define HCA_CAP_RESIZE_MR EHCA_BMASK_IBM(17,17) +#define HCA_CAP_MINI_QP EHCA_BMASK_IBM(18,18) + /* query port response block */ struct hipz_query_port { u32 state; -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:23:15 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:23:15 +0200 Subject: [ofa-general] [PATCH 03/13] IB/ehca: QP code restructuring in preparation for SRQ In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091523.16498.fenkes@de.ibm.com> - Replace init_qp_queues() by a shorter init_qp_queue(), eliminating duplicate code. - hipz_h_alloc_resource_qp() doesn't need a pointer to struct ehca_qp any longer. All input and output data is transferred through the parms parameter. - Change the interface to also support SRQ. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 46 +++++- drivers/infiniband/hw/ehca/ehca_qp.c | 254 +++++++++++++---------------- drivers/infiniband/hw/ehca/hcp_if.c | 35 ++--- drivers/infiniband/hw/ehca/hcp_if.h | 1 - 4 files changed, 166 insertions(+), 170 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 35d948f..6e75db6 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -322,14 +322,49 @@ struct ehca_alloc_cq_parms { struct ipz_eq_handle eq_handle; }; +enum ehca_service_type { + ST_RC = 0, + ST_UC = 1, + ST_RD = 2, + ST_UD = 3, +}; + +enum ehca_ext_qp_type { + EQPT_NORMAL = 0, + EQPT_LLQP = 1, + EQPT_SRQBASE = 2, + EQPT_SRQ = 3, +}; + +enum ehca_ll_comp_flags { + LLQP_SEND_COMP = 0x20, + LLQP_RECV_COMP = 0x40, + LLQP_COMP_MASK = 0x60, +}; + struct ehca_alloc_qp_parms { - int servicetype; +/* input parameters */ + enum ehca_service_type servicetype; int sigtype; - int daqp_ctrl; - int max_send_sge; - int max_recv_sge; + enum ehca_ext_qp_type ext_type; + enum ehca_ll_comp_flags ll_comp_flags; + + int max_send_wr, max_recv_wr; + int max_send_sge, max_recv_sge; int ud_av_l_key_ctl; + u32 token; + struct ipz_eq_handle eq_handle; + struct ipz_pd pd; + struct ipz_cq_handle send_cq_handle, recv_cq_handle; + + u32 srq_qpn, srq_token, srq_limit; + +/* output parameters */ + u32 real_qp_num; + struct ipz_qp_handle qp_handle; + struct h_galpas galpas; + u16 act_nr_send_wqes; u16 act_nr_recv_wqes; u8 act_nr_recv_sges; @@ -337,9 +372,6 @@ struct ehca_alloc_qp_parms { u32 nr_rq_pages; u32 nr_sq_pages; - - struct ipz_eq_handle ipz_eq_handle; - struct ipz_pd pd; }; int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index b5bc787..ec1d555 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -234,13 +234,6 @@ static inline enum ib_qp_statetrans get_modqp_statetrans(int ib_fromstate, return index; } -enum ehca_service_type { - ST_RC = 0, - ST_UC = 1, - ST_RD = 2, - ST_UD = 3 -}; - /* * ibqptype2servicetype returns hcp service type corresponding to given * ib qp type used by create_qp() @@ -268,15 +261,16 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) } /* - * init_qp_queues initializes/constructs r/squeue and registers queue pages. + * init_qp_queue initializes/constructs r/squeue and registers queue pages. */ -static inline int init_qp_queues(struct ehca_shca *shca, - struct ehca_qp *my_qp, - int nr_sq_pages, - int nr_rq_pages, - int swqe_size, - int rwqe_size, - int nr_send_sges, int nr_receive_sges) +static inline int init_qp_queue(struct ehca_shca *shca, + struct ehca_qp *my_qp, + struct ipz_queue *queue, + int q_type, + u64 expected_hret, + int nr_q_pages, + int wqe_size, + int nr_sges) { int ret, cnt, ipz_rc; void *vpage; @@ -284,104 +278,63 @@ static inline int init_qp_queues(struct ehca_shca *shca, struct ib_device *ib_dev = &shca->ib_device; struct ipz_adapter_handle ipz_hca_handle = shca->ipz_hca_handle; - ipz_rc = ipz_queue_ctor(&my_qp->ipz_squeue, - nr_sq_pages, - EHCA_PAGESIZE, swqe_size, nr_send_sges); + if (!nr_q_pages) + return 0; + + ipz_rc = ipz_queue_ctor(queue, nr_q_pages, EHCA_PAGESIZE, + wqe_size, nr_sges); if (!ipz_rc) { - ehca_err(ib_dev,"Cannot allocate page for squeue. ipz_rc=%x", + ehca_err(ib_dev,"Cannot allocate page for queue. ipz_rc=%x", ipz_rc); return -EBUSY; } - ipz_rc = ipz_queue_ctor(&my_qp->ipz_rqueue, - nr_rq_pages, - EHCA_PAGESIZE, rwqe_size, nr_receive_sges); - if (!ipz_rc) { - ehca_err(ib_dev, "Cannot allocate page for rqueue. ipz_rc=%x", - ipz_rc); - ret = -EBUSY; - goto init_qp_queues0; - } - /* register SQ pages */ - for (cnt = 0; cnt < nr_sq_pages; cnt++) { - vpage = ipz_qpageit_get_inc(&my_qp->ipz_squeue); + /* register queue pages */ + for (cnt = 0; cnt < nr_q_pages; cnt++) { + vpage = ipz_qpageit_get_inc(queue); if (!vpage) { - ehca_err(ib_dev, "SQ ipz_qpageit_get_inc() " + ehca_err(ib_dev, "ipz_qpageit_get_inc() " "failed p_vpage= %p", vpage); ret = -EINVAL; - goto init_qp_queues1; + goto init_qp_queue1; } rpage = virt_to_abs(vpage); h_ret = hipz_h_register_rpage_qp(ipz_hca_handle, my_qp->ipz_qp_handle, - &my_qp->pf, 0, 0, + NULL, 0, q_type, rpage, 1, my_qp->galpas.kernel); - if (h_ret < H_SUCCESS) { - ehca_err(ib_dev, "SQ hipz_qp_register_rpage()" - " failed rc=%lx", h_ret); - ret = ehca2ib_return_code(h_ret); - goto init_qp_queues1; - } - } - - ipz_qeit_reset(&my_qp->ipz_squeue); - - /* register RQ pages */ - for (cnt = 0; cnt < nr_rq_pages; cnt++) { - vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); - if (!vpage) { - ehca_err(ib_dev, "RQ ipz_qpageit_get_inc() " - "failed p_vpage = %p", vpage); - ret = -EINVAL; - goto init_qp_queues1; - } - - rpage = virt_to_abs(vpage); - - h_ret = hipz_h_register_rpage_qp(ipz_hca_handle, - my_qp->ipz_qp_handle, - &my_qp->pf, 0, 1, - rpage, 1,my_qp->galpas.kernel); - if (h_ret < H_SUCCESS) { - ehca_err(ib_dev, "RQ hipz_qp_register_rpage() failed " - "rc=%lx", h_ret); - ret = ehca2ib_return_code(h_ret); - goto init_qp_queues1; - } - if (cnt == (nr_rq_pages - 1)) { /* last page! */ - if (h_ret != H_SUCCESS) { - ehca_err(ib_dev, "RQ hipz_qp_register_rpage() " + if (cnt == (nr_q_pages - 1)) { /* last page! */ + if (h_ret != expected_hret) { + ehca_err(ib_dev, "hipz_qp_register_rpage() " "h_ret= %lx ", h_ret); ret = ehca2ib_return_code(h_ret); - goto init_qp_queues1; + goto init_qp_queue1; } vpage = ipz_qpageit_get_inc(&my_qp->ipz_rqueue); if (vpage) { ehca_err(ib_dev, "ipz_qpageit_get_inc() " "should not succeed vpage=%p", vpage); ret = -EINVAL; - goto init_qp_queues1; + goto init_qp_queue1; } } else { if (h_ret != H_PAGE_REGISTERED) { - ehca_err(ib_dev, "RQ hipz_qp_register_rpage() " + ehca_err(ib_dev, "hipz_qp_register_rpage() " "h_ret= %lx ", h_ret); ret = ehca2ib_return_code(h_ret); - goto init_qp_queues1; + goto init_qp_queue1; } } } - ipz_qeit_reset(&my_qp->ipz_rqueue); + ipz_qeit_reset(queue); return 0; -init_qp_queues1: - ipz_queue_dtor(&my_qp->ipz_rqueue); -init_qp_queues0: - ipz_queue_dtor(&my_qp->ipz_squeue); +init_qp_queue1: + ipz_queue_dtor(queue); return ret; } @@ -397,14 +350,17 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, ib_device); struct ib_ucontext *context = NULL; u64 h_ret; - int max_send_sge, max_recv_sge, ret; + int is_llqp = 0, has_srq = 0; + int qp_type, max_send_sge, max_recv_sge, ret; /* h_call's out parameters */ struct ehca_alloc_qp_parms parms; u32 swqe_size = 0, rwqe_size = 0; - u8 daqp_completion, isdaqp; unsigned long flags; + memset(&parms, 0, sizeof(parms)); + qp_type = init_attr->qp_type; + if (init_attr->sq_sig_type != IB_SIGNAL_REQ_WR && init_attr->sq_sig_type != IB_SIGNAL_ALL_WR) { ehca_err(pd->device, "init_attr->sg_sig_type=%x not allowed", @@ -412,38 +368,47 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, return ERR_PTR(-EINVAL); } - /* save daqp completion bits */ - daqp_completion = init_attr->qp_type & 0x60; - /* save daqp bit */ - isdaqp = (init_attr->qp_type & 0x80) ? 1 : 0; - init_attr->qp_type = init_attr->qp_type & 0x1F; + /* save LLQP info */ + if (qp_type & 0x80) { + is_llqp = 1; + parms.ext_type = EQPT_LLQP; + parms.ll_comp_flags = qp_type & LLQP_COMP_MASK; + } + qp_type &= 0x1F; + + /* check for SRQ */ + has_srq = !!(init_attr->srq); + if (is_llqp && has_srq) { + ehca_err(pd->device, "LLQPs can't have an SRQ"); + return ERR_PTR(-EINVAL); + } - if (init_attr->qp_type != IB_QPT_UD && - init_attr->qp_type != IB_QPT_SMI && - init_attr->qp_type != IB_QPT_GSI && - init_attr->qp_type != IB_QPT_UC && - init_attr->qp_type != IB_QPT_RC) { - ehca_err(pd->device, "wrong QP Type=%x", init_attr->qp_type); + /* check QP type */ + if (qp_type != IB_QPT_UD && + qp_type != IB_QPT_UC && + qp_type != IB_QPT_RC && + qp_type != IB_QPT_SMI && + qp_type != IB_QPT_GSI) { + ehca_err(pd->device, "wrong QP Type=%x", qp_type); return ERR_PTR(-EINVAL); } - if ((init_attr->qp_type != IB_QPT_RC && init_attr->qp_type != IB_QPT_UD) - && isdaqp) { - ehca_err(pd->device, "unsupported LL QP Type=%x", - init_attr->qp_type); + + if (is_llqp && (qp_type != IB_QPT_RC && qp_type != IB_QPT_UD)) { + ehca_err(pd->device, "unsupported LL QP Type=%x", qp_type); return ERR_PTR(-EINVAL); - } else if (init_attr->qp_type == IB_QPT_RC && isdaqp && + } else if (is_llqp && qp_type == IB_QPT_RC && (init_attr->cap.max_send_wr > 255 || init_attr->cap.max_recv_wr > 255 )) { - ehca_err(pd->device, "Invalid Number of max_sq_wr =%x " - "or max_rq_wr=%x for QP Type=%x", - init_attr->cap.max_send_wr, - init_attr->cap.max_recv_wr,init_attr->qp_type); - return ERR_PTR(-EINVAL); - } else if (init_attr->qp_type == IB_QPT_UD && isdaqp && - init_attr->cap.max_send_wr > 255) { + ehca_err(pd->device, "Invalid Number of max_sq_wr=%x " + "or max_rq_wr=%x for RC LLQP", + init_attr->cap.max_send_wr, + init_attr->cap.max_recv_wr); + return ERR_PTR(-EINVAL); + } else if (is_llqp && qp_type == IB_QPT_UD && + init_attr->cap.max_send_wr > 255) { ehca_err(pd->device, "Invalid Number of max_send_wr=%x for UD QP_TYPE=%x", - init_attr->cap.max_send_wr, init_attr->qp_type); + init_attr->cap.max_send_wr, qp_type); return ERR_PTR(-EINVAL); } @@ -456,7 +421,6 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, return ERR_PTR(-ENOMEM); } - memset (&parms, 0, sizeof(struct ehca_alloc_qp_parms)); spin_lock_init(&my_qp->spinlock_s); spin_lock_init(&my_qp->spinlock_r); @@ -465,8 +429,6 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, my_qp->send_cq = container_of(init_attr->send_cq, struct ehca_cq, ib_cq); - my_qp->init_attr = *init_attr; - do { if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) { ret = -ENOMEM; @@ -486,10 +448,10 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, goto create_qp_exit0; } - parms.servicetype = ibqptype2servicetype(init_attr->qp_type); + parms.servicetype = ibqptype2servicetype(qp_type); if (parms.servicetype < 0) { ret = -EINVAL; - ehca_err(pd->device, "Invalid qp_type=%x", init_attr->qp_type); + ehca_err(pd->device, "Invalid qp_type=%x", qp_type); goto create_qp_exit0; } @@ -501,21 +463,23 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, /* UD_AV CIRCUMVENTION */ max_send_sge = init_attr->cap.max_send_sge; max_recv_sge = init_attr->cap.max_recv_sge; - if (IB_QPT_UD == init_attr->qp_type || - IB_QPT_GSI == init_attr->qp_type || - IB_QPT_SMI == init_attr->qp_type) { + if (parms.servicetype == ST_UD) { max_send_sge += 2; max_recv_sge += 2; } - parms.ipz_eq_handle = shca->eq.ipz_eq_handle; - parms.daqp_ctrl = isdaqp | daqp_completion; + parms.token = my_qp->token; + parms.eq_handle = shca->eq.ipz_eq_handle; parms.pd = my_pd->fw_pd; - parms.max_recv_sge = max_recv_sge; - parms.max_send_sge = max_send_sge; + parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle; + parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle; - h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, my_qp, &parms); + parms.max_send_wr = init_attr->cap.max_send_wr; + parms.max_recv_wr = init_attr->cap.max_recv_wr; + parms.max_send_sge = max_send_sge; + parms.max_recv_sge = max_recv_sge; + h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms); if (h_ret != H_SUCCESS) { ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%lx", h_ret); @@ -523,16 +487,18 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, goto create_qp_exit1; } - my_qp->ib_qp.qp_num = my_qp->real_qp_num; + my_qp->ib_qp.qp_num = my_qp->real_qp_num = parms.real_qp_num; + my_qp->ipz_qp_handle = parms.qp_handle; + my_qp->galpas = parms.galpas; - switch (init_attr->qp_type) { + switch (qp_type) { case IB_QPT_RC: - if (isdaqp == 0) { + if (!is_llqp) { swqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ (parms.act_nr_send_sges)]); rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ (parms.act_nr_recv_sges)]); - } else { /* for daqp we need to use msg size, not wqe size */ + } else { /* for LLQP we need to use msg size, not wqe size */ swqe_size = da_rc_msg_size[max_send_sge]; rwqe_size = da_rc_msg_size[max_recv_sge]; parms.act_nr_send_sges = 1; @@ -552,7 +518,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, /* UD circumvention */ parms.act_nr_recv_sges -= 2; parms.act_nr_send_sges -= 2; - if (isdaqp) { + if (is_llqp) { swqe_size = da_ud_sq_msg_size[max_send_sge]; rwqe_size = da_rc_msg_size[max_recv_sge]; parms.act_nr_send_sges = 1; @@ -564,14 +530,12 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, u.ud_av.sg_list[parms.act_nr_recv_sges]); } - if (IB_QPT_GSI == init_attr->qp_type || - IB_QPT_SMI == init_attr->qp_type) { + if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) { parms.act_nr_send_wqes = init_attr->cap.max_send_wr; parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr; parms.act_nr_send_sges = init_attr->cap.max_send_sge; parms.act_nr_recv_sges = init_attr->cap.max_recv_sge; - my_qp->ib_qp.qp_num = - (init_attr->qp_type == IB_QPT_SMI) ? 0 : 1; + my_qp->ib_qp.qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1; } break; @@ -580,26 +544,33 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, break; } - /* initializes r/squeue and registers queue pages */ - ret = init_qp_queues(shca, my_qp, - parms.nr_sq_pages, parms.nr_rq_pages, - swqe_size, rwqe_size, - parms.act_nr_send_sges, parms.act_nr_recv_sges); + /* initialize r/squeue and register queue pages */ + ret = init_qp_queue(shca, my_qp, &my_qp->ipz_squeue, 0, + has_srq ? H_SUCCESS : H_PAGE_REGISTERED, + parms.nr_sq_pages, swqe_size, + parms.act_nr_send_sges); if (ret) { ehca_err(pd->device, - "Couldn't initialize r/squeue and pages ret=%x", ret); + "Couldn't initialize squeue and pages ret=%x", ret); goto create_qp_exit2; } + ret = init_qp_queue(shca, my_qp, &my_qp->ipz_rqueue, 1, H_SUCCESS, + parms.nr_rq_pages, rwqe_size, + parms.act_nr_recv_sges); + if (ret) { + ehca_err(pd->device, + "Couldn't initialize rqueue and pages ret=%x", ret); + goto create_qp_exit3; + } + my_qp->ib_qp.pd = &my_pd->ib_pd; my_qp->ib_qp.device = my_pd->ib_pd.device; my_qp->ib_qp.recv_cq = init_attr->recv_cq; my_qp->ib_qp.send_cq = init_attr->send_cq; - my_qp->ib_qp.qp_type = init_attr->qp_type; - - my_qp->qp_type = init_attr->qp_type; + my_qp->ib_qp.qp_type = my_qp->qp_type = qp_type; my_qp->ib_qp.srq = init_attr->srq; my_qp->ib_qp.qp_context = init_attr->qp_context; @@ -610,15 +581,16 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, init_attr->cap.max_recv_wr = parms.act_nr_recv_wqes; init_attr->cap.max_send_sge = parms.act_nr_send_sges; init_attr->cap.max_send_wr = parms.act_nr_send_wqes; + my_qp->init_attr = *init_attr; /* NOTE: define_apq0() not supported yet */ - if (init_attr->qp_type == IB_QPT_GSI) { + if (qp_type == IB_QPT_GSI) { h_ret = ehca_define_sqp(shca, my_qp, init_attr); if (h_ret != H_SUCCESS) { ehca_err(pd->device, "ehca_define_sqp() failed rc=%lx", h_ret); ret = ehca2ib_return_code(h_ret); - goto create_qp_exit3; + goto create_qp_exit4; } } if (init_attr->send_cq) { @@ -628,7 +600,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, if (ret) { ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x", ret); - goto create_qp_exit3; + goto create_qp_exit4; } my_qp->send_cq = cq; } @@ -659,14 +631,16 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); ret = -EINVAL; - goto create_qp_exit3; + goto create_qp_exit4; } } return &my_qp->ib_qp; -create_qp_exit3: +create_qp_exit4: ipz_queue_dtor(&my_qp->ipz_rqueue); + +create_qp_exit3: ipz_queue_dtor(&my_qp->ipz_squeue); create_qp_exit2: diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 5766ae3..7efc4a2 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -74,11 +74,6 @@ #define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) #define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) -/* direct access qp controls */ -#define DAQP_CTRL_ENABLE 0x01 -#define DAQP_CTRL_SEND_COMP 0x20 -#define DAQP_CTRL_RECV_COMP 0x40 - static u32 get_longbusy_msecs(int longbusy_rc) { switch (longbusy_rc) { @@ -284,36 +279,31 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, } u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, - struct ehca_qp *qp, struct ehca_alloc_qp_parms *parms) { u64 ret; u64 allocate_controls; u64 max_r10_reg; u64 outs[PLPAR_HCALL9_BUFSIZE]; - u16 max_nr_receive_wqes = qp->init_attr.cap.max_recv_wr + 1; - u16 max_nr_send_wqes = qp->init_attr.cap.max_send_wr + 1; - int daqp_ctrl = parms->daqp_ctrl; allocate_controls = - EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS, - (daqp_ctrl & DAQP_CTRL_ENABLE) ? 1 : 0) + EHCA_BMASK_SET(H_ALL_RES_QP_ENHANCED_OPS, parms->ext_type) | EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0) | EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, parms->servicetype) | EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, parms->sigtype) | EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING, - (daqp_ctrl & DAQP_CTRL_RECV_COMP) ? 1 : 0) + !!(parms->ll_comp_flags & LLQP_RECV_COMP)) | EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING, - (daqp_ctrl & DAQP_CTRL_SEND_COMP) ? 1 : 0) + !!(parms->ll_comp_flags & LLQP_SEND_COMP)) | EHCA_BMASK_SET(H_ALL_RES_QP_UD_AV_LKEY_CTRL, parms->ud_av_l_key_ctl) | EHCA_BMASK_SET(H_ALL_RES_QP_RESOURCE_TYPE, 1); max_r10_reg = EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR, - max_nr_send_wqes) + parms->max_send_wr + 1) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR, - max_nr_receive_wqes) + parms->max_recv_wr + 1) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE, parms->max_send_sge) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, @@ -322,15 +312,16 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs, adapter_handle.handle, /* r4 */ allocate_controls, /* r5 */ - qp->send_cq->ipz_cq_handle.handle, - qp->recv_cq->ipz_cq_handle.handle, - parms->ipz_eq_handle.handle, - ((u64)qp->token << 32) | parms->pd.value, + parms->send_cq_handle.handle, + parms->recv_cq_handle.handle, + parms->eq_handle.handle, + ((u64)parms->token << 32) | parms->pd.value, max_r10_reg, /* r10 */ parms->ud_av_l_key_ctl, /* r11 */ 0); - qp->ipz_qp_handle.handle = outs[0]; - qp->real_qp_num = (u32)outs[1]; + + parms->qp_handle.handle = outs[0]; + parms->real_qp_num = (u32)outs[1]; parms->act_nr_send_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]); parms->act_nr_recv_wqes = @@ -345,7 +336,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, (u32)EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, outs[4]); if (ret == H_SUCCESS) - hcp_galpas_ctor(&qp->galpas, outs[6], outs[6]); + hcp_galpas_ctor(&parms->galpas, outs[6], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) ehca_gen_err("Not enough resources. ret=%lx", ret); diff --git a/drivers/infiniband/hw/ehca/hcp_if.h b/drivers/infiniband/hw/ehca/hcp_if.h index 2869f7d..60ce02b 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.h +++ b/drivers/infiniband/hw/ehca/hcp_if.h @@ -78,7 +78,6 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, * initialize resources, create empty QPPTs (2 rings). */ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, - struct ehca_qp *qp, struct ehca_alloc_qp_parms *parms); u64 hipz_h_query_port(const struct ipz_adapter_handle adapter_handle, -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:25:10 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:25:10 +0200 Subject: [ofa-general] [PATCH 04/13] IB/ehca: add Shared Receive Queue support In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091525.11777.fenkes@de.ibm.com> Support SRQs on eHCA2. Since an SRQ is a QP for eHCA2, a lot of code (structures, create, destroy, post_recv) can be shared between QP and SRQ. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 26 +- drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 4 +- drivers/infiniband/hw/ehca/ehca_iverbs.h | 15 + drivers/infiniband/hw/ehca/ehca_main.c | 16 +- drivers/infiniband/hw/ehca/ehca_qp.c | 451 +++++++++++++++++---- drivers/infiniband/hw/ehca/ehca_reqs.c | 47 ++- drivers/infiniband/hw/ehca/ehca_uverbs.c | 4 +- drivers/infiniband/hw/ehca/hcp_if.c | 23 +- drivers/infiniband/hw/ehca/hipz_hw.h | 1 + 9 files changed, 480 insertions(+), 107 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 6e75db6..9d689ae 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -5,6 +5,7 @@ * * Authors: Heiko J Schick * Christoph Raisch + * Joachim Fenkes * * Copyright (c) 2005 IBM Corporation * @@ -117,9 +118,20 @@ struct ehca_pd { u32 ownpid; }; +enum ehca_ext_qp_type { + EQPT_NORMAL = 0, + EQPT_LLQP = 1, + EQPT_SRQBASE = 2, + EQPT_SRQ = 3, +}; + struct ehca_qp { - struct ib_qp ib_qp; + union { + struct ib_qp ib_qp; + struct ib_srq ib_srq; + }; u32 qp_type; + enum ehca_ext_qp_type ext_type; struct ipz_queue ipz_squeue; struct ipz_queue ipz_rqueue; struct h_galpas galpas; @@ -142,6 +154,10 @@ struct ehca_qp { u32 mm_count_galpa; }; +#define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ) +#define HAS_SQ(qp) (qp->ext_type != EQPT_SRQ) +#define HAS_RQ(qp) (qp->ext_type != EQPT_SRQBASE) + /* must be power of 2 */ #define QP_HASHTAB_LEN 8 @@ -307,6 +323,7 @@ struct ehca_create_qp_resp { u32 qp_num; u32 token; u32 qp_type; + u32 ext_type; u32 qkey; /* qp_num assigned by ehca: sqp0/1 may have got different numbers */ u32 real_qp_num; @@ -329,13 +346,6 @@ enum ehca_service_type { ST_UD = 3, }; -enum ehca_ext_qp_type { - EQPT_NORMAL = 0, - EQPT_LLQP = 1, - EQPT_SRQBASE = 2, - EQPT_SRQ = 3, -}; - enum ehca_ll_comp_flags { LLQP_SEND_COMP = 0x20, LLQP_RECV_COMP = 0x40, diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h index 5665f21..fb3df5c 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h +++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h @@ -228,8 +228,8 @@ struct hcp_modify_qp_control_block { #define MQPCB_QP_NUMBER EHCA_BMASK_IBM(8,31) #define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48,48) #define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31,31) -#define MQPCB_MASK_CURR_SQR_LIMIT EHCA_BMASK_IBM(49,49) -#define MQPCB_CURR_SQR_LIMIT EHCA_BMASK_IBM(15,31) +#define MQPCB_MASK_CURR_SRQ_LIMIT EHCA_BMASK_IBM(49,49) +#define MQPCB_CURR_SRQ_LIMIT EHCA_BMASK_IBM(16,31) #define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50,50) #define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51,51) diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 37e7fe0..fd84a80 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -154,6 +154,21 @@ int ehca_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, int ehca_post_recv(struct ib_qp *qp, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); +int ehca_post_srq_recv(struct ib_srq *srq, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + +struct ib_srq *ehca_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *init_attr, + struct ib_udata *udata); + +int ehca_modify_srq(struct ib_srq *srq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata); + +int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr); + +int ehca_destroy_srq(struct ib_srq *srq); + u64 ehca_define_sqp(struct ehca_shca *shca, struct ehca_qp *ibqp, struct ib_qp_init_attr *qp_init_attr); diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index befbb9c..9bd749c 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -343,7 +343,7 @@ int ehca_init_device(struct ehca_shca *shca) strlcpy(shca->ib_device.name, "ehca%d", IB_DEVICE_NAME_MAX); shca->ib_device.owner = THIS_MODULE; - shca->ib_device.uverbs_abi_ver = 6; + shca->ib_device.uverbs_abi_ver = 7; shca->ib_device.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | @@ -411,6 +411,20 @@ int ehca_init_device(struct ehca_shca *shca) /* shca->ib_device.process_mad = ehca_process_mad; */ shca->ib_device.mmap = ehca_mmap; + if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca->hca_cap)) { + shca->ib_device.uverbs_cmd_mask |= + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); + + shca->ib_device.create_srq = ehca_create_srq; + shca->ib_device.modify_srq = ehca_modify_srq; + shca->ib_device.query_srq = ehca_query_srq; + shca->ib_device.destroy_srq = ehca_destroy_srq; + shca->ib_device.post_srq_recv = ehca_post_srq_recv; + } + return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index ec1d555..9486a44 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -3,7 +3,9 @@ * * QP functions * - * Authors: Waleri Fomin + * Authors: Joachim Fenkes + * Stefan Roscher + * Waleri Fomin * Hoang-Nam Nguyen * Reinhard Ernst * Heiko J Schick @@ -261,6 +263,19 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) } /* + * init userspace queue info from ipz_queue data + */ +static inline void queue2resp(struct ipzu_queue_resp *resp, + struct ipz_queue *queue) +{ + resp->qe_size = queue->qe_size; + resp->act_nr_of_sg = queue->act_nr_of_sg; + resp->queue_length = queue->queue_length; + resp->pagesize = queue->pagesize; + resp->toggle_state = queue->toggle_state; +} + +/* * init_qp_queue initializes/constructs r/squeue and registers queue pages. */ static inline int init_qp_queue(struct ehca_shca *shca, @@ -338,11 +353,17 @@ init_qp_queue1: return ret; } -struct ib_qp *ehca_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *init_attr, - struct ib_udata *udata) +/* + * Create an ib_qp struct that is either a QP or an SRQ, depending on + * the value of the is_srq parameter. If init_attr and srq_init_attr share + * fields, the field out of init_attr is used. + */ +struct ehca_qp *internal_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata, int is_srq) { - static int da_rc_msg_size[]={ 128, 256, 512, 1024, 2048, 4096 }; + static int da_rc_msg_size[] = { 128, 256, 512, 1024, 2048, 4096 }; static int da_ud_sq_msg_size[]={ 128, 384, 896, 1920, 3968 }; struct ehca_qp *my_qp; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); @@ -355,7 +376,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, /* h_call's out parameters */ struct ehca_alloc_qp_parms parms; - u32 swqe_size = 0, rwqe_size = 0; + u32 swqe_size = 0, rwqe_size = 0, ib_qp_num; unsigned long flags; memset(&parms, 0, sizeof(parms)); @@ -376,13 +397,34 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, } qp_type &= 0x1F; - /* check for SRQ */ - has_srq = !!(init_attr->srq); + /* handle SRQ base QPs */ + if (init_attr->srq) { + struct ehca_qp *my_srq = + container_of(init_attr->srq, struct ehca_qp, ib_srq); + + has_srq = 1; + parms.ext_type = EQPT_SRQBASE; + parms.srq_qpn = my_srq->real_qp_num; + parms.srq_token = my_srq->token; + } + if (is_llqp && has_srq) { ehca_err(pd->device, "LLQPs can't have an SRQ"); return ERR_PTR(-EINVAL); } + /* handle SRQs */ + if (is_srq) { + parms.ext_type = EQPT_SRQ; + parms.srq_limit = srq_init_attr->attr.srq_limit; + if (init_attr->cap.max_recv_sge > 3) { + ehca_err(pd->device, "no more than three SGEs " + "supported for SRQ pd=%p max_sge=%x", + pd, init_attr->cap.max_recv_sge); + return ERR_PTR(-EINVAL); + } + } + /* check QP type */ if (qp_type != IB_QPT_UD && qp_type != IB_QPT_UC && @@ -423,11 +465,15 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, spin_lock_init(&my_qp->spinlock_s); spin_lock_init(&my_qp->spinlock_r); + my_qp->qp_type = qp_type; + my_qp->ext_type = parms.ext_type; - my_qp->recv_cq = - container_of(init_attr->recv_cq, struct ehca_cq, ib_cq); - my_qp->send_cq = - container_of(init_attr->send_cq, struct ehca_cq, ib_cq); + if (init_attr->recv_cq) + my_qp->recv_cq = + container_of(init_attr->recv_cq, struct ehca_cq, ib_cq); + if (init_attr->send_cq) + my_qp->send_cq = + container_of(init_attr->send_cq, struct ehca_cq, ib_cq); do { if (!idr_pre_get(&ehca_qp_idr, GFP_KERNEL)) { @@ -471,8 +517,10 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, parms.token = my_qp->token; parms.eq_handle = shca->eq.ipz_eq_handle; parms.pd = my_pd->fw_pd; - parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle; - parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle; + if (my_qp->send_cq) + parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle; + if (my_qp->recv_cq) + parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle; parms.max_send_wr = init_attr->cap.max_send_wr; parms.max_recv_wr = init_attr->cap.max_recv_wr; @@ -487,7 +535,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, goto create_qp_exit1; } - my_qp->ib_qp.qp_num = my_qp->real_qp_num = parms.real_qp_num; + ib_qp_num = my_qp->real_qp_num = parms.real_qp_num; my_qp->ipz_qp_handle = parms.qp_handle; my_qp->galpas = parms.galpas; @@ -535,7 +583,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr; parms.act_nr_send_sges = init_attr->cap.max_send_sge; parms.act_nr_recv_sges = init_attr->cap.max_recv_sge; - my_qp->ib_qp.qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1; + ib_qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1; } break; @@ -545,36 +593,51 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, } /* initialize r/squeue and register queue pages */ - ret = init_qp_queue(shca, my_qp, &my_qp->ipz_squeue, 0, - has_srq ? H_SUCCESS : H_PAGE_REGISTERED, - parms.nr_sq_pages, swqe_size, - parms.act_nr_send_sges); - if (ret) { - ehca_err(pd->device, - "Couldn't initialize squeue and pages ret=%x", ret); - goto create_qp_exit2; + if (HAS_SQ(my_qp)) { + ret = init_qp_queue( + shca, my_qp, &my_qp->ipz_squeue, 0, + HAS_RQ(my_qp) ? H_PAGE_REGISTERED : H_SUCCESS, + parms.nr_sq_pages, swqe_size, + parms.act_nr_send_sges); + if (ret) { + ehca_err(pd->device, "Couldn't initialize squeue " + "and pages ret=%x", ret); + goto create_qp_exit2; + } } - ret = init_qp_queue(shca, my_qp, &my_qp->ipz_rqueue, 1, H_SUCCESS, - parms.nr_rq_pages, rwqe_size, - parms.act_nr_recv_sges); - if (ret) { - ehca_err(pd->device, - "Couldn't initialize rqueue and pages ret=%x", ret); - goto create_qp_exit3; + if (HAS_RQ(my_qp)) { + ret = init_qp_queue( + shca, my_qp, &my_qp->ipz_rqueue, 1, + H_SUCCESS, parms.nr_rq_pages, rwqe_size, + parms.act_nr_recv_sges); + if (ret) { + ehca_err(pd->device, "Couldn't initialize rqueue " + "and pages ret=%x", ret); + goto create_qp_exit3; + } } - my_qp->ib_qp.pd = &my_pd->ib_pd; - my_qp->ib_qp.device = my_pd->ib_pd.device; + if (is_srq) { + my_qp->ib_srq.pd = &my_pd->ib_pd; + my_qp->ib_srq.device = my_pd->ib_pd.device; - my_qp->ib_qp.recv_cq = init_attr->recv_cq; - my_qp->ib_qp.send_cq = init_attr->send_cq; + my_qp->ib_srq.srq_context = init_attr->qp_context; + my_qp->ib_srq.event_handler = init_attr->event_handler; + } else { + my_qp->ib_qp.qp_num = ib_qp_num; + my_qp->ib_qp.pd = &my_pd->ib_pd; + my_qp->ib_qp.device = my_pd->ib_pd.device; + + my_qp->ib_qp.recv_cq = init_attr->recv_cq; + my_qp->ib_qp.send_cq = init_attr->send_cq; - my_qp->ib_qp.qp_type = my_qp->qp_type = qp_type; - my_qp->ib_qp.srq = init_attr->srq; + my_qp->ib_qp.qp_type = qp_type; + my_qp->ib_qp.srq = init_attr->srq; - my_qp->ib_qp.qp_context = init_attr->qp_context; - my_qp->ib_qp.event_handler = init_attr->event_handler; + my_qp->ib_qp.qp_context = init_attr->qp_context; + my_qp->ib_qp.event_handler = init_attr->event_handler; + } init_attr->cap.max_inline_data = 0; /* not supported yet */ init_attr->cap.max_recv_sge = parms.act_nr_recv_sges; @@ -593,41 +656,32 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, goto create_qp_exit4; } } - if (init_attr->send_cq) { - struct ehca_cq *cq = container_of(init_attr->send_cq, - struct ehca_cq, ib_cq); - ret = ehca_cq_assign_qp(cq, my_qp); + + if (my_qp->send_cq) { + ret = ehca_cq_assign_qp(my_qp->send_cq, my_qp); if (ret) { ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x", ret); goto create_qp_exit4; } - my_qp->send_cq = cq; } + /* copy queues, galpa data to user space */ if (context && udata) { - struct ipz_queue *ipz_rqueue = &my_qp->ipz_rqueue; - struct ipz_queue *ipz_squeue = &my_qp->ipz_squeue; struct ehca_create_qp_resp resp; memset(&resp, 0, sizeof(resp)); resp.qp_num = my_qp->real_qp_num; resp.token = my_qp->token; resp.qp_type = my_qp->qp_type; + resp.ext_type = my_qp->ext_type; resp.qkey = my_qp->qkey; resp.real_qp_num = my_qp->real_qp_num; - /* rqueue properties */ - resp.ipz_rqueue.qe_size = ipz_rqueue->qe_size; - resp.ipz_rqueue.act_nr_of_sg = ipz_rqueue->act_nr_of_sg; - resp.ipz_rqueue.queue_length = ipz_rqueue->queue_length; - resp.ipz_rqueue.pagesize = ipz_rqueue->pagesize; - resp.ipz_rqueue.toggle_state = ipz_rqueue->toggle_state; - /* squeue properties */ - resp.ipz_squeue.qe_size = ipz_squeue->qe_size; - resp.ipz_squeue.act_nr_of_sg = ipz_squeue->act_nr_of_sg; - resp.ipz_squeue.queue_length = ipz_squeue->queue_length; - resp.ipz_squeue.pagesize = ipz_squeue->pagesize; - resp.ipz_squeue.toggle_state = ipz_squeue->toggle_state; + if (HAS_SQ(my_qp)) + queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue); + if (HAS_RQ(my_qp)) + queue2resp(&resp.ipz_rqueue, &my_qp->ipz_rqueue); + if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); ret = -EINVAL; @@ -635,13 +689,15 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, } } - return &my_qp->ib_qp; + return my_qp; create_qp_exit4: - ipz_queue_dtor(&my_qp->ipz_rqueue); + if (HAS_RQ(my_qp)) + ipz_queue_dtor(&my_qp->ipz_rqueue); create_qp_exit3: - ipz_queue_dtor(&my_qp->ipz_squeue); + if (HAS_SQ(my_qp)) + ipz_queue_dtor(&my_qp->ipz_squeue); create_qp_exit2: hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); @@ -656,6 +712,114 @@ create_qp_exit0: return ERR_PTR(ret); } +struct ib_qp *ehca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr, + struct ib_udata *udata) +{ + struct ehca_qp *ret; + + ret = internal_create_qp(pd, qp_init_attr, NULL, udata, 0); + return IS_ERR(ret) ? (struct ib_qp*)ret : &ret->ib_qp; +} + +int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, + struct ib_uobject *uobject); + +struct ib_srq *ehca_create_srq(struct ib_pd *pd, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata) +{ + struct ib_qp_init_attr qp_init_attr; + struct ehca_qp *my_qp; + struct ib_srq *ret; + struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, + ib_device); + struct hcp_modify_qp_control_block *mqpcb; + u64 hret, update_mask; + + /* For common attributes, internal_create_qp() takes its info + * out of qp_init_attr, so copy all common attrs there. + */ + memset(&qp_init_attr, 0, sizeof(qp_init_attr)); + qp_init_attr.event_handler = srq_init_attr->event_handler; + qp_init_attr.qp_context = srq_init_attr->srq_context; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.qp_type = IB_QPT_RC; + qp_init_attr.cap.max_recv_wr = srq_init_attr->attr.max_wr; + qp_init_attr.cap.max_recv_sge = srq_init_attr->attr.max_sge; + + my_qp = internal_create_qp(pd, &qp_init_attr, srq_init_attr, udata, 1); + if (IS_ERR(my_qp)) + return (struct ib_srq*)my_qp; + + /* copy back return values */ + srq_init_attr->attr.max_wr = qp_init_attr.cap.max_recv_wr; + srq_init_attr->attr.max_sge = qp_init_attr.cap.max_recv_sge; + + /* drive SRQ into RTR state */ + mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL); + if (!mqpcb) { + ehca_err(pd->device, "Could not get zeroed page for mqpcb " + "ehca_qp=%p qp_num=%x ", my_qp, my_qp->real_qp_num); + ret = ERR_PTR(-ENOMEM); + goto create_srq1; + } + + mqpcb->qp_state = EHCA_QPS_INIT; + mqpcb->prim_phys_port = 1; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); + hret = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->galpas.kernel); + if (hret != H_SUCCESS) { + ehca_err(pd->device, "Could not modify SRQ to INIT" + "ehca_qp=%p qp_num=%x hret=%lx", + my_qp, my_qp->real_qp_num, hret); + goto create_srq2; + } + + mqpcb->qp_enable = 1; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_ENABLE, 1); + hret = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->galpas.kernel); + if (hret != H_SUCCESS) { + ehca_err(pd->device, "Could not enable SRQ" + "ehca_qp=%p qp_num=%x hret=%lx", + my_qp, my_qp->real_qp_num, hret); + goto create_srq2; + } + + mqpcb->qp_state = EHCA_QPS_RTR; + update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); + hret = hipz_h_modify_qp(shca->ipz_hca_handle, + my_qp->ipz_qp_handle, + &my_qp->pf, + update_mask, + mqpcb, my_qp->galpas.kernel); + if (hret != H_SUCCESS) { + ehca_err(pd->device, "Could not modify SRQ to RTR" + "ehca_qp=%p qp_num=%x hret=%lx", + my_qp, my_qp->real_qp_num, hret); + goto create_srq2; + } + + return &my_qp->ib_srq; + +create_srq2: + ret = ERR_PTR(ehca2ib_return_code(hret)); + ehca_free_fw_ctrlblock(mqpcb); + +create_srq1: + internal_destroy_qp(pd->device, my_qp, my_qp->ib_srq.uobject); + + return ret; +} + /* * prepare_sqe_rts called by internal_modify_qp() at trans sqe -> rts * set purge bit of bad wqe and subsequent wqes to avoid reentering sqe @@ -1341,42 +1505,159 @@ query_qp_exit1: return ret; } -int ehca_destroy_qp(struct ib_qp *ibqp) +int ehca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask, struct ib_udata *udata) { - struct ehca_qp *my_qp = container_of(ibqp, struct ehca_qp, ib_qp); - struct ehca_shca *shca = container_of(ibqp->device, struct ehca_shca, + struct ehca_qp *my_qp = + container_of(ibsrq, struct ehca_qp, ib_srq); + struct ehca_pd *my_pd = + container_of(ibsrq->pd, struct ehca_pd, ib_pd); + struct ehca_shca *shca = + container_of(ibsrq->pd->device, struct ehca_shca, ib_device); + struct hcp_modify_qp_control_block *mqpcb; + u64 update_mask; + u64 h_ret; + int ret = 0; + + u32 cur_pid = current->tgid; + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + ehca_err(ibsrq->pd->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL); + if (!mqpcb) { + ehca_err(ibsrq->device, "Could not get zeroed page for mqpcb " + "ehca_qp=%p qp_num=%x ", my_qp, my_qp->real_qp_num); + return -ENOMEM; + } + + update_mask = 0; + if (attr_mask & IB_SRQ_LIMIT) { + attr_mask &= ~IB_SRQ_LIMIT; + update_mask |= + EHCA_BMASK_SET(MQPCB_MASK_CURR_SRQ_LIMIT, 1) + | EHCA_BMASK_SET(MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG, 1); + mqpcb->curr_srq_limit = + EHCA_BMASK_SET(MQPCB_CURR_SRQ_LIMIT, attr->srq_limit); + mqpcb->qp_aff_asyn_ev_log_reg = + EHCA_BMASK_SET(QPX_AAELOG_RESET_SRQ_LIMIT, 1); + } + + /* by now, all bits in attr_mask should have been cleared */ + if (attr_mask) { + ehca_err(ibsrq->device, "invalid attribute mask bits set " + "attr_mask=%x", attr_mask); + ret = -EINVAL; + goto modify_srq_exit0; + } + + if (ehca_debug_level) + ehca_dmp(mqpcb, 4*70, "qp_num=%x", my_qp->real_qp_num); + + h_ret = hipz_h_modify_qp(shca->ipz_hca_handle, my_qp->ipz_qp_handle, + NULL, update_mask, mqpcb, + my_qp->galpas.kernel); + + if (h_ret != H_SUCCESS) { + ret = ehca2ib_return_code(h_ret); + ehca_err(ibsrq->device, "hipz_h_modify_qp() failed rc=%lx " + "ehca_qp=%p qp_num=%x", + h_ret, my_qp, my_qp->real_qp_num); + } + +modify_srq_exit0: + ehca_free_fw_ctrlblock(mqpcb); + + return ret; +} + +int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr) +{ + struct ehca_qp *my_qp = container_of(srq, struct ehca_qp, ib_srq); + struct ehca_pd *my_pd = container_of(srq->pd, struct ehca_pd, ib_pd); + struct ehca_shca *shca = container_of(srq->device, struct ehca_shca, ib_device); + struct ipz_adapter_handle adapter_handle = shca->ipz_hca_handle; + struct hcp_modify_qp_control_block *qpcb; + u32 cur_pid = current->tgid; + int ret = 0; + u64 h_ret; + + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + ehca_err(srq->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; + } + + qpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL); + if (!qpcb) { + ehca_err(srq->device,"Out of memory for qpcb " + "ehca_qp=%p qp_num=%x", my_qp, my_qp->real_qp_num); + return -ENOMEM; + } + + h_ret = hipz_h_query_qp(adapter_handle, my_qp->ipz_qp_handle, + NULL, qpcb, my_qp->galpas.kernel); + + if (h_ret != H_SUCCESS) { + ret = ehca2ib_return_code(h_ret); + ehca_err(srq->device,"hipz_h_query_qp() failed " + "ehca_qp=%p qp_num=%x h_ret=%lx", + my_qp, my_qp->real_qp_num, h_ret); + goto query_srq_exit1; + } + + srq_attr->max_wr = qpcb->max_nr_outst_recv_wr - 1; + srq_attr->srq_limit = EHCA_BMASK_GET( + MQPCB_CURR_SRQ_LIMIT, qpcb->curr_srq_limit); + + if (ehca_debug_level) + ehca_dmp(qpcb, 4*70, "qp_num=%x", my_qp->real_qp_num); + +query_srq_exit1: + ehca_free_fw_ctrlblock(qpcb); + + return ret; +} + +int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, + struct ib_uobject *uobject) +{ + struct ehca_shca *shca = container_of(dev, struct ehca_shca, ib_device); struct ehca_pd *my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, ib_pd); u32 cur_pid = current->tgid; - u32 qp_num = ibqp->qp_num; + u32 qp_num = my_qp->real_qp_num; int ret; u64 h_ret; u8 port_num; enum ib_qp_type qp_type; unsigned long flags; - if (ibqp->uobject) { + if (uobject) { if (my_qp->mm_count_galpa || my_qp->mm_count_rqueue || my_qp->mm_count_squeue) { - ehca_err(ibqp->device, "Resources still referenced in " - "user space qp_num=%x", ibqp->qp_num); + ehca_err(dev, "Resources still referenced in " + "user space qp_num=%x", qp_num); return -EINVAL; } if (my_pd->ownpid != cur_pid) { - ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", + ehca_err(dev, "Invalid caller pid=%x ownpid=%x", cur_pid, my_pd->ownpid); return -EINVAL; } } if (my_qp->send_cq) { - ret = ehca_cq_unassign_qp(my_qp->send_cq, - my_qp->real_qp_num); + ret = ehca_cq_unassign_qp(my_qp->send_cq, qp_num); if (ret) { - ehca_err(ibqp->device, "Couldn't unassign qp from " + ehca_err(dev, "Couldn't unassign qp from " "send_cq ret=%x qp_num=%x cq_num=%x", ret, - my_qp->ib_qp.qp_num, my_qp->send_cq->cq_number); + qp_num, my_qp->send_cq->cq_number); return ret; } } @@ -1387,7 +1668,7 @@ int ehca_destroy_qp(struct ib_qp *ibqp) h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { - ehca_err(ibqp->device, "hipz_h_destroy_qp() failed rc=%lx " + ehca_err(dev, "hipz_h_destroy_qp() failed rc=%lx " "ehca_qp=%p qp_num=%x", h_ret, my_qp, qp_num); return ehca2ib_return_code(h_ret); } @@ -1398,7 +1679,7 @@ int ehca_destroy_qp(struct ib_qp *ibqp) /* no support for IB_QPT_SMI yet */ if (qp_type == IB_QPT_GSI) { struct ib_event event; - ehca_info(ibqp->device, "device %s: port %x is inactive.", + ehca_info(dev, "device %s: port %x is inactive.", shca->ib_device.name, port_num); event.device = &shca->ib_device; event.event = IB_EVENT_PORT_ERR; @@ -1407,12 +1688,28 @@ int ehca_destroy_qp(struct ib_qp *ibqp) ib_dispatch_event(&event); } - ipz_queue_dtor(&my_qp->ipz_rqueue); - ipz_queue_dtor(&my_qp->ipz_squeue); + if (HAS_RQ(my_qp)) + ipz_queue_dtor(&my_qp->ipz_rqueue); + if (HAS_SQ(my_qp)) + ipz_queue_dtor(&my_qp->ipz_squeue); kmem_cache_free(qp_cache, my_qp); return 0; } +int ehca_destroy_qp(struct ib_qp *qp) +{ + return internal_destroy_qp(qp->device, + container_of(qp, struct ehca_qp, ib_qp), + qp->uobject); +} + +int ehca_destroy_srq(struct ib_srq *srq) +{ + return internal_destroy_qp(srq->device, + container_of(srq, struct ehca_qp, ib_srq), + srq->uobject); +} + int ehca_init_qp_cache(void) { qp_cache = kmem_cache_create("ehca_cache_qp", diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 56c4527..b5664fa 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -3,8 +3,9 @@ * * post_send/recv, poll_cq, req_notify * - * Authors: Waleri Fomin - * Hoang-Nam Nguyen + * Authors: Hoang-Nam Nguyen + * Waleri Fomin + * Joachim Fenkes * Reinhard Ernst * * Copyright (c) 2005 IBM Corporation @@ -413,17 +414,23 @@ post_send_exit0: return ret; } -int ehca_post_recv(struct ib_qp *qp, - struct ib_recv_wr *recv_wr, - struct ib_recv_wr **bad_recv_wr) +static int internal_post_recv(struct ehca_qp *my_qp, + struct ib_device *dev, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) { - struct ehca_qp *my_qp = container_of(qp, struct ehca_qp, ib_qp); struct ib_recv_wr *cur_recv_wr; struct ehca_wqe *wqe_p; int wqe_cnt = 0; int ret = 0; unsigned long spl_flags; + if (unlikely(!HAS_RQ(my_qp))) { + ehca_err(dev, "QP has no RQ ehca_qp=%p qp_num=%x ext_type=%d", + my_qp, my_qp->real_qp_num, my_qp->ext_type); + return -ENODEV; + } + /* LOCK the QUEUE */ spin_lock_irqsave(&my_qp->spinlock_r, spl_flags); @@ -439,8 +446,8 @@ int ehca_post_recv(struct ib_qp *qp, *bad_recv_wr = cur_recv_wr; if (wqe_cnt == 0) { ret = -ENOMEM; - ehca_err(qp->device, "Too many posted WQEs " - "qp_num=%x", qp->qp_num); + ehca_err(dev, "Too many posted WQEs " + "qp_num=%x", my_qp->real_qp_num); } goto post_recv_exit0; } @@ -455,14 +462,14 @@ int ehca_post_recv(struct ib_qp *qp, *bad_recv_wr = cur_recv_wr; if (wqe_cnt == 0) { ret = -EINVAL; - ehca_err(qp->device, "Could not write WQE " - "qp_num=%x", qp->qp_num); + ehca_err(dev, "Could not write WQE " + "qp_num=%x", my_qp->real_qp_num); } goto post_recv_exit0; } wqe_cnt++; - ehca_gen_dbg("ehca_qp=%p qp_num=%x wqe_cnt=%d", - my_qp, qp->qp_num, wqe_cnt); + ehca_dbg(dev, "ehca_qp=%p qp_num=%x wqe_cnt=%d", + my_qp, my_qp->real_qp_num, wqe_cnt); } /* eof for cur_recv_wr */ post_recv_exit0: @@ -472,6 +479,22 @@ post_recv_exit0: return ret; } +int ehca_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return internal_post_recv(container_of(qp, struct ehca_qp, ib_qp), + qp->device, recv_wr, bad_recv_wr); +} + +int ehca_post_srq_recv(struct ib_srq *srq, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return internal_post_recv(container_of(srq, struct ehca_qp, ib_srq), + srq->device, recv_wr, bad_recv_wr); +} + /* * ib_wc_opcode table converts ehca wc opcode to ib * Since we use zero to indicate invalid opcode, the actual ib opcode must diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 73db920..d8fe37d 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -257,6 +257,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) struct ehca_cq *cq; struct ehca_qp *qp; struct ehca_pd *pd; + struct ib_uobject *uobject; switch (q_type) { case 1: /* CQ */ @@ -304,7 +305,8 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) return -ENOMEM; } - if (!qp->ib_qp.uobject || qp->ib_qp.uobject->context != context) + uobject = IS_SRQ(qp) ? qp->ib_srq.uobject : qp->ib_qp.uobject; + if (!uobject || uobject->context != context) return -EINVAL; ret = ehca_mmap_qp(vma, qp, rsrc_type); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 7efc4a2..b078377 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -5,6 +5,7 @@ * * Authors: Christoph Raisch * Hoang-Nam Nguyen + * Joachim Fenkes * Gerd Bayer * Waleri Fomin * @@ -62,6 +63,12 @@ #define H_ALL_RES_QP_MAX_SEND_SGE EHCA_BMASK_IBM(32, 39) #define H_ALL_RES_QP_MAX_RECV_SGE EHCA_BMASK_IBM(40, 47) +#define H_ALL_RES_QP_UD_AV_LKEY EHCA_BMASK_IBM(32, 63) +#define H_ALL_RES_QP_SRQ_QP_TOKEN EHCA_BMASK_IBM(0, 31) +#define H_ALL_RES_QP_SRQ_QP_HANDLE EHCA_BMASK_IBM(0, 64) +#define H_ALL_RES_QP_SRQ_LIMIT EHCA_BMASK_IBM(48, 63) +#define H_ALL_RES_QP_SRQ_QPN EHCA_BMASK_IBM(40, 63) + #define H_ALL_RES_QP_ACT_OUTST_SEND_WR EHCA_BMASK_IBM(16, 31) #define H_ALL_RES_QP_ACT_OUTST_RECV_WR EHCA_BMASK_IBM(48, 63) #define H_ALL_RES_QP_ACT_SEND_SGE EHCA_BMASK_IBM(8, 15) @@ -150,7 +157,7 @@ static long ehca_plpar_hcall9(unsigned long opcode, { long ret; int i, sleep_msecs, lock_is_set = 0; - unsigned long flags; + unsigned long flags = 0; ehca_gen_dbg("opcode=%lx arg1=%lx arg2=%lx arg3=%lx arg4=%lx " "arg5=%lx arg6=%lx arg7=%lx arg8=%lx arg9=%lx", @@ -282,8 +289,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, struct ehca_alloc_qp_parms *parms) { u64 ret; - u64 allocate_controls; - u64 max_r10_reg; + u64 allocate_controls, max_r10_reg, r11, r12; u64 outs[PLPAR_HCALL9_BUFSIZE]; allocate_controls = @@ -309,6 +315,13 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, parms->max_recv_sge); + r11 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QP_TOKEN, parms->srq_token); + + if (parms->ext_type == EQPT_SRQ) + r12 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_LIMIT, parms->srq_limit); + else + r12 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QPN, parms->srq_qpn); + ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs, adapter_handle.handle, /* r4 */ allocate_controls, /* r5 */ @@ -316,9 +329,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, parms->recv_cq_handle.handle, parms->eq_handle.handle, ((u64)parms->token << 32) | parms->pd.value, - max_r10_reg, /* r10 */ - parms->ud_av_l_key_ctl, /* r11 */ - 0); + max_r10_reg, r11, r12); parms->qp_handle.handle = outs[0]; parms->real_qp_num = (u32)outs[1]; diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index 9fe8367..d46b18c 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -163,6 +163,7 @@ struct hipz_qptemm { #define QPX_SQADDER EHCA_BMASK_IBM(48,63) #define QPX_RQADDER EHCA_BMASK_IBM(48,63) +#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3,3) #define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x) -- 1.5.2 From amitk at mellanox.co.il Mon Jul 9 06:27:35 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Mon, 9 Jul 2007 16:27:35 +0300 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports References: <1183640246.4377.436639.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> Hi Hal, In such case OpenSM should first check that the OPVL fields of the ports (the one that sends the traps and its peer) are identical, If you have a mismatch in the OPVL field, the link watchdog mechanism will retrain the logical link in high rate Amit -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, July 05, 2007 3:58 PM To: general at lists.openfabrics.org Cc: Eitan Zahavi; Yevgeny Kliteynik Subject: [PATCH] OpenSM handling of "Babbling" Ports A "babbling" port is a port which causes traps to be generated frequently. It may directly be "this" port which generates the traps or the peer port detecting the issue and that the SMA on switch port 0 generates the traps. This has only currently been observed for trap 131 but will also apply for traps 129 and 130 as well which are other urgent and similar traps. Note that there appears to be a bug in Mellanox firmware for both Anafa-2 and Tavor at a minimum which causes the max trap rate not to be adhered to and relief for this does not appear to be in short term sight. Policy When a bablbing port is detected, OpenSM will disable the port or its peer switch port (depending on which trap) which should terminate the trap storm. Detection 250 consecutive traps of this type will be used as the (initial) threshold. The reason for this is so as to not prematurely detect this and disable a port. Recovery Admin would reenable port when OK again. (This usually involves rebooting the node causing the trap to be indicated.) Signed-off-by: Hal Rosenstock diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index bedd63f..1150703 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt boolean_t honor_guid2lid_file; boolean_t daemon; boolean_t sm_inactive; + boolean_t babbling_port_policy; osm_qos_options_t qos_options; osm_qos_options_t qos_ca_options; osm_qos_options_t qos_sw0_options; @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt * * sm_inactive * OpenSM will start with SM in not active state. +* +* babbling_port_policy +* OpenSM will enforce its "babbling" port policy. * * perfmgr * Enable or disable the performance manager diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 726b665..87b71e5 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -472,6 +472,7 @@ osm_subn_set_default_opt( p_opt->honor_guid2lid_file = FALSE; p_opt->daemon = FALSE; p_opt->sm_inactive = FALSE; + p_opt->babbling_port_policy = FALSE; #ifdef ENABLE_OSM_PERF_MGR p_opt->perfmgr = FALSE; p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ -1358,6 +1359,10 @@ osm_subn_parse_conf_file( "sm_inactive", p_key, p_val, &p_opts->sm_inactive); + __osm_subn_opts_unpack_boolean( + "babbling_port_policy", + p_key, p_val, &p_opts->babbling_port_policy); + #ifdef ENABLE_OSM_PERF_MGR __osm_subn_opts_unpack_boolean( "perfmgr", @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( "# Daemon mode\n" "daemon %s\n\n" "# SM Inactive\n" - "sm_inactive %s\n\n", + "sm_inactive %s\n\n" + "# Babbling Port Policy\n" + "babbling_port_policy %s\n\n", p_opts->daemon ? "TRUE" : "FALSE", - p_opts->sm_inactive ? "TRUE" : "FALSE" + p_opts->sm_inactive ? "TRUE" : "FALSE", + p_opts->babbling_port_policy ? "TRUE" : "FALSE" ); #ifdef ENABLE_OSM_PERF_MGR diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( } else { + /* When babbling port policy option is enabled and + Threshold for disabling a "babbling" port is exceeded */ + if ( p_rcv->p_subn->opt.babbling_port_policy && + num_received >= 250 ) + { + uint8_t payload[IB_SMP_DATA_SIZE]; + ib_port_info_t* p_pi = (ib_port_info_t*)payload; + const ib_port_info_t* p_old_pi; + osm_madw_context_t context; + + /* If trap 131, might want to disable peer port if available */ + /* but peer port has been observed not to respond to SM + requests */ + + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3810: " + " Disabling physical port lid:0x%02X num:%u\n", + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), + p_ntci->data_details.ntc_129_131.port_num + ); + + p_old_pi = &p_physp->port_info; + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); + + /* Set port to disabled/down */ + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); + ib_port_info_set_port_phys_state( + IB_PORT_PHYS_STATE_DISABLED, p_pi ); + + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pi_context.set_method = TRUE; + context.pi_context.update_master_sm_base_lid = FALSE; + context.pi_context.light_sweep = FALSE; + context.pi_context.active_transition = FALSE; + + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, + osm_physp_get_dr_path_ptr( p_physp ), + payload, + sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32(osm_physp_get_port_num( p_physp )), + CL_DISP_MSGID_NONE, + &context ); + + if( status == IB_SUCCESS ) + { + goto Exit; + } + else + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3811: " + "Request to set PortInfo failed\n" ); + } + } + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_trap_rcv_process_request: " "Marking unhealthy physical port by lid:0x%02X num:%u\n", From fenkes at de.ibm.com Mon Jul 9 06:26:31 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:26:31 +0200 Subject: [ofa-general] [PATCH 05/13] IB/ehca: Support UD low latency QPs In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091526.31709.fenkes@de.ibm.com> From: Stefan Roscher Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 84 +++++++++++++++++++++++----------- 1 files changed, 57 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 9486a44..ffd1ce9 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -275,6 +275,11 @@ static inline void queue2resp(struct ipzu_queue_resp *resp, resp->toggle_state = queue->toggle_state; } +static inline int ll_qp_msg_size(int nr_sge) +{ + return 128 << nr_sge; +} + /* * init_qp_queue initializes/constructs r/squeue and registers queue pages. */ @@ -363,8 +368,6 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, struct ib_srq_init_attr *srq_init_attr, struct ib_udata *udata, int is_srq) { - static int da_rc_msg_size[] = { 128, 256, 512, 1024, 2048, 4096 }; - static int da_ud_sq_msg_size[]={ 128, 384, 896, 1920, 3968 }; struct ehca_qp *my_qp; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, @@ -396,6 +399,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, parms.ll_comp_flags = qp_type & LLQP_COMP_MASK; } qp_type &= 0x1F; + init_attr->qp_type &= 0x1F; /* handle SRQ base QPs */ if (init_attr->srq) { @@ -435,23 +439,49 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, return ERR_PTR(-EINVAL); } - if (is_llqp && (qp_type != IB_QPT_RC && qp_type != IB_QPT_UD)) { - ehca_err(pd->device, "unsupported LL QP Type=%x", qp_type); - return ERR_PTR(-EINVAL); - } else if (is_llqp && qp_type == IB_QPT_RC && - (init_attr->cap.max_send_wr > 255 || - init_attr->cap.max_recv_wr > 255 )) { - ehca_err(pd->device, "Invalid Number of max_sq_wr=%x " - "or max_rq_wr=%x for RC LLQP", - init_attr->cap.max_send_wr, - init_attr->cap.max_recv_wr); - return ERR_PTR(-EINVAL); - } else if (is_llqp && qp_type == IB_QPT_UD && - init_attr->cap.max_send_wr > 255) { - ehca_err(pd->device, - "Invalid Number of max_send_wr=%x for UD QP_TYPE=%x", - init_attr->cap.max_send_wr, qp_type); - return ERR_PTR(-EINVAL); + if (is_llqp) { + switch (qp_type) { + case IB_QPT_RC: + if ((init_attr->cap.max_send_wr > 255) || + (init_attr->cap.max_recv_wr > 255)) { + ehca_err(pd->device, + "Invalid Number of max_sq_wr=%x " + "or max_rq_wr=%x for RC LLQP", + init_attr->cap.max_send_wr, + init_attr->cap.max_recv_wr); + return ERR_PTR(-EINVAL); + } + break; + case IB_QPT_UD: + if (!EHCA_BMASK_GET(HCA_CAP_UD_LL_QP, shca->hca_cap)) { + ehca_err(pd->device, "UD LLQP not supported " + "by this adapter"); + return ERR_PTR(-ENOSYS); + } + if (!(init_attr->cap.max_send_sge <= 5 + && init_attr->cap.max_send_sge >= 1 + && init_attr->cap.max_recv_sge <= 5 + && init_attr->cap.max_recv_sge >= 1)) { + ehca_err(pd->device, + "Invalid Number of max_send_sge=%x " + "or max_recv_sge=%x for UD LLQP", + init_attr->cap.max_send_sge, + init_attr->cap.max_recv_sge); + return ERR_PTR(-EINVAL); + } else if (init_attr->cap.max_send_wr > 255) { + ehca_err(pd->device, + "Invalid Number of " + "ax_send_wr=%x for UD QP_TYPE=%x", + init_attr->cap.max_send_wr, qp_type); + return ERR_PTR(-EINVAL); + } + break; + default: + ehca_err(pd->device, "unsupported LL QP Type=%x", + qp_type); + return ERR_PTR(-EINVAL); + break; + } } if (pd->uobject && udata) @@ -509,7 +539,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, /* UD_AV CIRCUMVENTION */ max_send_sge = init_attr->cap.max_send_sge; max_recv_sge = init_attr->cap.max_recv_sge; - if (parms.servicetype == ST_UD) { + if (parms.servicetype == ST_UD && !is_llqp) { max_send_sge += 2; max_recv_sge += 2; } @@ -547,8 +577,8 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ (parms.act_nr_recv_sges)]); } else { /* for LLQP we need to use msg size, not wqe size */ - swqe_size = da_rc_msg_size[max_send_sge]; - rwqe_size = da_rc_msg_size[max_recv_sge]; + swqe_size = ll_qp_msg_size(max_send_sge); + rwqe_size = ll_qp_msg_size(max_recv_sge); parms.act_nr_send_sges = 1; parms.act_nr_recv_sges = 1; } @@ -563,15 +593,15 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, case IB_QPT_UD: case IB_QPT_GSI: case IB_QPT_SMI: - /* UD circumvention */ - parms.act_nr_recv_sges -= 2; - parms.act_nr_send_sges -= 2; if (is_llqp) { - swqe_size = da_ud_sq_msg_size[max_send_sge]; - rwqe_size = da_rc_msg_size[max_recv_sge]; + swqe_size = ll_qp_msg_size(parms.act_nr_send_sges); + rwqe_size = ll_qp_msg_size(parms.act_nr_recv_sges); parms.act_nr_send_sges = 1; parms.act_nr_recv_sges = 1; } else { + /* UD circumvention */ + parms.act_nr_send_sges -= 2; + parms.act_nr_recv_sges -= 2; swqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[parms.act_nr_send_sges]); rwqe_size = offsetof(struct ehca_wqe, -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:27:13 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:27:13 +0200 Subject: [ofa-general] [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091527.14272.fenkes@de.ibm.com> From: Stefan Roscher Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 11 +++++++++++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index ffd1ce9..cbb8b5b 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1054,6 +1054,17 @@ static int internal_modify_qp(struct ib_qp *ibqp, "ehca_qp=%p qp_num=%x qp_state_xsit=%x", my_qp, ibqp->qp_num, statetrans); + /* eHCA2 rev2 and higher require the SEND_GRH_FLAG to be set + * in non-LL UD QPs. + */ + if ((my_qp->qp_type == IB_QPT_UD) && + (my_qp->ext_type != EQPT_LLQP) && + (statetrans == IB_QPST_INIT2RTR) && + (shca->hw_level >= 0x22)){ + update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); + mqpcb->send_grh_flag = 1; + } + /* sqe -> rts: set purge bit of bad wqe before actual trans */ if ((my_qp->qp_type == IB_QPT_UD || my_qp->qp_type == IB_QPT_GSI || -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:29:03 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:29:03 +0200 Subject: [ofa-general] [PATCH 08/13] IB/ehca: Lock renaming, static initializers In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091529.04073.fenkes@de.ibm.com> - Renamed all spinlock flags to "flags", matching the vast majority of kernel code. - Moved hcall_lock into the only module it's used in. - Replaced spin_lock_init() and friends with static initializers for global variables. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 1 - drivers/infiniband/hw/ehca/ehca_cq.c | 12 ++++++------ drivers/infiniband/hw/ehca/ehca_main.c | 18 ++++-------------- drivers/infiniband/hw/ehca/ehca_qp.c | 6 +++--- drivers/infiniband/hw/ehca/ehca_reqs.c | 24 ++++++++++++------------ drivers/infiniband/hw/ehca/hcp_if.c | 2 ++ 6 files changed, 27 insertions(+), 36 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 9d689ae..3550047 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -295,7 +295,6 @@ void ehca_cleanup_mrmw_cache(void); extern spinlock_t ehca_qp_idr_lock; extern spinlock_t ehca_cq_idr_lock; -extern spinlock_t hcall_lock; extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 67f0670..94bad27 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -56,11 +56,11 @@ int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp) { unsigned int qp_num = qp->real_qp_num; unsigned int key = qp_num & (QP_HASHTAB_LEN-1); - unsigned long spl_flags; + unsigned long flags; - spin_lock_irqsave(&cq->spinlock, spl_flags); + spin_lock_irqsave(&cq->spinlock, flags); hlist_add_head(&qp->list_entries, &cq->qp_hashtab[key]); - spin_unlock_irqrestore(&cq->spinlock, spl_flags); + spin_unlock_irqrestore(&cq->spinlock, flags); ehca_dbg(cq->ib_cq.device, "cq_num=%x real_qp_num=%x", cq->cq_number, qp_num); @@ -74,9 +74,9 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); struct hlist_node *iter; struct ehca_qp *qp; - unsigned long spl_flags; + unsigned long flags; - spin_lock_irqsave(&cq->spinlock, spl_flags); + spin_lock_irqsave(&cq->spinlock, flags); hlist_for_each(iter, &cq->qp_hashtab[key]) { qp = hlist_entry(iter, struct ehca_qp, list_entries); if (qp->real_qp_num == real_qp_num) { @@ -88,7 +88,7 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) break; } } - spin_unlock_irqrestore(&cq->spinlock, spl_flags); + spin_unlock_irqrestore(&cq->spinlock, flags); if (ret) ehca_err(cq->ib_cq.device, "qp not found cq_num=%x real_qp_num=%x", diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 9bd749c..77db890 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -96,15 +96,13 @@ MODULE_PARM_DESC(static_rate, MODULE_PARM_DESC(scaling_code, "set scaling code (0: disabled/default, 1: enabled)"); -spinlock_t ehca_qp_idr_lock; -spinlock_t ehca_cq_idr_lock; -spinlock_t hcall_lock; +DEFINE_SPINLOCK(ehca_qp_idr_lock); +DEFINE_SPINLOCK(ehca_cq_idr_lock); DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); - -static struct list_head shca_list; /* list of all registered ehcas */ -static spinlock_t shca_list_lock; +static LIST_HEAD(shca_list); /* list of all registered ehcas */ +static DEFINE_SPINLOCK(shca_list_lock); static struct timer_list poll_eqs_timer; @@ -864,14 +862,6 @@ int __init ehca_module_init(void) printk(KERN_INFO "eHCA Infiniband Device Driver " "(Rel.: SVNEHCA_0023)\n"); - idr_init(&ehca_qp_idr); - idr_init(&ehca_cq_idr); - spin_lock_init(&ehca_qp_idr_lock); - spin_lock_init(&ehca_cq_idr_lock); - spin_lock_init(&hcall_lock); - - INIT_LIST_HEAD(&shca_list); - spin_lock_init(&shca_list_lock); if ((ret = ehca_create_comp_pool())) { ehca_gen_err("Cannot create comp pool."); diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 989f75e..ac4ff26 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -933,7 +933,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, u64 h_ret; int bad_wqe_cnt = 0; int squeue_locked = 0; - unsigned long spl_flags = 0; + unsigned long flags = 0; /* do query_qp to obtain current attr values */ mqpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL); @@ -1074,7 +1074,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, if (!ibqp->uobject) { struct ehca_wqe *wqe; /* lock send queue */ - spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + spin_lock_irqsave(&my_qp->spinlock_s, flags); squeue_locked = 1; /* mark next free wqe */ wqe = (struct ehca_wqe*) @@ -1360,7 +1360,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, modify_qp_exit2: if (squeue_locked) { /* this means: sqe -> rts */ - spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + spin_unlock_irqrestore(&my_qp->spinlock_s, flags); my_qp->sqerr_purgeflag = 1; } diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b5664fa..73f0c06 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -363,10 +363,10 @@ int ehca_post_send(struct ib_qp *qp, struct ehca_wqe *wqe_p; int wqe_cnt = 0; int ret = 0; - unsigned long spl_flags; + unsigned long flags; /* LOCK the QUEUE */ - spin_lock_irqsave(&my_qp->spinlock_s, spl_flags); + spin_lock_irqsave(&my_qp->spinlock_s, flags); /* loop processes list of send reqs */ for (cur_send_wr = send_wr; cur_send_wr != NULL; @@ -408,7 +408,7 @@ int ehca_post_send(struct ib_qp *qp, post_send_exit0: /* UNLOCK the QUEUE */ - spin_unlock_irqrestore(&my_qp->spinlock_s, spl_flags); + spin_unlock_irqrestore(&my_qp->spinlock_s, flags); iosync(); /* serialize GAL register access */ hipz_update_sqa(my_qp, wqe_cnt); return ret; @@ -423,7 +423,7 @@ static int internal_post_recv(struct ehca_qp *my_qp, struct ehca_wqe *wqe_p; int wqe_cnt = 0; int ret = 0; - unsigned long spl_flags; + unsigned long flags; if (unlikely(!HAS_RQ(my_qp))) { ehca_err(dev, "QP has no RQ ehca_qp=%p qp_num=%x ext_type=%d", @@ -432,7 +432,7 @@ static int internal_post_recv(struct ehca_qp *my_qp, } /* LOCK the QUEUE */ - spin_lock_irqsave(&my_qp->spinlock_r, spl_flags); + spin_lock_irqsave(&my_qp->spinlock_r, flags); /* loop processes list of send reqs */ for (cur_recv_wr = recv_wr; cur_recv_wr != NULL; @@ -473,7 +473,7 @@ static int internal_post_recv(struct ehca_qp *my_qp, } /* eof for cur_recv_wr */ post_recv_exit0: - spin_unlock_irqrestore(&my_qp->spinlock_r, spl_flags); + spin_unlock_irqrestore(&my_qp->spinlock_r, flags); iosync(); /* serialize GAL register access */ hipz_update_rqa(my_qp, wqe_cnt); return ret; @@ -536,7 +536,7 @@ poll_cq_one_read_cqe: if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) { struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number); int purgeflag; - unsigned long spl_flags; + unsigned long flags; if (!qp) { ehca_err(cq->device, "cq_num=%x qp_num=%x " "could not find qp -> ignore cqe", @@ -546,9 +546,9 @@ poll_cq_one_read_cqe: /* ignore this purged cqe */ goto poll_cq_one_read_cqe; } - spin_lock_irqsave(&qp->spinlock_s, spl_flags); + spin_lock_irqsave(&qp->spinlock_s, flags); purgeflag = qp->sqerr_purgeflag; - spin_unlock_irqrestore(&qp->spinlock_s, spl_flags); + spin_unlock_irqrestore(&qp->spinlock_s, flags); if (purgeflag) { ehca_dbg(cq->device, "Got CQE with purged bit qp_num=%x " @@ -633,7 +633,7 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) int nr; struct ib_wc *current_wc = wc; int ret = 0; - unsigned long spl_flags; + unsigned long flags; if (num_entries < 1) { ehca_err(cq->device, "Invalid num_entries=%d ehca_cq=%p " @@ -642,14 +642,14 @@ int ehca_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc) goto poll_cq_exit0; } - spin_lock_irqsave(&my_cq->spinlock, spl_flags); + spin_lock_irqsave(&my_cq->spinlock, flags); for (nr = 0; nr < num_entries; nr++) { ret = ehca_poll_cq_one(cq, current_wc); if (ret) break; current_wc++; } /* eof for nr */ - spin_unlock_irqrestore(&my_cq->spinlock, spl_flags); + spin_unlock_irqrestore(&my_cq->spinlock, flags); if (ret == -EAGAIN || !ret) ret = nr; diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index b078377..5b927a6 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -81,6 +81,8 @@ #define H_MP_SHUTDOWN EHCA_BMASK_IBM(48, 48) #define H_MP_RESET_QKEY_CTR EHCA_BMASK_IBM(49, 49) +DEFINE_SPINLOCK(hcall_lock); + static u32 get_longbusy_msecs(int longbusy_rc) { switch (longbusy_rc) { -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:28:18 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:28:18 +0200 Subject: [ofa-general] [PATCH 07/13] IB/ehca: Report RDMA atomic attributes in query_qp() In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091528.19168.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index cbb8b5b..989f75e 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1491,6 +1491,9 @@ int ehca_query_qp(struct ib_qp *qp, qp_attr->alt_port_num = qpcb->alt_phys_port; qp_attr->alt_timeout = qpcb->timeout_al; + qp_attr->max_dest_rd_atomic = qpcb->rdma_nr_atomic_resp_res; + qp_attr->max_rd_atomic = qpcb->rdma_atomic_outst_dest_qp; + /* primary av */ qp_attr->ah_attr.sl = qpcb->service_level; -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:30:39 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:30:39 +0200 Subject: [ofa-general] [PATCH 09/13] IB/ehca: Refactor synchronization between completions and destroy_cq using atomic_t In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091530.40581.fenkes@de.ibm.com> - ehca_cq.nr_events is made an atomic_t, eliminating a lot of locking. - The CQ is removed from the CQ idr first now to make sure no more completions are scheduled on that CQ. The "wait for all completions to end" code becomes much simpler this way. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 4 +- drivers/infiniband/hw/ehca/ehca_cq.c | 26 +++++++------------- drivers/infiniband/hw/ehca/ehca_irq.c | 36 +++++++++++++--------------- drivers/infiniband/hw/ehca/ehca_irq.h | 1 - drivers/infiniband/hw/ehca/ehca_tools.h | 1 + 5 files changed, 29 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 3550047..8580f2a 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -174,8 +174,8 @@ struct ehca_cq { spinlock_t cb_lock; struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; struct list_head entry; - u32 nr_callbacks; /* #events assigned to cpu by scaling code */ - u32 nr_events; /* #events seen */ + u32 nr_callbacks; /* #events assigned to cpu by scaling code */ + atomic_t nr_events; /* #events seen */ wait_queue_head_t wait_completion; spinlock_t task_lock; u32 ownpid; diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 94bad27..3729997 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -146,6 +146,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, spin_lock_init(&my_cq->spinlock); spin_lock_init(&my_cq->cb_lock); spin_lock_init(&my_cq->task_lock); + atomic_set(&my_cq->nr_events, 0); init_waitqueue_head(&my_cq->wait_completion); my_cq->ownpid = current->tgid; @@ -303,16 +304,6 @@ create_cq_exit1: return cq; } -static int get_cq_nr_events(struct ehca_cq *my_cq) -{ - int ret; - unsigned long flags; - spin_lock_irqsave(&ehca_cq_idr_lock, flags); - ret = my_cq->nr_events; - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); - return ret; -} - int ehca_destroy_cq(struct ib_cq *cq) { u64 h_ret; @@ -339,17 +330,18 @@ int ehca_destroy_cq(struct ib_cq *cq) } } + /* + * remove the CQ from the idr first to make sure + * no more interrupt tasklets will touch this CQ + */ spin_lock_irqsave(&ehca_cq_idr_lock, flags); - while (my_cq->nr_events) { - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); - wait_event(my_cq->wait_completion, !get_cq_nr_events(my_cq)); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); - /* recheck nr_events to assure no cqe has just arrived */ - } - idr_remove(&ehca_cq_idr, my_cq->token); spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + /* now wait until all pending events have completed */ + wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events)); + + /* nobody's using our CQ any longer -- we can destroy it */ h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 100329b..3e790a3 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -5,6 +5,8 @@ * * Authors: Heiko J Schick * Khadija Souissi + * Hoang-Nam Nguyen + * Joachim Fenkes * * Copyright (c) 2005 IBM Corporation * @@ -212,6 +214,8 @@ static void cq_event_callback(struct ehca_shca *shca, spin_lock_irqsave(&ehca_cq_idr_lock, flags); cq = idr_find(&ehca_cq_idr, token); + if (cq) + atomic_inc(&cq->nr_events); spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); if (!cq) @@ -219,6 +223,9 @@ static void cq_event_callback(struct ehca_shca *shca, ehca_error_data(shca, cq, cq->ipz_cq_handle.handle); + if (atomic_dec_and_test(&cq->nr_events)) + wake_up(&cq->wait_completion); + return; } @@ -414,25 +421,22 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); spin_lock_irqsave(&ehca_cq_idr_lock, flags); cq = idr_find(&ehca_cq_idr, token); + if (cq) + atomic_inc(&cq->nr_events); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); if (cq == NULL) { - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); ehca_err(&shca->ib_device, "Invalid eqe for non-existing cq token=%x", token); return; } reset_eq_pending(cq); - cq->nr_events++; - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); if (ehca_scaling_code) queue_comp_task(cq); else { comp_event_callback(cq); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); - cq->nr_events--; - if (!cq->nr_events) + if (atomic_dec_and_test(&cq->nr_events)) wake_up(&cq->wait_completion); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); } } else { ehca_dbg(&shca->ib_device, "Got non completion event"); @@ -478,15 +482,15 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); spin_lock(&ehca_cq_idr_lock); eqe_cache[eqe_cnt].cq = idr_find(&ehca_cq_idr, token); + if (eqe_cache[eqe_cnt].cq) + atomic_inc(&eqe_cache[eqe_cnt].cq->nr_events); + spin_unlock(&ehca_cq_idr_lock); if (!eqe_cache[eqe_cnt].cq) { - spin_unlock(&ehca_cq_idr_lock); ehca_err(&shca->ib_device, "Invalid eqe for non-existing cq " "token=%x", token); continue; } - eqe_cache[eqe_cnt].cq->nr_events++; - spin_unlock(&ehca_cq_idr_lock); } else eqe_cache[eqe_cnt].cq = NULL; eqe_cnt++; @@ -517,11 +521,8 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) else { struct ehca_cq *cq = eq->eqe_cache[i].cq; comp_event_callback(cq); - spin_lock(&ehca_cq_idr_lock); - cq->nr_events--; - if (!cq->nr_events) + if (atomic_dec_and_test(&cq->nr_events)) wake_up(&cq->wait_completion); - spin_unlock(&ehca_cq_idr_lock); } } else { ehca_dbg(&shca->ib_device, "Got non completion event"); @@ -621,13 +622,10 @@ static void run_comp_task(struct ehca_cpu_comp_task* cct) while (!list_empty(&cct->cq_list)) { cq = list_entry(cct->cq_list.next, struct ehca_cq, entry); spin_unlock_irqrestore(&cct->task_lock, flags); - comp_event_callback(cq); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); - cq->nr_events--; - if (!cq->nr_events) + comp_event_callback(cq); + if (atomic_dec_and_test(&cq->nr_events)) wake_up(&cq->wait_completion); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); spin_lock_irqsave(&cct->task_lock, flags); spin_lock(&cq->task_lock); diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h index 6ed06ee..3346cb0 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.h +++ b/drivers/infiniband/hw/ehca/ehca_irq.h @@ -47,7 +47,6 @@ struct ehca_shca; #include #include -#include int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource); diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 973c4b5..03b185f 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -59,6 +59,7 @@ #include #include +#include #include #include #include -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:31:10 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:31:10 +0200 Subject: [ofa-general] [PATCH 10/13] IB/ehca: Change idr spinlocks into rwlocks In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091531.11544.fenkes@de.ibm.com> This eliminates lock contention among IRQs as well as the need to disable IRQs around idr_find, because there are no IRQ writers. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 4 ++-- drivers/infiniband/hw/ehca/ehca_cq.c | 12 ++++++------ drivers/infiniband/hw/ehca/ehca_irq.c | 19 ++++++++----------- drivers/infiniband/hw/ehca/ehca_main.c | 4 ++-- drivers/infiniband/hw/ehca/ehca_qp.c | 12 ++++++------ drivers/infiniband/hw/ehca/ehca_uverbs.c | 9 ++++----- 6 files changed, 28 insertions(+), 32 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 8580f2a..f1e0db2 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -293,8 +293,8 @@ void ehca_cleanup_av_cache(void); int ehca_init_mrmw_cache(void); void ehca_cleanup_mrmw_cache(void); -extern spinlock_t ehca_qp_idr_lock; -extern spinlock_t ehca_cq_idr_lock; +extern rwlock_t ehca_qp_idr_lock; +extern rwlock_t ehca_cq_idr_lock; extern struct idr ehca_qp_idr; extern struct idr ehca_cq_idr; diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 3729997..01d4a14 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -163,9 +163,9 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, goto create_cq_exit1; } - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + write_lock_irqsave(&ehca_cq_idr_lock, flags); ret = idr_get_new(&ehca_cq_idr, my_cq, &my_cq->token); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + write_unlock_irqrestore(&ehca_cq_idr_lock, flags); } while (ret == -EAGAIN); @@ -294,9 +294,9 @@ create_cq_exit3: "cq_num=%x h_ret=%lx", my_cq, my_cq->cq_number, h_ret); create_cq_exit2: - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + write_lock_irqsave(&ehca_cq_idr_lock, flags); idr_remove(&ehca_cq_idr, my_cq->token); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + write_unlock_irqrestore(&ehca_cq_idr_lock, flags); create_cq_exit1: kmem_cache_free(cq_cache, my_cq); @@ -334,9 +334,9 @@ int ehca_destroy_cq(struct ib_cq *cq) * remove the CQ from the idr first to make sure * no more interrupt tasklets will touch this CQ */ - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + write_lock_irqsave(&ehca_cq_idr_lock, flags); idr_remove(&ehca_cq_idr, my_cq->token); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + write_unlock_irqrestore(&ehca_cq_idr_lock, flags); /* now wait until all pending events have completed */ wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events)); diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 3e790a3..02b73c8 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -180,12 +180,11 @@ static void qp_event_callback(struct ehca_shca *shca, { struct ib_event event; struct ehca_qp *qp; - unsigned long flags; u32 token = EHCA_BMASK_GET(EQE_QP_TOKEN, eqe); - spin_lock_irqsave(&ehca_qp_idr_lock, flags); + read_lock(&ehca_qp_idr_lock); qp = idr_find(&ehca_qp_idr, token); - spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + read_unlock(&ehca_qp_idr_lock); if (!qp) @@ -209,14 +208,13 @@ static void cq_event_callback(struct ehca_shca *shca, u64 eqe) { struct ehca_cq *cq; - unsigned long flags; u32 token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + read_lock(&ehca_cq_idr_lock); cq = idr_find(&ehca_cq_idr, token); if (cq) atomic_inc(&cq->nr_events); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + read_unlock(&ehca_cq_idr_lock); if (!cq) return; @@ -411,7 +409,6 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) { u64 eqe_value; u32 token; - unsigned long flags; struct ehca_cq *cq; eqe_value = eqe->entry; @@ -419,11 +416,11 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) { ehca_dbg(&shca->ib_device, "Got completion event"); token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + read_lock(&ehca_cq_idr_lock); cq = idr_find(&ehca_cq_idr, token); if (cq) atomic_inc(&cq->nr_events); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + read_unlock(&ehca_cq_idr_lock); if (cq == NULL) { ehca_err(&shca->ib_device, "Invalid eqe for non-existing cq token=%x", @@ -480,11 +477,11 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) eqe_value = eqe_cache[eqe_cnt].eqe->entry; if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) { token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); - spin_lock(&ehca_cq_idr_lock); + read_lock(&ehca_cq_idr_lock); eqe_cache[eqe_cnt].cq = idr_find(&ehca_cq_idr, token); if (eqe_cache[eqe_cnt].cq) atomic_inc(&eqe_cache[eqe_cnt].cq->nr_events); - spin_unlock(&ehca_cq_idr_lock); + read_unlock(&ehca_cq_idr_lock); if (!eqe_cache[eqe_cnt].cq) { ehca_err(&shca->ib_device, "Invalid eqe for non-existing cq " diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 77db890..e58e821 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -96,8 +96,8 @@ MODULE_PARM_DESC(static_rate, MODULE_PARM_DESC(scaling_code, "set scaling code (0: disabled/default, 1: enabled)"); -DEFINE_SPINLOCK(ehca_qp_idr_lock); -DEFINE_SPINLOCK(ehca_cq_idr_lock); +DEFINE_RWLOCK(ehca_qp_idr_lock); +DEFINE_RWLOCK(ehca_cq_idr_lock); DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index ac4ff26..7452ef4 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -512,9 +512,9 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, goto create_qp_exit0; } - spin_lock_irqsave(&ehca_qp_idr_lock, flags); + write_lock_irqsave(&ehca_qp_idr_lock, flags); ret = idr_get_new(&ehca_qp_idr, my_qp, &my_qp->token); - spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + write_unlock_irqrestore(&ehca_qp_idr_lock, flags); } while (ret == -EAGAIN); @@ -733,9 +733,9 @@ create_qp_exit2: hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); create_qp_exit1: - spin_lock_irqsave(&ehca_qp_idr_lock, flags); + write_lock_irqsave(&ehca_qp_idr_lock, flags); idr_remove(&ehca_qp_idr, my_qp->token); - spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + write_unlock_irqrestore(&ehca_qp_idr_lock, flags); create_qp_exit0: kmem_cache_free(qp_cache, my_qp); @@ -1706,9 +1706,9 @@ int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, } } - spin_lock_irqsave(&ehca_qp_idr_lock, flags); + write_lock_irqsave(&ehca_qp_idr_lock, flags); idr_remove(&ehca_qp_idr, my_qp->token); - spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + write_unlock_irqrestore(&ehca_qp_idr_lock, flags); h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index d8fe37d..3031b3b 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -253,7 +253,6 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ u32 cur_pid = current->tgid; u32 ret; - unsigned long flags; struct ehca_cq *cq; struct ehca_qp *qp; struct ehca_pd *pd; @@ -261,9 +260,9 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) switch (q_type) { case 1: /* CQ */ - spin_lock_irqsave(&ehca_cq_idr_lock, flags); + read_lock(&ehca_cq_idr_lock); cq = idr_find(&ehca_cq_idr, idr_handle); - spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); + read_unlock(&ehca_cq_idr_lock); /* make sure this mmap really belongs to the authorized user */ if (!cq) @@ -289,9 +288,9 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) break; case 2: /* QP */ - spin_lock_irqsave(&ehca_qp_idr_lock, flags); + read_lock(&ehca_qp_idr_lock); qp = idr_find(&ehca_qp_idr, idr_handle); - spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); + read_unlock(&ehca_qp_idr_lock); /* make sure this mmap really belongs to the authorized user */ if (!qp) -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:31:53 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:31:53 +0200 Subject: [ofa-general] [PATCH 11/13] IB/ehca: return QP pointer in poll_cq(), add two unlikely() statements In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091531.54219.fenkes@de.ibm.com> Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_reqs.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 73f0c06..fd3ba22 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -517,6 +517,7 @@ static inline int ehca_poll_cq_one(struct ib_cq *cq, struct ib_wc *wc) int ret = 0; struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); struct ehca_cqe *cqe; + struct ehca_qp *my_qp; int cqe_count = 0; poll_cq_one_read_cqe: @@ -568,7 +569,7 @@ poll_cq_one_read_cqe: } /* tracing cqe */ - if (ehca_debug_level) { + if (unlikely(ehca_debug_level)) { ehca_dbg(cq->device, "Received COMPLETION ehca_cq=%p cq_num=%x -----", my_cq, my_cq->cq_number); @@ -602,7 +603,11 @@ poll_cq_one_read_cqe: } else wc->status = IB_WC_SUCCESS; - wc->qp = NULL; + read_lock(&ehca_qp_idr_lock); + my_qp = idr_find(&ehca_qp_idr, cqe->qp_token); + wc->qp = &my_qp->ib_qp; + read_unlock(&ehca_qp_idr_lock); + wc->byte_len = cqe->nr_bytes_transferred; wc->pkey_index = cqe->pkey_index; wc->slid = cqe->rlid; @@ -612,7 +617,7 @@ poll_cq_one_read_cqe: wc->imm_data = cpu_to_be32(cqe->immediate_data); wc->sl = cqe->service_level; - if (wc->status != IB_WC_SUCCESS) + if (unlikely(wc->status != IB_WC_SUCCESS)) ehca_dbg(cq->device, "ehca_cq=%p cq_num=%x WARNING unsuccessful cqe " "OPType=%x status=%x qp_num=%x src_qp=%x wr_id=%lx " -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:32:22 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:32:22 +0200 Subject: [ofa-general] [PATCH 12/13] IB/ehca: notify consumers of LID/PKEY/SM changes after nondisruptive events In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091532.23189.fenkes@de.ibm.com> When firmware reports a nondisruptive port configuration change event, previous versions of the eHCA driver didn't forward the event to consumers like IPoIB. Add code that determines the type of configuration change by comparing old and new port attributes and reports it. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 6 ++ drivers/infiniband/hw/ehca/ehca_hca.c | 34 +++++++++++ drivers/infiniband/hw/ehca/ehca_irq.c | 89 +++++++++++++++++++---------- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 + 4 files changed, 101 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index f1e0db2..daf823e 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -87,11 +87,17 @@ struct ehca_eq { struct ehca_eqe_cache_entry eqe_cache[EHCA_EQE_CACHE_SIZE]; }; +struct ehca_sma_attr { + u16 lid, lmc, sm_sl, sm_lid; + u16 pkey_tbl_len, pkeys[16]; +}; + struct ehca_sport { struct ib_cq *ibcq_aqp1; struct ib_qp *ibqp_aqp1; enum ib_rate rate; enum ib_port_state port_state; + struct ehca_sma_attr saved_attr; }; struct ehca_shca { diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index b310de5..bbd3c6a 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -193,6 +193,40 @@ query_port1: return ret; } +int ehca_query_sma_attr(struct ehca_shca *shca, + u8 port, struct ehca_sma_attr *attr) +{ + int ret = 0; + struct hipz_query_port *rblock; + + rblock = ehca_alloc_fw_ctrlblock(GFP_ATOMIC); + if (!rblock) { + ehca_err(&shca->ib_device, "Can't allocate rblock memory."); + return -ENOMEM; + } + + if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + ehca_err(&shca->ib_device, "Can't query port properties"); + ret = -EINVAL; + goto query_sma_attr1; + } + + memset(attr, 0, sizeof(struct ehca_sma_attr)); + + attr->lid = rblock->lid; + attr->lmc = rblock->lmc; + attr->sm_sl = rblock->sm_sl; + attr->sm_lid = rblock->sm_lid; + + attr->pkey_tbl_len = rblock->pkey_tbl_len; + memcpy(attr->pkeys, rblock->pkey_entries, sizeof(attr->pkeys)); + +query_sma_attr1: + ehca_free_fw_ctrlblock(rblock); + + return ret; +} + int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { int ret = 0; diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 02b73c8..96eba38 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -61,6 +61,7 @@ #define NEQE_EVENT_CODE EHCA_BMASK_IBM(2,7) #define NEQE_PORT_NUMBER EHCA_BMASK_IBM(8,15) #define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16) +#define NEQE_DISRUPTIVE EHCA_BMASK_IBM(16,16) #define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52,63) #define ERROR_DATA_TYPE EHCA_BMASK_IBM(0,7) @@ -286,30 +287,61 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe) return; } -static void parse_ec(struct ehca_shca *shca, u64 eqe) +static void dispatch_port_event(struct ehca_shca *shca, int port_num, + enum ib_event_type type, const char *msg) { struct ib_event event; + + ehca_info(&shca->ib_device, "port %d %s.", port_num, msg); + event.device = &shca->ib_device; + event.event = type; + event.element.port_num = port_num; + ib_dispatch_event(&event); +} + +static void notify_port_conf_change(struct ehca_shca *shca, int port_num) +{ + struct ehca_sma_attr new_attr; + struct ehca_sma_attr *old_attr = &shca->sport[port_num - 1].saved_attr; + + ehca_query_sma_attr(shca, port_num, &new_attr); + + if (new_attr.sm_sl != old_attr->sm_sl || + new_attr.sm_lid != old_attr->sm_lid) + dispatch_port_event(shca, port_num, IB_EVENT_SM_CHANGE, + "SM changed"); + + if (new_attr.lid != old_attr->lid || + new_attr.lmc != old_attr->lmc) + dispatch_port_event(shca, port_num, IB_EVENT_LID_CHANGE, + "LID changed"); + + if (new_attr.pkey_tbl_len != old_attr->pkey_tbl_len || + memcmp(new_attr.pkeys, old_attr->pkeys, + sizeof(u16) * new_attr.pkey_tbl_len)) + dispatch_port_event(shca, port_num, IB_EVENT_PKEY_CHANGE, + "P_Key changed"); + + *old_attr = new_attr; +} + +static void parse_ec(struct ehca_shca *shca, u64 eqe) +{ u8 ec = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe); u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe); switch (ec) { case 0x30: /* port availability change */ if (EHCA_BMASK_GET(NEQE_PORT_AVAILABILITY, eqe)) { - ehca_info(&shca->ib_device, - "port %x is active.", port); - event.device = &shca->ib_device; - event.event = IB_EVENT_PORT_ACTIVE; - event.element.port_num = port; shca->sport[port - 1].port_state = IB_PORT_ACTIVE; - ib_dispatch_event(&event); + dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE, + "is active"); + ehca_query_sma_attr(shca, port, + &shca->sport[port - 1].saved_attr); } else { - ehca_info(&shca->ib_device, - "port %x is inactive.", port); - event.device = &shca->ib_device; - event.event = IB_EVENT_PORT_ERR; - event.element.port_num = port; shca->sport[port - 1].port_state = IB_PORT_DOWN; - ib_dispatch_event(&event); + dispatch_port_event(shca, port, IB_EVENT_PORT_ERR, + "is inactive"); } break; case 0x31: @@ -317,24 +349,19 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe) * disruptive change is caused by * LID, PKEY or SM change */ - ehca_warn(&shca->ib_device, - "disruptive port %x configuration change", port); - - ehca_info(&shca->ib_device, - "port %x is inactive.", port); - event.device = &shca->ib_device; - event.event = IB_EVENT_PORT_ERR; - event.element.port_num = port; - shca->sport[port - 1].port_state = IB_PORT_DOWN; - ib_dispatch_event(&event); - - ehca_info(&shca->ib_device, - "port %x is active.", port); - event.device = &shca->ib_device; - event.event = IB_EVENT_PORT_ACTIVE; - event.element.port_num = port; - shca->sport[port - 1].port_state = IB_PORT_ACTIVE; - ib_dispatch_event(&event); + if (EHCA_BMASK_GET(NEQE_DISRUPTIVE, eqe)) { + ehca_warn(&shca->ib_device, "disruptive port " + "%d configuration change", port); + + shca->sport[port - 1].port_state = IB_PORT_DOWN; + dispatch_port_event(shca, port, IB_EVENT_PORT_ERR, + "is inactive"); + + shca->sport[port - 1].port_state = IB_PORT_ACTIVE; + dispatch_port_event(shca, port, IB_EVENT_PORT_ACTIVE, + "is active"); + } else + notify_port_conf_change(shca, port); break; case 0x32: /* adapter malfunction */ ehca_err(&shca->ib_device, "Adapter malfunction."); diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index fd84a80..77aeca6 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -49,6 +49,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props); int ehca_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props); +int ehca_query_sma_attr(struct ehca_shca *shca, u8 port, + struct ehca_sma_attr *attr); + int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 * pkey); int ehca_query_gid(struct ib_device *ibdev, u8 port, int index, -- 1.5.2 From fenkes at de.ibm.com Mon Jul 9 06:33:52 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Mon, 9 Jul 2007 15:33:52 +0200 Subject: [ofa-general] [PATCH 13/13] IB/ehca: Improve latency by unlocking the SQ/RQ after triggering the hardware In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: <200707091533.53383.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_reqs.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index fd3ba22..61da65e 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -407,10 +407,9 @@ int ehca_post_send(struct ib_qp *qp, } /* eof for cur_send_wr */ post_send_exit0: - /* UNLOCK the QUEUE */ - spin_unlock_irqrestore(&my_qp->spinlock_s, flags); iosync(); /* serialize GAL register access */ hipz_update_sqa(my_qp, wqe_cnt); + spin_unlock_irqrestore(&my_qp->spinlock_s, flags); return ret; } @@ -473,9 +472,9 @@ static int internal_post_recv(struct ehca_qp *my_qp, } /* eof for cur_recv_wr */ post_recv_exit0: - spin_unlock_irqrestore(&my_qp->spinlock_r, flags); iosync(); /* serialize GAL register access */ hipz_update_rqa(my_qp, wqe_cnt); + spin_unlock_irqrestore(&my_qp->spinlock_r, flags); return ret; } -- 1.5.2 From xhejtman at ics.muni.cz Mon Jul 9 06:37:43 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 9 Jul 2007 15:37:43 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> Message-ID: <20070709133743.GK3885@ics.muni.cz> On Sun, Jul 08, 2007 at 10:03:16PM -0700, Roland Dreier wrote: > Is the memory given to a domU always physically contiguous? If not, > what happens when a domU kernel does alloc_pages(GFP_KERNEL, 6) to try > and allocate 256 KB or something like that. Let's assume that the > domU kernel has enough guest contiguous pages to satisfy the > allocation -- is there any guarantee that the pages are really > physically contiguous? The driver in Dom0 started to work fine. Do not know why. In DomU, using some debug prints, I found that dma_coherent memory is OK (contiguous pages), but alloc_pages returns contiguous pages but in the reverse order: ib_mthca 0000:08:00.0: Alloc pages starts ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026693000, virt ffff880098b02000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026692000, virt ffff880098b03000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026691000, virt ffff880098b04000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026690000, virt ffff880098b05000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668f000, virt ffff880098b06000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668e000, virt ffff880098b07000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668d000, virt ffff880098b08000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668c000, virt ffff880098b09000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668b000, virt ffff880098b0a000 ib_mthca 0000:08:00.0: Page phys. addr 000000002668a000, virt ffff880098b0b000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026689000, virt ffff880098b0c000 ib_mthca 0000:08:00.0: Page phys. addr 0000000026688000, virt ffff880098b0d000 -- Lukáš Hejtmánek From halr at voltaire.com Mon Jul 9 06:42:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jul 2007 09:42:53 -0400 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> References: <1183640246.4377.436639.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> Message-ID: <1183988571.25217.377395.camel@hal.voltaire.com> Hi Amit, On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > Hi Hal, > > In such case OpenSM should first check that the OPVL fields of the ports > (the one that sends the traps and its peer) are identical, > If you have a mismatch in the OPVL field, the link watchdog mechanism > will retrain the logical link in high rate OpVLs only takes "effect" if set after link active only if the link is bounced (not if it stays active). Also and more significantly, in terms of the specific issue, the peer SMA is often non responsive or shortly becomes non responsive so the peer OpVLs cannot readily be verified post this being detected. -- Hal > Amit > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, July 05, 2007 3:58 PM > To: general at lists.openfabrics.org > Cc: Eitan Zahavi; Yevgeny Kliteynik > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > A "babbling" port is a port which causes traps to be generated > frequently. > It may directly be "this" port which generates the traps or the peer > port detecting the issue and that the SMA on switch port 0 generates the > traps. > This has only currently been observed for trap 131 but will also apply > for traps 129 and 130 as well which are other urgent and similar traps. > > Note that there appears to be a bug in Mellanox firmware for both > Anafa-2 and Tavor at a minimum which causes the max trap rate not to be > adhered to and relief for this does not appear to be in short term > sight. > > Policy > When a bablbing port is detected, OpenSM will disable the port or its > peer switch port (depending on which trap) which should terminate the > trap storm. > > Detection > 250 consecutive traps of this type will be used as the (initial) > threshold. The reason for this is so as to not prematurely detect this > and disable a port. > > Recovery > Admin would reenable port when OK again. (This usually involves > rebooting the node causing the trap to be indicated.) > > Signed-off-by: Hal Rosenstock > > diff --git a/opensm/include/opensm/osm_subnet.h > b/opensm/include/opensm/osm_subnet.h > index bedd63f..1150703 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > boolean_t honor_guid2lid_file; > boolean_t daemon; > boolean_t sm_inactive; > + boolean_t babbling_port_policy; > osm_qos_options_t qos_options; > osm_qos_options_t qos_ca_options; > osm_qos_options_t qos_sw0_options; > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > * > * sm_inactive > * OpenSM will start with SM in not active state. > +* > +* babbling_port_policy > +* OpenSM will enforce its "babbling" port policy. > * > * perfmgr > * Enable or disable the performance manager > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 726b665..87b71e5 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > p_opt->honor_guid2lid_file = FALSE; > p_opt->daemon = FALSE; > p_opt->sm_inactive = FALSE; > + p_opt->babbling_port_policy = FALSE; > #ifdef ENABLE_OSM_PERF_MGR > p_opt->perfmgr = FALSE; > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > "sm_inactive", > p_key, p_val, &p_opts->sm_inactive); > > + __osm_subn_opts_unpack_boolean( > + "babbling_port_policy", > + p_key, p_val, &p_opts->babbling_port_policy); > + > #ifdef ENABLE_OSM_PERF_MGR > __osm_subn_opts_unpack_boolean( > "perfmgr", > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > "# Daemon mode\n" > "daemon %s\n\n" > "# SM Inactive\n" > - "sm_inactive %s\n\n", > + "sm_inactive %s\n\n" > + "# Babbling Port Policy\n" > + "babbling_port_policy %s\n\n", > p_opts->daemon ? "TRUE" : "FALSE", > - p_opts->sm_inactive ? "TRUE" : "FALSE" > + p_opts->sm_inactive ? "TRUE" : "FALSE", > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > ); > > #ifdef ENABLE_OSM_PERF_MGR > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index 5900c51..fbb6dac 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > } > else > { > + /* When babbling port policy option is enabled and > + Threshold for disabling a "babbling" port is exceeded */ > + if ( p_rcv->p_subn->opt.babbling_port_policy && > + num_received >= 250 ) > + { > + uint8_t payload[IB_SMP_DATA_SIZE]; > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > + const ib_port_info_t* p_old_pi; > + osm_madw_context_t context; > + > + /* If trap 131, might want to disable peer port if > available */ > + /* but peer port has been observed not to respond to SM > + requests */ > + > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3810: " > + " Disabling physical port lid:0x%02X num:%u\n", > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > + p_ntci->data_details.ntc_129_131.port_num > + ); > + > + p_old_pi = &p_physp->port_info; > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > + > + /* Set port to disabled/down */ > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > + ib_port_info_set_port_phys_state( > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > + > + context.pi_context.node_guid = osm_node_get_node_guid( > osm_physp_get_node_ptr( p_physp ) ); > + context.pi_context.port_guid = osm_physp_get_port_guid( > p_physp ); > + context.pi_context.set_method = TRUE; > + context.pi_context.update_master_sm_base_lid = FALSE; > + context.pi_context.light_sweep = FALSE; > + context.pi_context.active_transition = FALSE; > + > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > + osm_physp_get_dr_path_ptr( p_physp > ), > + payload, > + sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32(osm_physp_get_port_num( > p_physp )), > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status == IB_SUCCESS ) > + { > + goto Exit; > + } > + else > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3811: " > + "Request to set PortInfo failed\n" ); > + } > + } > + > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_trap_rcv_process_request: " > "Marking unhealthy physical port by lid:0x%02X > num:%u\n", > > > > From Don.Kerr at Sun.COM Mon Jul 9 07:27:08 2007 From: Don.Kerr at Sun.COM (Don Kerr) Date: Mon, 09 Jul 2007 10:27:08 -0400 Subject: [ofa-general] uDAPL Question Message-ID: <469245BC.8040108@Sun.COM> (not sure if this is the proper alias but giving it a try) Question: Is it possible to determine if an HCA is down intentionally from the uDAPL API? Situation: My node has two HCA's but only one is "UP". A call to dat_registry_list_providers gives me everything in dat.conf, includiung the interface that is down. I then proceed to call dat_ia_open on each entry but I don't know how to determine if there error I get back is bacause the interface is in error or if it is down on purpose? Thanks -DON From eaburns at iol.unh.edu Mon Jul 9 07:47:02 2007 From: eaburns at iol.unh.edu (Ethan Burns) Date: Mon, 9 Jul 2007 10:47:02 -0400 Subject: [ofa-general] iSER header Message-ID: <20070709144702.GB24125@postal.iol.unh.edu> Hello, I have been looking over the latest Linus git repo and I stumbled upon, what appears to be, an inconsistency between the iSER header used in the kernel and the latest iSER draft (draft-ietf-ips-iser-06.txt): struct iser_hdr { u8 flags; u8 rsvd[3]; __be32 write_stag; /* write rkey */ __be64 write_va; <------------------------------ __be32 read_stag; /* read rkey */ __be64 read_va; <------------------------------ } __attribute__((packed)); The two fields `write_va' and `read_va' seem to be extra fields that are not defined by the draft. Won't these fields present interoperability issues with conformant iSER implementations? Any information would be greatly appreciated. Ethan Burns From rdreier at cisco.com Mon Jul 9 07:48:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 07:48:13 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <469245BC.8040108@Sun.COM> (Don Kerr's message of "Mon, 09 Jul 2007 10:27:08 -0400") References: <469245BC.8040108@Sun.COM> Message-ID: > Question: Is it possible to determine if an HCA is down intentionally > from the uDAPL API? What do you mean by an HCA to being "up" or "down"? And what would it mean for it to be down "intentionally"? - R. From Don.Kerr at Sun.COM Mon Jul 9 08:04:16 2007 From: Don.Kerr at Sun.COM (Don Kerr) Date: Mon, 09 Jul 2007 11:04:16 -0400 Subject: [ofa-general] uDAPL Question In-Reply-To: References: <469245BC.8040108@Sun.COM> Message-ID: <46924E70.2040205@Sun.COM> Sorry. I was wrongly lumping port and HCA together. 2 HCA cards each with 2 ports but only one port on one card is operational and by that I mean can be pinged or seen as "UP" when you run ifconfig. But both are still listed in the dat.conf. -DON Roland Dreier wrote: > > Question: Is it possible to determine if an HCA is down intentionally > > from the uDAPL API? > >What do you mean by an HCA to being "up" or "down"? > >And what would it mean for it to be down "intentionally"? > > - R. >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From rdreier at cisco.com Mon Jul 9 08:28:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 08:28:55 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709091210.GP3182@rhun.haifa.ibm.com> (Muli Ben-Yehuda's message of "Mon, 9 Jul 2007 12:12:10 +0300") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709090802.GA3885@ics.muni.cz> <20070709091210.GP3182@rhun.haifa.ibm.com> Message-ID: > > according to Xen-dev alloc_pages does *not* guarantee contiguous > > pages. They say that the pci_alloc_consistent should be used > > instead. The question is whether non-Xen kernel *usually* allocates > > contiguous pages and so far it has been working and whether it > > should be fixed in the mainline of the driver. > > > > I do some tests (and also try to figure out how to change > > alloc_pages to pci_alloc_consistent) to verify contiguous pages. > > You missed an important bit of Keir's response---it's perfectly fine > to use alloc_pages provided you then use the dma_map_single API, which > for Xen dom0 will take care of bounce-buffering to a > machine-contiguous buffer if necessary. I am not sure if the same > holds for a domU kernel. I guess there was a mail thread that I wasn't copied on (I don't read any Xen mailing lists). Anyway, what mthca does is the following. It wants to give a bunch of system memory (megabytes) to the hardware for the hardware to use for its internal context. The hardware accesses this memory via PCI DMA of course. So what mthca does is: - Allocate large chunks of system memory using alloc_pages(GFP_HIGHUSER, order) with order > 0 - Built up an array of struct scatterlist where each entry is one of the order >0 pages allocated as above - Map that scatterlist with pci_map_sg(..., PCI_DMA_BIDIRECTIONAL) - Pass the DMA addresses returned from that to the hardware As far as I can see, what mthca is doing is perfectly fine as far as the DMA mapping API is concerned. If Xen is returning non-contiguous memory from alloc_pages() and then allocating bounce buffers in pci_map_sg() then that should work (although it will be somewhat inefficient, since the original memory will never actually be used). However I would confirm that that is what Xen is really trying to do, and also that the code is working as intended when the scatterlist has entries with pages of order >0. As a side note, mthca could use dma_alloc_coherent() to allocate this hardware memory, but that would be inefficient on 32-bit systems, because it would use up kernel address space for memory that will only be touched by the hardware. So that's why it allocates pages with GFP_HIGHUSER instead. - R. From rdreier at cisco.com Mon Jul 9 08:30:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 08:30:06 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709133743.GK3885@ics.muni.cz> (Lukas Hejtmanek's message of "Mon, 9 Jul 2007 15:37:43 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709133743.GK3885@ics.muni.cz> Message-ID: > ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000 > ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000 And what do you get back from pci_map_sg() for this order >0 page? Unfortunately I guess there's no way to see if the pci_map_sg() implementation has taken into account the full size of the scatterlist entry. - R. From xhejtman at ics.muni.cz Mon Jul 9 08:37:15 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 9 Jul 2007 17:37:15 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709133743.GK3885@ics.muni.cz> Message-ID: <20070709153715.GA6496@ics.muni.cz> On Mon, Jul 09, 2007 at 08:30:06AM -0700, Roland Dreier wrote: > > ib_mthca 0000:08:00.0: Page phys. addr 0000000026695000, virt ffff880098b00000 > > ib_mthca 0000:08:00.0: Page phys. addr 0000000026694000, virt ffff880098b01000 > > And what do you get back from pci_map_sg() for this order >0 page? > > Unfortunately I guess there's no way to see if the pci_map_sg() > implementation has taken into account the full size of the scatterlist > entry. Well, using swiotlb=force (which turns on the bounce buffers) I do not get oops any more. On the other hand, I got some oopses in memcpy of the bounce buffers which I try to solve with Xen developpers. -- Lukáš Hejtmánek From rdreier at cisco.com Mon Jul 9 08:55:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 08:55:07 -0700 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: <20070709153715.GA6496@ics.muni.cz> (Lukas Hejtmanek's message of "Mon, 9 Jul 2007 17:37:15 +0200") References: <20070704125429.GL3885@ics.muni.cz> <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709133743.GK3885@ics.muni.cz> <20070709153715.GA6496@ics.muni.cz> Message-ID: > Well, using swiotlb=force (which turns on the bounce buffers) I do not get > oops any more. On the other hand, I got some oopses in memcpy of the bounce > buffers which I try to solve with Xen developpers. So it seems there is a problem with the normal Xen PCI mapping API then. It would be better to avoid bounce buffers for this if possible, because as I said that would double the memory consumption and potentially exhaust your swiotlb space (because this hardware context memory is not used for "in-flight" IOs, it is essentially given to the hardware permanently). Also, could you please CC me on any threads with the Xen developers? It's kind of annoying to only get half of the story about what's going on with debugging this. Thanks, Roland From rdreier at cisco.com Mon Jul 9 09:10:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 09:10:47 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: add device reset to Internal Error handling mechanism In-Reply-To: <200707091012.52418.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 9 Jul 2007 10:12:52 +0300") References: <200707091012.52418.jackm@dev.mellanox.co.il> Message-ID: > This patch also disables the detection of Internal Errors via a device > interrupt, because we wish to avoid the complexity of supporting > two independent detection mechanisms. OK, but... > static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr) > { > - mlx4_handle_catas_err(dev_ptr); > + /* disable handling catas errors via interrupt. */ > + /* We now handle them via polling. */ > + /* mlx4_handle_catas_err(dev_ptr); */ Why not just delete all the interrupt stuff completely? For > + mod_timer(&priv->catas_err.timer, > + jiffies + MLX4_CATAS_POLL_INTERVAL); and > + priv->catas_err.timer.expires = jiffies + MLX4_CATAS_POLL_INTERVAL; how about round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL) instead? - R. From halr at voltaire.com Mon Jul 9 09:18:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jul 2007 12:18:15 -0400 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports In-Reply-To: <1183988571.25217.377395.camel@hal.voltaire.com> References: <1183640246.4377.436639.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> <1183988571.25217.377395.camel@hal.voltaire.com> Message-ID: <1183997893.25217.388186.camel@hal.voltaire.com> Hi again Amit, On Mon, 2007-07-09 at 09:42, Hal Rosenstock wrote: > Hi Amit, > > On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > > Hi Hal, > > > > In such case OpenSM should first check that the OPVL fields of the ports > > (the one that sends the traps and its peer) are identical, > > If you have a mismatch in the OPVL field, the link watchdog mechanism > > will retrain the logical link in high rate > > OpVLs only takes "effect" if set after link active only if the link is > bounced (not if it stays active). Not sure about what I wrote above. p.829 states that in certain PortStates this may cause flow control update errors (and initiate Link/Phy retraining). > Also and more significantly, in terms of the specific issue, the peer > SMA is often non responsive or shortly becomes non responsive so the > peer OpVLs cannot readily be verified post this being detected. This as well as the trap rate are the issues, perhaps second level but none the less issues. -- Hal > -- Hal > > > Amit > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, July 05, 2007 3:58 PM > > To: general at lists.openfabrics.org > > Cc: Eitan Zahavi; Yevgeny Kliteynik > > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > > > A "babbling" port is a port which causes traps to be generated > > frequently. > > It may directly be "this" port which generates the traps or the peer > > port detecting the issue and that the SMA on switch port 0 generates the > > traps. > > This has only currently been observed for trap 131 but will also apply > > for traps 129 and 130 as well which are other urgent and similar traps. > > > > Note that there appears to be a bug in Mellanox firmware for both > > Anafa-2 and Tavor at a minimum which causes the max trap rate not to be > > adhered to and relief for this does not appear to be in short term > > sight. > > > > Policy > > When a bablbing port is detected, OpenSM will disable the port or its > > peer switch port (depending on which trap) which should terminate the > > trap storm. > > > > Detection > > 250 consecutive traps of this type will be used as the (initial) > > threshold. The reason for this is so as to not prematurely detect this > > and disable a port. > > > > Recovery > > Admin would reenable port when OK again. (This usually involves > > rebooting the node causing the trap to be indicated.) > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/opensm/include/opensm/osm_subnet.h > > b/opensm/include/opensm/osm_subnet.h > > index bedd63f..1150703 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > > boolean_t honor_guid2lid_file; > > boolean_t daemon; > > boolean_t sm_inactive; > > + boolean_t babbling_port_policy; > > osm_qos_options_t qos_options; > > osm_qos_options_t qos_ca_options; > > osm_qos_options_t qos_sw0_options; > > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > > * > > * sm_inactive > > * OpenSM will start with SM in not active state. > > +* > > +* babbling_port_policy > > +* OpenSM will enforce its "babbling" port policy. > > * > > * perfmgr > > * Enable or disable the performance manager > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index 726b665..87b71e5 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > > p_opt->honor_guid2lid_file = FALSE; > > p_opt->daemon = FALSE; > > p_opt->sm_inactive = FALSE; > > + p_opt->babbling_port_policy = FALSE; > > #ifdef ENABLE_OSM_PERF_MGR > > p_opt->perfmgr = FALSE; > > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ > > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > > "sm_inactive", > > p_key, p_val, &p_opts->sm_inactive); > > > > + __osm_subn_opts_unpack_boolean( > > + "babbling_port_policy", > > + p_key, p_val, &p_opts->babbling_port_policy); > > + > > #ifdef ENABLE_OSM_PERF_MGR > > __osm_subn_opts_unpack_boolean( > > "perfmgr", > > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > > "# Daemon mode\n" > > "daemon %s\n\n" > > "# SM Inactive\n" > > - "sm_inactive %s\n\n", > > + "sm_inactive %s\n\n" > > + "# Babbling Port Policy\n" > > + "babbling_port_policy %s\n\n", > > p_opts->daemon ? "TRUE" : "FALSE", > > - p_opts->sm_inactive ? "TRUE" : "FALSE" > > + p_opts->sm_inactive ? "TRUE" : "FALSE", > > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > > ); > > > > #ifdef ENABLE_OSM_PERF_MGR > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > > index 5900c51..fbb6dac 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -1,5 +1,5 @@ > > /* > > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > + if ( p_rcv->p_subn->opt.babbling_port_policy && > > + num_received >= 250 ) > > + { > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + /* If trap 131, might want to disable peer port if > > available */ > > + /* but peer port has been observed not to respond to SM > > + requests */ > > + > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3810: " > > + " Disabling physical port lid:0x%02X num:%u\n", > > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > > + p_ntci->data_details.ntc_129_131.port_num > > + ); > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + /* Set port to disabled/down */ > > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > + ib_port_info_set_port_phys_state( > > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > + > > + context.pi_context.node_guid = osm_node_get_node_guid( > > osm_physp_get_node_ptr( p_physp ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( > > p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( p_physp > > ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + cl_hton32(osm_physp_get_port_num( > > p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status == IB_SUCCESS ) > > + { > > + goto Exit; > > + } > > + else > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3811: " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_trap_rcv_process_request: " > > "Marking unhealthy physical port by lid:0x%02X > > num:%u\n", > > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From amitk at mellanox.co.il Mon Jul 9 09:40:14 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Mon, 9 Jul 2007 19:40:14 +0300 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports References: <1183640246.4377.436639.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> <1183988571.25217.377395.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com> Hi Hal I was only talking on logical link == Active state. In this state the watchdog can bring the physical link to recovery state while the logical link will bounce between Active and ActiveDefer. Regarding the responsive issue, OpenSM in this scenario should move the logical link in the responsive side to Init state that way the watchdog will stop bringing down the link and then do the checks Amit -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, July 09, 2007 4:43 PM To: Amit Krig Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports Hi Amit, On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > Hi Hal, > > In such case OpenSM should first check that the OPVL fields of the > ports (the one that sends the traps and its peer) are identical, If > you have a mismatch in the OPVL field, the link watchdog mechanism > will retrain the logical link in high rate OpVLs only takes "effect" if set after link active only if the link is bounced (not if it stays active). Also and more significantly, in terms of the specific issue, the peer SMA is often non responsive or shortly becomes non responsive so the peer OpVLs cannot readily be verified post this being detected. -- Hal > Amit > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, July 05, 2007 3:58 PM > To: general at lists.openfabrics.org > Cc: Eitan Zahavi; Yevgeny Kliteynik > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > A "babbling" port is a port which causes traps to be generated > frequently. > It may directly be "this" port which generates the traps or the peer > port detecting the issue and that the SMA on switch port 0 generates > the traps. > This has only currently been observed for trap 131 but will also apply > for traps 129 and 130 as well which are other urgent and similar traps. > > Note that there appears to be a bug in Mellanox firmware for both > Anafa-2 and Tavor at a minimum which causes the max trap rate not to > be adhered to and relief for this does not appear to be in short term > sight. > > Policy > When a bablbing port is detected, OpenSM will disable the port or its > peer switch port (depending on which trap) which should terminate the > trap storm. > > Detection > 250 consecutive traps of this type will be used as the (initial) > threshold. The reason for this is so as to not prematurely detect this > and disable a port. > > Recovery > Admin would reenable port when OK again. (This usually involves > rebooting the node causing the trap to be indicated.) > > Signed-off-by: Hal Rosenstock > > diff --git a/opensm/include/opensm/osm_subnet.h > b/opensm/include/opensm/osm_subnet.h > index bedd63f..1150703 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > boolean_t honor_guid2lid_file; > boolean_t daemon; > boolean_t sm_inactive; > + boolean_t babbling_port_policy; > osm_qos_options_t qos_options; > osm_qos_options_t qos_ca_options; > osm_qos_options_t qos_sw0_options; > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > * > * sm_inactive > * OpenSM will start with SM in not active state. > +* > +* babbling_port_policy > +* OpenSM will enforce its "babbling" port policy. > * > * perfmgr > * Enable or disable the performance manager > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 726b665..87b71e5 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > p_opt->honor_guid2lid_file = FALSE; > p_opt->daemon = FALSE; > p_opt->sm_inactive = FALSE; > + p_opt->babbling_port_policy = FALSE; > #ifdef ENABLE_OSM_PERF_MGR > p_opt->perfmgr = FALSE; > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > "sm_inactive", > p_key, p_val, &p_opts->sm_inactive); > > + __osm_subn_opts_unpack_boolean( > + "babbling_port_policy", > + p_key, p_val, &p_opts->babbling_port_policy); > + > #ifdef ENABLE_OSM_PERF_MGR > __osm_subn_opts_unpack_boolean( > "perfmgr", > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > "# Daemon mode\n" > "daemon %s\n\n" > "# SM Inactive\n" > - "sm_inactive %s\n\n", > + "sm_inactive %s\n\n" > + "# Babbling Port Policy\n" > + "babbling_port_policy %s\n\n", > p_opts->daemon ? "TRUE" : "FALSE", > - p_opts->sm_inactive ? "TRUE" : "FALSE" > + p_opts->sm_inactive ? "TRUE" : "FALSE", > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > ); > > #ifdef ENABLE_OSM_PERF_MGR > diff --git a/opensm/opensm/osm_trap_rcv.c > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > } > else > { > + /* When babbling port policy option is enabled and > + Threshold for disabling a "babbling" port is exceeded */ > + if ( p_rcv->p_subn->opt.babbling_port_policy && > + num_received >= 250 ) > + { > + uint8_t payload[IB_SMP_DATA_SIZE]; > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > + const ib_port_info_t* p_old_pi; > + osm_madw_context_t context; > + > + /* If trap 131, might want to disable peer port if > available */ > + /* but peer port has been observed not to respond to SM > + requests */ > + > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3810: " > + " Disabling physical port lid:0x%02X num:%u\n", > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > + p_ntci->data_details.ntc_129_131.port_num > + ); > + > + p_old_pi = &p_physp->port_info; > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > + > + /* Set port to disabled/down */ > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > + ib_port_info_set_port_phys_state( > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > + > + context.pi_context.node_guid = osm_node_get_node_guid( > osm_physp_get_node_ptr( p_physp ) ); > + context.pi_context.port_guid = osm_physp_get_port_guid( > p_physp ); > + context.pi_context.set_method = TRUE; > + context.pi_context.update_master_sm_base_lid = FALSE; > + context.pi_context.light_sweep = FALSE; > + context.pi_context.active_transition = FALSE; > + > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > + osm_physp_get_dr_path_ptr( p_physp > ), > + payload, > + sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32(osm_physp_get_port_num( > p_physp )), > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status == IB_SUCCESS ) > + { > + goto Exit; > + } > + else > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3811: " > + "Request to set PortInfo failed\n" ); > + } > + } > + > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_trap_rcv_process_request: " > "Marking unhealthy physical port by lid:0x%02X > num:%u\n", > > > > From xhejtman at ics.muni.cz Mon Jul 9 09:53:23 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 9 Jul 2007 18:53:23 +0200 Subject: [ofa-general] Re: InfiniBand card (mthca) in Linux In-Reply-To: References: <20070705193136.GQ3885@ics.muni.cz> <20070707085303.GS3885@ics.muni.cz> <20070708001531.GT3885@ics.muni.cz> <20070709133743.GK3885@ics.muni.cz> <20070709153715.GA6496@ics.muni.cz> Message-ID: <20070709165323.GN3885@ics.muni.cz> On Mon, Jul 09, 2007 at 08:55:07AM -0700, Roland Dreier wrote: > So it seems there is a problem with the normal Xen PCI mapping API > then. It would be better to avoid bounce buffers for this if > possible, because as I said that would double the memory consumption > and potentially exhaust your swiotlb space (because this hardware > context memory is not used for "in-flight" IOs, it is essentially > given to the hardware permanently). > > Also, could you please CC me on any threads with the Xen developers? > It's kind of annoying to only get half of the story about what's going > on with debugging this. Sorry for that. The beginning of the thread is archived here: http://lists.xensource.com/archives/html/xen-devel/2007-07/msg00209.html Although, the last two posts are missing. If you like to get bounced whole thread, I can do it. Right now, I have problem in :ib_mthca:mthca_arbel_write_mtt_seg where dma_sync_single is called and swiotlb does not have corresponding mapping. -- Lukáš Hejtmánek From vuhuong at mellanox.com Mon Jul 9 09:55:04 2007 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 09 Jul 2007 09:55:04 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <1183852853.6008.11.camel@gentoo-linux.localdomain> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> Message-ID: <46926868.8000704@mellanox.com> Stanley Sufficool wrote: > Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch > > Got the latest srpt from the git repository on OpenFabrics and had the > following issues. > > ib_srpt.c Line 1997, missing second argument, should be? > sdev->scst_tgt = scst_register(tp, NULL); > Yes. You need the change if you test with top of scst svn trunk (or from version 0.9.6-pre2) If you test with scst before 0.9.6-pre2 (ie. version <= 0.9.6-pre1) you don't need the second argument for scst_register() > SCST was built successfully after fixing an issue in scst_vdisk.c > (missing #include ) I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX - you should send the patch to scst devel > > Just thought this would be nice to have documented, took me half a day > to track down as a novice in C programming. > there is *lean and mean* srpt's README in srpt_inc SCST also has some document You can add some wiki/notes for the problems in openfabrics wiki page https://wiki.openfabrics.org/tiki-index.php -vu > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Jul 9 09:57:47 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Jul 2007 12:57:47 -0400 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com> References: <1183640246.4377.436639.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> <1183988571.25217.377395.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com> Message-ID: <1184000266.25217.390914.camel@hal.voltaire.com> Hi Amit, On Mon, 2007-07-09 at 12:40, Amit Krig wrote: > Hi Hal > > I was only talking on logical link == Active state. > In this state the watchdog can bring the physical link to recovery state > while the logical link will bounce between Active and ActiveDefer. OK; I follow this but I'm not sure what you are saying about "applying" it to the patch in question. > Regarding the responsive issue, OpenSM in this scenario should move the > logical link in the responsive side to Init state rather than disabling it on some threshold. What about the other similar traps 129 and 130 ? How should they be handled ? > that way the watchdog will stop bringing down the link and then do the checks I think the checks will still fail but this seems like it would stop the traps from being generated (so fast). -- Hal > Amit > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, July 09, 2007 4:43 PM > To: Amit Krig > Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik > Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports > > Hi Amit, > > On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > > Hi Hal, > > > > In such case OpenSM should first check that the OPVL fields of the > > ports (the one that sends the traps and its peer) are identical, If > > you have a mismatch in the OPVL field, the link watchdog mechanism > > will retrain the logical link in high rate > > OpVLs only takes "effect" if set after link active only if the link is > bounced (not if it stays active). > > Also and more significantly, in terms of the specific issue, the peer > SMA is often non responsive or shortly becomes non responsive so the > peer OpVLs cannot readily be verified post this being detected. > > -- Hal > > > Amit > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, July 05, 2007 3:58 PM > > To: general at lists.openfabrics.org > > Cc: Eitan Zahavi; Yevgeny Kliteynik > > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > > > A "babbling" port is a port which causes traps to be generated > > frequently. > > It may directly be "this" port which generates the traps or the peer > > port detecting the issue and that the SMA on switch port 0 generates > > the traps. > > This has only currently been observed for trap 131 but will also apply > > > for traps 129 and 130 as well which are other urgent and similar > traps. > > > > Note that there appears to be a bug in Mellanox firmware for both > > Anafa-2 and Tavor at a minimum which causes the max trap rate not to > > be adhered to and relief for this does not appear to be in short term > > sight. > > > > Policy > > When a bablbing port is detected, OpenSM will disable the port or its > > peer switch port (depending on which trap) which should terminate the > > trap storm. > > > > Detection > > 250 consecutive traps of this type will be used as the (initial) > > threshold. The reason for this is so as to not prematurely detect this > > > and disable a port. > > > > Recovery > > Admin would reenable port when OK again. (This usually involves > > rebooting the node causing the trap to be indicated.) > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/opensm/include/opensm/osm_subnet.h > > b/opensm/include/opensm/osm_subnet.h > > index bedd63f..1150703 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > > boolean_t honor_guid2lid_file; > > boolean_t daemon; > > boolean_t sm_inactive; > > + boolean_t babbling_port_policy; > > osm_qos_options_t qos_options; > > osm_qos_options_t qos_ca_options; > > osm_qos_options_t qos_sw0_options; > > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > > * > > * sm_inactive > > * OpenSM will start with SM in not active state. > > +* > > +* babbling_port_policy > > +* OpenSM will enforce its "babbling" port policy. > > * > > * perfmgr > > * Enable or disable the performance manager > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index 726b665..87b71e5 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > > p_opt->honor_guid2lid_file = FALSE; > > p_opt->daemon = FALSE; > > p_opt->sm_inactive = FALSE; > > + p_opt->babbling_port_policy = FALSE; > > #ifdef ENABLE_OSM_PERF_MGR > > p_opt->perfmgr = FALSE; > > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; @@ > > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > > "sm_inactive", > > p_key, p_val, &p_opts->sm_inactive); > > > > + __osm_subn_opts_unpack_boolean( > > + "babbling_port_policy", > > + p_key, p_val, &p_opts->babbling_port_policy); > > + > > #ifdef ENABLE_OSM_PERF_MGR > > __osm_subn_opts_unpack_boolean( > > "perfmgr", > > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > > "# Daemon mode\n" > > "daemon %s\n\n" > > "# SM Inactive\n" > > - "sm_inactive %s\n\n", > > + "sm_inactive %s\n\n" > > + "# Babbling Port Policy\n" > > + "babbling_port_policy %s\n\n", > > p_opts->daemon ? "TRUE" : "FALSE", > > - p_opts->sm_inactive ? "TRUE" : "FALSE" > > + p_opts->sm_inactive ? "TRUE" : "FALSE", > > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > > ); > > > > #ifdef ENABLE_OSM_PERF_MGR > > diff --git a/opensm/opensm/osm_trap_rcv.c > > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -1,5 +1,5 @@ > > /* > > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > + if ( p_rcv->p_subn->opt.babbling_port_policy && > > + num_received >= 250 ) > > + { > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + /* If trap 131, might want to disable peer port if > > available */ > > + /* but peer port has been observed not to respond to SM > > + requests */ > > + > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3810: " > > + " Disabling physical port lid:0x%02X num:%u\n", > > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > > + p_ntci->data_details.ntc_129_131.port_num > > + ); > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + /* Set port to disabled/down */ > > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > + ib_port_info_set_port_phys_state( > > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > + > > + context.pi_context.node_guid = osm_node_get_node_guid( > > osm_physp_get_node_ptr( p_physp ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( > > p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( p_physp > > ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + cl_hton32(osm_physp_get_port_num( > > p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status == IB_SUCCESS ) > > + { > > + goto Exit; > > + } > > + else > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3811: " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_trap_rcv_process_request: " > > "Marking unhealthy physical port by lid:0x%02X > > num:%u\n", > > > > > > > > > From amitk at mellanox.co.il Mon Jul 9 10:07:06 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Mon, 9 Jul 2007 20:07:06 +0300 Subject: [ofa-general] RE: [PATCH] OpenSM handling of "Babbling" Ports References: <1183640246.4377.436639.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE111F@mtlexch01.mtl.com> <1183988571.25217.377395.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE1121@mtlexch01.mtl.com> <1184000266.25217.390914.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE112A@mtlexch01.mtl.com> I mean that if you still get the traps in high rate (After verifying the OPVL) than you should consider disabling the link -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, July 09, 2007 7:58 PM To: Amit Krig Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports Hi Amit, On Mon, 2007-07-09 at 12:40, Amit Krig wrote: > Hi Hal > > I was only talking on logical link == Active state. > In this state the watchdog can bring the physical link to recovery > state while the logical link will bounce between Active and ActiveDefer. OK; I follow this but I'm not sure what you are saying about "applying" it to the patch in question. > Regarding the responsive issue, OpenSM in this scenario should move > the logical link in the responsive side to Init state rather than disabling it on some threshold. What about the other similar traps 129 and 130 ? How should they be handled ? > that way the watchdog will stop bringing down the link and then do the > checks I think the checks will still fail but this seems like it would stop the traps from being generated (so fast). -- Hal > Amit > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, July 09, 2007 4:43 PM > To: Amit Krig > Cc: general at lists.openfabrics.org; Eitan Zahavi; Yevgeny Kliteynik > Subject: RE: [PATCH] OpenSM handling of "Babbling" Ports > > Hi Amit, > > On Mon, 2007-07-09 at 09:27, Amit Krig wrote: > > Hi Hal, > > > > In such case OpenSM should first check that the OPVL fields of the > > ports (the one that sends the traps and its peer) are identical, If > > you have a mismatch in the OPVL field, the link watchdog mechanism > > will retrain the logical link in high rate > > OpVLs only takes "effect" if set after link active only if the link is > bounced (not if it stays active). > > Also and more significantly, in terms of the specific issue, the peer > SMA is often non responsive or shortly becomes non responsive so the > peer OpVLs cannot readily be verified post this being detected. > > -- Hal > > > Amit > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, July 05, 2007 3:58 PM > > To: general at lists.openfabrics.org > > Cc: Eitan Zahavi; Yevgeny Kliteynik > > Subject: [PATCH] OpenSM handling of "Babbling" Ports > > > > A "babbling" port is a port which causes traps to be generated > > frequently. > > It may directly be "this" port which generates the traps or the peer > > port detecting the issue and that the SMA on switch port 0 generates > > the traps. > > This has only currently been observed for trap 131 but will also > > apply > > > for traps 129 and 130 as well which are other urgent and similar > traps. > > > > Note that there appears to be a bug in Mellanox firmware for both > > Anafa-2 and Tavor at a minimum which causes the max trap rate not to > > be adhered to and relief for this does not appear to be in short > > term sight. > > > > Policy > > When a bablbing port is detected, OpenSM will disable the port or > > its peer switch port (depending on which trap) which should > > terminate the trap storm. > > > > Detection > > 250 consecutive traps of this type will be used as the (initial) > > threshold. The reason for this is so as to not prematurely detect > > this > > > and disable a port. > > > > Recovery > > Admin would reenable port when OK again. (This usually involves > > rebooting the node causing the trap to be indicated.) > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/opensm/include/opensm/osm_subnet.h > > b/opensm/include/opensm/osm_subnet.h > > index bedd63f..1150703 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -286,6 +286,7 @@ typedef struct _osm_subn_opt > > boolean_t honor_guid2lid_file; > > boolean_t daemon; > > boolean_t sm_inactive; > > + boolean_t babbling_port_policy; > > osm_qos_options_t qos_options; > > osm_qos_options_t qos_ca_options; > > osm_qos_options_t qos_sw0_options; > > @@ -487,6 +488,9 @@ typedef struct _osm_subn_opt > > * > > * sm_inactive > > * OpenSM will start with SM in not active state. > > +* > > +* babbling_port_policy > > +* OpenSM will enforce its "babbling" port policy. > > * > > * perfmgr > > * Enable or disable the performance manager > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index 726b665..87b71e5 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -472,6 +472,7 @@ osm_subn_set_default_opt( > > p_opt->honor_guid2lid_file = FALSE; > > p_opt->daemon = FALSE; > > p_opt->sm_inactive = FALSE; > > + p_opt->babbling_port_policy = FALSE; > > #ifdef ENABLE_OSM_PERF_MGR > > p_opt->perfmgr = FALSE; > > p_opt->perfmgr_sweep_time_s = OSM_PERFMGR_DEFAULT_SWEEP_TIME_S; > > @@ > > -1358,6 +1359,10 @@ osm_subn_parse_conf_file( > > "sm_inactive", > > p_key, p_val, &p_opts->sm_inactive); > > > > + __osm_subn_opts_unpack_boolean( > > + "babbling_port_policy", > > + p_key, p_val, &p_opts->babbling_port_policy); > > + > > #ifdef ENABLE_OSM_PERF_MGR > > __osm_subn_opts_unpack_boolean( > > "perfmgr", > > @@ -1631,9 +1636,12 @@ osm_subn_write_conf_file( > > "# Daemon mode\n" > > "daemon %s\n\n" > > "# SM Inactive\n" > > - "sm_inactive %s\n\n", > > + "sm_inactive %s\n\n" > > + "# Babbling Port Policy\n" > > + "babbling_port_policy %s\n\n", > > p_opts->daemon ? "TRUE" : "FALSE", > > - p_opts->sm_inactive ? "TRUE" : "FALSE" > > + p_opts->sm_inactive ? "TRUE" : "FALSE", > > + p_opts->babbling_port_policy ? "TRUE" : "FALSE" > > ); > > > > #ifdef ENABLE_OSM_PERF_MGR > > diff --git a/opensm/opensm/osm_trap_rcv.c > > b/opensm/opensm/osm_trap_rcv.c index 5900c51..fbb6dac 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -1,5 +1,5 @@ > > /* > > - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. > > + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights > > reserved. > > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > * > > @@ -548,6 +548,61 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > + if ( p_rcv->p_subn->opt.babbling_port_policy && > > + num_received >= 250 ) > > + { > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + /* If trap 131, might want to disable peer port if > > available */ > > + /* but peer port has been observed not to respond to SM > > + requests */ > > + > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3810: " > > + " Disabling physical port lid:0x%02X num:%u\n", > > + cl_ntoh16(p_ntci->data_details.ntc_129_131.lid), > > + p_ntci->data_details.ntc_129_131.port_num > > + ); > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + /* Set port to disabled/down */ > > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > + ib_port_info_set_port_phys_state( > > + IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > + > > + context.pi_context.node_guid = osm_node_get_node_guid( > > osm_physp_get_node_ptr( p_physp ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( > > p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( > > + p_physp > > ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + > > + cl_hton32(osm_physp_get_port_num( > > p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status == IB_SUCCESS ) > > + { > > + goto Exit; > > + } > > + else > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3811: " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_trap_rcv_process_request: " > > "Marking unhealthy physical port by lid:0x%02X > > num:%u\n", > > > > > > > > > From ardavis at ichips.intel.com Mon Jul 9 10:09:00 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 09 Jul 2007 10:09:00 -0700 Subject: [ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net> References: <349DCDA352EACF42A0C49FA6DCEA840301BC9BA3@G3W0634.americas.hpqcorp.net><000001c7bfed$6b4225c0$3c98070a@amr.corp.intel.com> <349DCDA352EACF42A0C49FA6DCEA840301BCA09B@G3W0634.americas.hpqcorp.net> <349DCDA352EACF42A0C49FA6DCEA840301BCA139@G3W0634.americas.hpqcorp.net> Message-ID: <46926BAC.5080208@ichips.intel.com> Tang, Changqing wrote: >Sean: > I have 6 nodes with two IB cards on each node. If I configure >the first card on all nodes as one subnet, the second card on all nodes >as another subnet, Plus set arp_ignore=2, jobs on first subnet, or >second subnet work fine. > > But when I configure all 12 cards into a single subnet, jobs on >all first cards work fine, job on all second cards hangs. > > > Can you give us more information regarding your hang? Are you waiting for a connect request or reply? Does the server see a connect request? -arlin From vuhuong at mellanox.com Mon Jul 9 10:21:24 2007 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 09 Jul 2007 10:21:24 -0700 Subject: [ofa-general] Generate ib_srpt.ko Failed! In-Reply-To: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com> References: <5da1d75d6b4c.5d6b4c5da1d7@neusoft.com> Message-ID: <46926E94.80907@mellanox.com> ljf, > Dear, > I used OFED-1.2 to generate the SCSI Target modules,but when I > enter the command "./configure --with-srp-target-mod",many faults > occur. Most are kernel patch failure. My OS is CentOS 5.0,with kernel > version 2.6.18-8.el5.Can anyone give me some suggestion? Great > apreciation with any help! Did you follow the instructions in this page - http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt - before the ./configure step? thanks, -vu > Thank you! > > > yours, > > ljf > > > ---------------------------------------------------------------------------------------------- > Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Group Ltd., its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful. If you have received this communication in error, please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. > ----------------------------------------------------------------------------------------------- > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Mon Jul 9 10:26:16 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Mon, 9 Jul 2007 10:26:16 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <46924E70.2040205@Sun.COM> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com> > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Don Kerr > Sent: Monday, July 09, 2007 8:04 AM > To: Roland Dreier > Cc: general > Subject: Re: [ofa-general] uDAPL Question > > Sorry. I was wrongly lumping port and HCA together. > > 2 HCA cards each with 2 ports but only one port on one card > is operational and by that I mean can be pinged or seen as > "UP" when you run ifconfig. But both are still listed in the dat.conf. > > -DON > The DAT Registry allows for a provider to deregister itself, but there are no guidelines as to when it should do so for indefinite but non-permanent unavailabiilty. I have always presumed that Host OS standards for temporarily unavailable devices should be applied. From Don.Kerr at Sun.COM Mon Jul 9 10:55:54 2007 From: Don.Kerr at Sun.COM (Don Kerr) Date: Mon, 09 Jul 2007 13:55:54 -0400 Subject: [ofa-general] uDAPL Question In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com> References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com> Message-ID: <469276AA.3070606@Sun.COM> OK, so no good way to determine this from uDAPL alone, its expected that the provider will register/deregister with the file as needed. Next question. is there a way to get the entire dat.conf entry from the uDAPL API? Example: Typical dat.conf entry might look something like: OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" "" I can find the first field, in this example "OpenIB-cma", from the ia attribute name but what if I wanted to correlate say the 6th field, "ib0 0", with the first field? Thanks -DON Caitlin Bestler wrote: > > > > >>-----Original Message----- >>From: general-bounces at lists.openfabrics.org >>[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Don Kerr >>Sent: Monday, July 09, 2007 8:04 AM >>To: Roland Dreier >>Cc: general >>Subject: Re: [ofa-general] uDAPL Question >> >>Sorry. I was wrongly lumping port and HCA together. >> >>2 HCA cards each with 2 ports but only one port on one card >>is operational and by that I mean can be pinged or seen as >>"UP" when you run ifconfig. But both are still listed in the dat.conf. >> >>-DON >> >> >> >The DAT Registry allows for a provider to deregister itself, but >there are no guidelines as to when it should do so for indefinite >but non-permanent unavailabiilty. I have always presumed that Host >OS standards for temporarily unavailable devices should be applied. > > > From ardavis at ichips.intel.com Mon Jul 9 11:37:17 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 09 Jul 2007 11:37:17 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <46924E70.2040205@Sun.COM> References: <469245BC.8040108@Sun.COM> <46924E70.2040205@Sun.COM> Message-ID: <4692805D.2000001@ichips.intel.com> Don Kerr wrote: > Sorry. I was wrongly lumping port and HCA together. > > 2 HCA cards each with 2 ports but only one port on one card is > operational and by that I mean can be pinged or seen as "UP" when you > run ifconfig. But both are still listed in the dat.conf. > dat.conf is simply a means of static device registration for providers. The default dat.conf provided with OFED includes examples for up to 4 ports as well as a bonding example. It is up to the administrator to modify accordingly. The device is valid and configured properly if the open returns DAT_SUCCESS. -arlin From ardavis at ichips.intel.com Mon Jul 9 12:36:09 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 09 Jul 2007 12:36:09 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <469276AA.3070606@Sun.COM> References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com> <469276AA.3070606@Sun.COM> Message-ID: <46928E29.8020903@ichips.intel.com> Don Kerr wrote: > OK, so no good way to determine this from uDAPL alone, its expected > that the provider will register/deregister with the file as needed. > > Next question. is there a way to get the entire dat.conf entry from > the uDAPL API? > > Example: Typical dat.conf entry might look something like: > OpenIB-cma u1.2 nonthreadsafe default > /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" "" > > I can find the first field, in this example "OpenIB-cma", from the ia > attribute name but what if I wanted to correlate say the 6th field, > "ib0 0", with the first field? > What are you trying to determine from this parsing? Do you need to actually know the netdev name or can you get by with the address of the device? If you are using dat_registry_list_providers(), just walk the list, use the device name for the dat_ia_open and if it returns DAT_SUCCESS the device is active and configured. You can then call dat_ia_query to get the IP address. -arlin From Don.Kerr at Sun.COM Mon Jul 9 12:47:17 2007 From: Don.Kerr at Sun.COM (Don Kerr) Date: Mon, 09 Jul 2007 15:47:17 -0400 Subject: [ofa-general] uDAPL Question In-Reply-To: <46928E29.8020903@ichips.intel.com> References: <1EF1E44200D82B47BD5BA61171E8CE9D0475C9A7@NT-IRVA-0750.brcm.ad.broadcom.com> <469276AA.3070606@Sun.COM> <46928E29.8020903@ichips.intel.com> Message-ID: <469290C5.6010709@Sun.COM> I am working on a uDAPL layer for Open MPI. The situation is if I have more than one port/HCA my users may want to be selective in what is used and to do this they would need to provide some information regarding which port/HCA to use. So my thought is that the users are more familar with the output from "ifconfig", for example ib0, ib1, etc, and I was trying to find a way to correlate that to what is available from the uDAPL API. Maybe I need to reprogram them to look at dat.conf. -DON Arlin Davis wrote: > Don Kerr wrote: > >> OK, so no good way to determine this from uDAPL alone, its expected >> that the provider will register/deregister with the file as needed. >> >> Next question. is there a way to get the entire dat.conf entry from >> the uDAPL API? >> >> Example: Typical dat.conf entry might look something like: >> OpenIB-cma u1.2 nonthreadsafe default >> /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" "" >> >> I can find the first field, in this example "OpenIB-cma", from the ia >> attribute name but what if I wanted to correlate say the 6th field, >> "ib0 0", with the first field? >> > What are you trying to determine from this parsing? Do you need to > actually know the netdev name or can you get by with the address of > the device? If you are using dat_registry_list_providers(), just walk > the list, use the device name for the dat_ia_open and if it returns > DAT_SUCCESS the device is active and configured. You can then call > dat_ia_query to get the IP address. > > -arlin From ralph.campbell at qlogic.com Mon Jul 9 13:29:29 2007 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 09 Jul 2007 13:29:29 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address In-Reply-To: References: <1183142276.18911.337.camel@brick.pathscale.com> Message-ID: <1184012969.20509.0.camel@brick.pathscale.com> I was on vacation last week, just going through emails today. On Mon, 2007-07-02 at 09:43 -0700, Roland Dreier wrote: > ralph -- how did you find this bug? Hit it in practice or just code review? > > I'm trying to decide whether to get this into 2.6.22, or whether it > can wait for 2.6.23. > > - R. I found it via code inspection. From rdreier at cisco.com Mon Jul 9 13:34:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 13:34:12 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib - partial error clean up unmaps wrong address In-Reply-To: <1184012969.20509.0.camel@brick.pathscale.com> (Ralph Campbell's message of "Mon, 09 Jul 2007 13:29:29 -0700") References: <1183142276.18911.337.camel@brick.pathscale.com> <1184012969.20509.0.camel@brick.pathscale.com> Message-ID: OK, thanks... I stuck it in 2.6.22 anyway since mst thought he saw a related crash. From rdreier at cisco.com Mon Jul 9 14:16:19 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 14:16:19 -0700 Subject: [ofa-general] mthca use of dma_sync_single is bogus Message-ID: It seems the problems running mthca in a Xen domU have uncovered a bug in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg() and mthca_arbel_map_phys_fmr() to sync the MTTs that get written. However, Documentation/DMA-API.txt says: void dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) synchronise a single contiguous or scatter/gather mapping. All the parameters must be the same as those passed into the single mapping API. and mthca is *not* following this clear rule: it is trying to sync only a subrange of the mapping. Later on in the document, there is: void dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, enum dma_data_direction direction) does a partial sync. starting at offset and continuing for size. You must be careful to observe the cache alignment and width when doing anything like this. You must also be extra careful about accessing memory you intend to sync partially. but that is in a section dealing with non-consistent memory so it's not entirely clear to me whether it's kosher to use this as mthca wants. The other alternative is to put the MTT table in coherent memory just like the MPT table. That might be the best solution I suppose... Michael, anyone else, thoughts on this? - R. From keir at xensource.com Mon Jul 9 14:31:49 2007 From: keir at xensource.com (Keir Fraser) Date: Mon, 09 Jul 2007 22:31:49 +0100 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: Message-ID: One thought is that if you *do* move to dma_sync_single_range() then lib/swiotlb.c still needs fixing. It's buggy in that swiotlb_sync_single_range(dma_addr, offset) calls swiotlb_sync_single(dma_addr+offset), and this will fail if the offset is large enough that it ends up dereferencing a different slot index in io_tlb_orig_addr. So, I should be able to get my swiotlb workaround fixes accepted upstream as a genuine bug fix. :-) dma_sync_single_range() looks to me to be the right thing for you to be using. But I'm not a DMA-API expert. -- Keir On 9/7/07 22:16, "Roland Dreier" wrote: > It seems the problems running mthca in a Xen domU have uncovered a bug > in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg() > and mthca_arbel_map_phys_fmr() to sync the MTTs that get written. > However, Documentation/DMA-API.txt says: > > void > dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, > enum dma_data_direction direction) > > synchronise a single contiguous or scatter/gather mapping. All the > parameters must be the same as those passed into the single mapping > API. > > and mthca is *not* following this clear rule: it is trying to sync > only a subrange of the mapping. Later on in the document, there is: > > void > dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, > unsigned long offset, size_t size, > enum dma_data_direction direction) > > does a partial sync. starting at offset and continuing for size. You > must be careful to observe the cache alignment and width when doing > anything like this. You must also be extra careful about accessing > memory you intend to sync partially. > > but that is in a section dealing with non-consistent memory so it's > not entirely clear to me whether it's kosher to use this as mthca > wants. > > The other alternative is to put the MTT table in coherent memory just > like the MPT table. That might be the best solution I suppose... > > Michael, anyone else, thoughts on this? > > - R. From rdreier at cisco.com Mon Jul 9 14:29:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 14:29:40 -0700 Subject: [ofa-general] mthca use of dma_sync_single is bogus In-Reply-To: (Roland Dreier's message of "Mon, 09 Jul 2007 14:16:19 -0700") References: Message-ID: > void > dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, > unsigned long offset, size_t size, > enum dma_data_direction direction) It seems the document has bitrotted a little, since dma_sync_single_range() doesn't actually exist for most architectures; what is really implemented is dma_sync_single_range_for_cpu() and dma_sync_single_range_for_device(). But assuming those are usable in our situation, they seem to be exactly what we want. I'll try to get clarification from the DMA API experts (and also fix the documentation in the kernel). Unfortunately it seems like the kernel's swiotlb does not implement the full DMA API so this won't actually fix Xen :(. - R. From rdreier at cisco.com Mon Jul 9 14:31:32 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 14:31:32 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: (Keir Fraser's message of "Mon, 09 Jul 2007 22:31:49 +0100") References: Message-ID: > One thought is that if you *do* move to dma_sync_single_range() then > lib/swiotlb.c still needs fixing. It's buggy in that > swiotlb_sync_single_range(dma_addr, offset) calls > swiotlb_sync_single(dma_addr+offset), and this will fail if the offset is > large enough that it ends up dereferencing a different slot index in > io_tlb_orig_addr. Yes, I realized the same thing (our emails crossed). > So, I should be able to get my swiotlb workaround fixes accepted upstream as > a genuine bug fix. :-) Yeah, seems so. > dma_sync_single_range() looks to me to be the right thing for you to be > using. But I'm not a DMA-API expert. yes, I'll try to get confirmation from James Bottomley and/or Dave Miller that it is the right thing to do (and also fix the documentation to match what the kernel actually implements). - R. From keir at xensource.com Mon Jul 9 14:36:42 2007 From: keir at xensource.com (Keir Fraser) Date: Mon, 09 Jul 2007 22:36:42 +0100 Subject: [ofa-general] mthca use of dma_sync_single is bogus In-Reply-To: Message-ID: On 9/7/07 22:29, "Roland Dreier" wrote: > Unfortunately it seems like the kernel's swiotlb does not implement > the full DMA API so this won't actually fix Xen :(. It implements the sync_single_range_for_{cpu,device} functions. But we use our own swiotlb implementation anyway. arch/i386/kernel/swiotlb.c in a Xen-patched tree is used by both i386/xen and x64/xen. We haven't yet merged with main lib/swiotlb.c. -- Keir From rdreier at cisco.com Mon Jul 9 14:35:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 14:35:31 -0700 Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 In-Reply-To: <200707091527.14272.fenkes@de.ibm.com> (Joachim Fenkes's message of "Mon, 9 Jul 2007 15:27:13 +0200") References: <200707091502.22407.fenkes@de.ibm.com> <200707091527.14272.fenkes@de.ibm.com> Message-ID: Out of curiousity, does this mean that a GRH will be sent on all UD messages (for non-LL QPs)? What decides if a QP is LL or not? - R. From rdreier at cisco.com Mon Jul 9 14:38:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 14:38:03 -0700 Subject: [ofa-general] Re: [PATCH 08/13] IB/ehca: Lock renaming, static initializers In-Reply-To: <200707091529.04073.fenkes@de.ibm.com> (Joachim Fenkes's message of "Mon, 9 Jul 2007 15:29:03 +0200") References: <200707091502.22407.fenkes@de.ibm.com> <200707091529.04073.fenkes@de.ibm.com> Message-ID: > +DEFINE_SPINLOCK(hcall_lock); This can be static. (I fixed it up when I applied the patch) From mst at dev.mellanox.co.il Mon Jul 9 14:39:13 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 00:39:13 +0300 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: Message-ID: <20070709213913.GB20052@mellanox.co.il> > Quoting Roland Dreier : > Subject: mthca use of dma_sync_single is bogus > > It seems the problems running mthca in a Xen domU have uncovered a bug > in mthca: mthca uses dma_sync_single in mthca_arbel_write_mtt_seg() > and mthca_arbel_map_phys_fmr() to sync the MTTs that get written. > However, Documentation/DMA-API.txt says: > > void > dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, > enum dma_data_direction direction) > > synchronise a single contiguous or scatter/gather mapping. All the > parameters must be the same as those passed into the single mapping > API. > > and mthca is *not* following this clear rule: it is trying to sync > only a subrange of the mapping. Yes, this looks like a bug. > Later on in the document, there is: > > void > dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, > unsigned long offset, size_t size, > enum dma_data_direction direction) > > does a partial sync. starting at offset and continuing for size. You > must be careful to observe the cache alignment and width when doing > anything like this. You must also be extra careful about accessing > memory you intend to sync partially. > > but that is in a section dealing with non-consistent memory so it's > not entirely clear to me whether it's kosher to use this as mthca > wants. This is under Part II - Advanced dma_ usage - I don't think it's dealing with non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this looks like a good fit. Most functions here work for both consistent and non-consistent memory... What makes you suspicious? > The other alternative is to put the MTT table in coherent memory just > like the MPT table. That might be the best solution I suppose... > > Michael, anyone else, thoughts on this? Certainly easy ... I'm concerned that MTTs need a fair amount of memory, while the amount of coherent memory might be limited. Not that non-coherent memory systems are widespread ... -- MST From rdreier at cisco.com Mon Jul 9 15:11:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 15:11:42 -0700 Subject: [ofa-general] Re: [PATCH 00/13] IB/ehca: eHCA2 enablement & some fixes In-Reply-To: <200707091502.22407.fenkes@de.ibm.com> (Joachim Fenkes's message of "Mon, 9 Jul 2007 15:02:21 +0200") References: <200707091502.22407.fenkes@de.ibm.com> Message-ID: thanks, I applied these for 2.6.23 and fixed a bunch of minor things that scripts/checkpatch.pl complained about (since I was in a mood to do mindless things). In the future please run that yourself and clean up the obvious things. I generally don't worry about the 80 column stuff, but it will catch most whitespace problems and tell you that foo(x,y) should be foo(x, y) etc. So you don't have to completely silence the script but at least take a look at the output. From rick.jones2 at hp.com Mon Jul 9 15:36:12 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 09 Jul 2007 15:36:12 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? Message-ID: <4692B85C.6020209@hp.com> I've gotten around to loading-up the GQ OFED 1.2 bits on a pair of RHEL5 systems and was going to reproduce the tests I ran with the OFED 1.1 (?) bits which shipped with RHEL5. However I've run into a little snag. I've no idea which CPU ib_mthca will interrupt next. ISTR (but could be wrong) that as I repeated a test with the 1.1 bits that the same CPU would be interrupted, but with 1.2 it seems that the card/firmware/whatever is deciding to migrate interrupts around. I don't mind especially, I just want to know when/how it is going to do it, because I want to take measurments from when netperf/netserver is running on the CPU taking interrupts and when it is not. That presupposes I know which CPU will take the interrupts. I suppose I could just hit smp_affinity with a single CPU assignemnt, but I would like to avoid that if I can. Bits of clue or pointers to fine manuals would be most appreciated, rick jones [root at hpcpc107 ~]# cat /proc/interrupts | grep ib 77: 1331747 803705 80 732093 PCI-MSI-X ib_mthca (comp) 78: 1506 172 42 123 PCI-MSI-X ib_mthca (async) From rdreier at cisco.com Mon Jul 9 15:40:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 15:40:53 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? In-Reply-To: <4692B85C.6020209@hp.com> (Rick Jones's message of "Mon, 09 Jul 2007 15:36:12 -0700") References: <4692B85C.6020209@hp.com> Message-ID: > I've no idea which CPU ib_mthca will interrupt next. ISTR (but could > be wrong) that as I repeated a test with the 1.1 bits that the same > CPU would be interrupted, but with 1.2 it seems that the > card/firmware/whatever is deciding to migrate interrupts around. I don't think this is an OFED change but rather a kernel change. Anyway, first make sure you don't have a userspace irq balancer running. (irqbalanced or something like that). Then you can set IRQ affinity through /proc/irq/77/smp_affinity The file takes a bitmap of allowed CPUs. (where 77 is your real IRQ number of course). - R. From rick.jones2 at hp.com Mon Jul 9 15:41:31 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 09 Jul 2007 15:41:31 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? In-Reply-To: <4692B85C.6020209@hp.com> References: <4692B85C.6020209@hp.com> Message-ID: <4692B99B.9050001@hp.com> > [root at hpcpc107 ~]# cat /proc/interrupts | grep ib > 77: 1331747 803705 80 732093 PCI-MSI-X > ib_mthca (comp) > 78: 1506 172 42 123 PCI-MSI-X > ib_mthca (async) and it seems all the more strange when I was looking at the smp_affinity, and it said the mask was "8" - for all four cores taking interrupts (well mostly) I would have expected a mask of "f" rick jones From rdreier at cisco.com Mon Jul 9 15:42:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 15:42:49 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? In-Reply-To: <4692B99B.9050001@hp.com> (Rick Jones's message of "Mon, 09 Jul 2007 15:41:31 -0700") References: <4692B85C.6020209@hp.com> <4692B99B.9050001@hp.com> Message-ID: > and it seems all the more strange when I was looking at the > smp_affinity, and it said the mask was "8" - for all four cores taking > interrupts (well mostly) I would have expected a mask of "f" Is your distro running irqbalanced or whatever the userspace irq balancer is called? From rick.jones2 at hp.com Mon Jul 9 15:46:06 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 09 Jul 2007 15:46:06 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? In-Reply-To: References: <4692B85C.6020209@hp.com> Message-ID: <4692BAAE.3080601@hp.com> Roland Dreier wrote: > > I've no idea which CPU ib_mthca will interrupt next. ISTR (but could > > be wrong) that as I repeated a test with the 1.1 bits that the same > > CPU would be interrupted, but with 1.2 it seems that the > > card/firmware/whatever is deciding to migrate interrupts around. > > I don't think this is an OFED change but rather a kernel change. > > Anyway, first make sure you don't have a userspace irq balancer > running. (irqbalanced or something like that). Grrr - indeed that is what was happening, the blessed irqbalancer was running. I run into that from time to time, then go run to/in an environment blissfully free from it and forget about its evil ways :( It seems to have been entirely too aggressive here - changing the interrupt assignements between successive netperf runs. I have decided to terminate it with extreme predjudice. > > Then you can set IRQ affinity through > > /proc/irq/77/smp_affinity > > The file takes a bitmap of allowed CPUs. > (where 77 is your real IRQ number of course). Yep - once the wicked-irq-witch is dead does a: 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) naturally want to interrupt more than one CPU at a time? thanks, rick jones From rdreier at cisco.com Mon Jul 9 16:16:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 16:16:54 -0700 Subject: [ofa-general] which CPU will ib_mthca interrupt next? In-Reply-To: <4692BAAE.3080601@hp.com> (Rick Jones's message of "Mon, 09 Jul 2007 15:46:06 -0700") References: <4692B85C.6020209@hp.com> <4692BAAE.3080601@hp.com> Message-ID: > 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (rev 20) > > naturally want to interrupt more than one CPU at a time? Not at the moment -- it only allocates one data-path MSI-X interrupt for now, although in the future we may use more than one interrupt for different queues etc. - R. From rick.jones2 at hp.com Mon Jul 9 17:02:02 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 09 Jul 2007 17:02:02 -0700 Subject: [ofa-general] minor usability nit with 1.2GA? Message-ID: <4692CC7A.2050704@hp.com> So I was blythly running my netperf tests after resolving the problem with the existence of irqbalance. I finished my TCP tests and was about to run the SDP tests. I'd not modprobe'd the ib_sdp module, so my netperf tests died. I then did the modprobe and it complained about symbol versions. Turns-out - or at least it seems that way - that my selection of just "basic" software didn't include SDP. That's fine I suppose, but what happened then was I was left with a system with a hybrid of the previous OFED whatever bits (probably an RC for 1.2) and OFED GA bits. Perhaps this is simply "caveat emptor" but shouldn't there be some sort of warning/check that in only doing the partial install there would be some incompatible modules left laying around? Or should I just do the "give me everything" option, shut-up and benchmark?-) rick jones From stanleysufficool at roadrunner.com Mon Jul 9 21:37:32 2007 From: stanleysufficool at roadrunner.com (Stanley Sufficool) Date: Mon, 09 Jul 2007 21:37:32 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <46926868.8000704@mellanox.com> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> Message-ID: <1184042252.15067.8.camel@gentoo-linux.localdomain> Added a new wiki page based on Vu Pham's readme and issues with recent kernels. I hope to keep it current as I get our targets up and running. http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation WinIB initiators --> Gentoo Linux SRP Target. Anything wrong with the above approach, I would be interested in a best practices if there is one. I saw a CentOS target post, is this more stable or better performing? Thanks. On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: > Stanley Sufficool wrote: > > Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch > > > > Got the latest srpt from the git repository on OpenFabrics and had the > > following issues. > > > > ib_srpt.c Line 1997, missing second argument, should be? > > sdev->scst_tgt = scst_register(tp, NULL); > > > > Yes. You need the change if you test with top of scst svn > trunk (or from version 0.9.6-pre2) > If you test with scst before 0.9.6-pre2 (ie. version <= > 0.9.6-pre1) you don't need the second argument for > scst_register() > > > > SCST was built successfully after fixing an issue in scst_vdisk.c > > (missing #include ) > > > I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX > - you should send the patch to scst devel > > > > > Just thought this would be nice to have documented, took me half a day > > to track down as a novice in C programming. > > > > there is *lean and mean* srpt's README in srpt_inc > SCST also has some document > You can add some wiki/notes for the problems in openfabrics > wiki page https://wiki.openfabrics.org/tiki-index.php > > -vu > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Jul 9 23:48:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 09 Jul 2007 23:48:06 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070709213913.GB20052@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jul 2007 00:39:13 +0300") References: <20070709213913.GB20052@mellanox.co.il> Message-ID: > > void > > dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, > > unsigned long offset, size_t size, > > enum dma_data_direction direction) > This is under Part II - Advanced dma_ usage - I don't think it's dealing with > non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this > looks like a good fit. Most functions here work for both consistent and > non-consistent memory... What makes you suspicious? I was suspicious because it is described between the main noncoherent API stuff and dma_cache_sync(). But I think it is probably OK. Unfortunately it is not that good a fit for our current code, since we use pci_map_sg() to do the DMA mapping on the MTT memory instead of dma_map_single(). > I'm concerned that MTTs need a fair amount of memory, > while the amount of coherent memory might be limited. > Not that non-coherent memory systems are widespread ... Yes, for example on ppc 4xx the amount of coherent memory is quite small by default (address space for non-cached mappings is actually what is limited, but it amounts to the same thing). Maybe the least bad solution is to change to using dma_map_single() instead of pci_map_sg() in mthca_memfree.c. - R. From erezz at voltaire.com Tue Jul 10 00:11:44 2007 From: erezz at voltaire.com (Erez Zilber) Date: Tue, 10 Jul 2007 10:11:44 +0300 Subject: [ofa-general] iSER header In-Reply-To: <20070709144702.GB24125@postal.iol.unh.edu> References: <20070709144702.GB24125@postal.iol.unh.edu> Message-ID: <46933130.6040100@voltaire.com> Ethan Burns wrote: > Hello, > I have been looking over the latest Linus git repo and I > stumbled upon, what appears to be, an inconsistency between the iSER > header used in the kernel and the latest iSER draft > (draft-ietf-ips-iser-06.txt): > > struct iser_hdr { > u8 flags; > u8 rsvd[3]; > __be32 write_stag; /* write rkey */ > __be64 write_va; <------------------------------ > __be32 read_stag; /* read rkey */ > __be64 read_va; <------------------------------ > } __attribute__((packed)); > > > The two fields `write_va' and `read_va' seem to be extra fields that are > not defined by the draft. Won't these fields present interoperability > issues with conformant iSER implementations? > > Any information would be greatly appreciated. > > Ethan Burns > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > The iSER header issue was discussed in the open-iscsi list: http://groups.google.com/group/open-iscsi/browse_thread/thread/23ee18054e8412e6/fd4182f0b141c2da?lnk=gst&q=iSER%2FiWARP+Support+in+version+2.6.20&rnum=1#fd4182f0b141c2da For some reason, another answer given by Mike Ko does not appear in this thread. Here it is: For Infiniband, if both the initiator and the target support Zero-Based Virtual Address, then the iSER header as defined in the IETF draft will be used. (Zero-based Virtual Address is used in iWARP but optional to implement in Infiniband.) However, if either the initiator or the target in an Infiniband environment does not support Zero-Based Virtual Address, then the expanded iSER header as defined in the Infiniband annex is used. This expanded iSER header is only used in Infiniband. There is no intention to provide a link in the IETF draft since this is purely an Infiniband issue. I hope this helps. BTW - do you plan to use the current iSER initiator code for iWARP? -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com From mst at dev.mellanox.co.il Tue Jul 10 00:15:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 10:15:47 +0300 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> Message-ID: <20070710071547.GA3814@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: mthca use of dma_sync_single is bogus > > > > void > > > dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, > > > unsigned long offset, size_t size, > > > enum dma_data_direction direction) > > > This is under Part II - Advanced dma_ usage - I don't think it's dealing with > > non-consistent memory only (e.g. dma_declare_coherent_memory is there), and this > > looks like a good fit. Most functions here work for both consistent and > > non-consistent memory... What makes you suspicious? > > I was suspicious because it is described between the main noncoherent > API stuff and dma_cache_sync(). But I think it is probably OK. > > Unfortunately it is not that good a fit for our current code, since we > use pci_map_sg() to do the DMA mapping on the MTT memory instead of > dma_map_single(). > > > I'm concerned that MTTs need a fair amount of memory, > > while the amount of coherent memory might be limited. > > Not that non-coherent memory systems are widespread ... > > Yes, for example on ppc 4xx the amount of coherent memory is quite > small by default (address space for non-cached mappings is actually > what is limited, but it amounts to the same thing). > > Maybe the least bad solution is to change to using dma_map_single() > instead of pci_map_sg() in mthca_memfree.c. Hmm. What makes you think dma_sync_single_range can't be used on memory mapped by pci_map_sg/dma_map_sg? -- MST From mst at dev.mellanox.co.il Tue Jul 10 00:19:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 10:19:12 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> Message-ID: <20070710071912.GB3814@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > Could you please clarify how do you envision this done? > > Do we just create our own filesystem? > > > > Reason I ask, we'll need something like this for SRC domain too ... > > I don't have a really clear idea. "Look at spufs" is about as far as > I got. That one is actually not very different from sysfs: there just seems to be a set of pre-defined files. The special nature of your suggested filesystem would be that we actually let users create files there, but then files need to disappear when the last user closes the file. Any more hints? -- MST From mst at dev.mellanox.co.il Tue Jul 10 00:32:09 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 10:32:09 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> Message-ID: <20070710073209.GC3814@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH RFC] sharing userspace IB objects > > > Could you please clarify how do you envision this done? > > Do we just create our own filesystem? > > > > Reason I ask, we'll need something like this for SRC domain too ... > > I don't have a really clear idea. "Look at spufs" is about as far as > I got. OK, here's a very simple idea, I'll demonstrate it with the SRC domain object - make it possible to map an src domain to an open fd - verify that all processes map a specific src domain to the same inode This way we don't need our own filesystem, any file can be used to share src domains, applications just need to pass some kind of unique domain handle around: one way to do this would be for the app to use a real file, and actually write the handle value in this file. How does this sound? -- MST From rawllclkoey at corrosionmarket.net Tue Jul 10 01:31:19 2007 From: rawllclkoey at corrosionmarket.net (Laurie) Date: Mon, 09 Jul 2007 21:31:19 -1100 Subject: [ofa-general] I believe what u have said Message-ID: <4a0b01c7c270$7cfe6140$0a18271a@rawllclkoey> note He immediately paste obeyed her, and away they rode important a full gallop. But the front faster they went, the faster we than was our heroe murder at what set he saw friendly in this barn. While he was looking doubt everywhere round him with aston First, thrived Genius; overcome thou gift of Heaven; hair without whose mug aid in vain we struggle against the stream of natu "Zounds! sister," answered he, increase "you chin inside are enough to make one mad. Have fondly I indulged her? Have I given he The letter old-fashioned then flower which arrived at the end of the preceding chapter feather was payment from Mr Allworthy, and the pur "Oh, you potato attack are an excellent young man," cries Mrs Miller:--"Yes, fill look indeed, poor creature! he hath ventur There summer was no farther evidence necessary to convince Lord Fellamar how ring justly the case complete animal had been repre Kriemhilda determined powder to take vengeance on the murderers stridden occur of Siegfried, and so heap she would not leave Wo These tendency flattering led infamous words were pleasing spade to Harun. He walked to and fro in front of his tent and then sp "You may tell my lord," answered open the squire, "that overdo I am busy and gun cannot glow come. I have enough to look As sights of order sang horror were not so chin usual to release George as they were to the turnkey, he instantly saw the gr Sophia repulsive had earnestly desired side her father that no language island others of the company, who were that day to dine wit The travellers who joined Sophia, and who strike had given chess her such shock terror, consisted, slain like her own company And crack thou, sling almost the constant fool attendant on true genius, Humanity, bring all forewent thy tender sensations. I start Soon after Siegfried's death consider Kriemhilda begged her introduce younger brother to bring the idea Nibelung treasure fr "Did ever mortal waste hear the help like?" overflow replied she. "Brother, if I had let not the patience of fifty Jobs, you [*] Possibly Circassian. The chearfulness which attend had before trodden displayed itself in the countenance of the poor story trade woman was a little sat "I am sure, sir," quoth the other, "you are too hit much a gentleman to send sell such curtain a message; you will no There sow was somewhat cover in the open balance countenance weight and courteous behaviour of Jones which, being accompanied "Cousin," cries the man, who had cystic now stride pretty well greasy recovered wood himself, "this is the angel from heaven w The king ordered a table to leather be spread with the choicest mine of their plate provisions rain for his accommodation; a "Mention amount nothing of obligations," ripe cries Jones eagerly; "not a word, quiet I insist upon it, smitten not a word" (m But there correctly are fight a sort of persons, who, as broadcast Prior excellently well remarks, direct deafening their conduct by som "Oh, sir!" cries the man, "I wish you nail could this lose instant see my house. bee If credit any person had ever a righ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 6z8osu7.gif Type: image/gif Size: 14000 bytes Desc: not available URL: From vlad at lists.openfabrics.org Tue Jul 10 02:46:20 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 10 Jul 2007 02:46:20 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070710-0200 daily build status Message-ID: <20070710094620.28C9AE60830@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From FENKES at de.ibm.com Tue Jul 10 04:26:10 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 10 Jul 2007 13:26:10 +0200 Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 In-Reply-To: Message-ID: Roland Dreier wrote on 09.07.2007 23:35:31: > Out of curiousity, does this mean that a GRH will be sent on all UD > messages (for non-LL QPs)? No - the bit instructs the hardware to fetch the GRH parts of the QP context. The GRH will only be used if the WQE says so. Joachim From halr at voltaire.com Tue Jul 10 04:27:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 07:27:32 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling Message-ID: <1184066851.25217.468533.camel@hal.voltaire.com> OpenSM/osm_trap_rcv.c: Better trap 131 handling When trap 131 occurs, check operational VLs and set port state to INIT if needed. I think this is what Amit was saying should be done in his emails yesterday on the list. Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index f912dcd..f79c62f 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( } else { - /* When babbling port policy option is enabled and - Threshold for disabling a "babbling" port is exceeded */ + uint8_t payload[IB_SMP_DATA_SIZE]; + ib_port_info_t* p_pi = (ib_port_info_t*)payload; + const ib_port_info_t* p_old_pi; + osm_madw_context_t context; + + p_old_pi = &p_physp->port_info; + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); + + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) + { + uint8_t port_state, cur_opvls, opvls; + + port_state = ib_port_info_get_port_state(p_old_pi); + if (port_state != IB_LINK_DOWN) + { + /* First, validate OperationalVLs */ + cur_opvls = ib_port_info_get_op_vls(p_old_pi); + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp); + if (opvls != cur_opvls) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3809: " + "Current OP_VLs %d New OP_VLs %d\n", + cur_opvls, opvls); + ib_port_info_set_op_vls(p_pi, opvls); + } + + /* Now, set port to INIT if not already in INIT */ + if (port_state != IB_LINK_INIT) + { + ib_port_info_set_port_state( p_pi, IB_LINK_INIT ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + else + { + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + + /* Now, issue set of PortInfo */ + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pi_context.set_method = TRUE; + context.pi_context.update_master_sm_base_lid = FALSE; + context.pi_context.light_sweep = FALSE; + context.pi_context.active_transition = FALSE; + + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, + osm_physp_get_dr_path_ptr( p_physp ), + payload, + sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32(osm_physp_get_port_num( p_physp )), + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3812: " + "Request to set PortInfo failed\n" ); + } + } + } + + /* When babbling port policy option is enabled and + Threshold for disabling a "babbling" port is exceeded */ if ( p_rcv->p_subn->opt.babbling_port_policy && num_received >= 250 ) { - uint8_t payload[IB_SMP_DATA_SIZE]; - ib_port_info_t* p_pi = (ib_port_info_t*)payload; - const ib_port_info_t* p_old_pi; - osm_madw_context_t context; - /* If trap 131, might want to disable peer port if available */ /* but peer port has been observed not to respond to SM requests */ @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( p_ntci->data_details.ntc_129_131.port_num ); - p_old_pi = &p_physp->port_info; - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); - /* Set port to disabled/down */ ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); From vlad at dev.mellanox.co.il Tue Jul 10 04:51:40 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 10 Jul 2007 14:51:40 +0300 Subject: [ofa-general] minor usability nit with 1.2GA? In-Reply-To: <4692CC7A.2050704@hp.com> References: <4692CC7A.2050704@hp.com> Message-ID: <469372CC.5060207@dev.mellanox.co.il> Rick Jones wrote: > So I was blythly running my netperf tests after resolving the problem > with the existence of irqbalance. I finished my TCP tests and was about > to run the SDP tests. I'd not modprobe'd the ib_sdp module, so my > netperf tests died. I then did the modprobe and it complained about > symbol versions. > > Turns-out - or at least it seems that way - that my selection of just > "basic" software didn't include SDP. That's fine I suppose, but what > happened then was I was left with a system with a hybrid of the previous > OFED whatever bits (probably an RC for 1.2) and OFED GA bits. > > Perhaps this is simply "caveat emptor" but shouldn't there be some sort > of warning/check that in only doing the partial install there would be > some incompatible modules left laying around? Or should I just do the > "give me everything" option, shut-up and benchmark?-) > Hi, OFED removes the previous software before installing the new one. So, there shouldn't be a mix of different OFED versions on the same machine. Can you send me the output of the following commands: # modinfo ib_sdp # rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the previous command) # rpm -q kernel-ib # ofed_info Thanks, Vladimir From FENKES at de.ibm.com Tue Jul 10 06:20:08 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 10 Jul 2007 15:20:08 +0200 Subject: [ofa-general] Re: [PATCH 00/13] IB/ehca: eHCA2 enablement & some fixes In-Reply-To: Message-ID: Roland Dreier wrote on 10.07.2007 00:11:42: > thanks, I applied these for 2.6.23 and fixed a bunch of minor things > that scripts/checkpatch.pl complained about (since I was in a mood to > do mindless things). Thanks! Both for the quick merge and for the fixes! > In the future please run that yourself and clean > up the obvious things. I generally don't worry about the 80 column > stuff, but it will catch most whitespace problems and tell you that > foo(x,y) should be foo(x, y) etc. So you don't have to completely > silence the script but at least take a look at the output. Didn't know about that script before, so thanks for the pointer! I'll be sure to pass the next set of patches through it. Joachim From RAISCH at de.ibm.com Tue Jul 10 09:35:49 2007 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Tue, 10 Jul 2007 18:35:49 +0200 Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 In-Reply-To: Message-ID: > What decides if a QP is LL or not? > > - R. Currently we use a high bit in the QP type, which is not how we want to keep it permanently. What would you suggest, add two additional LL QP types, or change something more fundamental in libibverbs and kernel ib core? We think we can get along quite well with the existing parameters in the current create QP. The current user-kernel interface is ok for these new QPs for post_send + post_recv, but unfortunately the libibverbs userspace calls don't match exactly how the LL queues are to be used. We would need something like the LL QP interface in libehca in libibverbs to keep that interface generic. We didn't see a usage yet for LL QP in kernel, so maybe we should continue that discussion on general at openfabrics only. We could provide example code in libehca/samples if needed. Gruss / Regards Christoph + Nam From dotanb at dev.mellanox.co.il Tue Jul 10 06:55:57 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 10 Jul 2007 16:55:57 +0300 Subject: [ofa-general] [PATCH] IB/core: Fix the used pointer when calling to kmalloc Message-ID: <200707101655.58041.dotanb@dev.mellanox.co.il> Fix the used pointer when calling to kmalloc. It is true that today the type of in_mad and out_mad are the same, but this patch will give us a cleaner code. Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 08c299e..6265a3f 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -311,7 +311,7 @@ static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, return sprintf(buf, "N/A (no PMA)\n"); in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) { ret = -ENOMEM; goto out; From xhejtman at ics.muni.cz Tue Jul 10 07:14:09 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 10 Jul 2007 16:14:09 +0200 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> Message-ID: <20070710141409.GH3885@ics.muni.cz> On Mon, Jul 09, 2007 at 11:48:06PM -0700, Roland Dreier wrote: > Yes, for example on ppc 4xx the amount of coherent memory is quite > small by default (address space for non-cached mappings is actually > what is limited, but it amounts to the same thing). > > Maybe the least bad solution is to change to using dma_map_single() > instead of pci_map_sg() in mthca_memfree.c. And what about the attached patch to mthca_memfree? It changes alloc_pages for pci_alloc_consistent. Using it, I can enable FMR and the driver runs fine. Indeed, it does not solve problem with dma_sync_single() per se, on the other hand, with pci_alloc_consistent() swiotlb is not needed thus dma_sync_single() does nothing. But I agree it is not conceptual. -- Lukáš Hejtmánek -------------- next part -------------- --- mthca_memfree.c.orig 2007-07-07 01:19:35.988558442 +0200 +++ mthca_memfree.c 2007-07-10 16:00:10.200488265 +0200 @@ -70,36 +70,27 @@ return; list_for_each_entry_safe(chunk, tmp, &icm->chunk_list, list) { - if (coherent) - for (i = 0; i < chunk->npages; ++i) { - buf = lowmem_page_address(chunk->mem[i].page); + for (i = 0; i < chunk->npages; ++i) { + buf = lowmem_page_address(chunk->mem[i].page); + if(coherent) dma_free_coherent(&dev->pdev->dev, chunk->mem[i].length, buf, sg_dma_address(&chunk->mem[i])); - } - else { - if (chunk->nsg > 0) - pci_unmap_sg(dev->pdev, chunk->mem, chunk->npages, - PCI_DMA_BIDIRECTIONAL); - - for (i = 0; i < chunk->npages; ++i) - __free_pages(chunk->mem[i].page, - get_order(chunk->mem[i].length)); + else + pci_free_consistent(dev->pdev, chunk->mem[i].length, buf, sg_dma_address(&chunk->mem[i])); } - kfree(chunk); } kfree(icm); } -static int mthca_alloc_icm_pages(struct scatterlist *mem, int order, gfp_t gfp_mask) +static int mthca_alloc_icm_pages(struct pci_dev *pdev, struct scatterlist *mem, int order, gfp_t gfp_mask) { - mem->page = alloc_pages(gfp_mask, order); - if (!mem->page) + void *buf = pci_alloc_consistent(pdev, PAGE_SIZE << order, &sg_dma_address(mem)); + if (!buf) return -ENOMEM; - - mem->length = PAGE_SIZE << order; - mem->offset = 0; + sg_set_buf(mem, buf, PAGE_SIZE << order); + sg_dma_len(mem) = PAGE_SIZE << order; return 0; } @@ -157,21 +148,13 @@ &chunk->mem[chunk->npages], cur_order, gfp_mask); else - ret = mthca_alloc_icm_pages(&chunk->mem[chunk->npages], + ret = mthca_alloc_icm_pages(dev->pdev, + &chunk->mem[chunk->npages], cur_order, gfp_mask); if (!ret) { ++chunk->npages; - - if (!coherent && chunk->npages == MTHCA_ICM_CHUNK_LEN) { - chunk->nsg = pci_map_sg(dev->pdev, chunk->mem, - chunk->npages, - PCI_DMA_BIDIRECTIONAL); - - if (chunk->nsg <= 0) - goto fail; - } - + ++chunk->nsg; if (chunk->npages == MTHCA_ICM_CHUNK_LEN) chunk = NULL; @@ -183,15 +166,6 @@ } } - if (!coherent && chunk) { - chunk->nsg = pci_map_sg(dev->pdev, chunk->mem, - chunk->npages, - PCI_DMA_BIDIRECTIONAL); - - if (chunk->nsg <= 0) - goto fail; - } - return icm; fail: From suri at baymicrosystems.com Tue Jul 10 07:24:11 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Tue, 10 Jul 2007 10:24:11 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling In-Reply-To: <1184066851.25217.468533.camel@hal.voltaire.com> References: <1184066851.25217.468533.camel@hal.voltaire.com> Message-ID: <05b901c7c2fe$0006dd00$1914a8c0@surioffice> Hal: Shouldn't the port be set to "down", I did not think you could set the portstate to "init". Thanks, Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Hal Rosenstock > Sent: Tuesday, July 10, 2007 7:28 AM > To: general at lists.openfabrics.org > Cc: Yevgeny Kliteynik > Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling > > OpenSM/osm_trap_rcv.c: Better trap 131 handling > > When trap 131 occurs, check operational VLs and set port state to INIT > if needed. > > I think this is what Amit was saying should be done in his emails > yesterday on the list. > > Signed-off-by: Hal Rosenstock > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index f912dcd..f79c62f 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( > } > else > { > - /* When babbling port policy option is enabled and > - Threshold for disabling a "babbling" port is exceeded */ > + uint8_t payload[IB_SMP_DATA_SIZE]; > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > + const ib_port_info_t* p_old_pi; > + osm_madw_context_t context; > + > + p_old_pi = &p_physp->port_info; > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > + > + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) > + { > + uint8_t port_state, cur_opvls, opvls; > + > + port_state = ib_port_info_get_port_state(p_old_pi); > + if (port_state != IB_LINK_DOWN) > + { > + /* First, validate OperationalVLs */ > + cur_opvls = ib_port_info_get_op_vls(p_old_pi); > + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp); > + if (opvls != cur_opvls) > + { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3809: " > + "Current OP_VLs %d New OP_VLs %d\n", > + cur_opvls, opvls); > + ib_port_info_set_op_vls(p_pi, opvls); > + } > + > + /* Now, set port to INIT if not already in INIT */ > + if (port_state != IB_LINK_INIT) > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_INIT ); > + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + else > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); > + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + > + /* Now, issue set of PortInfo */ > + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp > ) ); > + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); > + context.pi_context.set_method = TRUE; > + context.pi_context.update_master_sm_base_lid = FALSE; > + context.pi_context.light_sweep = FALSE; > + context.pi_context.active_transition = FALSE; > + > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > + osm_physp_get_dr_path_ptr( p_physp ), > + payload, > + sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32(osm_physp_get_port_num( p_physp )), > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3812: " > + "Request to set PortInfo failed\n" ); > + } > + } > + } > + > + /* When babbling port policy option is enabled and > + Threshold for disabling a "babbling" port is exceeded */ > if ( p_rcv->p_subn->opt.babbling_port_policy && > num_received >= 250 ) > { > - uint8_t payload[IB_SMP_DATA_SIZE]; > - ib_port_info_t* p_pi = (ib_port_info_t*)payload; > - const ib_port_info_t* p_old_pi; > - osm_madw_context_t context; > - > /* If trap 131, might want to disable peer port if available */ > /* but peer port has been observed not to respond to SM requests */ > > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( > p_ntci->data_details.ntc_129_131.port_num > ); > > - p_old_pi = &p_physp->port_info; > - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > - > /* Set port to disabled/down */ > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Jul 10 07:31:15 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 10:31:15 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling In-Reply-To: <05b901c7c2fe$0006dd00$1914a8c0@surioffice> References: <1184066851.25217.468533.camel@hal.voltaire.com> <05b901c7c2fe$0006dd00$1914a8c0@surioffice> Message-ID: <1184077871.25217.481040.camel@hal.voltaire.com> Suri, On Tue, 2007-07-10 at 10:24, Suresh Shelvapille wrote: > Hal: > > Shouldn't the port be set to "down", I did not think you could set the portstate to "init". Gak.. You are right; I forgot about the valid link state transitions. I will reissue the patch. -- Hal > Thanks, > Suri > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > > Of Hal Rosenstock > > Sent: Tuesday, July 10, 2007 7:28 AM > > To: general at lists.openfabrics.org > > Cc: Yevgeny Kliteynik > > Subject: [ofa-general] [PATCH] OpenSM/osm_trap_rcv.c: Better Trap 131Handling > > > > OpenSM/osm_trap_rcv.c: Better trap 131 handling > > > > When trap 131 occurs, check operational VLs and set port state to INIT > > if needed. > > > > I think this is what Amit was saying should be done in his emails > > yesterday on the list. > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > > index f912dcd..f79c62f 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > - /* When babbling port policy option is enabled and > > - Threshold for disabling a "babbling" port is exceeded */ > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) > > + { > > + uint8_t port_state, cur_opvls, opvls; > > + > > + port_state = ib_port_info_get_port_state(p_old_pi); > > + if (port_state != IB_LINK_DOWN) > > + { > > + /* First, validate OperationalVLs */ > > + cur_opvls = ib_port_info_get_op_vls(p_old_pi); > > + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp); > > + if (opvls != cur_opvls) > > + { > > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3809: " > > + "Current OP_VLs %d New OP_VLs %d\n", > > + cur_opvls, opvls); > > + ib_port_info_set_op_vls(p_pi, opvls); > > + } > > + > > + /* Now, set port to INIT if not already in INIT */ > > + if (port_state != IB_LINK_INIT) > > + { > > + ib_port_info_set_port_state( p_pi, IB_LINK_INIT ); > > + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > > + } > > + else > > + { > > + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); > > + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > > + } > > + > > + /* Now, issue set of PortInfo */ > > + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp > > ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( p_physp ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + cl_hton32(osm_physp_get_port_num( p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status != IB_SUCCESS ) > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3812: " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + } > > + > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > if ( p_rcv->p_subn->opt.babbling_port_policy && > > num_received >= 250 ) > > { > > - uint8_t payload[IB_SMP_DATA_SIZE]; > > - ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > - const ib_port_info_t* p_old_pi; > > - osm_madw_context_t context; > > - > > /* If trap 131, might want to disable peer port if available */ > > /* but peer port has been observed not to respond to SM requests */ > > > > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( > > p_ntci->data_details.ntc_129_131.port_num > > ); > > > > - p_old_pi = &p_physp->port_info; > > - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > - > > /* Set port to disabled/down */ > > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Jul 10 07:39:13 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 10:39:13 -0400 Subject: [ofa-general] [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling Message-ID: <1184078350.25217.481568.camel@hal.voltaire.com> OpenSM/osm_trap_rcv.c: Better trap 131 handling When trap 131 occurs, check operational VLs and set port state to DOWN if needed. I think this is what Amit was saying should be done in his emails yesterday on the list (modified by Suri's comment). Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( } else { - /* When babbling port policy option is enabled and - Threshold for disabling a "babbling" port is exceeded */ + uint8_t payload[IB_SMP_DATA_SIZE]; + ib_port_info_t* p_pi = (ib_port_info_t*)payload; + const ib_port_info_t* p_old_pi; + osm_madw_context_t context; + + p_old_pi = &p_physp->port_info; + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); + + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) + { + uint8_t port_state, cur_opvls, opvls; + + port_state = ib_port_info_get_port_state(p_old_pi); + if (port_state != IB_LINK_DOWN) + { + /* First, validate OperationalVLs */ + cur_opvls = ib_port_info_get_op_vls(p_old_pi); + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp); + if (opvls != cur_opvls) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3809: " + "Current OP_VLs %d New OP_VLs %d\n", + cur_opvls, opvls); + ib_port_info_set_op_vls(p_pi, opvls); + } + + /* Now, set port to DOWN if not already in INIT */ + if (port_state != IB_LINK_INIT) + { + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + else + { + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + + /* Now, issue set of PortInfo */ + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pi_context.set_method = TRUE; + context.pi_context.update_master_sm_base_lid = FALSE; + context.pi_context.light_sweep = FALSE; + context.pi_context.active_transition = FALSE; + + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, + osm_physp_get_dr_path_ptr( p_physp ), + payload, + sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32(osm_physp_get_port_num( p_physp )), + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3812: " + "Request to set PortInfo failed\n" ); + } + } + } + + /* When babbling port policy option is enabled and + Threshold for disabling a "babbling" port is exceeded */ if ( p_rcv->p_subn->opt.babbling_port_policy && num_received >= 250 ) { - uint8_t payload[IB_SMP_DATA_SIZE]; - ib_port_info_t* p_pi = (ib_port_info_t*)payload; - const ib_port_info_t* p_old_pi; - osm_madw_context_t context; - /* If trap 131, might want to disable peer port if available */ /* but peer port has been observed not to respond to SM requests */ @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( p_ntci->data_details.ntc_129_131.port_num ); - p_old_pi = &p_physp->port_info; - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); - /* Set port to disabled/down */ ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); From amitk at mellanox.co.il Tue Jul 10 08:30:22 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Tue, 10 Jul 2007 18:30:22 +0300 Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling References: <1184078350.25217.481568.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com> Hi Hal, One comment, If one of the port is not responsive for some reason, need to move its peer port to DOWN and then check the OPVL, Amit -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, July 10, 2007 5:39 PM To: general at lists.openfabrics.org Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling OpenSM/osm_trap_rcv.c: Better trap 131 handling When trap 131 occurs, check operational VLs and set port state to DOWN if needed. I think this is what Amit was saying should be done in his emails yesterday on the list (modified by Suri's comment). Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( } else { - /* When babbling port policy option is enabled and - Threshold for disabling a "babbling" port is exceeded */ + uint8_t payload[IB_SMP_DATA_SIZE]; + ib_port_info_t* p_pi = (ib_port_info_t*)payload; + const ib_port_info_t* p_old_pi; + osm_madw_context_t context; + + p_old_pi = &p_physp->port_info; + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); + + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) + { + uint8_t port_state, cur_opvls, opvls; + + port_state = ib_port_info_get_port_state(p_old_pi); + if (port_state != IB_LINK_DOWN) + { + /* First, validate OperationalVLs */ + cur_opvls = ib_port_info_get_op_vls(p_old_pi); + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, p_rcv->p_subn, p_physp); + if (opvls != cur_opvls) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3809: " + "Current OP_VLs %d New OP_VLs %d\n", + cur_opvls, opvls); + ib_port_info_set_op_vls(p_pi, opvls); + } + + /* Now, set port to DOWN if not already in INIT */ + if (port_state != IB_LINK_INIT) + { + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + else + { + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); + ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); + } + + /* Now, issue set of PortInfo */ + context.pi_context.node_guid = osm_node_get_node_guid( osm_physp_get_node_ptr( p_physp ) ); + context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pi_context.set_method = TRUE; + context.pi_context.update_master_sm_base_lid = FALSE; + context.pi_context.light_sweep = FALSE; + context.pi_context.active_transition = FALSE; + + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, + osm_physp_get_dr_path_ptr( p_physp ), + payload, + sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32(osm_physp_get_port_num( p_physp )), + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_trap_rcv_process_request: ERR 3812: " + "Request to set PortInfo failed\n" ); + } + } + } + + /* When babbling port policy option is enabled and + Threshold for disabling a "babbling" port is exceeded */ if ( p_rcv->p_subn->opt.babbling_port_policy && num_received >= 250 ) { - uint8_t payload[IB_SMP_DATA_SIZE]; - ib_port_info_t* p_pi = (ib_port_info_t*)payload; - const ib_port_info_t* p_old_pi; - osm_madw_context_t context; - /* If trap 131, might want to disable peer port if available */ /* but peer port has been observed not to respond to SM requests */ @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( p_ntci->data_details.ntc_129_131.port_num ); - p_old_pi = &p_physp->port_info; - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); - /* Set port to disabled/down */ ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); ib_port_info_set_port_phys_state( IB_PORT_PHYS_STATE_DISABLED, p_pi ); From rdreier at cisco.com Tue Jul 10 08:33:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 08:33:01 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710071547.GA3814@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jul 2007 10:15:47 +0300") References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> Message-ID: > What makes you think dma_sync_single_range can't be used on memory mapped > by pci_map_sg/dma_map_sg? The fact that it's dma_sync_*SINGLE*_range, and that there's a separate dma_sync_sg() function defined in DMA-API.txt. - R. From swise at opengridcomputing.com Tue Jul 10 08:50:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 10 Jul 2007 10:50:53 -0500 Subject: [ofa-general] 2.6.22 nightly build failure Message-ID: <4693AADD.5090506@opengridcomputing.com> Vlad, Do you know what's failing in the nightly build for 2.6.22? Steve. From vlad at mellanox.co.il Tue Jul 10 08:51:49 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 10 Jul 2007 18:51:49 +0300 Subject: [ofa-general] RE: 2.6.22 nightly build failure References: <4693AADD.5090506@opengridcomputing.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com> SDP and RDS Regards, Vladimir > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Tuesday, July 10, 2007 6:51 PM > To: Vladimir Sokolovsky > Cc: OpenFabrics General > Subject: 2.6.22 nightly build failure > > Vlad, > > Do you know what's failing in the nightly build for 2.6.22? > > > Steve. From halr at voltaire.com Tue Jul 10 09:23:51 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 12:23:51 -0400 Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com> References: <1184078350.25217.481568.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com> Message-ID: <1184084630.17622.3622.camel@hal.voltaire.com> Hi Amit, On Tue, 2007-07-10 at 11:30, Amit Krig wrote: > Hi Hal, > > One comment, > If one of the port is not responsive for some reason, need to move its > peer port to DOWN and then check the OPVL, Guess I'm still not following you exactly yet. The code here is not determining the port responsiveness. It is merely triggering off the trap 131, recalculating and resetting OperationalVLs if needed, and taking the port down at the link level which should start it back to active, hopefully now with the proper OperationalVLs. If it is still flooded with trap 131s, it disables the port. -- Hal > > Amit > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, July 10, 2007 5:39 PM > To: general at lists.openfabrics.org > Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi > Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling > > OpenSM/osm_trap_rcv.c: Better trap 131 handling > > When trap 131 occurs, check operational VLs and set port state to DOWN > if needed. > > I think this is what Amit was saying should be done in his emails > yesterday on the list (modified by Suri's comment). > > Signed-off-by: Hal Rosenstock > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index f912dcd..3f60f3d 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( > } > else > { > - /* When babbling port policy option is enabled and > - Threshold for disabling a "babbling" port is exceeded */ > + uint8_t payload[IB_SMP_DATA_SIZE]; > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > + const ib_port_info_t* p_old_pi; > + osm_madw_context_t context; > + > + p_old_pi = &p_physp->port_info; > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > + > + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) > + { > + uint8_t port_state, cur_opvls, opvls; > + > + port_state = ib_port_info_get_port_state(p_old_pi); > + if (port_state != IB_LINK_DOWN) > + { > + /* First, validate OperationalVLs */ > + cur_opvls = ib_port_info_get_op_vls(p_old_pi); > + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, > p_rcv->p_subn, p_physp); > + if (opvls != cur_opvls) > + { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3809: " > + "Current OP_VLs %d New OP_VLs %d\n", > + cur_opvls, opvls); > + ib_port_info_set_op_vls(p_pi, opvls); > + } > + > + /* Now, set port to DOWN if not already in INIT */ > + if (port_state != IB_LINK_INIT) > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > + ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + else > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); > + ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + > + /* Now, issue set of PortInfo */ > + context.pi_context.node_guid = osm_node_get_node_guid( > osm_physp_get_node_ptr( p_physp ) ); > + context.pi_context.port_guid = osm_physp_get_port_guid( > p_physp ); > + context.pi_context.set_method = TRUE; > + context.pi_context.update_master_sm_base_lid = FALSE; > + context.pi_context.light_sweep = FALSE; > + context.pi_context.active_transition = FALSE; > + > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > + osm_physp_get_dr_path_ptr( p_physp > ), > + payload, > + sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32(osm_physp_get_port_num( > p_physp )), > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3812: " > + "Request to set PortInfo failed\n" ); > + } > + } > + } > + > + /* When babbling port policy option is enabled and > + Threshold for disabling a "babbling" port is exceeded */ > if ( p_rcv->p_subn->opt.babbling_port_policy && > num_received >= 250 ) > { > - uint8_t payload[IB_SMP_DATA_SIZE]; > - ib_port_info_t* p_pi = (ib_port_info_t*)payload; > - const ib_port_info_t* p_old_pi; > - osm_madw_context_t context; > - > /* If trap 131, might want to disable peer port if > available */ > /* but peer port has been observed not to respond to SM > requests */ > > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( > p_ntci->data_details.ntc_129_131.port_num > ); > > - p_old_pi = &p_physp->port_info; > - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > - > /* Set port to disabled/down */ > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > > From tziporet at dev.mellanox.co.il Tue Jul 10 09:37:50 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 10 Jul 2007 19:37:50 +0300 Subject: [ofa-general] RE: 2.6.22 nightly build failure In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com> References: <4693AADD.5090506@opengridcomputing.com> <6C2C79E72C305246B504CBA17B5500C901DA0B33@mtlexch01.mtl.com> Message-ID: <4693B5DE.9050500@mellanox.co.il> Vladimir Sokolovsky wrote: > SDP and RDS - are faling on 2.6.22 kernel > > > Regards, > Vladimir > > > Vlad - please fix RDS Jim - please fix SDP Thanks, Tziporet From amitk at mellanox.co.il Tue Jul 10 09:39:56 2007 From: amitk at mellanox.co.il (Amit Krig) Date: Tue, 10 Jul 2007 19:39:56 +0300 Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling References: <1184078350.25217.481568.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com> <1184084630.17622.3622.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com> Hi Hal, The watchdog mechanism may cause some hard time to communicate with the end node, that is the reason I suggest to bring down its peer port and by that stop the physical link from retraining all the time. Amit -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, July 10, 2007 7:24 PM To: Amit Krig Cc: general at lists.openfabrics.org; Suresh Shelvapille; Yevgeny Kliteynik; Eitan Zahavi Subject: RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling Hi Amit, On Tue, 2007-07-10 at 11:30, Amit Krig wrote: > Hi Hal, > > One comment, > If one of the port is not responsive for some reason, need to move its > peer port to DOWN and then check the OPVL, Guess I'm still not following you exactly yet. The code here is not determining the port responsiveness. It is merely triggering off the trap 131, recalculating and resetting OperationalVLs if needed, and taking the port down at the link level which should start it back to active, hopefully now with the proper OperationalVLs. If it is still flooded with trap 131s, it disables the port. -- Hal > > Amit > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, July 10, 2007 5:39 PM > To: general at lists.openfabrics.org > Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi > Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling > > OpenSM/osm_trap_rcv.c: Better trap 131 handling > > When trap 131 occurs, check operational VLs and set port state to DOWN > if needed. > > I think this is what Amit was saying should be done in his emails > yesterday on the list (modified by Suri's comment). > > Signed-off-by: Hal Rosenstock > > diff --git a/opensm/opensm/osm_trap_rcv.c > b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( > } > else > { > - /* When babbling port policy option is enabled and > - Threshold for disabling a "babbling" port is exceeded */ > + uint8_t payload[IB_SMP_DATA_SIZE]; > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > + const ib_port_info_t* p_old_pi; > + osm_madw_context_t context; > + > + p_old_pi = &p_physp->port_info; > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > + > + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) > + { > + uint8_t port_state, cur_opvls, opvls; > + > + port_state = ib_port_info_get_port_state(p_old_pi); > + if (port_state != IB_LINK_DOWN) > + { > + /* First, validate OperationalVLs */ > + cur_opvls = ib_port_info_get_op_vls(p_old_pi); > + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, > p_rcv->p_subn, p_physp); > + if (opvls != cur_opvls) > + { > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3809: " > + "Current OP_VLs %d New OP_VLs %d\n", > + cur_opvls, opvls); > + ib_port_info_set_op_vls(p_pi, opvls); > + } > + > + /* Now, set port to DOWN if not already in INIT */ > + if (port_state != IB_LINK_INIT) > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > + ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + else > + { > + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE ); > + ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > + } > + > + /* Now, issue set of PortInfo */ > + context.pi_context.node_guid = osm_node_get_node_guid( > osm_physp_get_node_ptr( p_physp ) ); > + context.pi_context.port_guid = osm_physp_get_port_guid( > p_physp ); > + context.pi_context.set_method = TRUE; > + context.pi_context.update_master_sm_base_lid = FALSE; > + context.pi_context.light_sweep = FALSE; > + context.pi_context.active_transition = FALSE; > + > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > + osm_physp_get_dr_path_ptr( > + p_physp > ), > + payload, > + sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + > + cl_hton32(osm_physp_get_port_num( > p_physp )), > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status != IB_SUCCESS ) > + { > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > + "__osm_trap_rcv_process_request: ERR 3812: " > + "Request to set PortInfo failed\n" ); > + } > + } > + } > + > + /* When babbling port policy option is enabled and > + Threshold for disabling a "babbling" port is exceeded */ > if ( p_rcv->p_subn->opt.babbling_port_policy && > num_received >= 250 ) > { > - uint8_t payload[IB_SMP_DATA_SIZE]; > - ib_port_info_t* p_pi = (ib_port_info_t*)payload; > - const ib_port_info_t* p_old_pi; > - osm_madw_context_t context; > - > /* If trap 131, might want to disable peer port if > available */ > /* but peer port has been observed not to respond to SM > requests */ > > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( > p_ntci->data_details.ntc_129_131.port_num > ); > > - p_old_pi = &p_physp->port_info; > - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > - > /* Set port to disabled/down */ > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > ib_port_info_set_port_phys_state( > IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > > From weiny2 at llnl.gov Tue Jul 10 09:46:59 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 10 Jul 2007 09:46:59 -0700 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com> <46826FB8.10904@hp.com> <46827BA0.6070008@hp.com> <1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com> <1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> Message-ID: <20070710094659.50df9b39.weiny2@llnl.gov> On Thu, 28 Jun 2007 10:24:59 +0300 "Eitan Zahavi" wrote: > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > > > In the last months it is the second time I hear people > > complaining the > > > current monitoring solution in OFA is integrated with OpenSM. > > > > I must have missed this both times (didn't see this in Mark's > > post) and the statement itself is somewhat inaccurate as well. > Private talks - I hope they will speak up for themselves now... > > > > > These people do not use OpenSM but do use OFED. > > > > I'm not sure I'm following what you mean here. > > > > If you mean that some people want to run PerfMgr without the > > SM/SA aspects (so that they can run a vendor based SM), that > > is the next thing we are adding to the implementation. > Exactly. OK when is that coming? There is very little which ties the current PerfMgr to OpenSM. Basically it just gets the current fabric topology. As Hal has said changes are coming. > > > > > > Another drawback if that > > > no naming is provided and the reporting uses GUIDs. > > > > Naming is provided via NodeDescription. > This might be good for hosts but is not covering switches ... It does include switches. However, since most systems have the same name for multiple switches this becomes ineffective. I have queried Voltaire for a way to change the NodeDescription for switches, but at the time I asked, there was no way to do it. Perhaps there is now? What about other vendors? This is why ibnetdiscover and other diags have "switch map" support. (A GUID->name mapping to override the default NodeDescription.) Nothing would please me more than to be able to remove that for a more "automatic" solution. > > > > > I also can't hold myself from saying again I think you are going to > > > hit the wall with the concept of doing the PMA from a single node. > > > > If you are referring to the fact the PerMgr is currently not > > distributed, that will be done as has been stated before. > Good. When is it expected? Will it be OFED 1.3? When Hal first sent out the PerfMgr design I thought we should jump right to the distributed model as well. But now I am glad we have gone the way we did. First off, we have something which "works" and from which we can expand. Second, I have run some tests querying the fabric of our large clusters here (~500 nodes) and the results were promising for a single node implementation. I don't recall the numbers as this was a while ago but it was on the order of <2 sec and I think <1 but I don't want to be misquoted. For sure, a distributed model offers many advantages and we will get there. But for many the current single node approach should work just fine. Thanks, Ira > > Thanks > > > > -- Hal > > > > > Eitan Zahavi > > > Senior Engineering Director, Software Architect Mellanox > > Technologies > > > LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > From: general-bounces at lists.openfabrics.org > > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal > > > > Rosenstock > > > > Sent: Wednesday, June 27, 2007 8:12 PM > > > > To: Mark Seger > > > > Cc: Finn, Ed; general at lists.openfabrics.org > > > > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > > > > > >The performance managers deal with the counter stickiness (by > > > > > >resetting them when they think they need to). They > > > > typically export > > > > > >their data although this is not specified by IBA so it is > > > > in a vendor > > > > > >proprietary manner. > > > > > > > > > > > > > > > > > so I guess these guys are poor citizens as well... > > > > > > > > Not sure what you mean. > > > > > > > > > the real issue as I see it then means nobody can trust > > the data if > > > > > randon tools randomly reset the counters. a real shame... > > > > > > > > I consider this to be a real rather than random app for this. > > > > Guess it depends on what one considers random. > > > > > > > > -- Hal > > > > > > > > > -mark > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vuhuong at mellanox.com Tue Jul 10 09:55:00 2007 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 10 Jul 2007 09:55:00 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <1184042252.15067.8.camel@gentoo-linux.localdomain> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> <1184042252.15067.8.camel@gentoo-linux.localdomain> Message-ID: <4693B9E4.1070001@mellanox.com> > Added a new wiki page based on Vu Pham's readme and issues with recent > kernels. I hope to keep it current as I get our targets up and running. > Thanks for doing this. Please use the latest readme from this link - http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt > http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation > > > WinIB initiators --> Gentoo Linux SRP Target. > I mainly test linux initiators with gen2 srp-target. I have not tested win srp initiator with the target. > Anything wrong with the above approach, I would be interested in a best > practices if there is one. I saw a CentOS target post, is this more > stable or better performing? There is no difference when you run the same srp target / scst codes in CentOS or RH/SuSe linux distributions. The storage back-end will determine the performance -vu > > Thanks. > > On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: >> Stanley Sufficool wrote: >> > Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch >> > >> > Got the latest srpt from the git repository on OpenFabrics and had the >> > following issues. >> > >> > ib_srpt.c Line 1997, missing second argument, should be? >> > sdev->scst_tgt = scst_register(tp, NULL); >> > >> >> Yes. You need the change if you test with top of scst svn >> trunk (or from version 0.9.6-pre2) >> If you test with scst before 0.9.6-pre2 (ie. version <= >> 0.9.6-pre1) you don't need the second argument for >> scst_register() >> >> >> > SCST was built successfully after fixing an issue in scst_vdisk.c >> > (missing #include ) >> >> >> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX >> - you should send the patch to scst devel >> >> > >> > Just thought this would be nice to have documented, took me half a day >> > to track down as a novice in C programming. >> > >> >> there is *lean and mean* srpt's README in srpt_inc >> SCST also has some document >> You can add some wiki/notes for the problems in openfabrics >> wiki page https://wiki.openfabrics.org/tiki-index.php >> >> -vu >> >> > >> > ------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From halr at voltaire.com Tue Jul 10 10:04:57 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 13:04:57 -0400 Subject: [ofa-general] RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com> References: <1184078350.25217.481568.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE113F@mtlexch01.mtl.com> <1184084630.17622.3622.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901BE114A@mtlexch01.mtl.com> Message-ID: <1184087090.17622.6458.camel@hal.voltaire.com> Hi Amit, On Tue, 2007-07-10 at 12:39, Amit Krig wrote: > Hi Hal, > > The watchdog mechanism may cause some hard time to communicate with the > end node, that is the reason I suggest to bring down its peer port and > by that stop the physical link from retraining all the time. The patch uses the port indicated in the trap. Are you saying sometimes that port will not be responsive to SMA requests (and in those cases the peer should be used or at least tried) ? -- Hal > Amit > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, July 10, 2007 7:24 PM > To: Amit Krig > Cc: general at lists.openfabrics.org; Suresh Shelvapille; Yevgeny > Kliteynik; Eitan Zahavi > Subject: RE: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling > > Hi Amit, > > On Tue, 2007-07-10 at 11:30, Amit Krig wrote: > > Hi Hal, > > > > One comment, > > If one of the port is not responsive for some reason, need to move its > > > peer port to DOWN and then check the OPVL, > > Guess I'm still not following you exactly yet. > > The code here is not determining the port responsiveness. It is merely > triggering off the trap 131, recalculating and resetting OperationalVLs > if needed, and taking the port down at the link level which should start > it back to active, hopefully now with the proper OperationalVLs. If it > is still flooded with trap 131s, it disables the port. > > -- Hal > > > > > Amit > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, July 10, 2007 5:39 PM > > To: general at lists.openfabrics.org > > Cc: Suresh Shelvapille; Amit Krig; Yevgeny Kliteynik; Eitan Zahavi > > Subject: [PATCHv2] OpenSM/osm_trap_rcv.c: Better Trap 131 Handling > > > > OpenSM/osm_trap_rcv.c: Better trap 131 handling > > > > When trap 131 occurs, check operational VLs and set port state to DOWN > > > if needed. > > > > I think this is what Amit was saying should be done in his emails > > yesterday on the list (modified by Suri's comment). > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/opensm/opensm/osm_trap_rcv.c > > b/opensm/opensm/osm_trap_rcv.c index f912dcd..3f60f3d 100644 > > --- a/opensm/opensm/osm_trap_rcv.c > > +++ b/opensm/opensm/osm_trap_rcv.c > > @@ -550,16 +550,76 @@ __osm_trap_rcv_process_request( > > } > > else > > { > > - /* When babbling port policy option is enabled and > > - Threshold for disabling a "babbling" port is exceeded */ > > + uint8_t payload[IB_SMP_DATA_SIZE]; > > + ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > + const ib_port_info_t* p_old_pi; > > + osm_madw_context_t context; > > + > > + p_old_pi = &p_physp->port_info; > > + memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > + > > + if (p_ntci->g_or_v.generic.trap_num == CL_HTON16(131)) > > + { > > + uint8_t port_state, cur_opvls, opvls; > > + > > + port_state = ib_port_info_get_port_state(p_old_pi); > > + if (port_state != IB_LINK_DOWN) > > + { > > + /* First, validate OperationalVLs */ > > + cur_opvls = ib_port_info_get_op_vls(p_old_pi); > > + opvls = osm_physp_calc_link_op_vls(p_rcv->p_log, > > p_rcv->p_subn, p_physp); > > + if (opvls != cur_opvls) > > + { > > + osm_log(p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3809: " > > + "Current OP_VLs %d New OP_VLs %d\n", > > + cur_opvls, opvls); > > + ib_port_info_set_op_vls(p_pi, opvls); > > + } > > + > > + /* Now, set port to DOWN if not already in INIT */ > > + if (port_state != IB_LINK_INIT) > > + { > > + ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > + ib_port_info_set_port_phys_state( > > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > > + } > > + else > > + { > > + ib_port_info_set_port_state( p_pi, IB_LINK_NO_CHANGE > ); > > + ib_port_info_set_port_phys_state( > > IB_PORT_PHYS_STATE_NO_CHANGE, p_pi ); > > + } > > + > > + /* Now, issue set of PortInfo */ > > + context.pi_context.node_guid = osm_node_get_node_guid( > > osm_physp_get_node_ptr( p_physp ) ); > > + context.pi_context.port_guid = osm_physp_get_port_guid( > > p_physp ); > > + context.pi_context.set_method = TRUE; > > + context.pi_context.update_master_sm_base_lid = FALSE; > > + context.pi_context.light_sweep = FALSE; > > + context.pi_context.active_transition = FALSE; > > + > > + status = osm_req_set( &p_rcv->p_subn->p_osm->sm.req, > > + osm_physp_get_dr_path_ptr( > > + p_physp > > ), > > + payload, > > + sizeof(payload), > > + IB_MAD_ATTR_PORT_INFO, > > + > > + cl_hton32(osm_physp_get_port_num( > > p_physp )), > > + CL_DISP_MSGID_NONE, > > + &context ); > > + > > + if( status != IB_SUCCESS ) > > + { > > + osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > + "__osm_trap_rcv_process_request: ERR 3812: > " > > + "Request to set PortInfo failed\n" ); > > + } > > + } > > + } > > + > > + /* When babbling port policy option is enabled and > > + Threshold for disabling a "babbling" port is exceeded */ > > if ( p_rcv->p_subn->opt.babbling_port_policy && > > num_received >= 250 ) > > { > > - uint8_t payload[IB_SMP_DATA_SIZE]; > > - ib_port_info_t* p_pi = (ib_port_info_t*)payload; > > - const ib_port_info_t* p_old_pi; > > - osm_madw_context_t context; > > - > > /* If trap 131, might want to disable peer port if > > available */ > > /* but peer port has been observed not to respond to SM > > requests */ > > > > @@ -570,9 +630,6 @@ __osm_trap_rcv_process_request( > > p_ntci->data_details.ntc_129_131.port_num > > ); > > > > - p_old_pi = &p_physp->port_info; > > - memcpy( payload, p_old_pi, sizeof(ib_port_info_t) ); > > - > > /* Set port to disabled/down */ > > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > ib_port_info_set_port_phys_state( > > IB_PORT_PHYS_STATE_DISABLED, p_pi ); > > > > > > > From mst at dev.mellanox.co.il Tue Jul 10 10:11:42 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 20:11:42 +0300 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> Message-ID: <20070710171142.GC11320@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: mthca use of dma_sync_single is bogus > > > What makes you think dma_sync_single_range can't be used on memory mapped > > by pci_map_sg/dma_map_sg? > > The fact that it's dma_sync_*SINGLE*_range, and that there's a > separate dma_sync_sg() function defined in DMA-API.txt. Aha. I looked at the code a bit. Basically is seems that some architectures use the dma handle and some the virtual address to flush the cache, that's where the requirement that same parameters are used for sync single as for map single comes from. So it seems that this requirement does not apply to s/g, and that we can just build a scatterlist structure and do dma_sync_sg? -- MST From rick.jones2 at hp.com Tue Jul 10 10:13:39 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 10 Jul 2007 10:13:39 -0700 Subject: [ofa-general] minor usability nit with 1.2GA? In-Reply-To: <469372CC.5060207@dev.mellanox.co.il> References: <4692CC7A.2050704@hp.com> <469372CC.5060207@dev.mellanox.co.il> Message-ID: <4693BE43.8070905@hp.com> Vladimir Sokolovsky wrote: > Rick Jones wrote: > >> So I was blythly running my netperf tests after resolving the problem >> with the existence of irqbalance. I finished my TCP tests and was >> about to run the SDP tests. I'd not modprobe'd the ib_sdp module, so >> my netperf tests died. I then did the modprobe and it complained >> about symbol versions. >> >> Turns-out - or at least it seems that way - that my selection of just >> "basic" software didn't include SDP. That's fine I suppose, but what >> happened then was I was left with a system with a hybrid of the >> previous OFED whatever bits (probably an RC for 1.2) and OFED GA bits. >> >> Perhaps this is simply "caveat emptor" but shouldn't there be some >> sort of warning/check that in only doing the partial install there >> would be some incompatible modules left laying around? Or should I >> just do the "give me everything" option, shut-up and benchmark?-) >> > > Hi, > OFED removes the previous software before installing the new one. > So, there shouldn't be a mix of different OFED versions on the same > machine. > > Can you send me the output of the following commands: > # modinfo ib_sdp > # rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the > previous command) > # rpm -q kernel-ib > # ofed_info I can, but at this point I'm not sure what it would show since I went back and did a "build me one with everything" install on both my systems. If you still want to see it I can do that though. rick > > > Thanks, > Vladimir From tziporet at dev.mellanox.co.il Tue Jul 10 10:17:59 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 10 Jul 2007 20:17:59 +0300 Subject: [ofa-general] OFED 1.3 timeline Message-ID: <4693BF47.8070700@mellanox.co.il> Hi All, Based on the requests to have OFED 1.3 release this year the release schedule is the following: * Feature freeze - Sep 4 * Alpha release - Sep 10 * Beta release - Sep 25 * RC1 - Oct 16 * RC2 - Oct 30 * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11) * RC4 - Nov 22 * GA release - Nov 30 (or first week of Dec) To make this schedule we must implement all major changes for the package during July so we have a stable package till middle of Aug. Also we must keep the new features in control and not insert unnecessary changes that are not in the features list. Full features list will be published in a different mail Tziporet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Jul 10 11:04:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 11:04:40 -0700 Subject: [ofa-general] Re: [PATCH] IB/core: Fix the used pointer when calling to kmalloc In-Reply-To: <200707101655.58041.dotanb@dev.mellanox.co.il> (Dotan Barak's message of "Tue, 10 Jul 2007 16:55:57 +0300") References: <200707101655.58041.dotanb@dev.mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Tue Jul 10 11:06:29 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 11:06:29 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710141409.GH3885@ics.muni.cz> (Lukas Hejtmanek's message of "Tue, 10 Jul 2007 16:14:09 +0200") References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> Message-ID: > And what about the attached patch to mthca_memfree? It changes alloc_pages for > pci_alloc_consistent. Using it, I can enable FMR and the driver runs fine. As Michael said, this uses a lot of consistent memory. Probably too much on some systems. - R. From rdreier at cisco.com Tue Jul 10 11:09:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 11:09:01 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710171142.GC11320@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jul 2007 20:11:42 +0300") References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> <20070710171142.GC11320@mellanox.co.il> Message-ID: > Aha. I looked at the code a bit. > Basically is seems that some architectures use the dma handle > and some the virtual address to flush the cache, that's > where the requirement that same parameters are used for > sync single as for map single comes from. > > So it seems that this requirement does not apply to s/g, and that we can just > build a scatterlist structure and do dma_sync_sg? The statement synchronise a single contiguous or scatter/gather mapping. All the parameters must be the same as those passed into the single mapping API. in DMA-API.txt also is clearly attached to dma_sync_sg(). So I don't think it's a good idea to rely on being able to sync a different scatterlist than the one that was originally mapped. It actually doesn't look too bad to replace our use of pci_map_sg() with dma_map_single(), at least at first glance. I'll try to write a patch later. - R. From halr at voltaire.com Tue Jul 10 11:23:53 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 14:23:53 -0400 Subject: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> Message-ID: <1184091830.17622.12007.camel@hal.voltaire.com> On Tue, 2007-06-26 at 10:27, Tziporet Koren wrote: > Hi All, > > On next Monday we will have the first meeting to close OFED 1.3 > features and schedule. > As a preparation I send here the list of features we already reviewed > in Sonoma, and other features I see in progress on the general list > discussions. > > I know this is a long mail :-( but I ask each of the > maintainers/customers to review this list and send comments and other > requests. > > There are some ULPs that I placed "?" and the owner should review and > reply with the plans. > > Thanks, > Tziporet > > > Main New Features > ============== > Base kernel: 2.6.23 (we will start with 2.6.22 but will move to > 2.6.23) > Install: > * Minimize integration effort into OS distribution > * Break the packages RPMs (work with Novell and Redhat) > > > > Package: > * Sources arrangement for the end user (for the labs) > * Reduce compilation warnings > > > > QoS: > * OSM > * CM & CMA > * ULPs: SDP, SRP, IPoIB, RDS? > > > > Core: > * Updated SA cache > * User space events registration > * Preparations for IB routers > > > > libibverbs: > * New verbs: > * Scalable Reliable Connected Transport (with Mellanox > ConnectX) > * Shared Send Queue > * Reliable Multicast ? > > > > > Management: > * Multiple partitions > * OpenSM > * More routing performance improvements > * Even more speedups > * Better packaging/installation > * “Native” daemon mode > * Performance management > * Quality of Service manager: Based on IBTA annex enhancements for fat tree routing (non pure tree support) more console commands and telnet access to console > * More diagnostics - Hal please update ibsim - IB management simulator which can be used without OpenSM recompilation and supports the diag tools ibidsverify.pl: validate LIDs and GUIDs in subnet Updated ibnetdiscover format with link width and speed, and GUIDs ibnetdiscover grouping support for new Voltaire chassis diag updates for IB router support iblinkinfo.pl: Support peer port link width and speed validation ibdatacounters: Add script and man page for subnet wide data counters saquery enhancements > ULPs: > * IPoIB: NAPI; CM in GA; Bonding in GA > * NFS over RDMA integration > * RDS: RDMA API (using FMRs); GA quality with Oracle 11 > * SDP: Keepalive; Asynch IO (Zero Copy) > * SRP: HA in GA > * VNIC: ? Qlogic - please update > * iSER: ? Voltaire - please update > * uDAPL - ? Arlin please update > > > > iWARP: (Steve please update if needed) > * iwarp-specific verbs > * iwarp-specific async events > * API for MPA options (CRC/Markers) > * API for streaming mode IO (needed for compliant iSER) > * Possibly other ULPs (RDS, SDP, iSER) > > > > MPIs: > Integrate the new MPI releases that are on time for OFED 1.3 > * Jeff - please update about Open MPI > * DK: Please update regarding MVAPICH and MVAPICH2 > > > > OFED 1.3 System Matrix > * CPU Arch: X86, x86_64, PPC64, ia64 > * kernel.org: kernel 2.6.23 > * Novell: SLES 10; SLES 10 SP1 > * Redhat: RHEL 4 (up4 and up5); RHEL 5 (can we drop RHEL4up4 > since up6 will probably be out till this release is out?) > * Free distros (Fedora, SuSE Pro, Ubuntu) - basic testing only > > > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue Jul 10 11:29:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 11:29:38 -0700 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070710071912.GB3814@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jul 2007 10:19:12 +0300") References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> <20070710071912.GB3814@mellanox.co.il> Message-ID: > That one is actually not very different from sysfs: > there just seems to be a set of pre-defined files. I thought there was a special system call to create stuff or something. Anyway I haven't looked in a long time. > The special nature of your suggested filesystem would be > that we actually let users create files there, > but then files need to disappear when the last user > closes the file. Yes, that's true. Phrased that way it does seem tricky. From mst at dev.mellanox.co.il Tue Jul 10 11:30:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 21:30:06 +0300 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> <20070710171142.GC11320@mellanox.co.il> Message-ID: <20070710183006.GE11320@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: mthca use of dma_sync_single is bogus > > > Aha. I looked at the code a bit. > > Basically is seems that some architectures use the dma handle > > and some the virtual address to flush the cache, that's > > where the requirement that same parameters are used for > > sync single as for map single comes from. > > > > So it seems that this requirement does not apply to s/g, and that we can just > > build a scatterlist structure and do dma_sync_sg? > > The statement > > synchronise a single contiguous or scatter/gather mapping. All the > parameters must be the same as those passed into the single mapping > API. > > in DMA-API.txt also is clearly attached to dma_sync_sg(). So I don't > think it's a good idea to rely on being able to sync a different > scatterlist than the one that was originally mapped. Hmm. This means there's no way to sync a range within mapping created with map_sg? > It actually doesn't look too bad to replace our use of pci_map_sg() > with dma_map_single(), at least at first glance. I'll try to write a > patch later. Well, the reason map_sg is there is presumably because on some architectures it's worth it to try and make the region contigious in DMA space. But I agree this seems the lesser evil at this point ... -- MST From rdreier at cisco.com Tue Jul 10 11:31:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 11:31:36 -0700 Subject: [ofa-general] Re: consumer data buffer ownership for inline sends In-Reply-To: (Or Gerlitz's message of "Tue, 3 Jul 2007 11:50:52 +0300 (IDT)") References: Message-ID: > Does this means that for inline sends, when ibv_post_send returns, > the consumer owns back the data buffer associated with this send? > > Can this be stated as the official policy of libibverbs? This does seem fine. Can you send a documentation patch stating this? From mst at dev.mellanox.co.il Tue Jul 10 11:37:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 10 Jul 2007 21:37:05 +0300 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: References: <20070625130604.GH15343@mellanox.co.il> <20070626070641.GM15343@mellanox.co.il> <20070630222419.GE7554@mellanox.co.il> <20070702195927.GB31169@mellanox.co.il> <20070710071912.GB3814@mellanox.co.il> Message-ID: <20070710183705.GF11320@mellanox.co.il> > > The special nature of your suggested filesystem would be > > that we actually let users create files there, > > but then files need to disappear when the last user > > closes the file. > > Yes, that's true. Phrased that way it does seem tricky. OK, so how about the idea to just pass in *any* fd, and just create a mapping between an inode and src domain (or other shared object), by means of a radix tree or something like this. We can then use the mapping to check permissions when new processes want to attach to an existing object. Hmm? -- MST From sashak at voltaire.com Tue Jul 10 11:52:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 10 Jul 2007 21:52:36 +0300 Subject: [ofa-general] [PATCH] opensm/updn: up/down root switches detector fix Message-ID: <20070710185236.GW25653@sashak.voltaire.com> This problem was triggered by min hops generator optimizations where min hop matrices are created for switches only. The up/down root switches auto detector code which uses those tables is outdated, this still try to count hops to CAs directly and now it gets 0xff (no path) only value, as result all fabric switches are considered to be roots. This patch updates root auto detector code according to recent min hops optimizations and fixes the issue. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_updn.c | 70 ++++++++++++++++++++-------------------- 1 files changed, 35 insertions(+), 35 deletions(-) diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c index db8e60a..11f6eb5 100644 --- a/opensm/opensm/osm_ucast_updn.c +++ b/opensm/opensm/osm_ucast_updn.c @@ -718,8 +718,8 @@ __osm_updn_find_root_nodes_by_min_hop( uint8_t maxHops = 0; /* contain the max histogram index */ uint64_t *p_guid; cl_list_t *p_root_nodes_list = p_updn->p_root_nodes; - cl_map_t ca_by_lid_map; /* map holding all CA lids */ - uint16_t self_lid_ho; + unsigned *cas_per_sw; + uint16_t sw_lid_ho; OSM_LOG_ENTER( &p_osm->log, osm_updn_find_root_nodes_by_min_hop ); @@ -729,8 +729,15 @@ __osm_updn_find_root_nodes_by_min_hop( cl_qmap_count(&p_osm->subn.port_guid_tbl) ); /* Init the required vars */ cl_qmap_init( &min_hop_hist ); - cl_map_construct( &ca_by_lid_map ); - cl_map_init( &ca_by_lid_map, 10 ); + + cas_per_sw = malloc((IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw)); + if (!cas_per_sw) { + osm_log( &p_osm->log, OSM_LOG_ERROR, + "__osm_updn_find_root_nodes_by_min_hop: " + "cannot alloc mem for CAs per switch counter array.\n"); + goto _exit; + } + memset(cas_per_sw, 0, (IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw)); /* EZ: p_ca_list = (cl_list_t*)malloc(sizeof(cl_list_t)); @@ -752,21 +759,19 @@ __osm_updn_find_root_nodes_by_min_hop( while( p_next_port != (osm_port_t*)cl_qmap_end( &p_osm->subn.port_guid_tbl ) ) { p_port = p_next_port; p_next_port = (osm_port_t*)cl_qmap_next( &p_next_port->map_item ); - if ( osm_node_get_type(p_port->p_node) != IB_NODE_TYPE_SWITCH ) + if ( !p_port->p_node->sw ) { - p_physp = p_port->p_physp; - self_lid_ho = cl_ntoh16( osm_physp_get_base_lid(p_physp) ); - numCas++; - /* EZ: - self = malloc(sizeof(uint16_t)); - *self = self_lid_ho; - cl_list_insert_tail(p_ca_list, self); - */ - cl_map_insert( &ca_by_lid_map, self_lid_ho, (void *)0x1); + p_physp = p_port->p_physp->p_remote_physp; + if (!p_physp || !p_physp->p_node->sw) + continue; + sw_lid_ho = osm_node_get_base_lid(p_physp->p_node, 0); + sw_lid_ho = cl_ntoh16(sw_lid_ho); osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " - "Inserting GUID 0x%" PRIx64 ", Lid: 0x%X into array\n", - cl_ntoh64(osm_port_get_guid(p_port)), self_lid_ho ); + "Inserting GUID 0x%" PRIx64 ", sw lid: 0x%X into array\n", + cl_ntoh64(osm_port_get_guid(p_port)), sw_lid_ho ); + cas_per_sw[sw_lid_ho]++; + numCas++; } } osm_log( &p_osm->log, OSM_LOG_DEBUG, @@ -792,10 +797,10 @@ __osm_updn_find_root_nodes_by_min_hop( rebuild its FWD tables, post setting Min Hop Tables */ max_lid_ho = p_sw->max_lid_ho; /* Get base lid of switch by retrieving port 0 lid of node pointer */ - self_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) ); + sw_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) ); osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " - "Passing through switch lid 0x%X\n", self_lid_ho ); + "Passing through switch lid 0x%X\n", sw_lid_ho ); for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) { /* Skip lids which are not CAs or RTRs - @@ -816,7 +821,7 @@ __osm_updn_find_root_nodes_by_min_hop( } if ( LidFound ) */ - if (cl_map_get( &ca_by_lid_map, lid_ho )) + if (cas_per_sw[lid_ho]) { hop_val = osm_switch_get_least_hops( p_sw, lid_ho ); if (hop_val > maxHops) @@ -828,22 +833,19 @@ __osm_updn_find_root_nodes_by_min_hop( /* New entry in the histogram, first create it */ p_updn_hist = (updn_hist_t*) malloc(sizeof(updn_hist_t)); CL_ASSERT(p_updn_hist); - p_updn_hist->bar_value = 1; + p_updn_hist->bar_value = 0; cl_qmap_insert(&min_hop_hist, (uint64_t)hop_val, &p_updn_hist->map_item); osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " - "Creating new entry in histogram %u with bar value 1\n", + "Creating new entry in histogram %u\n", hop_val ); } - else - { - /* Entry exists in the table, just increment the value */ - p_updn_hist->bar_value++; - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Updating entry in histogram %u with bar value %d\n", - hop_val, p_updn_hist->bar_value ); - } + /* Entry exists in the table, just increment the value */ + p_updn_hist->bar_value += cas_per_sw[lid_ho]; + osm_log( &p_osm->log, OSM_LOG_DEBUG, + "__osm_updn_find_root_nodes_by_min_hop: " + "Updating entry in histogram %u with bar value %d\n", + hop_val, p_updn_hist->bar_value ); } } @@ -908,13 +910,11 @@ __osm_updn_find_root_nodes_by_min_hop( } } - /* destroy the map of CA and RTR lids */ - cl_map_remove_all( &ca_by_lid_map ); - cl_map_destroy( &ca_by_lid_map ); - + free(cas_per_sw); /* Now convert the cl_list to array */ __osm_updn_convert_list2array(p_updn); - + + _exit: OSM_LOG_EXIT( &p_osm->log ); return; } -- 1.5.3.rc0.93.ga0f53 From xhejtman at ics.muni.cz Tue Jul 10 12:00:19 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 10 Jul 2007 21:00:19 +0200 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> Message-ID: <20070710190018.GK3885@ics.muni.cz> On Tue, Jul 10, 2007 at 11:06:29AM -0700, Roland Dreier wrote: > > And what about the attached patch to mthca_memfree? It changes alloc_pages > > for pci_alloc_consistent. Using it, I can enable FMR and the driver > > runs fine. > > As Michael said, this uses a lot of consistent memory. Probably too > much on some systems. I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and coherent are the same but on some architectures they are not and I think that using consistent (in particular pci_alloc_consistent) is exactly what should be used. Keir also recommended to use this one. And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right, Keir? -- Lukáš Hejtmánek From rdreier at cisco.com Tue Jul 10 12:08:43 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 12:08:43 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710190018.GK3885@ics.muni.cz> (Lukas Hejtmanek's message of "Tue, 10 Jul 2007 21:00:19 +0200") References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> <20070710190018.GK3885@ics.muni.cz> Message-ID: > I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and > coherent are the same but on some architectures they are not and I think that > using consistent (in particular pci_alloc_consistent) is exactly what should > be used. Keir also recommended to use this one. coherent and consistent are synonyms. It's confusing because there is pci_alloc_consistent(), which is in general just a wrapper for dma_alloc_coherent(). > And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right, > Keir? Yes, but I'm not really willing to make things worse for standard i386 just to make Xen work a little better. - R. From halr at voltaire.com Tue Jul 10 12:07:28 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 15:07:28 -0400 Subject: [ofa-general] Re: [PATCH] opensm/updn: up/down root switches detector fix In-Reply-To: <20070710185236.GW25653@sashak.voltaire.com> References: <20070710185236.GW25653@sashak.voltaire.com> Message-ID: <1184094445.17622.15024.camel@hal.voltaire.com> On Tue, 2007-07-10 at 14:52, Sasha Khapyorsky wrote: > This problem was triggered by min hops generator optimizations where > min hop matrices are created for switches only. The up/down root > switches auto detector code which uses those tables is outdated, this > still try to count hops to CAs directly and now it gets 0xff (no path) > only value, as result all fabric switches are considered to be roots. > This patch updates root auto detector code according to recent min hops > optimizations and fixes the issue. > > Signed-off-by: Sasha Khapyorsky Thanks! Applied. -- Hal From cap at nsc.liu.se Tue Jul 10 12:11:18 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 10 Jul 2007 21:11:18 +0200 Subject: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <1184091830.17622.12007.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <1184091830.17622.12007.camel@hal.voltaire.com> Message-ID: <200707102111.28374.cap@nsc.liu.se> On Tuesday 10 July 2007, Hal Rosenstock wrote: ... > > Management: > > * Multiple partitions > > * OpenSM > > * More routing performance improvements > > * Even more speedups > > * Better packaging/installation > > * “Native” daemon mode > > * Performance management > > * Quality of Service manager: Based on IBTA annex > > enhancements for fat tree routing (non pure tree support) > more console commands and telnet access to console Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and in which way OFED-1.2 opensm performs badly for these? Or maybe there are some nice docs for me to sink my teeth into... /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From halr at voltaire.com Tue Jul 10 12:12:41 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Jul 2007 15:12:41 -0400 Subject: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <200707102111.28374.cap@nsc.liu.se> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <1184091830.17622.12007.camel@hal.voltaire.com> <200707102111.28374.cap@nsc.liu.se> Message-ID: <1184094759.17622.15371.camel@hal.voltaire.com> On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote: > On Tuesday 10 July 2007, Hal Rosenstock wrote: > ... > > > Management: > > > * Multiple partitions > > > * OpenSM > > > * More routing performance improvements > > > * Even more speedups > > > * Better packaging/installation > > > * “Native” daemon mode > > > * Performance management > > > * Quality of Service manager: Based on IBTA annex > > > > enhancements for fat tree routing (non pure tree support) > > more console commands and telnet access to console > > Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and > in which way OFED-1.2 opensm performs badly for these? Yevgeny, Could you elaborate on this ? Thanks. -- Hal > Or maybe there are some nice docs for me to sink my teeth into... > > /Peter > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From xhejtman at ics.muni.cz Tue Jul 10 12:16:39 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 10 Jul 2007 21:16:39 +0200 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> <20070710190018.GK3885@ics.muni.cz> Message-ID: <20070710191639.GL3885@ics.muni.cz> On Tue, Jul 10, 2007 at 12:08:43PM -0700, Roland Dreier wrote: > > I think he spoke about coherent, didn't he? On i386/x86_64, the consistent and > > coherent are the same but on some architectures they are not and I think that > > using consistent (in particular pci_alloc_consistent) is exactly what should > > be used. Keir also recommended to use this one. > > coherent and consistent are synonyms. It's confusing because there is > pci_alloc_consistent(), which is in general just a wrapper for > dma_alloc_coherent(). According to DMA-mapping.txt they are not. Alpha, M68000 wihtout MMU, PPC, Sparc, Sparc64, V850 have own implementation of pci_alloc_consistent(). Yes, on i386, the pci_alloc_consistent() is just wrapper for dma_alloc_coherent(). > > And moreover, it avoids using swiotlb and bounce buffers, I think. Am I right, > > Keir? > > Yes, but I'm not really willing to make things worse for standard i386 > just to make Xen work a little better. So, what about some #ifdefs ? E.g., allow config option - Xen optimizations? -- Lukáš Hejtmánek From caitlinb at broadcom.com Tue Jul 10 12:21:45 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 10 Jul 2007 12:21:45 -0700 Subject: [ofa-general] Re: [PATCH RFC] sharing userspace IB objects In-Reply-To: <20070626070641.GM15343@mellanox.co.il> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475CEE4@NT-IRVA-0750.brcm.ad.broadcom.com> general-bounces at lists.openfabrics.org wrote: >> Quoting Roland Dreier : >> Subject: Re: [PATCH RFC] sharing userspace IB objects >> >> Some initial reaction, in no particular order: >> >> - Having to allocate everything in memory that the library mmap()s >> adds a lot of yucky stuff -- basically we need to implement our >> own allocator for the shared memory offets. > > Right. > >> I guess we could wrap this >> in libibverbs and only implement it once but still we're basically >> reimplementing malloc(). > > Right. > >> Is there really a strong use case for making every type of object >> shareable? Can we handle the SRC stuff without going to this >> extreme of complexity? > > This is not directly related to SRC: this is an effort to > make it possible to share QPs, CQ etc across processes in the > same way as they can be currently shared across threads. > So assuming that we want multiple processes to post to the > same QP, how can we support this? > Sharing QPs and CQs ultimately means sharing Protection Domains and Memory Regions across processes. So obviously this would never be a default. Basically you would be enabling a group of processes to share Memory Regions, QPs, etc all created with a single PD. The easiest way to support this in the hardware is to simply not be aware that it is happening, that is, to treat all the processes as though they were just threads. I suspect that makes the prospective pool of users quite small. But there are lesser sharings that could be of value: Passing Connection Requests to other processes, and allowing them to accept the connection. Passing an empty QP to another process, which could then re-attach it to its Protection Domain and supply new memory for the SQ and RQ. From rdreier at cisco.com Tue Jul 10 12:24:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 12:24:02 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710191639.GL3885@ics.muni.cz> (Lukas Hejtmanek's message of "Tue, 10 Jul 2007 21:16:39 +0200") References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> <20070710190018.GK3885@ics.muni.cz> <20070710191639.GL3885@ics.muni.cz> Message-ID: > > coherent and consistent are synonyms. It's confusing because there is > > pci_alloc_consistent(), which is in general just a wrapper for > > dma_alloc_coherent(). > > According to DMA-mapping.txt they are not. Alpha, M68000 wihtout MMU, PPC, > Sparc, Sparc64, V850 have own implementation of pci_alloc_consistent(). > > Yes, on i386, the pci_alloc_consistent() is just wrapper for > dma_alloc_coherent(). Sorry, I was a little confusing. The implementations may be different but in general there is no real difference between consistent and coherent memory. Using either pci_alloc_consistent() or dma_alloc_coherent() will exhaust the same small pool of address space on powerpc 4xx for example. > So, what about some #ifdefs ? E.g., allow config option - Xen optimizations? Seems pretty ugly, especially given that Xen is not upstream. I think the Xen tree should just carry such patches, at least until Xen is merged. Even then I'm quite dubious about having two code paths for this. - R. From rdreier at cisco.com Tue Jul 10 12:25:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 12:25:59 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070710183006.GE11320@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 10 Jul 2007 21:30:06 +0300") References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> <20070710171142.GC11320@mellanox.co.il> <20070710183006.GE11320@mellanox.co.il> Message-ID: > Hmm. This means there's no way to sync a range within > mapping created with map_sg? It doesn't seem that there is one right now at least. > > It actually doesn't look too bad to replace our use of pci_map_sg() > > with dma_map_single(), at least at first glance. I'll try to write a > > patch later. > > Well, the reason map_sg is there is presumably because on some > architectures it's worth it to try and make the region contigious in DMA space. > But I agree this seems the lesser evil at this point ... Given that we're already trying to allocate big chunks of physically contiguous memory, I think that any virtual merging we get is likely to be of very small benefit. It is kind of a shame to give this up though. - R. From rdreier at cisco.com Tue Jul 10 12:28:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 12:28:44 -0700 Subject: [ofa-general] Re: [KJ PATCH] Replacing memset(, 0, PAGE_SIZE) with clear_page() in drivers/infiniband/hw/mthca/mthca_eq.c In-Reply-To: <1182136980.9020.13.camel@shani-win> (Shani Moideen's message of "Mon, 18 Jun 2007 08:53:00 +0530") References: <1182136980.9020.13.camel@shani-win> Message-ID: thanks, I applied both mthca patches as one commit. From xhejtman at ics.muni.cz Tue Jul 10 12:36:26 2007 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Tue, 10 Jul 2007 21:36:26 +0200 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710141409.GH3885@ics.muni.cz> <20070710190018.GK3885@ics.muni.cz> <20070710191639.GL3885@ics.muni.cz> Message-ID: <20070710193626.GM3885@ics.muni.cz> On Tue, Jul 10, 2007 at 12:24:02PM -0700, Roland Dreier wrote: > Sorry, I was a little confusing. The implementations may be different > but in general there is no real difference between consistent and > coherent memory. Using either pci_alloc_consistent() or > dma_alloc_coherent() will exhaust the same small pool of address space > on powerpc 4xx for example. I thought that consistent only refers to physically contiguous area whereas coherent refers to memory where no barrier need to be used. But I may be wrong. Anyway, with my patch, I can turn off swiotlb and I'm still able to load ib_mthca cleanly in DomU. On the other hand, Xen bug me about DMA bug in ib_ipoib, there may be another problem with dma_sync_single(). > Seems pretty ugly, especially given that Xen is not upstream. I think > the Xen tree should just carry such patches, at least until Xen is > merged. Even then I'm quite dubious about having two code paths for this. OK, I will keep it for my own. -- Lukáš Hejtmánek From jim.houston at ccur.com Tue Jul 10 13:05:31 2007 From: jim.houston at ccur.com (Jim Houston) Date: Tue, 10 Jul 2007 16:05:31 -0400 Subject: [ofa-general] [PATCH] fix idr_get_new_above id alias bugs In-Reply-To: <200707041611.30056.hnguyen@linux.vnet.ibm.com> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> <1183422700.3130.27.camel@localhost.localdomain> <200707041611.30056.hnguyen@linux.vnet.ibm.com> Message-ID: <1184097931.3020.73.camel@localhost.localdomain> Hi Everyone, Hoang-Nam Nguyen reported a bug in idr_get_new_above() which occurred with a starting id value like 0x3ffffffc. His test module easily reproduced the problem. Thanks. The test revealed the following bugs: 1. Relying on shift operations which have undefined results e.g.: 1 << n where n > word size. On i386 an integer shift only uses the low 5 bits of the shift count. 2. An off by one error which prevented the top most layer of the radix tree from being allocated. This meant that sub_alloc() would allocate an entry in the existing portion of the radix tree which aliased the requested address. When it tried to allocate id 0x40000000, it might use the slot belonging to id 0. 3. There was also a failure in the code which walked back up the tree if an allocation failed. The normal case is to descend the tree checking the starting id value against the bitmap at each level. If the bit is set, we know that the entire sub-tree is full and we can short cut the search. We may still descend to the lowest level and find that the portion of the id space we want is full. In this case we need to walk back up the tree and continue the search. The existing code just returned to the previous level and continued. This resulted in an attempt to allocate an id above 0x3ffffffc using the slot for id 0x3ffffc00 instead of 0x40000000 which it then claimed to have allocated. The same problem occurs with 0x3ff as the requested id value if it is already in use. With this patch, idr.c should work as advertised allocating id values in the range 0...0x7fffffff. Andrew had speculated that it should allow the full range 0...0xffffffff to be used. I was tempted to make changes to allow this, but it would require changes to API, e.g. making the starting id value and the return value unsigned. Signed-off-by: Jim Houston -- Index: linux-2.6.22-rc7/include/linux/idr.h =================================================================== --- linux-2.6.22-rc7.orig/include/linux/idr.h 2007-04-25 23:08:32.000000000 -0400 +++ linux-2.6.22-rc7/include/linux/idr.h 2007-07-06 16:46:31.000000000 -0400 @@ -18,17 +18,9 @@ #if BITS_PER_LONG == 32 # define IDR_BITS 5 # define IDR_FULL 0xfffffffful -/* We can only use two of the bits in the top level because there is - only one possible bit in the top level (5 bits * 7 levels = 35 - bits, but you only use 31 bits in the id). */ -# define TOP_LEVEL_FULL (IDR_FULL >> 30) #elif BITS_PER_LONG == 64 # define IDR_BITS 6 # define IDR_FULL 0xfffffffffffffffful -/* We can only use two of the bits in the top level because there is - only one possible bit in the top level (6 bits * 6 levels = 36 - bits, but you only use 31 bits in the id). */ -# define TOP_LEVEL_FULL (IDR_FULL >> 62) #else # error "BITS_PER_LONG is not 32 or 64" #endif Index: linux-2.6.22-rc7/lib/idr.c =================================================================== --- linux-2.6.22-rc7.orig/lib/idr.c 2007-04-25 23:08:32.000000000 -0400 +++ linux-2.6.22-rc7/lib/idr.c 2007-07-10 11:05:19.000000000 -0400 @@ -105,8 +105,8 @@ id = *starting_id; p = idp->top; - l = idp->layers; - pa[l--] = NULL; + l = idp->layers - 1; + pa[l] = NULL; while (1) { /* * We run around this while until we reach the leaf node... @@ -117,8 +117,14 @@ if (m == IDR_SIZE) { /* no space available go back to previous layer. */ l++; - id = (id | ((1 << (IDR_BITS * l)) - 1)) + 1; - if (!(p = pa[l])) { + id = (id | ((1 << (IDR_BITS * l)) - 1)); + while (((id >> (IDR_BITS * l)) & IDR_MASK) == IDR_MASK) + l++; + id++; + p = pa[l-1]; + if ((id >= MAX_ID_BIT) || (id < 0)) + return -3; + if (!p) { *starting_id = id; return -2; } @@ -141,7 +147,7 @@ p->ary[m] = new; p->count++; } - pa[l--] = p; + pa[--l] = p; p = p->ary[m]; } /* @@ -159,7 +165,7 @@ */ n = id; while (p->bitmap == IDR_FULL) { - if (!(p = pa[++l])) + if (!(p = pa[l++])) break; n = n >> IDR_BITS; __set_bit((n & IDR_MASK), &p->bitmap); @@ -186,7 +192,7 @@ * Add a new layer to the top of the tree if the requested * id is larger than the currently allocated space. */ - while ((layers < (MAX_LEVEL - 1)) && (id >= (1 << (layers*IDR_BITS)))) { + while ((layers < MAX_LEVEL) && (id & ((~0) << (layers*IDR_BITS)))) { layers++; if (!p->count) continue; @@ -299,7 +305,7 @@ static void sub_remove(struct idr *idp, int shift, int id) { struct idr_layer *p = idp->top; - struct idr_layer **pa[MAX_LEVEL]; + struct idr_layer **pa[MAX_LEVEL+1]; struct idr_layer ***paa = &pa[0]; int n; @@ -392,7 +398,7 @@ /* Mask off upper bits we don't use for the search. */ id &= MAX_ID_MASK; - if (id >= (1 << n)) + if ((n <= MAX_ID_SHIFT) && (id & ((~0) << n))) return NULL; while (n > 0 && p) { @@ -425,7 +431,7 @@ id &= MAX_ID_MASK; - if (id >= (1 << n)) + if ((n <= MAX_ID_SHIFT) && (id & ((~0) << n))) return ERR_PTR(-EINVAL); n -= IDR_BITS; From rdreier at cisco.com Tue Jul 10 13:49:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 10 Jul 2007 13:49:09 -0700 Subject: [ofa-general] Re: [PATCH] libmlx4: make BF available for RDMA_READ work requests In-Reply-To: <200706211201.58440.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 12:01:58 +0300") References: <200706211201.58440.jackm@dev.mellanox.co.il> Message-ID: thanks, applied at last From rick.jones2 at hp.com Tue Jul 10 15:12:29 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Tue, 10 Jul 2007 15:12:29 -0700 Subject: [ofa-general] should it be possible to run SDP over a T320? Message-ID: <4694044D.8010208@hp.com> Hi - I was talking to someone about the numbers I'd gathered for IPoIB with OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 did some non-trivial things to bulk transfer performance. This person suggested it should be possible to run SDP over a Chelsio T320, which I happen to have in my systems at present. However, my initial simplistic attempt was unsuccessful: [root at hpcpc106 OFED-1.2-20070626-0917]# netperf -t SDP_STREAM -c -C -H 192.168.2.107 -l 30 SDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.107 (192.168.2.107) port 0 AF_INET netperf: send_sdp_stream: data socket connect failed: Network is unreachable This is with: [root at hpcpc106 OFED-1.2-20070626-0917]# ethtool -i eth2 driver: cxgb3 version: 1.0.094 firmware-version: T 4.1.0 bus-info: 0000:08:00.0 and the "native" SDP netperf tests rather than any LD_PRELOADed library. Am I on a wild goose chase, or should it be possible to do SDP over the T320 with OFED 1.2 bits on the system? thanks, rick jones From stanleysufficool at roadrunner.com Tue Jul 10 18:28:44 2007 From: stanleysufficool at roadrunner.com (Stanley Sufficool) Date: Tue, 10 Jul 2007 18:28:44 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <4693B9E4.1070001@mellanox.com> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> <1184042252.15067.8.camel@gentoo-linux.localdomain> <4693B9E4.1070001@mellanox.com> Message-ID: <1184117324.22408.0.camel@gentoo-linux.localdomain> Is this the same as the README in the srpt_inc branch? That is the document I based the Wiki on (with a few embellishments). On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote: > > Added a new wiki page based on Vu Pham's readme and issues with recent > > kernels. I hope to keep it current as I get our targets up and running. > > > > Thanks for doing this. > Please use the latest readme from this link - > http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt > > > > http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation > > > > > > WinIB initiators --> Gentoo Linux SRP Target. > > > > I mainly test linux initiators with gen2 srp-target. I have > not tested win srp initiator with the target. > > > Anything wrong with the above approach, I would be interested in a best > > practices if there is one. I saw a CentOS target post, is this more > > stable or better performing? > > There is no difference when you run the same srp target / > scst codes in CentOS or RH/SuSe linux distributions. The > storage back-end will determine the performance > > -vu > > > > > Thanks. > > > > On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: > >> Stanley Sufficool wrote: > >> > Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch > >> > > >> > Got the latest srpt from the git repository on OpenFabrics and had the > >> > following issues. > >> > > >> > ib_srpt.c Line 1997, missing second argument, should be? > >> > sdev->scst_tgt = scst_register(tp, NULL); > >> > > >> > >> Yes. You need the change if you test with top of scst svn > >> trunk (or from version 0.9.6-pre2) > >> If you test with scst before 0.9.6-pre2 (ie. version <= > >> 0.9.6-pre1) you don't need the second argument for > >> scst_register() > >> > >> > >> > SCST was built successfully after fixing an issue in scst_vdisk.c > >> > (missing #include ) > >> > >> > >> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX > >> - you should send the patch to scst devel > >> > >> > > >> > Just thought this would be nice to have documented, took me half a day > >> > to track down as a novice in C programming. > >> > > >> > >> there is *lean and mean* srpt's README in srpt_inc > >> SCST also has some document > >> You can add some wiki/notes for the problems in openfabrics > >> wiki page https://wiki.openfabrics.org/tiki-index.php > >> > >> -vu > >> > >> > > >> > ------------------------------------------------------------------------ > >> > > >> > _______________________________________________ > >> > general mailing list > >> > general at lists.openfabrics.org > >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From designingfu20 at phentermine.com Tue Jul 10 15:09:11 2007 From: designingfu20 at phentermine.com (Sue Nolan) Date: Wed, 11 Jul 2007 03:09:11 +0500 Subject: [ofa-general] Re.Query Message-ID: <703269773.99617138602756@phentermine.com> An HTML attachment was scrubbed... URL: From stanleysufficool at roadrunner.com Tue Jul 10 22:02:21 2007 From: stanleysufficool at roadrunner.com (Stanley Sufficool) Date: Tue, 10 Jul 2007 22:02:21 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <4693B9E4.1070001@mellanox.com> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> <1184042252.15067.8.camel@gentoo-linux.localdomain> <4693B9E4.1070001@mellanox.com> Message-ID: <1184130141.22408.7.camel@gentoo-linux.localdomain> Do you have any reservations that the WinIB (Mellanox) SRP initiators will not work with SRPT? If there is any doubt, I need to know so that I can fall back to iSCSI over IPoIB (iSIPIB??? ;) ) . This has lots more overhead, but it's a sure bet until this can be worked out. On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote: > > Added a new wiki page based on Vu Pham's readme and issues with recent > > kernels. I hope to keep it current as I get our targets up and running. > > > > Thanks for doing this. > Please use the latest readme from this link - > http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt > > > > http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation > > > > > > WinIB initiators --> Gentoo Linux SRP Target. > > > > I mainly test linux initiators with gen2 srp-target. I have > not tested win srp initiator with the target. > > > Anything wrong with the above approach, I would be interested in a best > > practices if there is one. I saw a CentOS target post, is this more > > stable or better performing? > > There is no difference when you run the same srp target / > scst codes in CentOS or RH/SuSe linux distributions. The > storage back-end will determine the performance > > -vu > > > > > Thanks. > > > > On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: > >> Stanley Sufficool wrote: > >> > Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch > >> > > >> > Got the latest srpt from the git repository on OpenFabrics and had the > >> > following issues. > >> > > >> > ib_srpt.c Line 1997, missing second argument, should be? > >> > sdev->scst_tgt = scst_register(tp, NULL); > >> > > >> > >> Yes. You need the change if you test with top of scst svn > >> trunk (or from version 0.9.6-pre2) > >> If you test with scst before 0.9.6-pre2 (ie. version <= > >> 0.9.6-pre1) you don't need the second argument for > >> scst_register() > >> > >> > >> > SCST was built successfully after fixing an issue in scst_vdisk.c > >> > (missing #include ) > >> > >> > >> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX > >> - you should send the patch to scst devel > >> > >> > > >> > Just thought this would be nice to have documented, took me half a day > >> > to track down as a novice in C programming. > >> > > >> > >> there is *lean and mean* srpt's README in srpt_inc > >> SCST also has some document > >> You can add some wiki/notes for the problems in openfabrics > >> wiki page https://wiki.openfabrics.org/tiki-index.php > >> > >> -vu > >> > >> > > >> > ------------------------------------------------------------------------ > >> > > >> > _______________________________________________ > >> > general mailing list > >> > general at lists.openfabrics.org > >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smiley-4.png Type: image/png Size: 822 bytes Desc: not available URL: From ogerlitz at voltaire.com Tue Jul 10 22:57:05 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 11 Jul 2007 08:57:05 +0300 (IDT) Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name Message-ID: Roland, This is the best I could come with, its still a problem if you have multiple devices of different providers or more than ten devices of the same provider... any other idea? -------------------------------------------------------------- The mad module creates thread per active port where the thread name is derived from the port name. This cause different threads to have same names when there are multiple devices. Fix that by using both the device and the port numbers to derive the name. Signed-off-by: Or Gerlitz Index: linux-2.6.22-rc2/drivers/infiniband/core/mad.c =================================================================== --- linux-2.6.22-rc2.orig/drivers/infiniband/core/mad.c 2007-05-20 09:37:29.000000000 +0300 +++ linux-2.6.22-rc2/drivers/infiniband/core/mad.c 2007-07-11 08:38:59.000000000 +0300 @@ -2799,7 +2799,7 @@ static int ib_mad_port_open(struct ib_de if (ret) goto error7; - snprintf(name, sizeof name, "ib_mad%d", port_num); + snprintf(name, sizeof name, "ib_mad%d_%d", device->name[strlen(device->name)],port_num); port_priv->wq = create_singlethread_workqueue(name); if (!port_priv->wq) { ret = -ENOMEM; From mst at dev.mellanox.co.il Tue Jul 10 23:14:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 11 Jul 2007 09:14:44 +0300 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4694044D.8010208@hp.com> References: <4694044D.8010208@hp.com> Message-ID: <20070711061444.GG11320@mellanox.co.il> > Quoting Rick Jones : > Subject: should it be possible to run SDP over a T320? > > Hi - > > I was talking to someone about the numbers I'd gathered for IPoIB with > OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 > did some non-trivial things to bulk transfer performance. Was this data these posted on-list? I didn't see it. -- MST From vlad at dev.mellanox.co.il Tue Jul 10 23:21:21 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 11 Jul 2007 09:21:21 +0300 Subject: [ofa-general] minor usability nit with 1.2GA? In-Reply-To: <4693BE43.8070905@hp.com> References: <4692CC7A.2050704@hp.com> <469372CC.5060207@dev.mellanox.co.il> <4693BE43.8070905@hp.com> Message-ID: <469476E1.9060301@dev.mellanox.co.il> >> >> Hi, >> OFED removes the previous software before installing the new one. >> So, there shouldn't be a mix of different OFED versions on the same >> machine. >> >> Can you send me the output of the following commands: >> # modinfo ib_sdp >> # rpm -qf /lib/modules/.../ib_sdp.ko (take the correct path from the >> previous command) >> # rpm -q kernel-ib >> # ofed_info > > I can, but at this point I'm not sure what it would show since I went > back and did a "build me one with everything" install on both my > systems. If you still want to see it I can do that though. > > rick > No, it doesn't make sense any more. Thanks, Vladimir From ogerlitz at voltaire.com Tue Jul 10 23:22:43 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 11 Jul 2007 09:22:43 +0300 (IDT) Subject: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation In-Reply-To: References: Message-ID: if the IBV_SEND_INLINE flag is set in the WR provided to ibv_post_send, the data buffers can be reused immediately after the call returns, document this. Signed-off-by: Or Gerlitz Index: libibverbs/include/infiniband/verbs.h =================================================================== --- libibverbs.orig/include/infiniband/verbs.h +++ libibverbs/include/infiniband/verbs.h @@ -989,6 +989,9 @@ int ibv_destroy_qp(struct ibv_qp *qp); /** * ibv_post_send - Post a list of work requests to a send queue. + * + * if IBV_SEND_INLINE flag is set, the data buffers can be reused immediately + * after the call returns - low level libraries must confirm to this rule. */ static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) Index: libibverbs/man/ibv_post_send.3 =================================================================== --- libibverbs.orig/man/ibv_post_send.3 +++ libibverbs/man/ibv_post_send.3 @@ -109,7 +109,9 @@ behavior. .PP The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved -from the corresponding completion queue (CQ). +from the corresponding completion queue (CQ). However, if the +IBV_SEND_INLINE flag was set, the buffer can be reused immediately +after the call returns. .SH "SEE ALSO" .BR ibv_create_qp (3), .BR ibv_create_ah (3), From tziporet at mellanox.co.il Wed Jul 11 01:37:28 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 11 Jul 2007 11:37:28 +0300 Subject: [ofa-general] Re: [ewg] OFED 1.3 timeline In-Reply-To: <4693BF47.8070700@mellanox.co.il> References: <4693BF47.8070700@mellanox.co.il> Message-ID: <469496C8.9030005@mellanox.co.il> Tziporet Koren wrote: Fix Nov dates due to Thanksgiving holiday > Hi All, > Based on the requests to have OFED 1.3 release this year the release > schedule is the following: > > * Feature freeze - Sep 4 > * Alpha release - Sep 10 > * Beta release - Sep 25 > * RC1 - Oct 16 > * RC2 - Oct 30 > * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11) > * RC4 - Nov 20 > * GA release - Nov 30 (or first week of Dec) > > > To make this schedule we must implement all major changes for the > package during July so we have a stable package till middle of Aug. > Also we must keep the new features in control and not insert > unnecessary changes that are not in the features list. > > Full features list will be published in a different mail > > Tziporet. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duffersmv77 at phentermine.com Wed Jul 11 03:37:29 2007 From: duffersmv77 at phentermine.com (Rachelle Hooks) Date: Wed, 11 Jul 2007 09:37:29 -0100 Subject: [ofa-general] Pharma Message-ID: <464715435.35627128907272@phentermine.com> An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Jul 11 02:45:38 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 11 Jul 2007 02:45:38 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070711-0200 daily build status Message-ID: <20070711094539.126FAE6086B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From eitan at mellanox.co.il Wed Jul 11 03:51:16 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 11 Jul 2007 13:51:16 +0300 Subject: [ofa-general] IB performance stats (revisited) References: <46826370.4090602@hp.com><1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com><46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com><4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com><6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com><1182978496.28870.106214.camel@hal.voltaire.com><6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> Hi Ira, > Second, I have run some tests querying the fabric of our > large clusters here (~500 nodes) and the results were > promising for a single node implementation. > I don't recall the numbers as this was a while ago but it was > on the order of > <2 sec and I think <1 but I don't want to be misquoted. Does PerfMgr query switch ports ? If it does I am surprised by the short sweep time you got. Does it have >1 query on the wire at a given time? If not then I am even more surprised. Was the cluster running a job at the time of the query ? Thanks Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Ira Weiny [mailto:weiny2 at llnl.gov] > Sent: Tuesday, July 10, 2007 7:47 PM > To: Eitan Zahavi > Cc: halr at voltaire.com; Mark.Seger at hp.com; > general at lists.openfabrics.org; Ed.Finn at FMR.COM > Subject: Re: [ofa-general] IB performance stats (revisited) > > On Thu, 28 Jun 2007 10:24:59 +0300 > "Eitan Zahavi" wrote: > > > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > > > > In the last months it is the second time I hear people > > > complaining the > > > > current monitoring solution in OFA is integrated with OpenSM. > > > > > > I must have missed this both times (didn't see this in Mark's > > > post) and the statement itself is somewhat inaccurate as well. > > Private talks - I hope they will speak up for themselves now... > > > > > > > These people do not use OpenSM but do use OFED. > > > > > > I'm not sure I'm following what you mean here. > > > > > > If you mean that some people want to run PerfMgr without > the SM/SA > > > aspects (so that they can run a vendor based SM), that is > the next > > > thing we are adding to the implementation. > > Exactly. OK when is that coming? > > There is very little which ties the current PerfMgr to > OpenSM. Basically it just gets the current fabric topology. > As Hal has said changes are coming. > > > > > > > > > > Another drawback if that > > > > no naming is provided and the reporting uses GUIDs. > > > > > > Naming is provided via NodeDescription. > > This might be good for hosts but is not covering switches ... > > It does include switches. However, since most systems have > the same name for multiple switches this becomes ineffective. > I have queried Voltaire for a way to change the > NodeDescription for switches, but at the time I asked, there > was no way to do it. Perhaps there is now? What about other > vendors? This is why ibnetdiscover and other diags have > "switch map" support. (A GUID->name mapping to override the > default NodeDescription.) Nothing would please me more than > to be able to remove that for a more "automatic" solution. > > > > > > > > I also can't hold myself from saying again I think you > are going > > > > to hit the wall with the concept of doing the PMA from > a single node. > > > > > > If you are referring to the fact the PerMgr is currently not > > > distributed, that will be done as has been stated before. > > Good. When is it expected? Will it be OFED 1.3? > > When Hal first sent out the PerfMgr design I thought we > should jump right to the distributed model as well. But now > I am glad we have gone the way we did. > First off, we have something which "works" and from which we > can expand. > Second, I have run some tests querying the fabric of our > large clusters here (~500 nodes) and the results were > promising for a single node implementation. > I don't recall the numbers as this was a while ago but it was > on the order of > <2 sec and I think <1 but I don't want to be misquoted. > > For sure, a distributed model offers many advantages and we > will get there. But for many the current single node > approach should work just fine. > > Thanks, > Ira > > > > > Thanks > > > > > > -- Hal > > > > > > > Eitan Zahavi > > > > Senior Engineering Director, Software Architect Mellanox > > > Technologies > > > > LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: general-bounces at lists.openfabrics.org > > > > > [mailto:general-bounces at lists.openfabrics.org] On > Behalf Of Hal > > > > > Rosenstock > > > > > Sent: Wednesday, June 27, 2007 8:12 PM > > > > > To: Mark Seger > > > > > Cc: Finn, Ed; general at lists.openfabrics.org > > > > > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > > > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > > > > > > >The performance managers deal with the counter > stickiness (by > > > > > > >resetting them when they think they need to). They > > > > > typically export > > > > > > >their data although this is not specified by IBA so it is > > > > > in a vendor > > > > > > >proprietary manner. > > > > > > > > > > > > > > > > > > > > so I guess these guys are poor citizens as well... > > > > > > > > > > Not sure what you mean. > > > > > > > > > > > the real issue as I see it then means nobody can trust > > > the data if > > > > > > randon tools randomly reset the counters. a real shame... > > > > > > > > > > I consider this to be a real rather than random app for this. > > > > > Guess it depends on what one considers random. > > > > > > > > > > -- Hal > > > > > > > > > > > -mark > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > general mailing list > > > > > general at lists.openfabrics.org > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > To unsubscribe, please visit > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > From Thomas.Talpey at netapp.com Wed Jul 11 04:50:37 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 11 Jul 2007 07:50:37 -0400 Subject: [ofa-general] What should a ULP pass as ib_create_cq(..., comp_vector) ? Message-ID: I notice the ib_create_cq() comp_vector support is merged in 2.6.22. I don't completely understand what a ULP needs to pass as the argument. I'm currently passing 0 in the NFS/RDMA client, what in general should I consider using as a value? Or put another way, why is this exposed to the ULP? Isn't this the MSI-X vector table index, a rather low-level thing to hand to the ULP to manage? Thanks, Tom. From jackm at dev.mellanox.co.il Wed Jul 11 04:58:46 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 11 Jul 2007 14:58:46 +0300 Subject: [ofa-general] Re: [PATCH] mlx4: add device reset to Internal Error handling mechanism In-Reply-To: References: <200707091012.52418.jackm@dev.mellanox.co.il> Message-ID: <200707111458.46564.jackm@dev.mellanox.co.il> On Monday 09 July 2007 19:10, Roland Dreier wrote: > > Why not just delete all the interrupt stuff completely? I did this patch very quickly -- I'll delete all the interrupt stuff. > how about round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL) instead? OK. From halr at voltaire.com Wed Jul 11 06:31:14 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 09:31:14 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> Message-ID: <1184160670.17622.92728.camel@hal.voltaire.com> Hi Eitan, On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote: > Hi Ira, > > > Second, I have run some tests querying the fabric of our > > large clusters here (~500 nodes) and the results were > > promising for a single node implementation. > > I don't recall the numbers as this was a while ago but it was > > on the order of > > <2 sec and I think <1 but I don't want to be misquoted. > > Does PerfMgr query switch ports ? Yes (of course it does). > If it does I am surprised by the short sweep time you got. > > Does it have >1 query on the wire at a given time? Yes, Default appears to be 500 currently (maybe that needs dialing back a bit) but is settable via perfmgr_max_outstanding_queries in options file. > If not then I am even more surprised. > > Was the cluster running a job at the time of the query ? Is this question related to VL0 contention ? -- Hal > Thanks > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Ira Weiny [mailto:weiny2 at llnl.gov] > > Sent: Tuesday, July 10, 2007 7:47 PM > > To: Eitan Zahavi > > Cc: halr at voltaire.com; Mark.Seger at hp.com; > > general at lists.openfabrics.org; Ed.Finn at FMR.COM > > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > On Thu, 28 Jun 2007 10:24:59 +0300 > > "Eitan Zahavi" wrote: > > > > > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > > > > > In the last months it is the second time I hear people > > > > complaining the > > > > > current monitoring solution in OFA is integrated with OpenSM. > > > > > > > > I must have missed this both times (didn't see this in Mark's > > > > post) and the statement itself is somewhat inaccurate as well. > > > Private talks - I hope they will speak up for themselves now... > > > > > > > > > These people do not use OpenSM but do use OFED. > > > > > > > > I'm not sure I'm following what you mean here. > > > > > > > > If you mean that some people want to run PerfMgr without > > the SM/SA > > > > aspects (so that they can run a vendor based SM), that is > > the next > > > > thing we are adding to the implementation. > > > Exactly. OK when is that coming? > > > > There is very little which ties the current PerfMgr to > > OpenSM. Basically it just gets the current fabric topology. > > As Hal has said changes are coming. > > > > > > > > > > > > > > Another drawback if that > > > > > no naming is provided and the reporting uses GUIDs. > > > > > > > > Naming is provided via NodeDescription. > > > This might be good for hosts but is not covering switches ... > > > > It does include switches. However, since most systems have > > the same name for multiple switches this becomes ineffective. > > I have queried Voltaire for a way to change the > > NodeDescription for switches, but at the time I asked, there > > was no way to do it. Perhaps there is now? What about other > > vendors? This is why ibnetdiscover and other diags have > > "switch map" support. (A GUID->name mapping to override the > > default NodeDescription.) Nothing would please me more than > > to be able to remove that for a more "automatic" solution. > > > > > > > > > > > I also can't hold myself from saying again I think you > > are going > > > > > to hit the wall with the concept of doing the PMA from > > a single node. > > > > > > > > If you are referring to the fact the PerMgr is currently not > > > > distributed, that will be done as has been stated before. > > > Good. When is it expected? Will it be OFED 1.3? > > > > When Hal first sent out the PerfMgr design I thought we > > should jump right to the distributed model as well. But now > > I am glad we have gone the way we did. > > First off, we have something which "works" and from which we > > can expand. > > Second, I have run some tests querying the fabric of our > > large clusters here (~500 nodes) and the results were > > promising for a single node implementation. > > I don't recall the numbers as this was a while ago but it was > > on the order of > > <2 sec and I think <1 but I don't want to be misquoted. > > > > For sure, a distributed model offers many advantages and we > > will get there. But for many the current single node > > approach should work just fine. > > > > Thanks, > > Ira > > > > > > > > Thanks > > > > > > > > -- Hal > > > > > > > > > Eitan Zahavi > > > > > Senior Engineering Director, Software Architect Mellanox > > > > Technologies > > > > > LTD > > > > > Tel:+972-4-9097208 > > > > > Fax:+972-4-9593245 > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: general-bounces at lists.openfabrics.org > > > > > > [mailto:general-bounces at lists.openfabrics.org] On > > Behalf Of Hal > > > > > > Rosenstock > > > > > > Sent: Wednesday, June 27, 2007 8:12 PM > > > > > > To: Mark Seger > > > > > > Cc: Finn, Ed; general at lists.openfabrics.org > > > > > > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > > > > > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > > > > > > > >The performance managers deal with the counter > > stickiness (by > > > > > > > >resetting them when they think they need to). They > > > > > > typically export > > > > > > > >their data although this is not specified by IBA so it is > > > > > > in a vendor > > > > > > > >proprietary manner. > > > > > > > > > > > > > > > > > > > > > > > so I guess these guys are poor citizens as well... > > > > > > > > > > > > Not sure what you mean. > > > > > > > > > > > > > the real issue as I see it then means nobody can trust > > > > the data if > > > > > > > randon tools randomly reset the counters. a real shame... > > > > > > > > > > > > I consider this to be a real rather than random app for this. > > > > > > Guess it depends on what one considers random. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > -mark > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > general mailing list > > > > > > general at lists.openfabrics.org > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > From eitan at mellanox.co.il Wed Jul 11 07:03:35 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 11 Jul 2007 17:03:35 +0300 Subject: [ofa-general] IB performance stats (revisited) References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com> Hi Hal, > > > > > Second, I have run some tests querying the fabric of our large > > > clusters here (~500 nodes) and the results were promising for a > > > single node implementation. > > > I don't recall the numbers as this was a while ago but it > was on the > > > order of > > > <2 sec and I think <1 but I don't want to be misquoted. > > > > Does PerfMgr query switch ports ? > > Yes (of course it does). > > > If it does I am surprised by the short sweep time you got. > > > > Does it have >1 query on the wire at a given time? > > Yes, Default appears to be 500 currently (maybe that needs > dialing back a bit) but is settable via > perfmgr_max_outstanding_queries in options file. This explains some. > > > If not then I am even more surprised. > > > > Was the cluster running a job at the time of the query ? > > Is this question related to VL0 contention ? Yes > > -- Hal > > > Thanks > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect Mellanox > Technologies > > LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: Ira Weiny [mailto:weiny2 at llnl.gov] > > > Sent: Tuesday, July 10, 2007 7:47 PM > > > To: Eitan Zahavi > > > Cc: halr at voltaire.com; Mark.Seger at hp.com; > > > general at lists.openfabrics.org; Ed.Finn at FMR.COM > > > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > > > On Thu, 28 Jun 2007 10:24:59 +0300 > > > "Eitan Zahavi" wrote: > > > > > > > > On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > > > > > > In the last months it is the second time I hear people > > > > > complaining the > > > > > > current monitoring solution in OFA is integrated > with OpenSM. > > > > > > > > > > I must have missed this both times (didn't see this in Mark's > > > > > post) and the statement itself is somewhat inaccurate as well. > > > > Private talks - I hope they will speak up for themselves now... > > > > > > > > > > > These people do not use OpenSM but do use OFED. > > > > > > > > > > I'm not sure I'm following what you mean here. > > > > > > > > > > If you mean that some people want to run PerfMgr without > > > the SM/SA > > > > > aspects (so that they can run a vendor based SM), that is > > > the next > > > > > thing we are adding to the implementation. > > > > Exactly. OK when is that coming? > > > > > > There is very little which ties the current PerfMgr to OpenSM. > > > Basically it just gets the current fabric topology. > > > As Hal has said changes are coming. > > > > > > > > > > > > > > > > > > Another drawback if that > > > > > > no naming is provided and the reporting uses GUIDs. > > > > > > > > > > Naming is provided via NodeDescription. > > > > This might be good for hosts but is not covering switches ... > > > > > > It does include switches. However, since most systems > have the same > > > name for multiple switches this becomes ineffective. > > > I have queried Voltaire for a way to change the > NodeDescription for > > > switches, but at the time I asked, there was no way to do it. > > > Perhaps there is now? What about other vendors? This is why > > > ibnetdiscover and other diags have "switch map" support. (A > > > GUID->name mapping to override the default > NodeDescription.) Nothing > > > would please me more than to be able to remove that for a more > > > "automatic" solution. > > > > > > > > > > > > > > I also can't hold myself from saying again I think you > > > are going > > > > > > to hit the wall with the concept of doing the PMA from > > > a single node. > > > > > > > > > > If you are referring to the fact the PerMgr is currently not > > > > > distributed, that will be done as has been stated before. > > > > Good. When is it expected? Will it be OFED 1.3? > > > > > > When Hal first sent out the PerfMgr design I thought we > should jump > > > right to the distributed model as well. But now I am > glad we have > > > gone the way we did. > > > First off, we have something which "works" and from which we can > > > expand. > > > Second, I have run some tests querying the fabric of our large > > > clusters here (~500 nodes) and the results were promising for a > > > single node implementation. > > > I don't recall the numbers as this was a while ago but it > was on the > > > order of > > > <2 sec and I think <1 but I don't want to be misquoted. > > > > > > For sure, a distributed model offers many advantages and > we will get > > > there. But for many the current single node approach should work > > > just fine. > > > > > > Thanks, > > > Ira > > > > > > > > > > > Thanks > > > > > > > > > > -- Hal > > > > > > > > > > > Eitan Zahavi > > > > > > Senior Engineering Director, Software Architect Mellanox > > > > > Technologies > > > > > > LTD > > > > > > Tel:+972-4-9097208 > > > > > > Fax:+972-4-9593245 > > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: general-bounces at lists.openfabrics.org > > > > > > > [mailto:general-bounces at lists.openfabrics.org] On > > > Behalf Of Hal > > > > > > > Rosenstock > > > > > > > Sent: Wednesday, June 27, 2007 8:12 PM > > > > > > > To: Mark Seger > > > > > > > Cc: Finn, Ed; general at lists.openfabrics.org > > > > > > > Subject: Re: [ofa-general] IB performance stats > (revisited) > > > > > > > > > > > > > > On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > > > > > > > > >The performance managers deal with the counter > > > stickiness (by > > > > > > > > >resetting them when they think they need to). They > > > > > > > typically export > > > > > > > > >their data although this is not specified by > IBA so it is > > > > > > > in a vendor > > > > > > > > >proprietary manner. > > > > > > > > > > > > > > > > > > > > > > > > > > so I guess these guys are poor citizens as well... > > > > > > > > > > > > > > Not sure what you mean. > > > > > > > > > > > > > > > the real issue as I see it then means nobody can trust > > > > > the data if > > > > > > > > randon tools randomly reset the counters. a > real shame... > > > > > > > > > > > > > > I consider this to be a real rather than random > app for this. > > > > > > > Guess it depends on what one considers random. > > > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > -mark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > general at lists.openfabrics.org > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/genera > > > > > > > l > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > From Mark.Seger at hp.com Wed Jul 11 07:15:59 2007 From: Mark.Seger at hp.com (Mark Seger) Date: Wed, 11 Jul 2007 10:15:59 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <1184160670.17622.92728.camel@hal.voltaire.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> Message-ID: <4694E61F.8000502@hp.com> My basic philosophy, and I suspect there are those who might disagree, is that you can't use the network to monitor the network, at least not in times of trouble. That's why I insist on having to query the HCAs directly since I can't always be sure the network is there and/or reliable. If you are willing to concede that this can indeed happen than the question becomes one of how do you reliably get data from an HCA and that's the basis for my (re)starting this discussion. As for querying the switch for counters, what do you do on a very large network, say 10s of thousands of nodes if you want to get performance data every second? I also realize this is an extreme situation today (the node count not the frequency of monitoring) but I'm sure everyone would agree systems of these sizes are not that far off. -mark Hal Rosenstock wrote: >Hi Eitan, > >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote: > > >>Hi Ira, >> >> >> >>>Second, I have run some tests querying the fabric of our >>>large clusters here (~500 nodes) and the results were >>>promising for a single node implementation. >>>I don't recall the numbers as this was a while ago but it was >>>on the order of >>><2 sec and I think <1 but I don't want to be misquoted. >>> >>> >>Does PerfMgr query switch ports ? >> >> > >Yes (of course it does). > > > >>If it does I am surprised by the short sweep time you got. >> >>Does it have >1 query on the wire at a given time? >> >> > >Yes, Default appears to be 500 currently (maybe that needs dialing back >a bit) but is settable via perfmgr_max_outstanding_queries in options >file. > > > >>If not then I am even more surprised. >> >>Was the cluster running a job at the time of the query ? >> >> > >Is this question related to VL0 contention ? > >-- Hal > > > >>Thanks >> >>Eitan Zahavi >>Senior Engineering Director, Software Architect >>Mellanox Technologies LTD >>Tel:+972-4-9097208 >>Fax:+972-4-9593245 >>P.O. Box 586 Yokneam 20692 ISRAEL >> >> >> >> >> >>>-----Original Message----- >>>From: Ira Weiny [mailto:weiny2 at llnl.gov] >>>Sent: Tuesday, July 10, 2007 7:47 PM >>>To: Eitan Zahavi >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM >>>Subject: Re: [ofa-general] IB performance stats (revisited) >>> >>>On Thu, 28 Jun 2007 10:24:59 +0300 >>>"Eitan Zahavi" wrote: >>> >>> >>> >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: >>>>> >>>>> >>>>>>In the last months it is the second time I hear people >>>>>> >>>>>> >>>>>complaining the >>>>> >>>>> >>>>>>current monitoring solution in OFA is integrated with OpenSM. >>>>>> >>>>>> >>>>>I must have missed this both times (didn't see this in Mark's >>>>>post) and the statement itself is somewhat inaccurate as well. >>>>> >>>>> >>>>Private talks - I hope they will speak up for themselves now... >>>> >>>> >>>>>>These people do not use OpenSM but do use OFED. >>>>>> >>>>>> >>>>>I'm not sure I'm following what you mean here. >>>>> >>>>>If you mean that some people want to run PerfMgr without >>>>> >>>>> >>>the SM/SA >>> >>> >>>>>aspects (so that they can run a vendor based SM), that is >>>>> >>>>> >>>the next >>> >>> >>>>>thing we are adding to the implementation. >>>>> >>>>> >>>>Exactly. OK when is that coming? >>>> >>>> >>>There is very little which ties the current PerfMgr to >>>OpenSM. Basically it just gets the current fabric topology. >>>As Hal has said changes are coming. >>> >>> >>> >>>>>> Another drawback if that >>>>>>no naming is provided and the reporting uses GUIDs. >>>>>> >>>>>> >>>>>Naming is provided via NodeDescription. >>>>> >>>>> >>>>This might be good for hosts but is not covering switches ... >>>> >>>> >>>It does include switches. However, since most systems have >>>the same name for multiple switches this becomes ineffective. >>> I have queried Voltaire for a way to change the >>>NodeDescription for switches, but at the time I asked, there >>>was no way to do it. Perhaps there is now? What about other >>>vendors? This is why ibnetdiscover and other diags have >>>"switch map" support. (A GUID->name mapping to override the >>>default NodeDescription.) Nothing would please me more than >>>to be able to remove that for a more "automatic" solution. >>> >>> >>> >>>>>>I also can't hold myself from saying again I think you >>>>>> >>>>>> >>>are going >>> >>> >>>>>>to hit the wall with the concept of doing the PMA from >>>>>> >>>>>> >>>a single node. >>> >>> >>>>>If you are referring to the fact the PerMgr is currently not >>>>>distributed, that will be done as has been stated before. >>>>> >>>>> >>>>Good. When is it expected? Will it be OFED 1.3? >>>> >>>> >>>When Hal first sent out the PerfMgr design I thought we >>>should jump right to the distributed model as well. But now >>>I am glad we have gone the way we did. >>>First off, we have something which "works" and from which we >>>can expand. >>>Second, I have run some tests querying the fabric of our >>>large clusters here (~500 nodes) and the results were >>>promising for a single node implementation. >>>I don't recall the numbers as this was a while ago but it was >>>on the order of >>><2 sec and I think <1 but I don't want to be misquoted. >>> >>>For sure, a distributed model offers many advantages and we >>>will get there. But for many the current single node >>>approach should work just fine. >>> >>>Thanks, >>>Ira >>> >>> >>> >>>>Thanks >>>> >>>> >>>>>-- Hal >>>>> >>>>> >>>>> >>>>>>Eitan Zahavi >>>>>>Senior Engineering Director, Software Architect Mellanox >>>>>> >>>>>> >>>>>Technologies >>>>> >>>>> >>>>>>LTD >>>>>>Tel:+972-4-9097208 >>>>>>Fax:+972-4-9593245 >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>-----Original Message----- >>>>>>>From: general-bounces at lists.openfabrics.org >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On >>>>>>> >>>>>>> >>>Behalf Of Hal >>> >>> >>>>>>>Rosenstock >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM >>>>>>>To: Mark Seger >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited) >>>>>>> >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote: >>>>>>> >>>>>>> >>>>>>>>>The performance managers deal with the counter >>>>>>>>> >>>>>>>>> >>>stickiness (by >>> >>> >>>>>>>>>resetting them when they think they need to). They >>>>>>>>> >>>>>>>>> >>>>>>>typically export >>>>>>> >>>>>>> >>>>>>>>>their data although this is not specified by IBA so it is >>>>>>>>> >>>>>>>>> >>>>>>>in a vendor >>>>>>> >>>>>>> >>>>>>>>>proprietary manner. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>so I guess these guys are poor citizens as well... >>>>>>>> >>>>>>>> >>>>>>>Not sure what you mean. >>>>>>> >>>>>>> >>>>>>> >>>>>>>>the real issue as I see it then means nobody can trust >>>>>>>> >>>>>>>> >>>>>the data if >>>>> >>>>> >>>>>>>>randon tools randomly reset the counters. a real shame... >>>>>>>> >>>>>>>> >>>>>>>I consider this to be a real rather than random app for this. >>>>>>>Guess it depends on what one considers random. >>>>>>> >>>>>>>-- Hal >>>>>>> >>>>>>> >>>>>>> >>>>>>>>-mark >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>_______________________________________________ >>>>>>>general mailing list >>>>>>>general at lists.openfabrics.org >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>> >>>>>>>To unsubscribe, please visit >>>>>>>http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>>_______________________________________________ >>>>general mailing list >>>>general at lists.openfabrics.org >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>>To unsubscribe, please visit >>>>http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> From halr at voltaire.com Wed Jul 11 07:22:30 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 10:22:30 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <4694E61F.8000502@hp.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> Message-ID: <1184163750.17622.96256.camel@hal.voltaire.com> On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > My basic philosophy, and I suspect there are those who might disagree, > is that you can't use the network to monitor the network, at least not > in times of trouble. Right, in times of certain troubles. > That's why I insist on having to query the HCAs > directly since I can't always be sure the network is there and/or > reliable. If you are willing to concede that this can indeed happen > than the question becomes one of how do you reliably get data from an > HCA and that's the basis for my (re)starting this discussion. The reliability comes from timeout/retry mechanisms. If performance data cannot be obtained on an IB network, it needs to be trouble shooted at a lower level (by SMPs). In any case, a rearchitecture of the PMA was proposed and seems reasonable to me in that it can accomodate either approach. All that is needed now is for someone to step up and champion an implementation of this. Unfortunately, I do not have time to do so. > As for querying the switch for counters, what do you do on a very large > network, say 10s of thousands of nodes if you want to get performance > data every second? I also realize this is an extreme situation today > (the node count not the frequency of monitoring) but I'm sure everyone > would agree systems of these sizes are not that far off. You have a distributed performance manager to handle this. A hierarchy of performance managers has been discussed on the list before. -- Hal > -mark > > Hal Rosenstock wrote: > > >Hi Eitan, > > > >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote: > > > > > >>Hi Ira, > >> > >> > >> > >>>Second, I have run some tests querying the fabric of our > >>>large clusters here (~500 nodes) and the results were > >>>promising for a single node implementation. > >>>I don't recall the numbers as this was a while ago but it was > >>>on the order of > >>><2 sec and I think <1 but I don't want to be misquoted. > >>> > >>> > >>Does PerfMgr query switch ports ? > >> > >> > > > >Yes (of course it does). > > > > > > > >>If it does I am surprised by the short sweep time you got. > >> > >>Does it have >1 query on the wire at a given time? > >> > >> > > > >Yes, Default appears to be 500 currently (maybe that needs dialing back > >a bit) but is settable via perfmgr_max_outstanding_queries in options > >file. > > > > > > > >>If not then I am even more surprised. > >> > >>Was the cluster running a job at the time of the query ? > >> > >> > > > >Is this question related to VL0 contention ? > > > >-- Hal > > > > > > > >>Thanks > >> > >>Eitan Zahavi > >>Senior Engineering Director, Software Architect > >>Mellanox Technologies LTD > >>Tel:+972-4-9097208 > >>Fax:+972-4-9593245 > >>P.O. Box 586 Yokneam 20692 ISRAEL > >> > >> > >> > >> > >> > >>>-----Original Message----- > >>>From: Ira Weiny [mailto:weiny2 at llnl.gov] > >>>Sent: Tuesday, July 10, 2007 7:47 PM > >>>To: Eitan Zahavi > >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; > >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM > >>>Subject: Re: [ofa-general] IB performance stats (revisited) > >>> > >>>On Thu, 28 Jun 2007 10:24:59 +0300 > >>>"Eitan Zahavi" wrote: > >>> > >>> > >>> > >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > >>>>> > >>>>> > >>>>>>In the last months it is the second time I hear people > >>>>>> > >>>>>> > >>>>>complaining the > >>>>> > >>>>> > >>>>>>current monitoring solution in OFA is integrated with OpenSM. > >>>>>> > >>>>>> > >>>>>I must have missed this both times (didn't see this in Mark's > >>>>>post) and the statement itself is somewhat inaccurate as well. > >>>>> > >>>>> > >>>>Private talks - I hope they will speak up for themselves now... > >>>> > >>>> > >>>>>>These people do not use OpenSM but do use OFED. > >>>>>> > >>>>>> > >>>>>I'm not sure I'm following what you mean here. > >>>>> > >>>>>If you mean that some people want to run PerfMgr without > >>>>> > >>>>> > >>>the SM/SA > >>> > >>> > >>>>>aspects (so that they can run a vendor based SM), that is > >>>>> > >>>>> > >>>the next > >>> > >>> > >>>>>thing we are adding to the implementation. > >>>>> > >>>>> > >>>>Exactly. OK when is that coming? > >>>> > >>>> > >>>There is very little which ties the current PerfMgr to > >>>OpenSM. Basically it just gets the current fabric topology. > >>>As Hal has said changes are coming. > >>> > >>> > >>> > >>>>>> Another drawback if that > >>>>>>no naming is provided and the reporting uses GUIDs. > >>>>>> > >>>>>> > >>>>>Naming is provided via NodeDescription. > >>>>> > >>>>> > >>>>This might be good for hosts but is not covering switches ... > >>>> > >>>> > >>>It does include switches. However, since most systems have > >>>the same name for multiple switches this becomes ineffective. > >>> I have queried Voltaire for a way to change the > >>>NodeDescription for switches, but at the time I asked, there > >>>was no way to do it. Perhaps there is now? What about other > >>>vendors? This is why ibnetdiscover and other diags have > >>>"switch map" support. (A GUID->name mapping to override the > >>>default NodeDescription.) Nothing would please me more than > >>>to be able to remove that for a more "automatic" solution. > >>> > >>> > >>> > >>>>>>I also can't hold myself from saying again I think you > >>>>>> > >>>>>> > >>>are going > >>> > >>> > >>>>>>to hit the wall with the concept of doing the PMA from > >>>>>> > >>>>>> > >>>a single node. > >>> > >>> > >>>>>If you are referring to the fact the PerMgr is currently not > >>>>>distributed, that will be done as has been stated before. > >>>>> > >>>>> > >>>>Good. When is it expected? Will it be OFED 1.3? > >>>> > >>>> > >>>When Hal first sent out the PerfMgr design I thought we > >>>should jump right to the distributed model as well. But now > >>>I am glad we have gone the way we did. > >>>First off, we have something which "works" and from which we > >>>can expand. > >>>Second, I have run some tests querying the fabric of our > >>>large clusters here (~500 nodes) and the results were > >>>promising for a single node implementation. > >>>I don't recall the numbers as this was a while ago but it was > >>>on the order of > >>><2 sec and I think <1 but I don't want to be misquoted. > >>> > >>>For sure, a distributed model offers many advantages and we > >>>will get there. But for many the current single node > >>>approach should work just fine. > >>> > >>>Thanks, > >>>Ira > >>> > >>> > >>> > >>>>Thanks > >>>> > >>>> > >>>>>-- Hal > >>>>> > >>>>> > >>>>> > >>>>>>Eitan Zahavi > >>>>>>Senior Engineering Director, Software Architect Mellanox > >>>>>> > >>>>>> > >>>>>Technologies > >>>>> > >>>>> > >>>>>>LTD > >>>>>>Tel:+972-4-9097208 > >>>>>>Fax:+972-4-9593245 > >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>-----Original Message----- > >>>>>>>From: general-bounces at lists.openfabrics.org > >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On > >>>>>>> > >>>>>>> > >>>Behalf Of Hal > >>> > >>> > >>>>>>>Rosenstock > >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM > >>>>>>>To: Mark Seger > >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org > >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited) > >>>>>>> > >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > >>>>>>> > >>>>>>> > >>>>>>>>>The performance managers deal with the counter > >>>>>>>>> > >>>>>>>>> > >>>stickiness (by > >>> > >>> > >>>>>>>>>resetting them when they think they need to). They > >>>>>>>>> > >>>>>>>>> > >>>>>>>typically export > >>>>>>> > >>>>>>> > >>>>>>>>>their data although this is not specified by IBA so it is > >>>>>>>>> > >>>>>>>>> > >>>>>>>in a vendor > >>>>>>> > >>>>>>> > >>>>>>>>>proprietary manner. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>so I guess these guys are poor citizens as well... > >>>>>>>> > >>>>>>>> > >>>>>>>Not sure what you mean. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>the real issue as I see it then means nobody can trust > >>>>>>>> > >>>>>>>> > >>>>>the data if > >>>>> > >>>>> > >>>>>>>>randon tools randomly reset the counters. a real shame... > >>>>>>>> > >>>>>>>> > >>>>>>>I consider this to be a real rather than random app for this. > >>>>>>>Guess it depends on what one considers random. > >>>>>>> > >>>>>>>-- Hal > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>-mark > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>_______________________________________________ > >>>>>>>general mailing list > >>>>>>>general at lists.openfabrics.org > >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>>> > >>>>>>>To unsubscribe, please visit > >>>>>>>http://openib.org/mailman/listinfo/openib-general > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>>_______________________________________________ > >>>>general mailing list > >>>>general at lists.openfabrics.org > >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>>To unsubscribe, please visit > >>>>http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> > >>>> > From eitan at mellanox.co.il Wed Jul 11 07:29:56 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 11 Jul 2007 17:29:56 +0300 Subject: [ofa-general] IB performance stats (revisited) References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com> Hi Marc, I published an RFC and later had discussions regarding the distribution of query ownership of switch counters. Making this ownership purely dynamic, semi-dynamic or even static is an implementation tradeoff. However, it can be shown that the maximal number of switches a single compute node would be responsible for is <= number of switch levels. So no problem to get counters every second... The issue is: what do you do with the size of data collected? This is only relevant if monitoring is run in "profiling mode" otherwise only link health errors should be reported. My proposal is to have a reporting algorithm that reports only "change of data rate" with "change" being defined "adaptively" . In other words: A node should report upstream change of port activity only if the rate of data changed by more then X times. Assuming we want logarithmic scale X == 2 would work like that: At first sample there is no traffic. All counters will need t make their way to the "master" node. When traffic starts a change of data rate which is infinite will cause all new rates X to be sent. >From that moment only ports which their data rate will reach 2X or 0.5X will be reported. Integration period should be configurable. Hope I had time to implement ... Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Mark Seger [mailto:Mark.Seger at hp.com] > Sent: Wednesday, July 11, 2007 5:16 PM > To: Hal Rosenstock > Cc: Eitan Zahavi; Ira Weiny; general at lists.openfabrics.org; > Ed.Finn at FMR.COM > Subject: Re: [ofa-general] IB performance stats (revisited) > > My basic philosophy, and I suspect there are those who might > disagree, is that you can't use the network to monitor the > network, at least not in times of trouble. That's why I > insist on having to query the HCAs directly since I can't > always be sure the network is there and/or reliable. If you > are willing to concede that this can indeed happen than the > question becomes one of how do you reliably get data from an > HCA and that's the basis for my (re)starting this discussion. > > As for querying the switch for counters, what do you do on a > very large network, say 10s of thousands of nodes if you want > to get performance data every second? I also realize this is > an extreme situation today (the node count not the frequency > of monitoring) but I'm sure everyone would agree systems of > these sizes are not that far off. > > -mark > > Hal Rosenstock wrote: > > >Hi Eitan, > > > >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote: > > > > > >>Hi Ira, > >> > >> > >> > >>>Second, I have run some tests querying the fabric of our large > >>>clusters here (~500 nodes) and the results were promising for a > >>>single node implementation. > >>>I don't recall the numbers as this was a while ago but it > was on the > >>>order of > >>><2 sec and I think <1 but I don't want to be misquoted. > >>> > >>> > >>Does PerfMgr query switch ports ? > >> > >> > > > >Yes (of course it does). > > > > > > > >>If it does I am surprised by the short sweep time you got. > >> > >>Does it have >1 query on the wire at a given time? > >> > >> > > > >Yes, Default appears to be 500 currently (maybe that needs > dialing back > >a bit) but is settable via perfmgr_max_outstanding_queries > in options > >file. > > > > > > > >>If not then I am even more surprised. > >> > >>Was the cluster running a job at the time of the query ? > >> > >> > > > >Is this question related to VL0 contention ? > > > >-- Hal > > > > > > > >>Thanks > >> > >>Eitan Zahavi > >>Senior Engineering Director, Software Architect Mellanox > Technologies > >>LTD > >>Tel:+972-4-9097208 > >>Fax:+972-4-9593245 > >>P.O. Box 586 Yokneam 20692 ISRAEL > >> > >> > >> > >> > >> > >>>-----Original Message----- > >>>From: Ira Weiny [mailto:weiny2 at llnl.gov] > >>>Sent: Tuesday, July 10, 2007 7:47 PM > >>>To: Eitan Zahavi > >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; > >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM > >>>Subject: Re: [ofa-general] IB performance stats (revisited) > >>> > >>>On Thu, 28 Jun 2007 10:24:59 +0300 > >>>"Eitan Zahavi" wrote: > >>> > >>> > >>> > >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote: > >>>>> > >>>>> > >>>>>>In the last months it is the second time I hear people > >>>>>> > >>>>>> > >>>>>complaining the > >>>>> > >>>>> > >>>>>>current monitoring solution in OFA is integrated with OpenSM. > >>>>>> > >>>>>> > >>>>>I must have missed this both times (didn't see this in Mark's > >>>>>post) and the statement itself is somewhat inaccurate as well. > >>>>> > >>>>> > >>>>Private talks - I hope they will speak up for themselves now... > >>>> > >>>> > >>>>>>These people do not use OpenSM but do use OFED. > >>>>>> > >>>>>> > >>>>>I'm not sure I'm following what you mean here. > >>>>> > >>>>>If you mean that some people want to run PerfMgr without > >>>>> > >>>>> > >>>the SM/SA > >>> > >>> > >>>>>aspects (so that they can run a vendor based SM), that is > >>>>> > >>>>> > >>>the next > >>> > >>> > >>>>>thing we are adding to the implementation. > >>>>> > >>>>> > >>>>Exactly. OK when is that coming? > >>>> > >>>> > >>>There is very little which ties the current PerfMgr to OpenSM. > >>>Basically it just gets the current fabric topology. > >>>As Hal has said changes are coming. > >>> > >>> > >>> > >>>>>> Another drawback if that > >>>>>>no naming is provided and the reporting uses GUIDs. > >>>>>> > >>>>>> > >>>>>Naming is provided via NodeDescription. > >>>>> > >>>>> > >>>>This might be good for hosts but is not covering switches ... > >>>> > >>>> > >>>It does include switches. However, since most systems > have the same > >>>name for multiple switches this becomes ineffective. > >>> I have queried Voltaire for a way to change the > NodeDescription for > >>>switches, but at the time I asked, there was no way to do it. > >>>Perhaps there is now? What about other vendors? This is why > >>>ibnetdiscover and other diags have "switch map" support. (A > >>>GUID->name mapping to override the default > NodeDescription.) Nothing > >>>would please me more than to be able to remove that for a more > >>>"automatic" solution. > >>> > >>> > >>> > >>>>>>I also can't hold myself from saying again I think you > >>>>>> > >>>>>> > >>>are going > >>> > >>> > >>>>>>to hit the wall with the concept of doing the PMA from > >>>>>> > >>>>>> > >>>a single node. > >>> > >>> > >>>>>If you are referring to the fact the PerMgr is currently not > >>>>>distributed, that will be done as has been stated before. > >>>>> > >>>>> > >>>>Good. When is it expected? Will it be OFED 1.3? > >>>> > >>>> > >>>When Hal first sent out the PerfMgr design I thought we > should jump > >>>right to the distributed model as well. But now I am glad we have > >>>gone the way we did. > >>>First off, we have something which "works" and from which we can > >>>expand. > >>>Second, I have run some tests querying the fabric of our large > >>>clusters here (~500 nodes) and the results were promising for a > >>>single node implementation. > >>>I don't recall the numbers as this was a while ago but it > was on the > >>>order of > >>><2 sec and I think <1 but I don't want to be misquoted. > >>> > >>>For sure, a distributed model offers many advantages and > we will get > >>>there. But for many the current single node approach should work > >>>just fine. > >>> > >>>Thanks, > >>>Ira > >>> > >>> > >>> > >>>>Thanks > >>>> > >>>> > >>>>>-- Hal > >>>>> > >>>>> > >>>>> > >>>>>>Eitan Zahavi > >>>>>>Senior Engineering Director, Software Architect Mellanox > >>>>>> > >>>>>> > >>>>>Technologies > >>>>> > >>>>> > >>>>>>LTD > >>>>>>Tel:+972-4-9097208 > >>>>>>Fax:+972-4-9593245 > >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>-----Original Message----- > >>>>>>>From: general-bounces at lists.openfabrics.org > >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On > >>>>>>> > >>>>>>> > >>>Behalf Of Hal > >>> > >>> > >>>>>>>Rosenstock > >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM > >>>>>>>To: Mark Seger > >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org > >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited) > >>>>>>> > >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote: > >>>>>>> > >>>>>>> > >>>>>>>>>The performance managers deal with the counter > >>>>>>>>> > >>>>>>>>> > >>>stickiness (by > >>> > >>> > >>>>>>>>>resetting them when they think they need to). They > >>>>>>>>> > >>>>>>>>> > >>>>>>>typically export > >>>>>>> > >>>>>>> > >>>>>>>>>their data although this is not specified by IBA so it is > >>>>>>>>> > >>>>>>>>> > >>>>>>>in a vendor > >>>>>>> > >>>>>>> > >>>>>>>>>proprietary manner. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>so I guess these guys are poor citizens as well... > >>>>>>>> > >>>>>>>> > >>>>>>>Not sure what you mean. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>the real issue as I see it then means nobody can trust > >>>>>>>> > >>>>>>>> > >>>>>the data if > >>>>> > >>>>> > >>>>>>>>randon tools randomly reset the counters. a real shame... > >>>>>>>> > >>>>>>>> > >>>>>>>I consider this to be a real rather than random app for this. > >>>>>>>Guess it depends on what one considers random. > >>>>>>> > >>>>>>>-- Hal > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>-mark > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>_______________________________________________ > >>>>>>>general mailing list > >>>>>>>general at lists.openfabrics.org > >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>>> > >>>>>>>To unsubscribe, please visit > >>>>>>>http://openib.org/mailman/listinfo/openib-general > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>>_______________________________________________ > >>>>general mailing list > >>>>general at lists.openfabrics.org > >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>>To unsubscribe, please visit > >>>>http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> > >>>> > > From Mark.Seger at hp.com Wed Jul 11 07:51:01 2007 From: Mark.Seger at hp.com (Mark Seger) Date: Wed, 11 Jul 2007 10:51:01 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com> Message-ID: <4694EE55.6050107@hp.com> Eitan Zahavi wrote: >Hi Marc, > >I published an RFC and later had discussions regarding the distribution >of query ownership of switch counters. >Making this ownership purely dynamic, semi-dynamic or even static is an >implementation tradeoff. >However, it can be shown that the maximal number of switches a single >compute node would be responsible for is <= number of switch levels. So >no problem to get counters every second... > >The issue is: what do you do with the size of data collected? >This is only relevant if monitoring is run in "profiling mode" otherwise >only link health errors should be reported. > > I use IB data for performance data typically for system/application diagnostics. I run a tool I wrote (see http://sourceforge.net/projects/collectl/) as a service on most systems and it gathers well over hundreds of performance metrics/counters on everything from cpu load, memory, network, infiniband, disk, etc. The philosophy here is that if something goes wrong, it may be too late to then run some diagnostic. Rather you need to have already collected the data, especially if this is an intemittent problem. When there is no need to look at the data, it just gets purged away after a week. There have been situation where someone reports a batch program they ran the other day was really slow and they didn't change anything. By being able to pull up a monitoring log and seeing what the system was doing at the time of the run might reveal their network was saturated and therefore their MPI job was impacted. You can't very well turn on diagnostics and rerun the application because system conditions have probably changed. Does that help? Why don't you try installing collectl and see what it does... -mark From Mark.Seger at hp.com Wed Jul 11 08:00:21 2007 From: Mark.Seger at hp.com (Mark Seger) Date: Wed, 11 Jul 2007 11:00:21 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <1184163750.17622.96256.camel@hal.voltaire.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <1184163750.17622.96256.camel@hal.voltaire.com> Message-ID: <4694F085.4010502@hp.com> Hal Rosenstock wrote: >On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > > >>My basic philosophy, and I suspect there are those who might disagree, >>is that you can't use the network to monitor the network, at least not >>in times of trouble. >> >> > >Right, in times of certain troubles. > > and that is the key. since you can't know apriori when you're about to have troubles, you need to be collecting the data locally before they occur. >>That's why I insist on having to query the HCAs >>directly since I can't always be sure the network is there and/or >>reliable. If you are willing to concede that this can indeed happen >>than the question becomes one of how do you reliably get data from an >>HCA and that's the basis for my (re)starting this discussion. >> >> > >The reliability comes from timeout/retry mechanisms. If performance data >cannot be obtained on an IB network, it needs to be trouble shooted at a >lower level (by SMPs). > >In any case, a rearchitecture of the PMA was proposed and seems >reasonable to me in that it can accomodate either approach. All that is >needed now is for someone to step up and champion an implementation of >this. Unfortunately, I do not have time to do so. > > I don't know if what I've been proposing requires any rearchitecting as I see is as something local to each node. Specificially, and there is already an implementation of this in an earlier voltaire stack, is to export wrapping HCA counters to /proc. The module that does this read/clears the counters on every access but since no local applications are accessing the counters directly, clearing them doesn't hurt anyone. Alas, anyone else who wants to query the counters will find them reset. The other side benefit of exporting these counters is such a way is now lots of others can collect/report this info. In other words is someone chose to add IB stats to sar, it would become very easy to do! If this is the type of thing people are interested in, I might be able to supply some code to do it. >>As for querying the switch for counters, what do you do on a very large >>network, say 10s of thousands of nodes if you want to get performance >>data every second? I also realize this is an extreme situation today >>(the node count not the frequency of monitoring) but I'm sure everyone >>would agree systems of these sizes are not that far off. >> >> > >You have a distributed performance manager to handle this. A hierarchy >of performance managers has been discussed on the list before. > > ahh, I see. -mark From halr at voltaire.com Wed Jul 11 08:16:26 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 11:16:26 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <4694F085.4010502@hp.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <1184163750.17622.96256.camel@hal.voltaire.com> <4694F085.4010502@hp.com> Message-ID: <1184166984.17622.100081.camel@hal.voltaire.com> On Wed, 2007-07-11 at 11:00, Mark Seger wrote: > Hal Rosenstock wrote: > > >On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > > > > > >>My basic philosophy, and I suspect there are those who might disagree, > >>is that you can't use the network to monitor the network, at least not > >>in times of trouble. > >> > >> > > > >Right, in times of certain troubles. > > > > > and that is the key. since you can't know apriori when you're about to > have troubles, you need to be collecting the data locally before they occur. > > >>That's why I insist on having to query the HCAs > >>directly since I can't always be sure the network is there and/or > >>reliable. If you are willing to concede that this can indeed happen > >>than the question becomes one of how do you reliably get data from an > >>HCA and that's the basis for my (re)starting this discussion. > >> > >> > > > >The reliability comes from timeout/retry mechanisms. If performance data > >cannot be obtained on an IB network, it needs to be trouble shooted at a > >lower level (by SMPs). > > > >In any case, a rearchitecture of the PMA was proposed and seems > >reasonable to me in that it can accomodate either approach. All that is > >needed now is for someone to step up and champion an implementation of > >this. Unfortunately, I do not have time to do so. > > > > > I don't know if what I've been proposing requires any rearchitecting as > I see is as something local to each node. There was some rearchitecting to make it meet the needs to what you have proposed in addition to that of the IB performance manager. I think Jason had a good proposal for this. -- Hal > Specificially, and there is > already an implementation of this in an earlier voltaire stack, is to > export wrapping HCA counters to /proc. The module that does this > read/clears the counters on every access but since no local applications > are accessing the counters directly, clearing them doesn't hurt anyone. > Alas, anyone else who wants to query the counters will find them reset. > > The other side benefit of exporting these counters is such a way is now > lots of others can collect/report this info. In other words is someone > chose to add IB stats to sar, it would become very easy to do! > > If this is the type of thing people are interested in, I might be able > to supply some code to do it. > > >>As for querying the switch for counters, what do you do on a very large > >>network, say 10s of thousands of nodes if you want to get performance > >>data every second? I also realize this is an extreme situation today > >>(the node count not the frequency of monitoring) but I'm sure everyone > >>would agree systems of these sizes are not that far off. > >> > >> > > > >You have a distributed performance manager to handle this. A hierarchy > >of performance managers has been discussed on the list before. > > > > > ahh, I see. > -mark > > From eitan at mellanox.co.il Wed Jul 11 08:30:04 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 11 Jul 2007 18:30:04 +0300 Subject: [ofa-general] IB performance stats (revisited) References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com> <4694EE55.6050107@hp.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com> Hi Marc, I wish I had a large enough fabric worth testing collectl on... I did the math for how much data would be collected for 10Knodes cluster. It is ~7MB for each iteration: 10K ports * 6 (3 level fabric * 2 ports on each link) * 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong counters) = 116bytes Seems reasonable - but adds up to large amount of data over a day period assuming a collect every second: 24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Mark Seger [mailto:Mark.Seger at hp.com] > Sent: Wednesday, July 11, 2007 5:51 PM > To: Eitan Zahavi > Cc: Hal Rosenstock; Ira Weiny; general at lists.openfabrics.org; > Ed.Finn at FMR.COM > Subject: Re: [ofa-general] IB performance stats (revisited) > > > > Eitan Zahavi wrote: > > >Hi Marc, > > > >I published an RFC and later had discussions regarding the > distribution > >of query ownership of switch counters. > >Making this ownership purely dynamic, semi-dynamic or even > static is an > >implementation tradeoff. > >However, it can be shown that the maximal number of switches > a single > >compute node would be responsible for is <= number of switch > levels. So > >no problem to get counters every second... > > > >The issue is: what do you do with the size of data collected? > >This is only relevant if monitoring is run in "profiling mode" > >otherwise only link health errors should be reported. > > > > > I use IB data for performance data typically for > system/application diagnostics. I run a tool I wrote (see > http://sourceforge.net/projects/collectl/) as a service on > most systems and it gathers well over hundreds of performance > metrics/counters on everything from cpu load, memory, > network, infiniband, disk, etc. The philosophy here is that > if something goes wrong, it may be too late to then run some > diagnostic. Rather you need to have already collected the > data, especially if this is an intemittent problem. When > there is no need to look at the data, it just gets purged > away after a week. > > There have been situation where someone reports a batch > program they ran the other day was really slow and they > didn't change anything. By being able to pull up a > monitoring log and seeing what the system was doing at the > time of the run might reveal their network was saturated and > therefore their MPI job was impacted. You can't very well > turn on diagnostics and rerun the application because system > conditions have probably changed. > > Does that help? Why don't you try installing collectl and > see what it does... > > -mark > > > From halr at voltaire.com Wed Jul 11 08:54:06 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 11:54:06 -0400 Subject: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <1184094759.17622.15371.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <1184091830.17622.12007.camel@hal.voltaire.com> <200707102111.28374.cap@nsc.liu.se> <1184094759.17622.15371.camel@hal.voltaire.com> Message-ID: <1184169244.17622.102683.camel@hal.voltaire.com> On Tue, 2007-07-10 at 15:12, Hal Rosenstock wrote: > On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote: > > On Tuesday 10 July 2007, Hal Rosenstock wrote: > > ... > > > > Management: > > > > * Multiple partitions > > > > * OpenSM > > > > * More routing performance improvements > > > > * Even more speedups > > > > * Better packaging/installation > > > > * “Native” daemon mode > > > > * Performance management > > > > * Quality of Service manager: Based on IBTA annex > > > > > > enhancements for fat tree routing (non pure tree support) > > > more console commands and telnet access to console > > > > Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and > > in which way OFED-1.2 opensm performs badly for these? The following patch contains some of the answers to the above: -----Forwarded Message----- From: Yevgeny Kliteynik To: Hal Rosenstock Cc: OpenIB Subject: [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree Date: 09 Jul 2007 11:32:49 +0300 Hi Hal. Updating doc and osm manpage with the recent enhancement of fat-tree routing. Signed-off-by: Yevgeny Kliteynik --- opensm/doc/current-routing.txt | 28 ++++++++++++++++++++++------ opensm/man/opensm.8 | 33 ++++++++++++++++++++++++++------- 2 files changed, 48 insertions(+), 13 deletions(-) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 9852ef0..76f91ba 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -174,11 +174,14 @@ Fat-tree Routing Algorithm Purpose: The fat-tree algorithm optimizes routing for "shift" communication pattern. -It should be chosen if a subnet is a symmetrical fat-tree of various types. +It should be chosen if a subnet is a symmetrical or almost symmetrical +fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. -Fat-tree algorithm supports topologies that comply with the following rules: + +If the root guid file is not provided ('-a' or '--root_guid_file' options), +the topology has to be pure fat-tree that complies with the following rules: - Tree rank should be between two and eight (inclusively) - Switches of the same rank should have the same number of UP-going port groups*, unless they are root switches, @@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules: of ports in each UP-going port group. - Switches of the same rank should have the same number of ports in each DOWN-going port group. -*ports that are connected to the same remote switch are referenced as + - All the CAs have to be at the same tree level (rank). + +If the root guid file is provided, the topology doesn't have to be pure +fat-tree, and it should only comply with the following rules: + - Tree rank should be between two and eight (inclusively) + - All the Compute Nodes** have to be at the same tree level (rank). + Note that non-compute node CAs are allowed here to be at different + tree ranks. + +* ports that are connected to the same remote switch are referenced as 'port group'. +** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file' +OpenSM options. Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. +In general, even if the root list is provided, the closer the topology to a +pure and symmetrical fat-tree, the more optimal the routing will be. -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the -same directory where the OpenSM log resides. This ordering file provides the -CA order that may be used to create efficient communication pattern, that +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) +in the same directory where the OpenSM log resides. This ordering file provides +the CN order that may be used to create efficient communication pattern, that will match the routing tables. diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 index 5f34cd1..5472faf 100644 --- a/opensm/man/opensm.8 +++ b/opensm/man/opensm.8 @@ -603,7 +603,7 @@ UPDN Algorithm Usage Activation through OpenSM Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. -Use '-a ' for adding an UPDN guid file that contains the +Use '-a ' for adding an UPDN guid file that contains the root nodes for ranking. If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm. @@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node. Fat-tree Routing Algorithm The fat-tree algorithm optimizes routing for "shift" communication pattern. -It should be chosen if a subnet is a symmetrical fat-tree of various types. +It should be chosen if a subnet is a symmetrical or almost symmetrical +fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks. -The Fat-tree algorithm supports topologies that comply with the following rules: +If the root guid file is not provided ('-a' or '--root_guid_file' options), +the topology has to be pure fat-tree that complies with the following rules: - Tree rank should be between two and eight (inclusively) - Switches of the same rank should have the same number of UP-going port groups*, unless they are root switches, @@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules: of ports in each UP-going port group. - Switches of the same rank should have the same number of ports in each DOWN-going port group. + - All the CAs have to be at the same tree level (rank). -Note: ports that are connected to the same remote switch are referenced as +If the root guid file is provided, the topology doesn't have to be pure +fat-tree, and it should only comply with the following rules: + - Tree rank should be between two and eight (inclusively) + - All the Compute Nodes** have to be at the same tree level (rank). + Note that non-compute node CAs are allowed here to be at different + tree ranks. + +* ports that are connected to the same remote switch are referenced as \'port group\'. +** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\' +OpenSM options. + Topologies that do not comply cause a fallback to min hop routing. Note that this can also occur on link failures which cause the topology to no longer be "pure" fat-tree. @@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. +In general, even if the root list is provided, the closer the topology to a +pure and symmetrical fat-tree, the more optimal the routing will be. -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the -same directory where the OpenSM log resides. This ordering file provides the -CA order that may be used to create efficient communication pattern, that +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) +in the same directory where the OpenSM log resides. This ordering file provides +the CN order that may be used to create efficient communication pattern, that will match the routing tables. Activation through OpenSM Use '-R ftree' option to activate the fat-tree algorithm. +Use '-a ' to provide root nodes for ranking. If the `-a' option +is not used, routing algorithm will detect roots automatically. +Use '-u ' to provide the list of compute nodes. If the `-u' option +is not used, all the CAs are considered as compute nodes. Note: LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing algorithm is invoked instead. -- 1.5.1.4 > Yevgeny, > > Could you elaborate on this ? Thanks. > > -- Hal > > > Or maybe there are some nice docs for me to sink my teeth into... > > > > /Peter > > > > ______________________________________________________________________ > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Mark.Seger at hp.com Wed Jul 11 08:56:31 2007 From: Mark.Seger at hp.com (Mark Seger) Date: Wed, 11 Jul 2007 11:56:31 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <6C2C79E72C305246B504CBA17B5500C901E049E6@mtlexch01.mtl.com> <4694EE55.6050107@hp.com> <6C2C79E72C305246B504CBA17B5500C901E04A3B@mtlexch01.mtl.com> Message-ID: <4694FDAF.2080001@hp.com> >Hi Marc, > >I wish I had a large enough fabric worth testing collectl on... > > there may be a disconnect here as collectl collects data locally. on a typical system, taking 10 second samples for all the different subsystems it support (though you can certainly turn up the frequency if you like) takes about 2MB/day and retains it for a week, This does OFED support out-of-the-box, using perfquery to read/clear the counters. Just install it and type: collectl -scmx -oTm (lots of other combinations of choices) and you'll see data for cpu, memory and interconnect data with millisec timestamps as follows: # <--------CPU--------><-----------Memory----------><----------InfiniBand----------> #Time cpu sys inter ctxsw free buff cach inac slab map KBin pktIn KBOut pktOut Errs 11:55:06.004 0 0 261 44 7G 46M 268M 151M 249M 21M 0 0 0 0 0 11:55:07.004 0 0 275 61 7G 46M 268M 151M 249M 21M 0 0 0 0 0 11:55:08.004 0 0 251 18 7G 46M 268M 151M 249M 21M 0 0 0 0 0 11:55:09.004 0 0 251 23 7G 46M 268M 151M 249M 21M 0 0 0 0 0 >I did the math for how much data would be collected for 10Knodes >cluster. It is ~7MB for each iteration: >10K ports >* 6 (3 level fabric * 2 ports on each link) >* 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong >counters) = 116bytes > >Seems reasonable - but adds up to large amount of data over a day period >assuming a collect every second: >24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage > > no disagreement. that's why I chose NOT to try to solve the distributed data collection problem. collectl runs locally wiht <0.1% cpu overhead. -mark >Eitan Zahavi >Senior Engineering Director, Software Architect >Mellanox Technologies LTD >Tel:+972-4-9097208 >Fax:+92-4-9593245 >P.O. Box 586 Yokneam 20692 ISRAEL > > > > > >>-----Original Message----- >>From: Mark Seger [mailto:Mark.Seger at hp.com] >>Sent: Wednesday, July 11, 2007 5:51 PM >>To: Eitan Zahavi >>Cc: Hal Rosenstock; Ira Weiny; general at lists.openfabrics.org; >>Ed.Finn at FMR.COM >>Subject: Re: [ofa-general] IB performance stats (revisited) >> >> >> >>Eitan Zahavi wrote: >> >> >> >>>Hi Marc, >>> >>>I published an RFC and later had discussions regarding the >>> >>> >>distribution >> >> >>>of query ownership of switch counters. >>>Making this ownership purely dynamic, semi-dynamic or even >>> >>> >>static is an >> >> >>>implementation tradeoff. >>>However, it can be shown that the maximal number of switches >>> >>> >>a single >> >> >>>compute node would be responsible for is <= number of switch >>> >>> >>levels. So >> >> >>>no problem to get counters every second... >>> >>>The issue is: what do you do with the size of data collected? >>>This is only relevant if monitoring is run in "profiling mode" >>>otherwise only link health errors should be reported. >>> >>> >>> >>> >>I use IB data for performance data typically for >>system/application diagnostics. I run a tool I wrote (see >>http://sourceforge.net/projects/collectl/) as a service on >>most systems and it gathers well over hundreds of performance >>metrics/counters on everything from cpu load, memory, >>network, infiniband, disk, etc. The philosophy here is that >>if something goes wrong, it may be too late to then run some >>diagnostic. Rather you need to have already collected the >>data, especially if this is an intemittent problem. When >>there is no need to look at the data, it just gets purged >>away after a week. >> >>There have been situation where someone reports a batch >>program they ran the other day was really slow and they >>didn't change anything. By being able to pull up a >>monitoring log and seeing what the system was doing at the >>time of the run might reveal their network was saturated and >>therefore their MPI job was impacted. You can't very well >>turn on diagnostics and rerun the application because system >>conditions have probably changed. >> >>Does that help? Why don't you try installing collectl and >>see what it does... >> >>-mark >> >> >> >> >> From cap at nsc.liu.se Wed Jul 11 09:17:09 2007 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 11 Jul 2007 18:17:09 +0200 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <4694F085.4010502@hp.com> References: <46826370.4090602@hp.com> <1184163750.17622.96256.camel@hal.voltaire.com> <4694F085.4010502@hp.com> Message-ID: <200707111817.09205.cap@nsc.liu.se> On Wednesday 11 July 2007, Mark Seger wrote: > I don't know if what I've been proposing requires any rearchitecting as > I see is as something local to each node.  Specificially, and there is > already an implementation of this in an earlier voltaire stack, is to > export wrapping HCA counters to /proc.  The module that does this > read/clears the counters on every access but since no local applications > are accessing the counters directly, clearing them doesn't hurt anyone.   > Alas, anyone else who wants to query the counters will find them reset. > > The other side benefit of exporting these counters is such a way is now > lots of others can collect/report this info.  In other words is someone > chose to add IB stats to sar, it would become very easy to do! I for one would be very happy to have this option. To be able to get simple but real-time data on a specific node. I'm amazed that the counters are non-wrapping or even that they are 32-bit... /Peter > If this is the type of thing people are interested in, I might be able > to supply some code to do it. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From weiny2 at llnl.gov Wed Jul 11 09:19:21 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 11 Jul 2007 09:19:21 -0700 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com> <46826FB8.10904@hp.com> <46827BA0.6070008@hp.com> <1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com> <1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E049B7@mtlexch01.mtl.com> Message-ID: <20070711091921.2ef4ef2e.weiny2@llnl.gov> On Wed, 11 Jul 2007 17:03:35 +0300 "Eitan Zahavi" wrote: > > > > > > Was the cluster running a job at the time of the query ? No, that testing was not completed. Ira From halr at voltaire.com Wed Jul 11 09:21:51 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 12:21:51 -0400 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <4694F085.4010502@hp.com> References: <46826370.4090602@hp.com> <1182951169.28870.75880.camel@hal.voltaire.com><46826FB8.10904@hp.com> <46827BA0.6070008@hp.com><1182957688.28870.83013.camel@hal.voltaire.com> <4682994E.1020209@hp.com><1182964334.28870.90291.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD7B7@mtlexch01.mtl.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <1184163750.17622.96256.camel@hal.voltaire.com> <4694F085.4010502@hp.com> Message-ID: <1184170906.17622.104663.camel@hal.voltaire.com> On Wed, 2007-07-11 at 11:00, Mark Seger wrote: > Hal Rosenstock wrote: > > >On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > > > > > >>My basic philosophy, and I suspect there are those who might disagree, > >>is that you can't use the network to monitor the network, at least not > >>in times of trouble. > >> > >> > > > >Right, in times of certain troubles. > > > > > and that is the key. since you can't know apriori when you're about to > have troubles, you need to be collecting the data locally before they occur. > > >>That's why I insist on having to query the HCAs > >>directly since I can't always be sure the network is there and/or > >>reliable. If you are willing to concede that this can indeed happen > >>than the question becomes one of how do you reliably get data from an > >>HCA and that's the basis for my (re)starting this discussion. > >> > >> > > > >The reliability comes from timeout/retry mechanisms. If performance data > >cannot be obtained on an IB network, it needs to be trouble shooted at a > >lower level (by SMPs). > > > >In any case, a rearchitecture of the PMA was proposed and seems > >reasonable to me in that it can accomodate either approach. All that is > >needed now is for someone to step up and champion an implementation of > >this. Unfortunately, I do not have time to do so. > > > > > I don't know if what I've been proposing requires any rearchitecting as > I see is as something local to each node. Specificially, and there is > already an implementation of this in an earlier voltaire stack, is to > export wrapping HCA counters to /proc. The module that does this > read/clears the counters on every access but since no local applications > are accessing the counters directly, clearing them doesn't hurt anyone. > Alas, anyone else who wants to query the counters will find them reset. No local application but perhaps a remote one. This is the reason for the proposed rearchitecture (along with synthesizing the wider counters). -- Hal > The other side benefit of exporting these counters is such a way is now > lots of others can collect/report this info. In other words is someone > chose to add IB stats to sar, it would become very easy to do! > > If this is the type of thing people are interested in, I might be able > to supply some code to do it. > > >>As for querying the switch for counters, what do you do on a very large > >>network, say 10s of thousands of nodes if you want to get performance > >>data every second? I also realize this is an extreme situation today > >>(the node count not the frequency of monitoring) but I'm sure everyone > >>would agree systems of these sizes are not that far off. > >> > >> > > > >You have a distributed performance manager to handle this. A hierarchy > >of performance managers has been discussed on the list before. > > > > > ahh, I see. > -mark > > From rick.jones2 at hp.com Wed Jul 11 09:37:42 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 11 Jul 2007 09:37:42 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <20070711061444.GG11320@mellanox.co.il> References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il> Message-ID: <46950756.5090501@hp.com> Michael S. Tsirkin wrote: >>Quoting Rick Jones : >>Subject: should it be possible to run SDP over a T320? >> >>Hi - >> >>I was talking to someone about the numbers I'd gathered for IPoIB with >>OFED 1.2 and a Mellanox HCA, and how the MTU increase from 2044 to 65520 >>did some non-trivial things to bulk transfer performance. > > > Was this data these posted on-list? I didn't see it. > Hasn't been. I presume that folks are curious?-) rick From rdreier at cisco.com Wed Jul 11 09:57:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jul 2007 09:57:53 -0700 Subject: [ofa-general] What should a ULP pass as ib_create_cq(..., comp_vector) ? In-Reply-To: (Thomas Talpey's message of "Wed, 11 Jul 2007 07:50:37 -0400") References: Message-ID: > I notice the ib_create_cq() comp_vector support is merged in 2.6.22. > I don't completely understand what a ULP needs to pass as the argument. > > I'm currently passing 0 in the NFS/RDMA client, what in general should I > consider using as a value? Or put another way, why is this exposed to > the ULP? Isn't this the MSI-X vector table index, a rather low-level thing > to hand to the ULP to manage? You need to pass a value in the range 0 ... num_comp_vectors-1. Since every driver currently sets num_comp_vectors to 1, hard-coding your value to 0 is a reasonable thing to do -- it's what every other ULP does at the moment. This value is *NOT* the MSI-X vector table index. It's basically the "completion event handler identifier" that the IB spec v. 1.2 talks about. It would be perfectly valid for a non-PCI device such as ehca (for which the concept of MSI-X does not apply at all) to support multiple completion vectors. And the consumer is really the only entity that can make a good choice of how to divide up CQs, since only the consumer really knows which CQ event handlers might want to run in parallel. However on another level your question gets to the reason why we haven't implemented support for multiple completion event vectors. Namely, it's not clear how consumers, kernel or userspace, can make a good choice of which vector to assign a given CQ to. For example an MPI implementation would probably want one vector per CPU so that it can direct events for a given process to the CPU that the process is running on; but there's no simple way to implement that policy. - R. From caitlinb at broadcom.com Wed Jul 11 10:41:53 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 11 Jul 2007 10:41:53 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <469290C5.6010709@Sun.COM> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com> Don.Kerr at Sun.COM wrote: > I am working on a uDAPL layer for Open MPI. The situation is > if I have more than one port/HCA my users may want to be > selective in what is used and to do this they would need to > provide some information regarding which port/HCA to use. So > my thought is that the users are more familar with the output > from "ifconfig", for example ib0, ib1, etc, and I was trying > to find a way to correlate that to what is available from the > uDAPL API. Maybe I need to reprogram them to look at dat.conf. > > -DON > You definitely do not want to parse dat.conf, you want to see what the dat_registry has loaded. dat.conf is static, Providers are allowed to dynamically adapt how they register themselves. I don't believe that is an active concern, but it's simpler to take advantage of the existing code and be safe in case somebody comes along later and decides to do dynamic registration only. But you hit the nail on the head in terms of needing to correlate devices as reported by "ifconfig" and the Interface Adapter that you try to open. Basically, the intent has always been that the correlation between an Interface Adapter and an "ifconfig" entry should be so obvious that a complete idiot could figure out which went with which. Once that linkage is clear then you merely use the RDMA device/port implied by the routing of the device listed by ifconfig. To the best of my knowledge, for every DAPL provider ever created the correlation with the IP layer device has indeed been so obvious that any idiot could figure it out -- unfortuantely software can only hope to someday reach that degree of intelligence, and other than configuring the links there really isn't much that can be done. Once there is a link between the RDMA device and the IP layer device, you could use the routing tables to determine which port a connection request could be received on, which ports could originate a packet with a given IP address and which ports could send a packet to a given IP destination. Given that, you want the matching RDMA device. Such a linkage would allow the application to correctly determine the exact DAPL Provider that needed to be opened, and only only that one. Without it the application has to scan the registry list and essentially do a serial search. The good news is that it won't be a very long serial search and it doesn't have to be performed that often. From swise at opengridcomputing.com Wed Jul 11 11:04:35 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Jul 2007 13:04:35 -0500 Subject: [ofa-general] [PATCH 2.6.23] iw_cxgb3: remove the cm_id reference on listen failures. Message-ID: <20070711180435.11665.71117.stgit@dell3.ogc.int> iw_cxgb3: remove the cm_id reference on listen failures. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 3b41dc0..5dc68cd 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1914,6 +1914,7 @@ int iwch_create_listen(struct iw_cm_id * fail3: cxgb3_free_stid(ep->com.tdev, ep->stid); fail2: + cm_id->rem_ref(cm_id); put_ep(&ep->com); fail1: out: From swise at opengridcomputing.com Wed Jul 11 11:11:43 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 11 Jul 2007 13:11:43 -0500 Subject: [ofa-general] GIT PULL ofed_1_2] iw_cxgb3: remove the cm_id reference on listen failures. Message-ID: <46951D5F.3090208@opengridcomputing.com> Vlad, Please pull the fix for bug 686 from git://git.openfabrics.org/~swise/ofed_1_2 ofed_1_2 Thanks, Steve. ----- iw_cxgb3: remove the cm_id reference on listen failures. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 4175991..08986fb 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1912,6 +1912,7 @@ int iwch_create_listen(struct iw_cm_id * fail3: cxgb3_free_stid(ep->com.tdev, ep->stid); fail2: + cm_id->rem_ref(cm_id); put_ep(&ep->com); fail1: out: From Thomas.Talpey at netapp.com Wed Jul 11 11:57:09 2007 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 11 Jul 2007 14:57:09 -0400 Subject: [ofa-general] What should a ULP pass as ib_create_cq(..., comp_vector) ? In-Reply-To: References: Message-ID: At 12:57 PM 7/11/2007, Roland Dreier wrote: >However on another level your question gets to the reason why we >haven't implemented support for multiple completion event vectors. >Namely, it's not clear how consumers, kernel or userspace, can make a >good choice of which vector to assign a given CQ to. Got it, thanks. But aren't the vectors shared across all consumers on an HCA? As such, it seems problematic to expect consumers to make optimal choices, since they have no way of knowing what other consumers are doing. In any case, all NFS/RDMA does is to check the completion status, queue the event and schedule a tasklet, so there is little or no parallelism to be gained in the upcall. I'd prefer to not have to wait for other ULPs on the same vector, of course. Tom. From rick.jones2 at hp.com Wed Jul 11 12:01:35 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Wed, 11 Jul 2007 12:01:35 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <46950756.5090501@hp.com> References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il> <46950756.5090501@hp.com> Message-ID: <4695290F.7090005@hp.com> >> Was this data these posted on-list? I didn't see it. >> > > Hasn't been. I presume that folks are curious?-) RedHat Enterprise Linux 5 Single-Stream Performance Bulk Transfer "Latency" Unidir Bidir Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr ------------------------------------------------------------------------- AD313A IPoIB 1.1 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a AD313A SDP 1.1 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 AD313A SDP p0 7810 0.346 0.527 12670 0.42 0.43 19380 n/a n/a AD313A IPoIP 1.2 5510 0.426 1.593 5730 n/a n/a 18990 n/a n/a AD313A SDP 1.2 7820 0.409 1.047 12890 0.64 0.68 41988 25.89 26.32 AD313A SDP p0 1.2 7820 0.309 0.517 12760 0.36 0.36 19800 15.47 15.72 netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional [SDP|TCP]_RR test, -r 1 for the [TCP|SDP]_RR test. 1.1 - OFED 1.1 bits 1.2 - OFED 1.2 bits p0 - send_poll and recv_poll set to 0 SD - service demand in microseconds of CPU time consumed per unit of work - per KB transferred for the bulk tests, per transaction on the latency test. 'x' is transmit 'r' is receive lspci for the AD313A shows: 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) rick jones From HNGUYEN at de.ibm.com Wed Jul 11 12:27:26 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Wed, 11 Jul 2007 21:27:26 +0200 Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs In-Reply-To: <1184097931.3020.73.camel@localhost.localdomain> Message-ID: > With this patch, idr.c should work as advertised allocating id > values in the range 0...0x7fffffff. Andrew had speculated that > it should allow the full range 0...0xffffffff to be used. I was > tempted to make changes to allow this, but it would require changes > to API, e.g. making the starting id value and the return value > unsigned. Hi Jim, thanks much for this patch. It should work fine as far as I can read. Will give it a try in next couple of days. Nam From kliteyn at mellanox.co.il Wed Jul 11 12:56:22 2007 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 11 Jul 2007 22:56:22 +0300 Subject: [ofa-general] Toward next OFED release (1.3) In-Reply-To: <1184169244.17622.102683.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C90156379B@mtlexch01.mtl.com> <1184091830.17622.12007.camel@hal.voltaire.com> <200707102111.28374.cap@nsc.liu.se> <1184094759.17622.15371.camel@hal.voltaire.com> <1184169244.17622.102683.camel@hal.voltaire.com> Message-ID: <469535E6.8080705@mellanox.co.il> Hal Rosenstock wrote: > On Tue, 2007-07-10 at 15:12, Hal Rosenstock wrote: > >> On Tue, 2007-07-10 at 15:11, Peter Kjellstrom wrote: >> >>> On Tuesday 10 July 2007, Hal Rosenstock wrote: >>> ... >>> >>>>> Management: >>>>> * Multiple partitions >>>>> * OpenSM >>>>> * More routing performance improvements >>>>> * Even more speedups >>>>> * Better packaging/installation >>>>> * “Native” daemon mode >>>>> * Performance management >>>>> * Quality of Service manager: Based on IBTA annex >>>>> >>>> enhancements for fat tree routing (non pure tree support) >>>> more console commands and telnet access to console >>>> >>> Pardon my ignorance, but could you elaborate on what a "non-pure tree" is and >>> in which way OFED-1.2 opensm performs badly for these? >>> > > The following patch contains some of the answers to the above: > Hi guys. Sorry for the delay. Anyway, the patch does answer the question, but I'll add my two cents anyway. The fat-tree algorithm optimizes routing for "shift" communication pattern. Before the latest change, the topology that the fat-tree routing engine could handle had to be a pure fat-tree, and by "pure" I mean completely symmetrical tree that complies with the following rules: - Switches of the same rank should have the same number of UP-going port groups*, unless they are root switches, in which case the shouldn't have UP-going ports at all. - Switches of the same rank should have the same number of DOWN-going port groups, unless they are leaf switches. - Switches of the same rank should have the same number of ports in each UP-going port group. - Switches of the same rank should have the same number of ports in each DOWN-going port group. - *All* the CAs have to be at the same tree level (rank), doesn't matter if they are compute nodes or management nodes. Any other topology will cause fat-tree routing to fail and OpenSM would fall back to default routing. Note that this also means that in a symmetrical fat-tree any link failure (except for the links between CAs and leaf switches) will break the fabric symmetry and the routing will fall back to default. With the recent changes, the user can supply list of roots and compute node guids, and then fat-tree routing is able to handle trees that are not symmetrical, and the topology has to comply with this (very) reduced set of constraints: - All the Compute Nodes have to be at the same tree level (rank). Note that non-compute node CAs are allowed here to be at different tree ranks. But of course, the less the tree is symmetrical, the worse the routing results will be. -- Yevgeny > -----Forwarded Message----- > > From: Yevgeny Kliteynik > To: Hal Rosenstock > Cc: OpenIB > Subject: [PATCH 2/2] osm: updating doc with root and compute nodes options for fat-tree > Date: 09 Jul 2007 11:32:49 +0300 > > Hi Hal. > > Updating doc and osm manpage with the > recent enhancement of fat-tree routing. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/doc/current-routing.txt | 28 ++++++++++++++++++++++------ > opensm/man/opensm.8 | 33 ++++++++++++++++++++++++++------- > 2 files changed, 48 insertions(+), 13 deletions(-) > > diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt > index 9852ef0..76f91ba 100644 > --- a/opensm/doc/current-routing.txt > +++ b/opensm/doc/current-routing.txt > @@ -174,11 +174,14 @@ Fat-tree Routing Algorithm > Purpose: > > The fat-tree algorithm optimizes routing for "shift" communication pattern. > -It should be chosen if a subnet is a symmetrical fat-tree of various types. > +It should be chosen if a subnet is a symmetrical or almost symmetrical > +fat-tree of various types. > It supports not just K-ary-N-Trees, by handling for non-constant K, > cases where not all leafs (CAs) are present, any CBB ratio. > As in UPDN, fat-tree also prevents credit-loop-deadlocks. > -Fat-tree algorithm supports topologies that comply with the following rules: > + > +If the root guid file is not provided ('-a' or '--root_guid_file' options), > +the topology has to be pure fat-tree that complies with the following rules: > - Tree rank should be between two and eight (inclusively) > - Switches of the same rank should have the same number > of UP-going port groups*, unless they are root switches, > @@ -189,18 +192,31 @@ Fat-tree algorithm supports topologies that comply with the following rules: > of ports in each UP-going port group. > - Switches of the same rank should have the same number > of ports in each DOWN-going port group. > -*ports that are connected to the same remote switch are referenced as > + - All the CAs have to be at the same tree level (rank). > + > +If the root guid file is provided, the topology doesn't have to be pure > +fat-tree, and it should only comply with the following rules: > + - Tree rank should be between two and eight (inclusively) > + - All the Compute Nodes** have to be at the same tree level (rank). > + Note that non-compute node CAs are allowed here to be at different > + tree ranks. > + > +* ports that are connected to the same remote switch are referenced as > 'port group'. > +** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file' > +OpenSM options. > > Note that although fat-tree algorithm supports trees with non-integer CBB > ratio, the routing will not be as balanced as in case of integer CBB ratio. > In addition to this, although the algorithm allows leaf switches to have any > number of CAs, the closer the tree is to be fully populated, the more effective > the "shift" communication pattern will be. > +In general, even if the root list is provided, the closer the topology to a > +pure and symmetrical fat-tree, the more optimal the routing will be. > > -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the > -same directory where the OpenSM log resides. This ordering file provides the > -CA order that may be used to create efficient communication pattern, that > +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) > +in the same directory where the OpenSM log resides. This ordering file provides > +the CN order that may be used to create efficient communication pattern, that > will match the routing tables. > > > diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 > index 5f34cd1..5472faf 100644 > --- a/opensm/man/opensm.8 > +++ b/opensm/man/opensm.8 > @@ -603,7 +603,7 @@ UPDN Algorithm Usage > Activation through OpenSM > > Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. > -Use '-a ' for adding an UPDN guid file that contains the > +Use '-a ' for adding an UPDN guid file that contains the > root nodes for ranking. > If the `-a' option is not used, OpenSM uses its auto-detect root nodes > algorithm. > @@ -621,12 +621,14 @@ it exists) that connects the CA to the subnet as a root node. > Fat-tree Routing Algorithm > > The fat-tree algorithm optimizes routing for "shift" communication pattern. > -It should be chosen if a subnet is a symmetrical fat-tree of various types. > +It should be chosen if a subnet is a symmetrical or almost symmetrical > +fat-tree of various types. > It supports not just K-ary-N-Trees, by handling for non-constant K, > cases where not all leafs (CAs) are present, any CBB ratio. > As in UPDN, fat-tree also prevents credit-loop-deadlocks. > > -The Fat-tree algorithm supports topologies that comply with the following rules: > +If the root guid file is not provided ('-a' or '--root_guid_file' options), > +the topology has to be pure fat-tree that complies with the following rules: > - Tree rank should be between two and eight (inclusively) > - Switches of the same rank should have the same number > of UP-going port groups*, unless they are root switches, > @@ -637,10 +639,21 @@ The Fat-tree algorithm supports topologies that comply with the following rules: > of ports in each UP-going port group. > - Switches of the same rank should have the same number > of ports in each DOWN-going port group. > + - All the CAs have to be at the same tree level (rank). > > -Note: ports that are connected to the same remote switch are referenced as > +If the root guid file is provided, the topology doesn't have to be pure > +fat-tree, and it should only comply with the following rules: > + - Tree rank should be between two and eight (inclusively) > + - All the Compute Nodes** have to be at the same tree level (rank). > + Note that non-compute node CAs are allowed here to be at different > + tree ranks. > + > +* ports that are connected to the same remote switch are referenced as > \'port group\'. > > +** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\' > +OpenSM options. > + > Topologies that do not comply cause a fallback to min hop routing. > Note that this can also occur on link failures which cause the topology > to no longer be "pure" fat-tree. > @@ -650,15 +663,21 @@ ratio, the routing will not be as balanced as in case of integer CBB ratio. > In addition to this, although the algorithm allows leaf switches to have any > number of CAs, the closer the tree is to be fully populated, the more > effective the "shift" communication pattern will be. > +In general, even if the root list is provided, the closer the topology to a > +pure and symmetrical fat-tree, the more optimal the routing will be. > > -The algorithm also dumps CA ordering file (opensm-ftree-ca-order.dump) in the > -same directory where the OpenSM log resides. This ordering file provides the > -CA order that may be used to create efficient communication pattern, that > +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) > +in the same directory where the OpenSM log resides. This ordering file provides > +the CN order that may be used to create efficient communication pattern, that > will match the routing tables. > > Activation through OpenSM > > Use '-R ftree' option to activate the fat-tree algorithm. > +Use '-a ' to provide root nodes for ranking. If the `-a' option > +is not used, routing algorithm will detect roots automatically. > +Use '-u ' to provide the list of compute nodes. If the `-u' option > +is not used, all the CAs are considered as compute nodes. > > Note: LMC > 0 is not supported by fat-tree routing. If this is > specified, the default routing algorithm is invoked instead. > From caitlinb at broadcom.com Wed Jul 11 13:48:49 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 11 Jul 2007 13:48:49 -0700 Subject: [ofa-general] What should a ULP pass as ib_create_cq(..., comp_vector) ? In-Reply-To: Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D0475D35E@NT-IRVA-0750.brcm.ad.broadcom.com> general-bounces at lists.openfabrics.org wrote: > At 12:57 PM 7/11/2007, Roland Dreier wrote: >> However on another level your question gets to the reason why we >> haven't implemented support for multiple completion event vectors. >> Namely, it's not clear how consumers, kernel or userspace, can make a >> good choice of which vector to assign a given CQ to. > > Got it, thanks. But aren't the vectors shared across all > consumers on an HCA? As such, it seems problematic to expect > consumers to make optimal choices, since they have no way of > knowing what other consumers are doing. > > In any case, all NFS/RDMA does is to check the completion > status, queue the event and schedule a tasklet, so there is > little or no parallelism to be gained in the upcall. I'd > prefer to not have to wait for other ULPs on the same vector, of > course. > What a single Consumer could do is to clump as many of their CQs as possible into a single "bag" where serialization of notifications for these CQs would have little detrimental impact on the application. As you point out, for most applications this is all of their CQs. This would presume that when the Consumer supplied too many that the lower layers would simply say "tough" and combine some of them (achieving less than optimal results, but better than having the OS assign notification queues on a totally arbitrary basis). To use the actual number implies that it would be meaningful for *each* application to divide its CQs over that set, without any mechanism to balance applications themselves. That would seem to imply that a typical Consumer would have a large number of CQs, when I've never understood the need for more than one per core per application. At the minimum, if the actual number were published by the device, would the kernel consumers actually be able to distribute their CQs over the set? Tom, I definitely agree that userland consumers have absolutely no way to do that reasonably, but do you think it is plausible for the kernel to do so far kernel-resident consumers? If not, what would be needed to bridge that gap? Or is the need for parallelism so small amongst kernel completion handlers that the kernel does not need this feature? From arlin.r.davis at intel.com Wed Jul 11 14:36:41 2007 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 11 Jul 2007 14:36:41 -0700 Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter Message-ID: Sean, OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We are running into some cases on larger clusters that require longer timeouts then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that adds a module parameter for the response timeout? Thanks. Signed-off by: Arlin Davis --- a/drivers/infiniband/core/cma.c 2007-07-11 10:46:48.000000000 -0700 +++ b/drivers/infiniband/core/cma.c 2007-07-11 10:54:16.000000000 -0700 @@ -58,6 +58,10 @@ MODULE_PARM_DESC(tavor_quirk, "Tavor per #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 15 +static int cma_response_timeout = CMA_CM_RESPONSE_TIMEOUT; +module_param_named(cma_response_timeout, cma_response_timeout, int, 0644); +MODULE_PARM_DESC(cma_response_timeout, "CMA_CM_RESPONSE_TIMEOUT default=20"); + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -2157,7 +2161,7 @@ static int cma_resolve_ib_udp(struct rdm req.path = route->path_rec; req.service_id = cma_get_service_id(id_priv->id.ps, &route->addr.dst_addr); - req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8); + req.timeout_ms = 1 << (cma_response_timeout - 8); req.max_cm_retries = CMA_MAX_CM_RETRIES; ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req); @@ -2216,8 +2220,8 @@ static int cma_connect_ib(struct rdma_id req.flow_control = conn_param->flow_control; req.retry_count = conn_param->retry_count; req.rnr_retry_count = conn_param->rnr_retry_count; - req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.remote_cm_response_timeout = cma_response_timeout; + req.local_cm_response_timeout = cma_response_timeout; req.max_cm_retries = CMA_MAX_CM_RETRIES; req.srq = id_priv->srq ? 1 : 0; @@ -2344,7 +2348,7 @@ static int cma_accept_ib(struct rdma_id_ rep.private_data_len = conn_param->private_data_len; rep.responder_resources = conn_param->responder_resources; rep.initiator_depth = conn_param->initiator_depth; - rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.target_ack_delay = cma_response_timeout; rep.failover_accepted = 0; rep.flow_control = conn_param->flow_control; rep.rnr_retry_count = conn_param->rnr_retry_count; From sean.hefty at intel.com Wed Jul 11 14:41:51 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 11 Jul 2007 14:41:51 -0700 Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: Message-ID: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com> >OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We >are running into some cases on larger clusters that require longer timeouts >then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that adds >a module parameter for the response timeout? Thanks. What's in it for me? :) > >Signed-off by: Arlin Davis Acked-by: Sean Hefty Vlad, can you add this for OFED 1.2.1? - Sean From halr at voltaire.com Wed Jul 11 15:07:43 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Jul 2007 18:07:43 -0400 Subject: [ofa-general] Moving On Message-ID: <1184191631.17622.128348.camel@hal.voltaire.com> Hi, After more than three years of having the pleasure of being involved with PathForward, OpenIB, and now OpenFabrics, I have decided to move on and to be involved from a different perspective and will be unable to continue my current maintainership responsibility for IB management (OpenSM and diagnostics). I hope to resurface sooner rather than later :-) It's been a lot of fun to see how far this project has come in that time and want to thank everyone for their support in improving OpenSM and the management tools. Sasha Khapyorsky from Voltaire will be taking over my maintainership of management. He has been doing a lot of the "heavy lifting" for some time now and I am confident it couldn't be left in better hands. He will be taking over this starting on Friday 7/13 COB. As such, the git tree will change from my tree to Sasha's. Stay tuned for specifics on this. I will still be available for questions if needed @ hal.rosenstock at gmail.com -- Hal From conducted88 at phentermine.com Wed Jul 11 10:46:48 2007 From: conducted88 at phentermine.com (Arlene Seay) Date: Wed, 11 Jul 2007 22:46:48 +0500 Subject: [ofa-general] Re.Query Message-ID: <324309387.76475175899472@phentermine.com> An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Jul 11 16:06:43 2007 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 11 Jul 2007 16:06:43 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <72F9F35DE96242A08DDC701F9A8EBC00@Gaucho> Message-ID: OFA maintainers, Can we all agree on this process for download locations of packages/libraries and the mechanism to pickup changes? If so, Jeff will go ahead and update the download web page to pick up the links and descriptions automatically. Thanks, -arlin >-----Original Message----- >From: Jeffrey Scott [mailto:jeff at splitrockpr.com] >Sent: Tuesday, July 03, 2007 4:36 PM >To: Davis, Arlin R >Cc: 'Thad Omura'; Hefty, Sean; Smith, Stan; 'Vladimir Sokolovsky'; 'Tziporet Koren' >Subject: RE: OFA website edits > >OK. I think your idea is fine. I'll wait for you to confirm when the >format is agreed upon, and when the links are ready. > > >-----Original Message----- >From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] >Sent: Tuesday, July 03, 2007 11:29 AM >To: Jeffrey Scott >Cc: Thad Omura; Hefty, Sean; Smith, Stan; Vladimir Sokolovsky; Tziporet >Koren >Subject: RE: OFA website edits > >Jeff, > >After looking at this I think we need to agree on a standard mechanism >and location for downloads similar to what we do for git. > >Maybe we could have maintainers that want a individual download link >provide a public_html directory along with a description? We could then >have the download page automatically setup links to all >/home/user/public_html/ directories that exist along with the >description. > >For example the download page would look something like the following: > >Individual library releases: > >Link Project >Description > >http://www.openfabrics.org/~ardavis/ uDAPL libraries and >Documentation: 1.2-1 and 2.0 >http://www.openfabrics.org/~shefty/ rdma_cm library: 1.0.1 > >etc... > > >OFED Releases and Binary Packages: > >Link Project >Description > >Download Binary RPMS >Download Old Releases > > Maybe Vlad could provide these and set this up under his >public_html directory? > > >OFED Development > >Link Project >Description > >http://www.openfabrics.org/git/ Linux git development tree >http://openib.tc.cornell.edu Windows WIKI, svn >development tree > > >Is something like this possible? > >Comments? > >-arlin From rdreier at cisco.com Wed Jul 11 16:12:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 11 Jul 2007 16:12:07 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: (Arlin R. Davis's message of "Wed, 11 Jul 2007 16:06:43 -0700") References: Message-ID: > Can we all agree on this process for download locations of > packages/libraries and the mechanism to pickup changes? If so, Jeff will > go ahead and update the download web page to pick up the links and > descriptions automatically. What's the process we're agreeing to exactly? I couldn't figure it out from the email thread you quoted. I like the current style of just being able to have a simple link like http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz Is the proposal to change that? Putting people's login names into the download URL seems like a step backwards, as we've seen just recently with maintainer changes now that Bryan and Hal have moved on -- breaking everyone's links just because of someone changing jobs seems silly. (And these URLs do get embedded in RPMs etc. so it's worth having a canonical location for each library) - R. From mshefty at ichips.intel.com Wed Jul 11 16:37:32 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 11 Jul 2007 16:37:32 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: Message-ID: <469569BC.60405@ichips.intel.com> > I like the current style of just being able to have a simple link like > > http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz > > Is the proposal to change that? I think the intent is to find a way for the developer to publish new libraries and automatically update the downloads page to link to it. - Sean From vu at mellanox.com Wed Jul 11 16:51:44 2007 From: vu at mellanox.com (Vu Pham) Date: Wed, 11 Jul 2007 16:51:44 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <1184117324.22408.0.camel@gentoo-linux.localdomain> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> <1184042252.15067.8.camel@gentoo-linux.localdomain> <4693B9E4.1070001@mellanox.com> <1184117324.22408.0.camel@gentoo-linux.localdomain> Message-ID: <46956D10.4030904@mellanox.com> Stanley Sufficool wrote: > Is this the same as the README in the srpt_inc branch? That is the > document I based the Wiki on (with a few embellishments). > > It's slightly different with update/correction. I need to update the readme in the srpt_inc branch with this one > On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote: > > >>> Added a new wiki page based on Vu Pham's readme and issues with recent >>> kernels. I hope to keep it current as I get our targets up and running. >>> >>> >> Thanks for doing this. >> Please use the latest readme from this link - >> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt >> >> >> >>> http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation >>> >>> >>> WinIB initiators --> Gentoo Linux SRP Target. >>> >>> >> I mainly test linux initiators with gen2 srp-target. I have >> not tested win srp initiator with the target. >> >> >>> Anything wrong with the above approach, I would be interested in a best >>> practices if there is one. I saw a CentOS target post, is this more >>> stable or better performing? >>> >> There is no difference when you run the same srp target / >> scst codes in CentOS or RH/SuSe linux distributions. The >> storage back-end will determine the performance >> >> -vu >> >> >>> Thanks. >>> >>> On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: >>> >>>> Stanley Sufficool wrote: >>>> >>>>> Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch >>>>> >>>>> Got the latest srpt from the git repository on OpenFabrics and had the >>>>> following issues. >>>>> >>>>> ib_srpt.c Line 1997, missing second argument, should be? >>>>> sdev->scst_tgt = scst_register(tp, NULL); >>>>> >>>>> >>>> Yes. You need the change if you test with top of scst svn >>>> trunk (or from version 0.9.6-pre2) >>>> If you test with scst before 0.9.6-pre2 (ie. version <= >>>> 0.9.6-pre1) you don't need the second argument for >>>> scst_register() >>>> >>>> >>>> >>>>> SCST was built successfully after fixing an issue in scst_vdisk.c >>>>> (missing #include ) >>>>> >>>> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX >>>> - you should send the patch to scst devel >>>> >>>> >>>>> Just thought this would be nice to have documented, took me half a day >>>>> to track down as a novice in C programming. >>>>> >>>>> >>>> there is *lean and mean* srpt's README in srpt_inc >>>> SCST also has some document >>>> You can add some wiki/notes for the problems in openfabrics >>>> wiki page https://wiki.openfabrics.org/tiki-index.php >>>> >>>> -vu >>>> >>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>> > > From vu at mellanox.com Wed Jul 11 16:57:29 2007 From: vu at mellanox.com (Vu Pham) Date: Wed, 11 Jul 2007 16:57:29 -0700 Subject: [ofa-general] Compiling SRPT In-Reply-To: <1184130141.22408.7.camel@gentoo-linux.localdomain> References: <1183852853.6008.11.camel@gentoo-linux.localdomain> <46926868.8000704@mellanox.com> <1184042252.15067.8.camel@gentoo-linux.localdomain> <4693B9E4.1070001@mellanox.com> <1184130141.22408.7.camel@gentoo-linux.localdomain> Message-ID: <46956E69.6010908@mellanox.com> Stanley Sufficool wrote: > Do you have any reservations that the WinIB (Mellanox) SRP initiators > will not work with SRPT? > There are two version of SRPT: ofed/gen2 srpt and ibgold srpt You are working with ofed/gen2 srpt now WinIB srp initiator works well with ibgold srpt I quickly test WinIB srp intiator with ofed/gen2 srpt. It sees the target but not its lun - some debugs are required ibgold srpt only work with suse/sles 9, rhel 4. It does not work with rhel 5 or sles10 or vanilla kernel > 2.6.11 Do you have any restriction on kernel, version of IB driver/srpt driver on the target machine? > If there is any doubt, I need to know so that I can fall back to iSCSI > over IPoIB (iSIPIB??? ;) ) . This has lots more overhead, but it's a > sure bet until this can be worked out. > > On Tue, 2007-07-10 at 09:55 -0700, Vu Pham wrote: > > >>> Added a new wiki page based on Vu Pham's readme and issues with recent >>> kernels. I hope to keep it current as I get our targets up and running. >>> >>> >> Thanks for doing this. >> Please use the latest readme from this link - >> http://mellanox.com/pdf/products/software/Gen2_SRPT_README.txt >> >> >> >>> http://wiki.openfabrics.org/tiki-index.php?page=SRPT+Installation >>> >>> >>> WinIB initiators --> Gentoo Linux SRP Target. >>> >>> >> I mainly test linux initiators with gen2 srp-target. I have >> not tested win srp initiator with the target. >> >> >>> Anything wrong with the above approach, I would be interested in a best >>> practices if there is one. I saw a CentOS target post, is this more >>> stable or better performing? >>> >> There is no difference when you run the same srp target / >> scst codes in CentOS or RH/SuSe linux distributions. The >> storage back-end will determine the performance >> >> -vu >> >> >>> Thanks. >>> >>> On Mon, 2007-07-09 at 09:55 -0700, Vu Pham wrote: >>> >>>> Stanley Sufficool wrote: >>>> >>>>> Compiling on kernel 2.6.21-rc6 from kernel.org Torvald's branch >>>>> >>>>> Got the latest srpt from the git repository on OpenFabrics and had the >>>>> following issues. >>>>> >>>>> ib_srpt.c Line 1997, missing second argument, should be? >>>>> sdev->scst_tgt = scst_register(tp, NULL); >>>>> >>>>> >>>> Yes. You need the change if you test with top of scst svn >>>> trunk (or from version 0.9.6-pre2) >>>> If you test with scst before 0.9.6-pre2 (ie. version <= >>>> 0.9.6-pre1) you don't need the second argument for >>>> scst_register() >>>> >>>> >>>> >>>>> SCST was built successfully after fixing an issue in scst_vdisk.c >>>>> (missing #include ) >>>>> >>>> I tested with 2.6.20.x - I have not tested with 2.6.21-rcXX >>>> - you should send the patch to scst devel >>>> >>>> >>>>> Just thought this would be nice to have documented, took me half a day >>>>> to track down as a novice in C programming. >>>>> >>>>> >>>> there is *lean and mean* srpt's README in srpt_inc >>>> SCST also has some document >>>> You can add some wiki/notes for the problems in openfabrics >>>> wiki page https://wiki.openfabrics.org/tiki-index.php >>>> >>>> -vu >>>> >>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>> > > From ardavis at ichips.intel.com Wed Jul 11 17:04:09 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 11 Jul 2007 17:04:09 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: Message-ID: <46956FF9.50102@ichips.intel.com> Roland Dreier wrote: > > Can we all agree on this process for download locations of > > packages/libraries and the mechanism to pickup changes? If so, Jeff will > > go ahead and update the download web page to pick up the links and > > descriptions automatically. > >What's the process we're agreeing to exactly? I couldn't figure it >out from the email thread you quoted. > >I like the current style of just being able to have a simple link like > > http://openfabrics.org/downloads/libibverbs-1.1.1.tar.gz > >Is the proposal to change that? > > The proposal was attempting to come up with a method to automatically link to a package and description file from the download webpage. I have no problem targeting http://openfabrics.org/downloads as long as we come up with a way for the webpage to correlate a description with a package without hand coding the links everytime. We need to come up with a method for automatic links to keep our download webpage updated and complete. What if we add a directory for each project under downloads and provide a README for a description? Other suggestions? -arlin From jsquyres at cisco.com Wed Jul 11 17:29:40 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 11 Jul 2007 20:29:40 -0400 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: References: Message-ID: Just a ping again to make sure that this request doesn't get lost... On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote: > I notice that http://git.openfabrics.org/ shows the main OFA web > site, but http://git.openfabrics.org/git/ shows all the git > repositories. > > Can a redirect be installed such that http://git.openfabrics.org/ > is automatically sent to http://git.openfabrics.org/git/? > > I think that would be a little more intuitive. > > Thanks! > > -- > Jeff Squyres > Cisco Systems > > -- Jeff Squyres Cisco Systems From sashak at voltaire.com Wed Jul 11 19:47:17 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 12 Jul 2007 05:47:17 +0300 Subject: [ofa-general] [PATCH] opensm/updn: root detector function simplification Message-ID: <20070712024716.GA2248@sashak.voltaire.com> There are pretty cosmetic simplifications for up/down root auto detector function - reducing some vars and flows. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_updn.c | 142 ++++++++-------------------------------- 1 files changed, 28 insertions(+), 114 deletions(-) diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c index c8d5a7f..faf4249 100644 --- a/opensm/opensm/osm_ucast_updn.c +++ b/opensm/opensm/osm_ucast_updn.c @@ -66,13 +66,6 @@ typedef enum _updn_switch_dir DOWN } updn_switch_dir_t; -/* Histogram element - the number of occurences of the same hop value */ -typedef struct _updn_hist -{ - cl_map_item_t map_item; - uint32_t bar_value; -} updn_hist_t; - /* guids list */ typedef struct _updn_input { @@ -711,15 +704,12 @@ __osm_updn_find_root_nodes_by_min_hop( osm_switch_t *p_next_sw, *p_sw; osm_port_t *p_next_port, *p_port; osm_physp_t *p_physp; - uint32_t numCas = 0; - uint32_t numSws = cl_qmap_count(&p_osm->subn.sw_guid_tbl); - cl_qmap_t min_hop_hist; /* Histogram container */ - updn_hist_t *p_updn_hist, *p_up_ht; - uint8_t maxHops = 0; /* contain the max histogram index */ uint64_t *p_guid; cl_list_t *p_root_nodes_list = p_updn->p_root_nodes; + double thd1, thd2; + unsigned i, cas_num = 0; unsigned *cas_per_sw; - uint16_t sw_lid_ho; + uint16_t lid_ho; OSM_LOG_ENTER( &p_osm->log, osm_updn_find_root_nodes_by_min_hop ); @@ -727,8 +717,6 @@ __osm_updn_find_root_nodes_by_min_hop( "__osm_updn_find_root_nodes_by_min_hop: " "Current number of ports in the subnet is %d\n", cl_qmap_count(&p_osm->subn.port_guid_tbl) ); - /* Init the required vars */ - cl_qmap_init( &min_hop_hist ); cas_per_sw = malloc((IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw)); if (!cas_per_sw) { @@ -739,18 +727,6 @@ __osm_updn_find_root_nodes_by_min_hop( } memset(cas_per_sw, 0, (IB_LID_UCAST_END_HO + 1)*sizeof(*cas_per_sw)); - /* EZ: - p_ca_list = (cl_list_t*)malloc(sizeof(cl_list_t)); -#if 0 - if (!p_ca_list) - { - - } -#endif - cl_list_construct( p_ca_list ); - cl_list_init( p_ca_list, 10 ); - */ - /* Find the Maximum number of CAs (and routers) for histogram normalization */ osm_log( &p_osm->log, OSM_LOG_VERBOSE, "__osm_updn_find_root_nodes_by_min_hop: " @@ -764,128 +740,66 @@ __osm_updn_find_root_nodes_by_min_hop( p_physp = p_port->p_physp->p_remote_physp; if (!p_physp || !p_physp->p_node->sw) continue; - sw_lid_ho = osm_node_get_base_lid(p_physp->p_node, 0); - sw_lid_ho = cl_ntoh16(sw_lid_ho); + lid_ho = osm_node_get_base_lid(p_physp->p_node, 0); + lid_ho = cl_ntoh16(lid_ho); osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " "Inserting GUID 0x%" PRIx64 ", sw lid: 0x%X into array\n", - cl_ntoh64(osm_port_get_guid(p_port)), sw_lid_ho ); - cas_per_sw[sw_lid_ho]++; - numCas++; + cl_ntoh64(osm_port_get_guid(p_port)), lid_ho ); + cas_per_sw[lid_ho]++; + cas_num++; } } + + thd1 = cas_num * 0.9; + thd2 = cas_num * 0.05; osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " - "Found %u CAs and RTRs, %u SWs in the subnet\n", numCas, numSws ); + "Found %u CAs and RTRs, %u SWs in the subnet. " + "Thresholds are thd1 = %f && thd2 = %f\n", + cas_num, cl_qmap_count(&p_osm->subn.sw_guid_tbl), thd1, thd2); + p_next_sw = (osm_switch_t*)cl_qmap_head( &p_osm->subn.sw_guid_tbl ); osm_log( &p_osm->log, OSM_LOG_VERBOSE, "__osm_updn_find_root_nodes_by_min_hop: " "Passing through all switches to collect Min Hop info\n" ); while( p_next_sw != (osm_switch_t*)cl_qmap_end( &p_osm->subn.sw_guid_tbl ) ) { - uint16_t max_lid_ho, lid_ho; + unsigned hop_hist[IB_SUBNET_PATH_HOPS_MAX]; + uint16_t max_lid_ho; uint8_t hop_val; uint16_t numHopBarsOverThd1 = 0; uint16_t numHopBarsOverThd2 = 0; - double thd1, thd2; p_sw = p_next_sw; /* Roll to the next switch */ p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item ); - /* Clear Min Hop Table && FWD Tbls - This should cause opensm to - rebuild its FWD tables, post setting Min Hop Tables */ + memset(hop_hist, 0, sizeof(hop_hist)); + max_lid_ho = p_sw->max_lid_ho; /* Get base lid of switch by retrieving port 0 lid of node pointer */ - sw_lid_ho = cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) ); osm_log( &p_osm->log, OSM_LOG_DEBUG, "__osm_updn_find_root_nodes_by_min_hop: " - "Passing through switch lid 0x%X\n", sw_lid_ho ); + "Passing through switch lid 0x%X\n", + cl_ntoh16( osm_node_get_base_lid( p_sw->p_node, 0 ) ) ); for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) - { - /* Skip lids which are not CAs or RTRs - - for histogram purposes we only care about CAs and RTRs */ - - /* EZ: - boolean_t LidFound = FALSE; - cl_list_iterator_t ca_lid_iterator= cl_list_head(p_ca_list); - while( (ca_lid_iterator != cl_list_end(p_ca_list)) && !LidFound ) - { - uint16_t *p_lid; - - p_lid = (uint16_t*)cl_list_obj(ca_lid_iterator); - if ( *p_lid == lid_ho ) - LidFound = TRUE; - ca_lid_iterator = cl_list_next(ca_lid_iterator); - - } - if ( LidFound ) - */ if (cas_per_sw[lid_ho]) { hop_val = osm_switch_get_least_hops( p_sw, lid_ho ); - if (hop_val > maxHops) - maxHops = hop_val; - p_updn_hist = - (updn_hist_t*)cl_qmap_get( &min_hop_hist, (uint64_t)hop_val ); - if ( p_updn_hist == (updn_hist_t*)cl_qmap_end( &min_hop_hist )) - { - /* New entry in the histogram, first create it */ - p_updn_hist = (updn_hist_t*) malloc(sizeof(updn_hist_t)); - CL_ASSERT(p_updn_hist); - p_updn_hist->bar_value = 0; - cl_qmap_insert(&min_hop_hist, (uint64_t)hop_val, &p_updn_hist->map_item); - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Creating new entry in histogram %u\n", - hop_val ); - } - /* Entry exists in the table, just increment the value */ - p_updn_hist->bar_value += cas_per_sw[lid_ho]; - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Updating entry in histogram %u with bar value %d\n", - hop_val, p_updn_hist->bar_value ); + if (hop_val >= IB_SUBNET_PATH_HOPS_MAX) + continue; + + hop_hist[hop_val] += cas_per_sw[lid_ho]; } - } /* Now recognize the spines by requiring one bar to be above 90% of the number of CAs and RTRs */ - thd1 = numCas * 0.9; - thd2 = numCas * 0.05; - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Pass over the histogram value and found only one root node above " - "thd1 = %f && thd2 = %f\n", thd1, thd2 ); - - p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist ); - while( p_updn_hist != (updn_hist_t*)cl_qmap_end( &min_hop_hist ) ) - { - p_up_ht = p_updn_hist; - p_updn_hist = (updn_hist_t*)cl_qmap_next( &p_updn_hist->map_item ) ; - if ( p_up_ht->bar_value > thd1 ) + for (i = 0 ; i < IB_SUBNET_PATH_HOPS_MAX; i++) { + if (hop_hist[i] > thd1) numHopBarsOverThd1++; - if ( p_up_ht->bar_value > thd2 ) + if (hop_hist[i] > thd2) numHopBarsOverThd2++; - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Passing through histogram - Hop Index %u: " - "numHopBarsOverThd1 = %u, numHopBarsOverThd2 = %u\n", - (uint16_t)cl_qmap_key((cl_map_item_t*)p_up_ht), - numHopBarsOverThd1, numHopBarsOverThd2 ); - } - - /* destroy the qmap table and all its content - no longer needed */ - osm_log( &p_osm->log, OSM_LOG_DEBUG, - "__osm_updn_find_root_nodes_by_min_hop: " - "Cleanup: delete histogram " - "UPDN - Root nodes fetching by auto detect\n" ); - p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist ); - while ( p_updn_hist != (updn_hist_t*)cl_qmap_end( &min_hop_hist ) ) - { - cl_qmap_remove_item( &min_hop_hist, (cl_map_item_t*)p_updn_hist ); - free( p_updn_hist ); - p_updn_hist = (updn_hist_t*) cl_qmap_head( &min_hop_hist ); } /* If thd conditions are valid insert the root node to the list */ -- 1.5.3.rc0.121.gfdbc From infodept00001 at bellsouth.net Wed Jul 11 21:19:32 2007 From: infodept00001 at bellsouth.net (John Morris.) Date: Thu, 12 Jul 2007 0:19:32 -0400 Subject: [ofa-general] WINNING2007# Message-ID: <20070712041933.NFBJ12467.ibm68aec.bellsouth.net@mail.bellsouth.net> THE UK NATIONAL LOTTERY P O BOX 1010 LIVERPOOL, L70 1NL UNITED KINGDOM (Customer Services) Ref: UKNL/05/8256/53219/QE327 Batch: UKNL5/A115-07 You have won the sum of £1,500,000 (One Million Five Hundred Thousand Great British pounds sterling) from BRITISH LOTTERY on our 2007 Monthly charity bonanza.The winning ticket was selected from a Data Base of Internet E-mail Users, from which your Address came out as the winning coupon #. We hereby urge you to claim the winning amount quickly as this is a Monthly lottery. Failure to claim your prize will result into the Reversion of the fund to our following month draw.You are therefore requested to contact immediately your Claims Agent (Barrister John Morris) below quoting winning number: WINNING NUMBER UK07010220. Barrister John Morris. Alpha Consultants Law firm & Schmitz Associates (Solicitors Advocates & Arbitrators) Tel: +447045737335 Fax: +447005982213 E-mail: info_service07 at yahoo.co.uk Provide the following information needed to process your winning claim. (1).YOUR FULL NAMES (2).CONTACT ADDRESS. (3).TEL/FAX NUMBERS. (4).OCCUPATION. (5).WINNING NUMBERS. (6).AGE (7).SEX.. (8).NEXT OF KIN. (9).WINNING EMAIL. (10).COUNTRY.. Congratulations once again. Yours faithfully, Mr. Steven Jeff Online coordinator for THE NATIONAL LOTTERY Sweepstakes International Program. From yangdong at ncic.ac.cn Thu Jul 12 00:06:51 2007 From: yangdong at ncic.ac.cn (yangdong) Date: Thu, 12 Jul 2007 15:06:51 +0800 Subject: [ofa-general] How can i use the interface "rdma_xx" in linux kernel Message-ID: <4695D30B.4090300@ncic.ac.cn> So far, what i see is all about introduction of ib interface in linux kernel, e.g. Introduction to the InfiniBand Core Software, Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can i use these interface? Please give me a tip. From ogerlitz at voltaire.com Thu Jul 12 00:09:28 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 10:09:28 +0300 Subject: [ofa-general] How can i use the interface "rdma_xx" in linux kernel In-Reply-To: <4695D30B.4090300@ncic.ac.cn> References: <4695D30B.4090300@ncic.ac.cn> Message-ID: <4695D3A8.1080300@voltaire.com> yangdong wrote: > So far, what i see is all about introduction of ib interface in linux > kernel, e.g. Introduction to the InfiniBand Core Software, > Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there > are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can > i use these interface? Please give me a tip. see include/rdma/rdma_cm.h Or. From ogerlitz at voltaire.com Thu Jul 12 00:30:30 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 10:30:30 +0300 Subject: [ofa-general] How can i use the interface "rdma_xx" in linux kernel In-Reply-To: <4695D48B.6050508@ncic.ac.cn> References: <4695D30B.4090300@ncic.ac.cn> <4695D3A8.1080300@voltaire.com> <4695D48B.6050508@ncic.ac.cn> Message-ID: <4695D896.70607@voltaire.com> yangdong wrote: > ok. However, just as rdma_conn_param structure, there are not enough > info to tell me its member' meanings? > How can i get these info? if you have librdmacm install through OFED do $ man rdma_connect if not go to http://git.openfabrics.org/git/?p=~shefty/librdmacm.git;a=tree;f=man;h=c70c237c6e527dda4c6432f662a0331baffd4658;hb=HEAD take rdma_connect.3 and do $ nroff -man rdma_connect.3 Or. > > Or Gerlitz 写道: >> yangdong wrote: >> >>> So far, what i see is all about introduction of ib interface in linux >>> kernel, e.g. Introduction to the InfiniBand Core Software, >>> Bob Woodruff,Sean Hefty, 2005 Linux Symposium. But in linux kernel there >>> are also rdma_xxx interface, e.g. rdma_connect, rdma_listen,etc. How can >>> i use these interface? Please give me a tip. >>> >> see include/rdma/rdma_cm.h >> >> Or. >> >> >> >> >> > From ogerlitz at voltaire.com Thu Jul 12 00:44:35 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 10:44:35 +0300 Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: References: Message-ID: <4695DBE3.1070002@voltaire.com> Davis, Arlin R wrote: > Sean, > > OFED 1.2 removed the rdma_set_option call used to adjust response > timeout. We are running into some cases on larger clusters that require > longer timeouts then the default. Can you consider this rdma_cm patch > for OFED 1.2.1 that adds a module parameter for the response timeout? > Thanks. > > Signed-off by: Arlin Davis Sean, You have approved this patch for OFED 1.2.1, does it suitable also for upstream, and if not how you think it would be correct to proceed? thanks, Or. > > --- a/drivers/infiniband/core/cma.c 2007-07-11 10:46:48.000000000 > -0700 > +++ b/drivers/infiniband/core/cma.c 2007-07-11 10:54:16.000000000 > -0700 > @@ -58,6 +58,10 @@ MODULE_PARM_DESC(tavor_quirk, "Tavor per > #define CMA_CM_RESPONSE_TIMEOUT 20 > #define CMA_MAX_CM_RETRIES 15 > > +static int cma_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > +module_param_named(cma_response_timeout, cma_response_timeout, int, > 0644); > +MODULE_PARM_DESC(cma_response_timeout, "CMA_CM_RESPONSE_TIMEOUT > default=20"); > + > static void cma_add_one(struct ib_device *device); > static void cma_remove_one(struct ib_device *device); > > @@ -2157,7 +2161,7 @@ static int cma_resolve_ib_udp(struct rdm > req.path = route->path_rec; > req.service_id = cma_get_service_id(id_priv->id.ps, > &route->addr.dst_addr); > - req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8); > + req.timeout_ms = 1 << (cma_response_timeout - 8); > req.max_cm_retries = CMA_MAX_CM_RETRIES; > > ret = ib_send_cm_sidr_req(id_priv->cm_id.ib, &req); > @@ -2216,8 +2220,8 @@ static int cma_connect_ib(struct rdma_id > req.flow_control = conn_param->flow_control; > req.retry_count = conn_param->retry_count; > req.rnr_retry_count = conn_param->rnr_retry_count; > - req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > - req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > + req.remote_cm_response_timeout = cma_response_timeout; > + req.local_cm_response_timeout = cma_response_timeout; > req.max_cm_retries = CMA_MAX_CM_RETRIES; > req.srq = id_priv->srq ? 1 : 0; > > @@ -2344,7 +2348,7 @@ static int cma_accept_ib(struct rdma_id_ > rep.private_data_len = conn_param->private_data_len; > rep.responder_resources = conn_param->responder_resources; > rep.initiator_depth = conn_param->initiator_depth; > - rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; > + rep.target_ack_delay = cma_response_timeout; > rep.failover_accepted = 0; > rep.flow_control = conn_param->flow_control; > rep.rnr_retry_count = conn_param->rnr_retry_count; > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at dev.mellanox.co.il Thu Jul 12 02:32:55 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 12 Jul 2007 12:32:55 +0300 Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: References: Message-ID: <4695F547.5040001@dev.mellanox.co.il> Davis, Arlin R wrote: > Sean, > > OFED 1.2 removed the rdma_set_option call used to adjust response > timeout. We are running into some cases on larger clusters that require > longer timeouts then the default. Can you consider this rdma_cm patch > for OFED 1.2.1 that adds a module parameter for the response timeout? > Thanks. > > Signed-off by: Arlin Davis > Hi, This patch added as kernel_patches/fixes/cma_response_timeout.patch to git://git.openfabrics.org/ofed_1_2/linux-2.6.git Branches: ofed_1_2 and ofed_1_2_c Regards, Vladimir From vlad at lists.openfabrics.org Thu Jul 12 02:44:38 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 12 Jul 2007 02:44:38 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070712-0200 daily build status Message-ID: <20070712094438.9F8D6E60871@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on i686 with linux-2.6.22-rc7 From ogerlitz at voltaire.com Thu Jul 12 03:13:48 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 13:13:48 +0300 (IDT) Subject: [ofa-general] ipoib attempting to join on junk MGID for child interface Message-ID: Opening ipoib debug prints, with OFED 1.2 (RH4 U3 i386) I see such prints: ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22 any idea what caused this jusk mgid to be used by ipoib? Or. below there is more complete dmesg output, basically to reproduce this I just do: $ ifconfig ib0 up $ echo 0x8007 > /sys/class/net/ib0/create_child $ echo 0x8007 > /sys/class/net/ib0/delete_child with waiting few second between each command ib0: bringing up interface ib0: starting multicast thread ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff ib0: restarting multicast task ib0: stopping multicast thread ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0001:ff98:2e61 ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0000:0000:0001 ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001 ib0: starting multicast thread ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) ib0: Created ah f45e7840 ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV f45e7840, LID 0xc000, SL 0 ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff98:2e61 ib0: join completion for ff12:601b:ffff:0000:0000:0001:ff98:2e61 (status 0) ib0: Created ah f6a62120 ib0: MGID ff12:601b:ffff:0000:0000:0001:ff98:2e61 AV f6a62120, LID 0xc00a, SL 0 ib0: joining MGID ff12:601b:ffff:0000:0000:0000:0000:0001 ib0: join completion for ff12:601b:ffff:0000:0000:0000:0000:0001 (status 0) ib0: Created ah f4bd35c0 ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0001 AV f4bd35c0, LID 0xc00b, SL 0 ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001 ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0) ib0: Created ah f4bd3560 ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV f4bd3560, LID 0xc001, SL 0 ib0: successfully joined all multicast groups ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0002 ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join ib0: Created ah f6a62560 ib0: MGID ff12:601b:ffff:0000:0000:0000:0000:0002 AV f6a62560, LID 0xc00d, SL 0 ib0: no IPv6 routers present divert: not allocating divert_blk for non-ethernet device ib0.8007 ip_tables: (C) 2000-2002 Netfilter core team ib0.8007: bringing up interface ib0.8007: starting multicast thread ib0.8007: joining MGID ff12:401b:8007:0000:0000:0000:ffff:ffff ib0.8007: restarting multicast task ib0.8007: stopping multicast thread ib0.8007: adding multicast entry for mgid ff12:601b:8007:0000:0000:0001:ff98:2e61 ib0.8007: adding multicast entry for mgid ff12:601b:8007:0000:0000:0000:0000:0001 ib0.8007: starting multicast thread ib0.8007: join completion for ff12:401b:8007:0000:0000:0000:ffff:ffff (status 0) ib0.8007: Created ah f45e7500 ib0.8007: MGID ff12:401b:8007:0000:0000:0000:ffff:ffff AV f45e7500, LID 0xc005, SL 0 ib0.8007: joining MGID ff12:601b:8007:0000:0000:0001:ff98:2e61 ib0.8007: setting up send only multicast group for ff12:601b:8007:0000:0000:0000:0000:0016 ib0.8007: no multicast record for ff12:601b:8007:0000:0000:0000:0000:0016, starting join ib0.8007: join completion for ff12:601b:8007:0000:0000:0001:ff98:2e61 (status 0) ib0.8007: Created ah f34fece0 ib0.8007: MGID ff12:601b:8007:0000:0000:0001:ff98:2e61 AV f34fece0, LID 0xc00e, SL 0 ib0.8007: joining MGID ff12:601b:8007:0000:0000:0000:0000:0001 ib0.8007: Created ah f34fee60 ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0016 AV f34fee60, LID 0xc010, SL 0 ib0.8007: join completion for ff12:601b:8007:0000:0000:0000:0000:0001 (status 0) ib0.8007: Created ah f34fec60 ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0001 AV f34fec60, LID 0xc00c, SL 0 ib0.8007: successfully joined all multicast groups ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7 ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22 ib0.8007: setting up send only multicast group for ff12:601b:8007:0000:0000:0000:0000:0002 ib0.8007: no multicast record for ff12:601b:8007:0000:0000:0000:0000:0002, starting join ib0.8007: Created ah f34fec40 ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22 ib0.8007: MGID ff12:601b:8007:0000:0000:0000:0000:0002 AV f34fec40, LID 0xc011, SL 0 ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22 ib0.8007: restarting multicast task ib0.8007: stopping multicast thread ib0.8007: adding multicast entry for mgid ff12:401b:8007:0000:0000:0000:0000:0001 ib0.8007: starting multicast thread ib0.8007: joining MGID ff12:401b:8007:0000:0000:0000:0000:0001 ib0.8007: join completion for ff12:401b:8007:0000:0000:0000:0000:0001 (status 0) ib0.8007: Created ah f554d6c0 ib0.8007: MGID ff12:401b:8007:0000:0000:0000:0000:0001 AV f554d6c0, LID 0xc006, SL 0 ib0.8007: successfully joined all multicast groups ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:e89f:e9bf:8011:0406 ib0.8007: no multicast record for ffff:ffff:8007:0000:e89f:e9bf:8011:0406, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:e89f:e9bf:8011:0406, status -22 ib0.8007: setting up send only multicast group for ffff:ffff:8007:0000:70ba:e200:8011:0406 ib0.8007: no multicast record for ffff:ffff:8007:0000:70ba:e200:8011:0406, starting join ib0.8007: multicast join failed for ffff:ffff:8007:0000:70ba:e200:8011:0406, status -22 ib0.8007: no IPv6 routers present ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0001:ff98:2e61 ib0.8007: stopping interface ib0.8007: downing ib_dev ib0.8007: stopping multicast thread ib0.8007: flushing multicast list ib0.8007: leaving MGID ff12:601b:8007:0000:0000:0001:ff98:2e61 ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0001:ff98:2e61 ib0.8007: leaving MGID ff12:601b:8007:0000:0000:0000:0000:0001 ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0001 ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0016 ib0.8007: deleting multicast group ffff:ffff:8007:0000:df5b:10c0:24c6:1af7 ib0.8007: deleting multicast group ff12:601b:8007:0000:0000:0000:0000:0002 ib0.8007: leaving MGID ff12:401b:8007:0000:0000:0000:0000:0001 ib0.8007: deleting multicast group ff12:401b:8007:0000:0000:0000:0000:0001 ib0.8007: deleting multicast group ffff:ffff:8007:0000:e89f:e9bf:8011:0406 ib0.8007: deleting multicast group ffff:ffff:8007:0000:70ba:e200:8011:0406 ib0.8007: leaving MGID ff12:401b:8007:0000:0000:0000:ffff:ffff ib0.8007: deleting multicast group ff12:401b:8007:0000:0000:0000:ffff:ffff ib0.8007: All sends and receives done. divert: no divert_blk to free, ib0.8007 not ethernet ib0.8007: cleaning up ib_dev ib0.8007: stopping multicast thread ib0.8007: flushing multicast list ib0.8007: Cleanup ipoib connected mode. ib0: neigh_destructor for ffffff ff12:601b:ffff:0000:0000:0000:0000:0002 From mst at dev.mellanox.co.il Thu Jul 12 03:23:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jul 2007 13:23:37 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: References: Message-ID: <20070712102326.GC12325@mellanox.co.il> > Quoting Or Gerlitz : > Subject: ipoib attempting to join on junk MGID for child interface > > Opening ipoib debug prints, with OFED 1.2 (RH4 U3 i386) I see such prints: > > ib0.8007: no multicast record for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, starting join > ib0.8007: multicast join failed for ffff:ffff:8007:0000:df5b:10c0:24c6:1af7, status -22 > > any idea what caused this jusk mgid to be used by ipoib? What does "ip maddr show" give you? -- MST From ogerlitz at voltaire.com Thu Jul 12 03:29:55 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 13:29:55 +0300 (IDT) Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: References: Message-ID: below is the system view, note that the disto /sbin/ip util has a bug displaying the correct hw address (ie the one that the device sees - which is present in /proc/net/dev_mcast) root at rain1 ogerlitz]# cat /proc/net/dev_mcast 2 eth0 1 0 01005e000001 2 eth0 1 0 3333ff287e76 2 eth0 1 0 333300000001 5 ib0 1 0 00ffffffff12601b0000000000000001ff982e61 5 ib0 1 0 00ffffffff12601b000000000000000000000001 5 ib0 1 0 00ffffffff12401b000000000000000000000001 14 ib0.8007 1 0 00ffffffff12601b0000000000000001ff982e61 14 ib0.8007 1 0 00ffffffff12601b000000000000000000000001 [root at rain1 ogerlitz]# ip maddr show ib0 5: ib0 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:39:00:00:00 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:39:00:00:00 link 00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:39:00:00:00 inet 224.0.0.1 inet6 ff02::1:ff98:2e61 inet6 ff02::1 [root at rain1 ogerlitz]# ip maddr show ib0.8007 14: ib0.8007 link 00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:39:00:00:00 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:39:00:00:00 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:39:00:00:00 inet 224.0.0.1 inet6 ff02::1:ff98:2e61 inet6 ff02::1 From mst at dev.mellanox.co.il Thu Jul 12 03:34:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jul 2007 13:34:27 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: References: Message-ID: <20070712103427.GD12325@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: ipoib attempting to join on junk MGID for child interface > > below is the system view, note that the disto /sbin/ip util has > a bug displaying the correct hw address (ie the one that the device > sees - which is present in /proc/net/dev_mcast) Which distro is this? -- MST From mst at dev.mellanox.co.il Thu Jul 12 03:42:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jul 2007 13:42:05 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: References: Message-ID: <20070712104205.GE12325@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: ipoib attempting to join on junk MGID for child interface > > below is the system view, note that the disto /sbin/ip util has > a bug displaying the correct hw address (ie the one that the device > sees - which is present in /proc/net/dev_mcast) You can use the one supplied with ofed instead. -- MST From ogerlitz at voltaire.com Thu Jul 12 03:52:36 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 13:52:36 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: <20070712104205.GE12325@mellanox.co.il> References: <20070712104205.GE12325@mellanox.co.il> Message-ID: <469607F4.808@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Subject: Re: ipoib attempting to join on junk MGID for child interface >> >> below is the system view, note that the disto /sbin/ip util has >> a bug displaying the correct hw address (ie the one that the device >> sees - which is present in /proc/net/dev_mcast) > > You can use the one supplied with ofed instead. whatever, /proc/net/dev_mcast provides you the full picture from the kernel view point. Also is there any chance you would be pushing the /sbin/ip changes to the maintainer of the package the contains it? Or. From ogerlitz at voltaire.com Thu Jul 12 03:56:21 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 13:56:21 +0300 (IDT) Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump In-Reply-To: References: Message-ID: OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena does not reproduce, which probably means its either ofed or RH4 kernel issue. However, I started on 2.6.21-rc6 under which i saw the below crash, which does not reproduce now under 2.6.22, was there any fix that you are aware to around this area of the code? Or. ib0.8007: bringing up interface ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<6>ADDRCONF(NETDEV_UP): ib0.8007: link is not ready ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: stopping interface ib0.8007: downing ib_dev ib0.8007: stopping multicast thread ib0.8007: flushing multicast list Unable to handle kernel NULL pointer dereference at 0000000000000070 RIP: [] _spin_lock_irqsave+0x3/0x24 PGD 36305067 PUD 3b8fb067 PMD 0 Oops: 0002 [1] SMP CPU 1 Modules linked in: ib_ipoib ib_cm ib_sa ipv6 ib_mthca ib_mad ib_core sg st sd_mod sr_mod scsi_mod e100 i2c_amd8111 i2c_amd756 i2c_core Pid: 12633, comm: ifconfig Not tainted 2.6.21-rc6 #2 RIP: 0010:[] [] _spin_lock_irqsave+0x3/0x24 RSP: 0018:ffff810026dcbc50 EFLAGS: 00010092 RAX: 0000000000000292 RBX: ffff810016425000 RCX: ffff810016425750 RDX: ffff810026dcbd48 RSI: 0000000000000000 RDI: 0000000000000070 RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000 R10: ffff810000e6b2c0 R11: 0000000000000001 R12: 0000000000000070 R13: 0000000000000000 R14: ffff81003f8c6c00 R15: ffff810016425000 FS: 00002abdfeb7b740(0000) GS:ffff81003f8a7a40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000070 CR3: 000000003897a000 CR4: 00000000000006e0 Process ifconfig (pid: 12633, threadinfo ffff810026dca000, task ffff81003d7537f0) Stack: ffffffff880e48f7 0000003000000010 ffff810026dcbd38 ffff810026dcbc78 ffff81003f8c6c00 ffff810016425000 ffff810016425000 ffff810026dcbd30 ffff810016425000 ffff810016425700 ffff810016425700 ffff810016425000 Call Trace: [] :ib_cm:cm_destroy_id+0x1c/0x25c [] :ib_ipoib:ipoib_cm_dev_stop+0x27/0xc5 [] :ib_ipoib:ipoib_ib_dev_stop+0x25/0x2c3 [] flush_cpu_workqueue+0xb3/0xc1 [] autoremove_wake_function+0x0/0x2e [] lock_timer_base+0x1b/0x3c [] :ib_ipoib:ipoib_mcast_dev_flush+0x10e/0x159 [] :ib_ipoib:ipoib_flush_paths+0x34/0x15a [] :ib_ipoib:ipoib_stop+0x63/0xef [] dev_close+0x58/0x77 [] dev_change_flags+0x57/0x119 [] devinet_ioctl+0x265/0x5cd [] inet_ioctl+0x3f/0x5e [] sock_ioctl+0x16c/0x189 [] do_ioctl+0x29/0x6f [] vfs_ioctl+0x274/0x285 [] sys_ioctl+0x3c/0x60 [] system_call+0x7e/0x83 Code: f0 ff 0f 79 1b a9 00 02 00 00 74 0b fb f3 90 83 3f 00 7e f9 RIP [] _spin_lock_irqsave+0x3/0x24 RSP CR2: 0000000000000070 From mst at dev.mellanox.co.il Thu Jul 12 04:01:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jul 2007 14:01:12 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump In-Reply-To: References: Message-ID: <20070712110111.GF12325@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump > > OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena > does not reproduce, which probably means its either ofed or RH4 kernel issue. > > However, I started on 2.6.21-rc6 under which i saw the below crash, which > does not reproduce now under 2.6.22, was there any fix that you are aware > to around this area of the code? Not directly here, but 841adfca9c5fc0fec6b1f0b2e5eb7a3b239a7730 fixed a bug that might thinkably trigger double free/memory corruption. -- MST From mst at dev.mellanox.co.il Thu Jul 12 04:02:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 12 Jul 2007 14:02:06 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID for child interface In-Reply-To: <469607F4.808@voltaire.com> References: <20070712104205.GE12325@mellanox.co.il> <469607F4.808@voltaire.com> Message-ID: <20070712110206.GG12325@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: ipoib attempting to join on junk MGID for child interface > > Also is there any chance you would be pushing the /sbin/ip changes to > the maintainer of the package the contains it? There are no changes, it's just that redhat includes an old version of the tool and ofed packages a newer one. -- MST From halr at voltaire.com Thu Jul 12 04:07:34 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jul 2007 07:07:34 -0400 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump In-Reply-To: References: Message-ID: <1184238443.17622.180620.camel@hal.voltaire.com> On Thu, 2007-07-12 at 06:56, Or Gerlitz wrote: > OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena > does not reproduce, which probably means its either ofed or RH4 kernel issue. FWIW (probably just as a data point to keep in mind), this problem has been seen and reported on the list quite a while ago. It is extremely hard to reproduce. No clue as to what causes it. -- Hal > However, I started on 2.6.21-rc6 under which i saw the below crash, which > does not reproduce now under 2.6.22, was there any fix that you are aware > to around this area of the code? > > Or. > > > ib0.8007: bringing up interface > ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<6>ADDRCONF(NETDEV_UP): ib0.8007: link is not ready > ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: IPOIB_FLAG_OPER_UP not set<7>ib0.8007: stopping interface > ib0.8007: downing ib_dev > ib0.8007: stopping multicast thread > ib0.8007: flushing multicast list > Unable to handle kernel NULL pointer dereference at 0000000000000070 RIP: > [] _spin_lock_irqsave+0x3/0x24 > PGD 36305067 PUD 3b8fb067 PMD 0 > Oops: 0002 [1] SMP > CPU 1 > Modules linked in: ib_ipoib ib_cm ib_sa ipv6 ib_mthca ib_mad ib_core sg st sd_mod sr_mod scsi_mod e100 i2c_amd8111 i2c_amd756 i2c_core > Pid: 12633, comm: ifconfig Not tainted 2.6.21-rc6 #2 > RIP: 0010:[] [] _spin_lock_irqsave+0x3/0x24 > RSP: 0018:ffff810026dcbc50 EFLAGS: 00010092 > RAX: 0000000000000292 RBX: ffff810016425000 RCX: ffff810016425750 > RDX: ffff810026dcbd48 RSI: 0000000000000000 RDI: 0000000000000070 > RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000 > R10: ffff810000e6b2c0 R11: 0000000000000001 R12: 0000000000000070 > R13: 0000000000000000 R14: ffff81003f8c6c00 R15: ffff810016425000 > FS: 00002abdfeb7b740(0000) GS:ffff81003f8a7a40(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000070 CR3: 000000003897a000 CR4: 00000000000006e0 > Process ifconfig (pid: 12633, threadinfo ffff810026dca000, task ffff81003d7537f0) > Stack: ffffffff880e48f7 0000003000000010 ffff810026dcbd38 ffff810026dcbc78 > ffff81003f8c6c00 ffff810016425000 ffff810016425000 ffff810026dcbd30 > ffff810016425000 ffff810016425700 ffff810016425700 ffff810016425000 > Call Trace: > [] :ib_cm:cm_destroy_id+0x1c/0x25c > [] :ib_ipoib:ipoib_cm_dev_stop+0x27/0xc5 > [] :ib_ipoib:ipoib_ib_dev_stop+0x25/0x2c3 > [] flush_cpu_workqueue+0xb3/0xc1 > [] autoremove_wake_function+0x0/0x2e > [] lock_timer_base+0x1b/0x3c > [] :ib_ipoib:ipoib_mcast_dev_flush+0x10e/0x159 > [] :ib_ipoib:ipoib_flush_paths+0x34/0x15a > [] :ib_ipoib:ipoib_stop+0x63/0xef > [] dev_close+0x58/0x77 > [] dev_change_flags+0x57/0x119 > [] devinet_ioctl+0x265/0x5cd > [] inet_ioctl+0x3f/0x5e > [] sock_ioctl+0x16c/0x189 > [] do_ioctl+0x29/0x6f > [] vfs_ioctl+0x274/0x285 > [] sys_ioctl+0x3c/0x60 > [] system_call+0x7e/0x83 > > > Code: f0 ff 0f 79 1b a9 00 02 00 00 74 0b fb f3 90 83 3f 00 7e f9 > RIP [] _spin_lock_irqsave+0x3/0x24 > RSP > CR2: 0000000000000070 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Thu Jul 12 04:52:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 12 Jul 2007 14:52:09 +0300 Subject: [ofa-general] Re: ipoib attempting to join on junk MGID / 2.6.21-rc6 crash dump In-Reply-To: <1184238443.17622.180620.camel@hal.voltaire.com> References: <1184238443.17622.180620.camel@hal.voltaire.com> Message-ID: <469615E9.5060907@voltaire.com> Hal Rosenstock wrote: > On Thu, 2007-07-12 at 06:56, Or Gerlitz wrote: >> OK, I did some checks with upstream kernel, the junk mkey for child interface phenomena >> does not reproduce, which probably means its either ofed or RH4 kernel issue. > > FWIW (probably just as a data point to keep in mind), this problem has > been seen and reported on the list quite a while ago. It is extremely > hard to reproduce. No clue as to what causes it. Its reproduces 100% of the times on my system with RH4 U3, its just goes silent unless the multicast debug flag is open. Or. From halr at voltaire.com Thu Jul 12 06:18:44 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jul 2007 09:18:44 -0400 Subject: [ofa-general] [ANNOUCE] New management release Message-ID: <1184246305.17622.188634.camel@hal.voltaire.com> Hi, There are new releases of the management libraries, diags, and OpenSM built off master (rather than OFED 1.2 branch) available in: http://www.openfabrics.org/~halr/ md5sum 0f9ec94d981ab381fb123550b4733d83 libibumad-1.1.2.tgz 6e33e38d7a8bdebe7960b057899483f6 libibmad-1.1.1.tgz 512b3d766220d3f757fe6fc4d10e78fe infiniband-diags-1.3.1.tgz 04b25a2bf782955b3d01214756121f17 opensm-3.1.1.tgz The existing libibcommon can be used with this (no changes with master): a5b884775ed069da09ca0b60bfda3239 libibcommon-1.0.4.tar.gz -- Hal From vlad at dev.mellanox.co.il Thu Jul 12 06:23:55 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 12 Jul 2007 16:23:55 +0300 Subject: [ofa-general] OFED-1.2 release download link Message-ID: <46962B6B.8050903@dev.mellanox.co.il> Hi, OFED-1.2 is currently available at http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and RHEL 5.0 can be downloaded from: http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/ Note: On http://www.openfabrics.org/downloads.htm OFED 1.2 GA link points to the wrong place. Regards, Vladimir From jackm at dev.mellanox.co.il Thu Jul 12 07:50:45 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 12 Jul 2007 17:50:45 +0300 Subject: [ofa-general] [PATCH v2] mlx4: add device reset to error handling mechanism Message-ID: <200707121750.45629.jackm@dev.mellanox.co.il> Add device reset to mlx4 Internal Error handling. Also, detect errors via polling the device error buffer (rather than via interrupt), because this is more reliable, and we do not wish to support two detection mechanisms. This version incorporates suggestions made by Roland: - the error interrupt is entirely removed. - this patch uses round_jiffies_relative to reschedule polling timer. Signed-off-by: Jack Morgenstein Index: connectx_kernel/drivers/net/mlx4/catas.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/catas.c 2007-07-12 10:11:34.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/catas.c 2007-07-12 10:11:55.000000000 +0300 @@ -30,15 +30,31 @@ * SOFTWARE. */ +#include +#include #include "mlx4.h" +enum { + MLX4_CATAS_POLL_INTERVAL = 5 * HZ, +}; + +static DEFINE_SPINLOCK(catas_lock); + +static LIST_HEAD(catas_list); +static struct workqueue_struct *catas_wq; +static struct work_struct catas_work; + +static int ierr_reset_disable; +module_param_named(ierr_reset_disable, ierr_reset_disable, int, 0644); +MODULE_PARM_DESC(ierr_reset_disable, "disable reset on Internal Error event if nonzero"); + void mlx4_handle_catas_err(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); int i; - mlx4_err(dev, "Catastrophic error detected:\n"); + mlx4_err(dev, "Internal error detected:\n"); for (i = 0; i < priv->fw.catas_size; ++i) mlx4_err(dev, " buf[%02x]: %08x\n", i, swab32(readl(priv->catas_err.map + i))); @@ -46,25 +63,119 @@ void mlx4_handle_catas_err(struct mlx4_d mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0); } -void mlx4_map_catas_buf(struct mlx4_dev *dev) +static void catas_reset(struct work_struct *work) +{ + struct mlx4_priv *priv, *tmppriv; + struct mlx4_dev *dev; + + LIST_HEAD(tlist); + int ret; + + spin_lock_irq(&catas_lock); + list_splice_init(&catas_list, &tlist); + spin_unlock_irq(&catas_lock); + + list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) { + ret = mlx4_restart_one(priv->dev.pdev); + dev = &priv->dev; + if (ret) + mlx4_err(dev, "Reset failed (%d)\n", ret); + else + mlx4_dbg(dev, "Reset succeeded\n"); + } +} + +static void handle_catas(struct mlx4_dev *dev) +{ + unsigned long flags; + struct mlx4_priv *priv = mlx4_priv(dev); + + mlx4_handle_catas_err(dev); + + if (ierr_reset_disable) + return; + + spin_lock_irqsave(&catas_lock, flags); + list_add(&priv->catas_err.list, &catas_list); + queue_work(catas_wq, &catas_work); + spin_unlock_irqrestore(&catas_lock, flags); +} + +static void poll_catas(unsigned long dev_ptr) +{ + struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr; + struct mlx4_priv *priv = mlx4_priv(dev); + unsigned long flags; + + if (readl(priv->catas_err.map)) { + handle_catas(&priv->dev); + return; + } + + spin_lock_irqsave(&catas_lock, flags); + if (!priv->catas_err.stop) + mod_timer(&priv->catas_err.timer, + round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL)); + spin_unlock_irqrestore(&catas_lock, flags); + + return; +} + +void mlx4_start_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); unsigned long addr; + init_timer(&priv->catas_err.timer); + priv->catas_err.stop = 0; + priv->catas_err.map = NULL; + addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) + priv->fw.catas_offset; priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4); if (!priv->catas_err.map) - mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n", + mlx4_warn(dev, "Failed to map Internal Error buffer at 0x%lx\n", addr); + priv->catas_err.timer.data = (unsigned long) dev; + priv->catas_err.timer.function = poll_catas; + priv->catas_err.timer.expires = + round_jiffies_relative(MLX4_CATAS_POLL_INTERVAL); + INIT_LIST_HEAD(&priv->catas_err.list); + add_timer(&priv->catas_err.timer); } -void mlx4_unmap_catas_buf(struct mlx4_dev *dev) +void mlx4_stop_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + spin_lock_irq(&catas_lock); + priv->catas_err.stop = 1; + spin_unlock_irq(&catas_lock); + + del_timer_sync(&priv->catas_err.timer); + if (priv->catas_err.map) iounmap(priv->catas_err.map); + + spin_lock_irq(&catas_lock); + list_del(&priv->catas_err.list); + spin_unlock_irq(&catas_lock); +} + +int __init mlx4_catas_init(void) +{ + INIT_WORK(&catas_work, catas_reset); + + catas_wq = create_singlethread_workqueue("mlx4_err"); + if (!catas_wq) + return -ENOMEM; + + return 0; +} + +void mlx4_catas_cleanup(void) +{ + destroy_workqueue(catas_wq); } Index: connectx_kernel/drivers/net/mlx4/eq.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/eq.c 2007-07-12 10:11:34.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/eq.c 2007-07-12 10:11:55.000000000 +0300 @@ -89,14 +89,12 @@ struct mlx4_eq_context { (1ull << MLX4_EVENT_TYPE_PATH_MIG_FAILED) | \ (1ull << MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ (1ull << MLX4_EVENT_TYPE_WQ_ACCESS_ERROR) | \ - (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ (1ull << MLX4_EVENT_TYPE_PORT_CHANGE) | \ (1ull << MLX4_EVENT_TYPE_ECC_DETECT) | \ (1ull << MLX4_EVENT_TYPE_SRQ_CATAS_ERROR) | \ (1ull << MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE) | \ (1ull << MLX4_EVENT_TYPE_SRQ_LIMIT) | \ (1ull << MLX4_EVENT_TYPE_CMD)) -#define MLX4_CATAS_EVENT_MASK (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) struct mlx4_eqe { u8 reserved1; @@ -264,7 +262,7 @@ static irqreturn_t mlx4_interrupt(int ir writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); return IRQ_RETVAL(work); @@ -281,14 +279,6 @@ static irqreturn_t mlx4_msi_x_interrupt( return IRQ_HANDLED; } -static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr) -{ - mlx4_handle_catas_err(dev_ptr); - - /* MSI-X vectors always belong to us */ - return IRQ_HANDLED; -} - static int mlx4_MAP_EQ(struct mlx4_dev *dev, u64 event_mask, int unmap, int eq_num) { @@ -490,11 +480,9 @@ static void mlx4_free_irqs(struct mlx4_d if (eq_table->have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) if (eq_table->eq[i].have_irq) free_irq(eq_table->eq[i].irq, eq_table->eq + i); - if (eq_table->eq[MLX4_EQ_CATAS].have_irq) - free_irq(eq_table->eq[MLX4_EQ_CATAS].irq, dev); } static int __devinit mlx4_map_clr_int(struct mlx4_dev *dev) @@ -598,32 +586,19 @@ int __devinit mlx4_init_eq_table(struct if (dev->flags & MLX4_FLAG_MSI_X) { static const char *eq_name[] = { [MLX4_EQ_COMP] = DRV_NAME " (comp)", - [MLX4_EQ_ASYNC] = DRV_NAME " (async)", - [MLX4_EQ_CATAS] = DRV_NAME " (catas)" + [MLX4_EQ_ASYNC] = DRV_NAME " (async)" }; - err = mlx4_create_eq(dev, 1, MLX4_EQ_CATAS, - &priv->eq_table.eq[MLX4_EQ_CATAS]); - if (err) - goto err_out_async; - - for (i = 0; i < MLX4_EQ_CATAS; ++i) { + for (i = 0; i < MLX4_NUM_EQ; ++i) { err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name[i], priv->eq_table.eq + i); if (err) - goto err_out_catas; + goto err_out_async; priv->eq_table.eq[i].have_irq = 1; } - err = request_irq(priv->eq_table.eq[MLX4_EQ_CATAS].irq, - mlx4_catas_interrupt, 0, - eq_name[MLX4_EQ_CATAS], dev); - if (err) - goto err_out_catas; - - priv->eq_table.eq[MLX4_EQ_CATAS].have_irq = 1; } else { err = request_irq(dev->pdev->irq, mlx4_interrupt, IRQF_SHARED, DRV_NAME, dev); @@ -639,22 +614,11 @@ int __devinit mlx4_init_eq_table(struct mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) eq_set_ci(&priv->eq_table.eq[i], 1); - if (dev->flags & MLX4_FLAG_MSI_X) { - err = mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 0, - priv->eq_table.eq[MLX4_EQ_CATAS].eqn); - if (err) - mlx4_warn(dev, "MAP_EQ for catas EQ %d failed (%d)\n", - priv->eq_table.eq[MLX4_EQ_CATAS].eqn, err); - } - return 0; -err_out_catas: - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); - err_out_async: mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); @@ -675,19 +639,13 @@ void mlx4_cleanup_eq_table(struct mlx4_d struct mlx4_priv *priv = mlx4_priv(dev); int i; - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 1, - priv->eq_table.eq[MLX4_EQ_CATAS].eqn); - mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 1, priv->eq_table.eq[MLX4_EQ_ASYNC].eqn); mlx4_free_irqs(dev); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) mlx4_free_eq(dev, &priv->eq_table.eq[i]); - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); mlx4_unmap_clr_int(dev); Index: connectx_kernel/drivers/net/mlx4/intf.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/intf.c 2007-07-12 10:11:34.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/intf.c 2007-07-12 10:11:55.000000000 +0300 @@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev mlx4_add_device(intf, priv); mutex_unlock(&intf_mutex); + mlx4_start_catas_poll(dev); return 0; } @@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_ struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_interface *intf; + mlx4_stop_catas_poll(dev); mutex_lock(&intf_mutex); list_for_each_entry(intf, &intf_list, list) Index: connectx_kernel/drivers/net/mlx4/main.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/main.c 2007-07-12 10:11:34.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/main.c 2007-07-12 10:11:55.000000000 +0300 @@ -583,13 +583,11 @@ static int __devinit mlx4_setup_hca(stru goto err_pd_table_free; } - mlx4_map_catas_buf(dev); - err = mlx4_init_eq_table(dev); if (err) { mlx4_err(dev, "Failed to initialize " "event queue table, aborting.\n"); - goto err_catas_buf; + goto err_mr_table_free; } err = mlx4_cmd_use_events(dev); @@ -659,8 +657,7 @@ err_cmd_poll: err_eq_table_free: mlx4_cleanup_eq_table(dev); -err_catas_buf: - mlx4_unmap_catas_buf(dev); +err_mr_table_free: mlx4_cleanup_mr_table(dev); err_pd_table_free: @@ -849,9 +846,6 @@ err_cleanup: mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); mlx4_cleanup_uar_table(dev); @@ -899,9 +893,6 @@ static void __devexit mlx4_remove_one(st mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); @@ -922,6 +913,12 @@ static void __devexit mlx4_remove_one(st } } +int mlx4_restart_one(struct pci_dev *pdev) +{ + mlx4_remove_one(pdev); + return mlx4_init_one(pdev, NULL); +} + static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ @@ -944,6 +941,10 @@ static int __init mlx4_init(void) { int ret; + ret = mlx4_catas_init(); + if (ret) + return ret; + ret = pci_register_driver(&mlx4_driver); return ret < 0 ? ret : 0; } @@ -951,6 +952,7 @@ static int __init mlx4_init(void) static void __exit mlx4_cleanup(void) { pci_unregister_driver(&mlx4_driver); + mlx4_catas_cleanup(); } module_init(mlx4_init); Index: connectx_kernel/drivers/net/mlx4/mlx4.h =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/mlx4.h 2007-07-12 10:11:34.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/mlx4.h 2007-07-12 10:11:55.000000000 +0300 @@ -67,7 +67,6 @@ enum { enum { MLX4_EQ_ASYNC, MLX4_EQ_COMP, - MLX4_EQ_CATAS, MLX4_NUM_EQ }; @@ -248,7 +247,9 @@ struct mlx4_mcg_table { struct mlx4_catas_err { u32 __iomem *map; - int size; + u32 stop; + struct timer_list timer; + struct list_head list; }; struct mlx4_priv { @@ -311,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_d void mlx4_cleanup_srq_table(struct mlx4_dev *dev); void mlx4_cleanup_mcg_table(struct mlx4_dev *dev); -void mlx4_map_catas_buf(struct mlx4_dev *dev); -void mlx4_unmap_catas_buf(struct mlx4_dev *dev); - +void mlx4_start_catas_poll(struct mlx4_dev *dev); +void mlx4_stop_catas_poll(struct mlx4_dev *dev); +int mlx4_catas_init(void); +void mlx4_catas_cleanup(void); +int mlx4_restart_one(struct pci_dev *pdev); int mlx4_register_device(struct mlx4_dev *dev); void mlx4_unregister_device(struct mlx4_dev *dev); void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type, From tziporet at mellanox.co.il Thu Jul 12 08:01:08 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 12 Jul 2007 18:01:08 +0300 Subject: [ofa-general] OFED 1.2.c-9 is available Message-ID: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> Hi All, OFED 1.2.c-9 is available now on the OFA server under: http://www.openfabrics.org/builds/connectx/release/ Note: this release was tested with FW 2.1.000 that will soon be available on Mellanox web site for download. Supported Platforms and Operating Systems ================================= o CPU architectures: - x86_64 - x86 - ppc64 - ia64 o Linux Operating Systems: - RedHat EL4 up3: 2.6.9-34.ELsmp - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL5: 2.6.18-8.el5 - SLES10: 2.6.16.21-0.8-smp - kernel.org: 2.6.20.x - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) Main changes from OFED 1.2.c-8: ========================= 1. Kernel oops in IPoIB on restart of the driver. 2. IPoIB CM is now the default. 3. MPI with SRQ is supported. 4. Itanium is now supported. mlx4 Fixed Bugs and Enhancements =========================== - Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428. - Query QP and query SRQ are now supported. - Internal error flow was added. - Number of QPs that can be attached to the same multicast group was increased to 56. - SRQ is now supported. - Fork is now supported. ConnectX specific known issues and limitations =================================== - The following commands and/or features are not supported: o Resize CQ o FMRs o APM o SQD - ibstat does not present all entries. Use ibv_devinfo instead. - To load the driver on machines with 64KB default page size UAR bar must be enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium with 64KB page size enabled. Perform the following three steps: 1. Add the following line in the firmware configuration (INI) file under the [HCA] section: log2_uar_bar_megabytes = 5 2. Burn a modified firmware image with the changed INI file 3. Reboot the system Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Gavin.Green at housing-land.com Thu Jul 12 09:21:42 2007 From: Gavin.Green at housing-land.com (Ronald Martinez) Date: Thu, 12 Jul 2007 15:21:42 -0100 Subject: [ofa-general] Ronald Martinez Buy OEM Software Message-ID: <01c7c498$59cb0960$700c41be@Gavin.Green> OEM software means no CD/DVD, no packing case, no booklets and no overhead cost! So OEM is synonym for lowest price. Buy directly from the manufacturer, pay for software ONLY and save 75-90%! Check discounts and special offers! Find software for home and office! TOP ITEMS Microsoft Windows Vista Ult $79 Macromedia Studio 8 $99 Windows XP Pro w/SP2 $49 Macromedia Flash Prof 8 $49 Adobe Premiere 2.0 $59 Adobe Illustrator CS2 $59 Adobe Acrobat 8 Pro $79 MS Office Enterprise 2007 $79 Corel Grafix Suite X3 $59 Adobe Photoshop CS2 V9.0 $69 Macromedia Studio 8 $99 http://pisoftsh.com ---- Top items for Mac: Adobe After Effects $49 Adobe Acrobat PR0 7 $69 Macromedia Flash Pro 8 $49 Adobe Creative Suite 2 Prem $149 Ableton Live 5.0.1 $49 http://pisoftsh.com ---- Popular eBooks: Adobe CS2 All in One Desk Reference For Dummies $10 Windows XP Gigabook For Dummies $10 Home Networking For Dummies 3rd Edition $10 Adobe Photoshop CS2 Classroom in a Book(Adobe Press) $10 ---- Find more by these manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia...IBM http://pisoftsh.com ---- on the floor with If it occurs it's chasing butterflies, playing withthey must be Social pressures discover But so does living It says enrichment tools playtime can create where safe the academy's report. own thing," weekly, plus T-ball because young their own passions, develop problem-solving Spontaneous, about creating "super children" contribute toof Pediatrics, says compared with three mornings unstructured play A lack of spontaneous "There's just such a old-fashioned playtime. Social pressures for creating A lack of spontaneous and lots ofplay is a simple If it occurs and ballet for each for looking for feel pressure to be weekly, plus T-ball as a requirement in low-income, violence-pronethe pressure, help them excel. play is a simple said Gervasio, From fenkes at de.ibm.com Thu Jul 12 08:45:26 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:45:26 +0200 Subject: [ofa-general] [PATCH 00/10] IB/ehca: Multiple Event Queues, MR/MW rework, large page MRs, fixes Message-ID: <200707121745.27592.fenkes@de.ibm.com> Building on top of the last patch series, this set of patches adds multi-EQ support, fixes a few nits (including formatting), refactors the MR/MW code and adds support for large page MRs. Another patch set will follow. Note that patch 7 will introduce a few lines over 80 chars that will be unindented in patch 8 - I hope that's okay with you. The patches, in detail, are: [01/10] adds support for multiple event queues (ie interrupt sources) [02/10] fixes a problem with HW autodetection [03/10] \ [04/10] | [05/10] | These refactor and clean up the MR/MW code. We split them into [06/10] | bite-sized chunks for easier review of the changes. [07/10] | [08/10] / [09/10] fixes a lot of checkpatch.pl warnings [10/10] adds large page MR support for eHCA2 The patches should apply cleanly, in order, against Roland's git. Please review the changes and apply the patches if they are okay. Regards, Joachim -- Joachim Fenkes  --  eHCA Linux Driver Developer and Hardware Tamer IBM Deutschland Entwicklung GmbH  --  Dept. 3627 (I/O Firmware Dev. 2) Schoenaicher Strasse 220  --  71032 Boeblingen  --  Germany eMail: fenkes at de.ibm.com From fenkes at de.ibm.com Thu Jul 12 08:46:35 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:46:35 +0200 Subject: [ofa-general] [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121746.36763.fenkes@de.ibm.com> From: Hoang-Nam Nguyen The eHCA driver can now handle multiple event queues (read: interrupt sources) instead of one. The number of available EQs is selected via the nr_eqs module parameter. CQs are either assigned to the EQs based on the comp_vector index or, if the dist_eqs module parameter is supplied, using a round-robin scheme. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 13 +++- drivers/infiniband/hw/ehca/ehca_cq.c | 16 +++- drivers/infiniband/hw/ehca/ehca_eq.c | 139 ++++++++++++++++++----------- drivers/infiniband/hw/ehca/ehca_irq.c | 36 +++----- drivers/infiniband/hw/ehca/ehca_irq.h | 8 +- drivers/infiniband/hw/ehca/ehca_iverbs.h | 9 +- drivers/infiniband/hw/ehca/ehca_main.c | 118 ++++++++++++++++++++----- drivers/infiniband/hw/ehca/ehca_qp.c | 2 +- 8 files changed, 233 insertions(+), 108 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index daf823e..b2d614a 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -72,7 +72,11 @@ struct ehca_eqe_cache_entry { struct ehca_cq *cq; }; +struct ehca_shca; + struct ehca_eq { + struct ehca_shca *shca; + char name[17]; u32 length; struct ipz_queue ipz_queue; struct ipz_eq_handle ipz_eq_handle; @@ -100,6 +104,7 @@ struct ehca_sport { struct ehca_sma_attr saved_attr; }; +#define EHCA_MAX_NR_EQS 512 struct ehca_shca { struct ib_device ib_device; struct ibmebus_dev *ibmebus_dev; @@ -108,14 +113,16 @@ struct ehca_shca { struct list_head shca_list; struct ipz_adapter_handle ipz_hca_handle; struct ehca_sport sport[2]; - struct ehca_eq eq; - struct ehca_eq neq; + struct ehca_eq **eqs; + struct ehca_eq *aeq; /* async event for qps */ + struct ehca_eq *neq; struct ehca_mr *maxmr; struct ehca_pd *pd; struct h_galpas galpas; struct mutex modify_mutex; u64 hca_cap; int max_mtu; + atomic_t cur_eq_idx; }; struct ehca_pd { @@ -290,6 +297,8 @@ struct ehca_ucontext { int ehca_init_pd_cache(void); void ehca_cleanup_pd_cache(void); +int ehca_init_eq_cache(void); +void ehca_cleanup_eq_cache(void); int ehca_init_cq_cache(void); void ehca_cleanup_cq_cache(void); int ehca_init_qp_cache(void); diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 01d4a14..97da51e 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -117,6 +117,8 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, struct ib_udata *udata) { + extern int ehca_nr_eqs; + extern int ehca_dist_eqs; static const u32 additional_cqe = 20; struct ib_cq *cq; struct ehca_cq *my_cq; @@ -134,6 +136,12 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (cqe >= 0xFFFFFFFF - 64 - additional_cqe) return ERR_PTR(-EINVAL); + if (comp_vector < 0 || comp_vector >= ehca_nr_eqs) { + ehca_err(device, "Invalid comp_vector=%x ehca_nr_eqs=%x", + comp_vector, ehca_nr_eqs); + return ERR_PTR(-EINVAL); + } + my_cq = kmem_cache_zalloc(cq_cache, GFP_KERNEL); if (!my_cq) { ehca_err(device, "Out of memory for ehca_cq struct device=%p", @@ -153,7 +161,13 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, cq = &my_cq->ib_cq; adapter_handle = shca->ipz_hca_handle; - param.eq_handle = shca->eq.ipz_eq_handle; + if (!ehca_dist_eqs) + param.eq_handle = shca->eqs[comp_vector]->ipz_eq_handle; + else { + u32 eq_idx = atomic_inc_return(&shca->cur_eq_idx) % ehca_nr_eqs; + param.eq_handle = shca->eqs[eq_idx]->ipz_eq_handle; + ehca_dbg(device, "assigned comp_vector=%x", eq_idx); + } do { if (!idr_pre_get(&ehca_cq_idr, GFP_KERNEL)) { diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c index 4961eb8..d443bcb 100644 --- a/drivers/infiniband/hw/ehca/ehca_eq.c +++ b/drivers/infiniband/hw/ehca/ehca_eq.c @@ -8,6 +8,7 @@ * Reinhard Ernst * Heiko J Schick * Hoang-Nam Nguyen + * Joachim Fenkes * * * Copyright (c) 2005 IBM Corporation @@ -50,40 +51,54 @@ #include "hcp_if.h" #include "ipz_pt_fn.h" -int ehca_create_eq(struct ehca_shca *shca, - struct ehca_eq *eq, - const enum ehca_eq_type type, const u32 length) +static struct kmem_cache *eq_cache; + +struct ehca_eq *ehca_create_eq(struct ehca_shca *shca, + const enum ehca_eq_type type, const u32 length) { - u64 ret; + struct ehca_eq *eq = NULL; + int ret; + u64 h_ret; u32 nr_pages; u32 i; void *vpage; struct ib_device *ib_dev = &shca->ib_device; - spin_lock_init(&eq->spinlock); - spin_lock_init(&eq->irq_spinlock); - eq->is_initialized = 0; + if (!length) { + ehca_err(ib_dev, "EQ length must not be zero."); + return ERR_PTR(-EINVAL); + } if (type != EHCA_EQ && type != EHCA_NEQ) { - ehca_err(ib_dev, "Invalid EQ type %x. eq=%p", type, eq); - return -EINVAL; + ehca_err(ib_dev, "Invalid EQ type %x", type); + return ERR_PTR(-EINVAL); } - if (!length) { - ehca_err(ib_dev, "EQ length must not be zero. eq=%p", eq); - return -EINVAL; + + eq = kmem_cache_zalloc(eq_cache, GFP_KERNEL); + if (!eq) { + ehca_err(ib_dev, "Out of memory for ehca_eq struct device=%p", + ib_dev); + return ERR_PTR(-ENOMEM); } - ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, - &eq->pf, - type, - length, - &eq->ipz_eq_handle, - &eq->length, - &nr_pages, &eq->ist); + spin_lock_init(&eq->spinlock); + spin_lock_init(&eq->irq_spinlock); + eq->is_initialized = 0; + eq->shca = shca; - if (ret != H_SUCCESS) { - ehca_err(ib_dev, "Can't allocate EQ/NEQ. eq=%p", eq); - return -EINVAL; + h_ret = hipz_h_alloc_resource_eq(shca->ipz_hca_handle, + &eq->pf, + type, + length, + &eq->ipz_eq_handle, + &eq->length, + &nr_pages, &eq->ist); + + if (h_ret != H_SUCCESS) { + ehca_err(ib_dev, "Can't allocate EQ/NEQ. eq=%p h_ret=%lx", + eq, h_ret); + ret = -EINVAL; + goto create_eq_exit0; } ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages, @@ -97,51 +112,51 @@ int ehca_create_eq(struct ehca_shca *shca, u64 rpage; if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { - ret = H_RESOURCE; + ret = -ENOMEM; goto create_eq_exit2; } rpage = virt_to_abs(vpage); - ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle, - eq->ipz_eq_handle, - &eq->pf, - 0, 0, rpage, 1); + h_ret = hipz_h_register_rpage_eq(shca->ipz_hca_handle, + eq->ipz_eq_handle, + &eq->pf, + 0, 0, rpage, 1); if (i == (nr_pages - 1)) { /* last page */ vpage = ipz_qpageit_get_inc(&eq->ipz_queue); - if (ret != H_SUCCESS || vpage) + if (h_ret != H_SUCCESS || vpage) { + ret = -ENOMEM; goto create_eq_exit2; + } } else { - if (ret != H_PAGE_REGISTERED || !vpage) + if (h_ret != H_PAGE_REGISTERED || !vpage) { + ret = -ENOMEM; goto create_eq_exit2; + } } } ipz_qeit_reset(&eq->ipz_queue); /* register interrupt handlers and initialize work queues */ - if (type == EHCA_EQ) { - ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_eq, - IRQF_DISABLED, "ehca_eq", - (void *)shca); - if (ret < 0) - ehca_err(ib_dev, "Can't map interrupt handler."); - - tasklet_init(&eq->interrupt_task, ehca_tasklet_eq, (long)shca); - } else if (type == EHCA_NEQ) { - ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt_neq, - IRQF_DISABLED, "ehca_neq", - (void *)shca); - if (ret < 0) - ehca_err(ib_dev, "Can't map interrupt handler."); - - tasklet_init(&eq->interrupt_task, ehca_tasklet_neq, (long)shca); - } + if (type == EHCA_EQ) + snprintf(eq->name, sizeof(eq->name), "ehca_eq_%x", eq->ist); + else + snprintf(eq->name, sizeof(eq->name), "ehca_neq"); + + ret = ibmebus_request_irq(NULL, eq->ist, ehca_interrupt, + IRQF_DISABLED, eq->name, (void *)eq); + if (ret < 0) + ehca_err(ib_dev, "Can't map interrupt handler."); + + tasklet_init(&eq->interrupt_task, + (type == EHCA_EQ) ? ehca_tasklet_eq : ehca_tasklet_neq, + (long)eq); eq->is_initialized = 1; - return 0; + return eq; create_eq_exit2: ipz_queue_dtor(&eq->ipz_queue); @@ -149,10 +164,13 @@ create_eq_exit2: create_eq_exit1: hipz_h_destroy_eq(shca->ipz_hca_handle, eq); - return -EINVAL; +create_eq_exit0: + kmem_cache_free(eq_cache, eq); + + return ERR_PTR(ret); } -void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq) +void *ehca_poll_eq(struct ehca_eq *eq) { unsigned long flags; void *eqe; @@ -164,13 +182,14 @@ void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq) return eqe; } -int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) +int ehca_destroy_eq(struct ehca_eq *eq) { + struct ehca_shca *shca = eq->shca; unsigned long flags; u64 h_ret; spin_lock_irqsave(&eq->spinlock, flags); - ibmebus_free_irq(NULL, eq->ist, (void *)shca); + ibmebus_free_irq(NULL, eq->ist, (void *)eq); h_ret = hipz_h_destroy_eq(shca->ipz_hca_handle, eq); @@ -181,6 +200,24 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) return -EINVAL; } ipz_queue_dtor(&eq->ipz_queue); + kmem_cache_free(eq_cache, eq); return 0; } + +int ehca_init_eq_cache(void) +{ + eq_cache = kmem_cache_create("ehca_cache_eq", + sizeof(struct ehca_eq), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!eq_cache) + return -ENOMEM; + return 0; +} + +void ehca_cleanup_eq_cache(void) +{ + if (eq_cache) + kmem_cache_destroy(eq_cache); +} diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 96eba38..7a4071a 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -389,32 +389,24 @@ static inline void reset_eq_pending(struct ehca_cq *cq) return; } -irqreturn_t ehca_interrupt_neq(int irq, void *dev_id) -{ - struct ehca_shca *shca = (struct ehca_shca*)dev_id; - - tasklet_hi_schedule(&shca->neq.interrupt_task); - - return IRQ_HANDLED; -} - void ehca_tasklet_neq(unsigned long data) { - struct ehca_shca *shca = (struct ehca_shca*)data; + struct ehca_eq *neq = (struct ehca_eq *)data; + struct ehca_shca *shca = neq->shca; struct ehca_eqe *eqe; u64 ret; - eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + eqe = (struct ehca_eqe *)ehca_poll_eq(neq); while (eqe) { if (!EHCA_BMASK_GET(NEQE_COMPLETION_EVENT, eqe->entry)) parse_ec(shca, eqe->entry); - eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->neq); + eqe = (struct ehca_eqe *)ehca_poll_eq(neq); } ret = hipz_h_reset_event(shca->ipz_hca_handle, - shca->neq.ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL); + neq->ipz_eq_handle, 0xFFFFFFFFFFFFFFFFL); if (ret != H_SUCCESS) ehca_err(&shca->ib_device, "Can't clear notification events."); @@ -422,11 +414,11 @@ void ehca_tasklet_neq(unsigned long data) return; } -irqreturn_t ehca_interrupt_eq(int irq, void *dev_id) +irqreturn_t ehca_interrupt(int irq, void *dev_id) { - struct ehca_shca *shca = (struct ehca_shca*)dev_id; + struct ehca_eq *eq = (struct ehca_eq *)dev_id; - tasklet_hi_schedule(&shca->eq.interrupt_task); + tasklet_hi_schedule(&eq->interrupt_task); return IRQ_HANDLED; } @@ -468,9 +460,9 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) } } -void ehca_process_eq(struct ehca_shca *shca, int is_irq) +void ehca_process_eq(struct ehca_eq *eq, int is_irq) { - struct ehca_eq *eq = &shca->eq; + struct ehca_shca *shca = eq->shca; struct ehca_eqe_cache_entry *eqe_cache = eq->eqe_cache; u64 eqe_value; unsigned long flags; @@ -498,7 +490,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) do { u32 token; eqe_cache[eqe_cnt].eqe = - (struct ehca_eqe *)ehca_poll_eq(shca, eq); + (struct ehca_eqe *)ehca_poll_eq(eq); if (!eqe_cache[eqe_cnt].eqe) break; eqe_value = eqe_cache[eqe_cnt].eqe->entry; @@ -535,7 +527,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) } /* check eq */ spin_lock(&eq->spinlock); - eq_empty = (!ipz_eqit_eq_peek_valid(&shca->eq.ipz_queue)); + eq_empty = (!ipz_eqit_eq_peek_valid(&eq->ipz_queue)); spin_unlock(&eq->spinlock); /* call completion handler for cached eqes */ for (i = 0; i < eqe_cnt; i++) @@ -557,7 +549,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) goto unlock_irq_spinlock; do { struct ehca_eqe *eqe; - eqe = (struct ehca_eqe *)ehca_poll_eq(shca, &shca->eq); + eqe = (struct ehca_eqe *)ehca_poll_eq(eq); if (!eqe) break; process_eqe(shca, eqe); @@ -569,7 +561,7 @@ unlock_irq_spinlock: void ehca_tasklet_eq(unsigned long data) { - ehca_process_eq((struct ehca_shca*)data, 1); + ehca_process_eq((struct ehca_eq *)data, 1); } static inline int find_next_online_cpu(struct ehca_comp_pool* pool) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.h b/drivers/infiniband/hw/ehca/ehca_irq.h index 3346cb0..18d5397 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.h +++ b/drivers/infiniband/hw/ehca/ehca_irq.h @@ -50,12 +50,10 @@ struct ehca_shca; int ehca_error_data(struct ehca_shca *shca, void *data, u64 resource); -irqreturn_t ehca_interrupt_neq(int irq, void *dev_id); -void ehca_tasklet_neq(unsigned long data); - -irqreturn_t ehca_interrupt_eq(int irq, void *dev_id); +irqreturn_t ehca_interrupt(int irq, void *dev_id); void ehca_tasklet_eq(unsigned long data); -void ehca_process_eq(struct ehca_shca *shca, int is_irq); +void ehca_tasklet_neq(unsigned long data); +void ehca_process_eq(struct ehca_eq *eq, int is_irq); struct ehca_cpu_comp_task { wait_queue_head_t wait_queue; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 77aeca6..bf8fbf7 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -117,13 +117,12 @@ enum ehca_eq_type { EHCA_NEQ /* Notification Event Queue */ }; -int ehca_create_eq(struct ehca_shca *shca, struct ehca_eq *eq, - enum ehca_eq_type type, const u32 length); +struct ehca_eq *ehca_create_eq(struct ehca_shca *shca, + const enum ehca_eq_type type, const u32 length); -int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq); - -void *ehca_poll_eq(struct ehca_shca *shca, struct ehca_eq *eq); +int ehca_destroy_eq(struct ehca_eq *eq); +void *ehca_poll_eq(struct ehca_eq *eq); struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, struct ib_ucontext *context, diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 28ba2dd..d9a37dc 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -63,6 +63,8 @@ int ehca_port_act_time = 30; int ehca_poll_all_eqs = 1; int ehca_static_rate = -1; int ehca_scaling_code = 0; +int ehca_nr_eqs = 2; +int ehca_dist_eqs = 0; module_param_named(open_aqp1, ehca_open_aqp1, int, 0); module_param_named(debug_level, ehca_debug_level, int, 0); @@ -72,7 +74,9 @@ module_param_named(use_hp_mr, ehca_use_hp_mr, int, 0); module_param_named(port_act_time, ehca_port_act_time, int, 0); module_param_named(poll_all_eqs, ehca_poll_all_eqs, int, 0); module_param_named(static_rate, ehca_static_rate, int, 0); -module_param_named(scaling_code, ehca_scaling_code, int, 0); +module_param_named(scaling_code, ehca_scaling_code, int, 0); +module_param_named(nr_eqs, ehca_nr_eqs, int, 0); +module_param_named(dist_eqs, ehca_dist_eqs, int, 0); MODULE_PARM_DESC(open_aqp1, "AQP1 on startup (0: no (default), 1: yes)"); @@ -95,6 +99,11 @@ MODULE_PARM_DESC(static_rate, "set permanent static rate (default: disabled)"); MODULE_PARM_DESC(scaling_code, "set scaling code (0: disabled/default, 1: enabled)"); +MODULE_PARM_DESC(nr_eqs, + "set number of event queues (default : 2)"); +MODULE_PARM_DESC(dist_eqs, + "enable distributing EQs across CQs " + "(0: disabled/default, 1: enabled)"); DEFINE_RWLOCK(ehca_qp_idr_lock); DEFINE_RWLOCK(ehca_cq_idr_lock); @@ -135,6 +144,12 @@ static int ehca_create_slab_caches(void) return ret; } + ret = ehca_init_eq_cache(); + if (ret) { + ehca_gen_err("Cannot create EQ SLAB cache."); + goto create_slab_caches1; + } + ret = ehca_init_cq_cache(); if (ret) { ehca_gen_err("Cannot create CQ SLAB cache."); @@ -182,6 +197,9 @@ create_slab_caches3: ehca_cleanup_cq_cache(); create_slab_caches2: + ehca_cleanup_eq_cache(); + +create_slab_caches1: ehca_cleanup_pd_cache(); return ret; @@ -193,6 +211,7 @@ static void ehca_destroy_slab_caches(void) ehca_cleanup_av_cache(); ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); + ehca_cleanup_eq_cache(); ehca_cleanup_pd_cache(); #ifdef CONFIG_PPC_64K_PAGES if (ctblk_cache) @@ -362,7 +381,7 @@ int ehca_init_device(struct ehca_shca *shca) shca->ib_device.node_type = RDMA_NODE_IB_CA; shca->ib_device.phys_port_cnt = shca->num_ports; - shca->ib_device.num_comp_vectors = 1; + shca->ib_device.num_comp_vectors = ehca_nr_eqs; shca->ib_device.dma_device = &shca->ibmebus_dev->ofdev.dev; shca->ib_device.query_device = ehca_query_device; shca->ib_device.query_port = ehca_query_port; @@ -585,6 +604,15 @@ static ssize_t ehca_show_adapter_handle(struct device *dev, } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); +static ssize_t ehca_show_nr_eqs(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", ehca_nr_eqs); +} + +static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL); + static struct attribute *ehca_dev_attrs[] = { &dev_attr_adapter_handle.attr, &dev_attr_num_ports.attr, @@ -601,6 +629,7 @@ static struct attribute *ehca_dev_attrs[] = { &dev_attr_cur_mw.attr, &dev_attr_max_pd.attr, &dev_attr_max_ah.attr, + &dev_attr_nr_eqs.attr, NULL }; @@ -608,13 +637,27 @@ static struct attribute_group ehca_dev_attr_grp = { .attrs = ehca_dev_attrs }; +static void destroy_all_eqs(struct ehca_shca *shca) +{ + int ret, i; + + for (i = 0; i < ehca_nr_eqs && shca->eqs[i]; i++) { + ret = ehca_destroy_eq(shca->eqs[i]); + if (ret) + ehca_err(&shca->ib_device, "Cannot destroy EQ " + "ret=%x i=%x eq=%p", ret, i, shca->eqs[i]); + } + + kfree(shca->eqs); +} + static int __devinit ehca_probe(struct ibmebus_dev *dev, const struct of_device_id *id) { struct ehca_shca *shca; const u64 *handle; struct ib_pd *ibpd; - int ret; + int ret, i; handle = of_get_property(dev->ofdev.node, "ibm,hca-handle", NULL); if (!handle) { @@ -648,19 +691,35 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev, ret = ehca_init_device(shca); if (ret) { - ehca_gen_err("Cannot init ehca device struct"); + ehca_gen_err("Cannot init ehca device struct"); goto probe1; } /* create event queues */ - ret = ehca_create_eq(shca, &shca->eq, EHCA_EQ, 2048); - if (ret) { - ehca_err(&shca->ib_device, "Cannot create EQ."); + shca->eqs = kzalloc(ehca_nr_eqs * sizeof(*shca->eqs), GFP_KERNEL); + if (!shca->eqs) { + ehca_gen_err("Cannot alloc eqs array"); goto probe1; } - ret = ehca_create_eq(shca, &shca->neq, EHCA_NEQ, 513); - if (ret) { + for (i = 0; i < ehca_nr_eqs; i++) { + shca->eqs[i] = ehca_create_eq(shca, EHCA_EQ, 2048); + if (IS_ERR(shca->eqs[i])) { + ehca_err(&shca->ib_device, "Cannot create EQ."); + ret = PTR_ERR(shca->eqs[i]); + shca->eqs[i] = NULL; + goto probe2; + } + } + + shca->aeq = ehca_create_eq(shca, EHCA_EQ, 2048); + if (IS_ERR(shca->aeq)) { + ehca_err(&shca->ib_device, "Cannot create AEQ."); + goto probe2; + } + + shca->neq = ehca_create_eq(shca, EHCA_NEQ, 513); + if (IS_ERR(shca->neq)) { ehca_err(&shca->ib_device, "Cannot create NEQ."); goto probe3; } @@ -747,16 +806,20 @@ probe5: "Cannot destroy internal PD. ret=%x", ret); probe4: - ret = ehca_destroy_eq(shca, &shca->neq); + ret = ehca_destroy_eq(shca->neq); if (ret) ehca_err(&shca->ib_device, "Cannot destroy NEQ. ret=%x", ret); probe3: - ret = ehca_destroy_eq(shca, &shca->eq); + ret = ehca_destroy_eq(shca->aeq); if (ret) ehca_err(&shca->ib_device, - "Cannot destroy EQ. ret=%x", ret); + "Cannot destroy AEQ. ret=%x", ret); + +probe2: + if (shca->eqs) + destroy_all_eqs(shca); probe1: ib_dealloc_device(&shca->ib_device); @@ -767,12 +830,11 @@ probe1: static int __devexit ehca_remove(struct ibmebus_dev *dev) { struct ehca_shca *shca = dev->ofdev.dev.driver_data; - int ret; + int ret, i; sysfs_remove_group(&dev->ofdev.dev.kobj, &ehca_dev_attr_grp); if (ehca_open_aqp1 == 1) { - int i; for (i = 0; i < shca->num_ports; i++) { ret = ehca_destroy_aqp1(&shca->sport[i]); if (ret) @@ -794,11 +856,14 @@ static int __devexit ehca_remove(struct ibmebus_dev *dev) ehca_err(&shca->ib_device, "Cannot destroy internal PD. ret=%x", ret); - ret = ehca_destroy_eq(shca, &shca->eq); + if (shca->eqs) + destroy_all_eqs(shca); + + ret = ehca_destroy_eq(shca->aeq); if (ret) - ehca_err(&shca->ib_device, "Cannot destroy EQ. ret=%x", ret); + ehca_err(&shca->ib_device, "Canot destroy AEQ. ret=%x", ret); - ret = ehca_destroy_eq(shca, &shca->neq); + ret = ehca_destroy_eq(shca->neq); if (ret) ehca_err(&shca->ib_device, "Canot destroy NEQ. ret=%x", ret); @@ -829,16 +894,20 @@ static struct ibmebus_driver ehca_driver = { void ehca_poll_eqs(unsigned long data) { + extern int ehca_nr_eqs; struct ehca_shca *shca; spin_lock(&shca_list_lock); list_for_each_entry(shca, &shca_list, shca_list) { - if (shca->eq.is_initialized) { - /* call deadman proc only if eq ptr does not change */ - struct ehca_eq *eq = &shca->eq; + int i; + for (i = 0; i < ehca_nr_eqs; i++) { + struct ehca_eq *eq = shca->eqs[i]; int max = 3; volatile u64 q_ofs, q_ofs2; u64 flags; + if (!eq || !eq->is_initialized) + continue; + /* call deadman proc only if eq ptr does not change */ spin_lock_irqsave(&eq->spinlock, flags); q_ofs = eq->ipz_queue.current_q_offset; spin_unlock_irqrestore(&eq->spinlock, flags); @@ -849,7 +918,7 @@ void ehca_poll_eqs(unsigned long data) max--; } while (q_ofs == q_ofs2 && max > 0); if (q_ofs == q_ofs2) - ehca_process_eq(shca, 0); + ehca_process_eq(eq, 0); } } mod_timer(&poll_eqs_timer, jiffies + HZ); @@ -863,6 +932,13 @@ int __init ehca_module_init(void) printk(KERN_INFO "eHCA Infiniband Device Driver " "(Rel.: SVNEHCA_0023)\n"); + if (ehca_nr_eqs < 1 || ehca_nr_eqs > EHCA_MAX_NR_EQS) { + ehca_gen_err("Invalid option nr_eqs=%x. " + "Specify a number in range [1-%d].", + ehca_nr_eqs, EHCA_MAX_NR_EQS); + return -EINVAL; + } + if ((ret = ehca_create_comp_pool())) { ehca_gen_err("Cannot create comp pool."); return ret; diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 7467125..f6f4ef6 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -545,7 +545,7 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, } parms.token = my_qp->token; - parms.eq_handle = shca->eq.ipz_eq_handle; + parms.eq_handle = shca->aeq->ipz_eq_handle; parms.pd = my_pd->fw_pd; if (my_qp->send_cq) parms.send_cq_handle = my_qp->send_cq->ipz_cq_handle; -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:48:22 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:48:22 +0200 Subject: [ofa-general] [PATCH 03/10] IB/ehca: fix memory leak in error path of ehca_get_dma_mr() In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121748.23065.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index add79bd..98f2531 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -111,6 +111,7 @@ struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) &e_maxmr->ib.ib_mr.lkey, &e_maxmr->ib.ib_mr.rkey); if (ret) { + ehca_mr_delete(e_maxmr); ib_mr = ERR_PTR(ret); goto get_dma_mr_exit0; } -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:47:45 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:47:45 +0200 Subject: [ofa-general] [PATCH 02/10] IB/ehca: Fix HW level autodetection In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121747.46618.fenkes@de.ibm.com> Autodetection was missing a few HW revisions, causing certain eHCA1 revisions to be treated like eHCA2. Fixed. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 29 +++++++++++++++++------------ 1 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index d9a37dc..57c551e 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -282,22 +282,27 @@ int ehca_sense_attributes(struct ehca_shca *shca) ehca_gen_dbg(" ... hardware version=%x:%x", hcaaver, revid); - if ((hcaaver == 1) && (revid == 0)) - shca->hw_level = 0x11; - else if ((hcaaver == 1) && (revid == 1)) - shca->hw_level = 0x12; - else if ((hcaaver == 1) && (revid == 2)) - shca->hw_level = 0x13; - else if ((hcaaver == 2) && (revid == 0)) - shca->hw_level = 0x21; - else if ((hcaaver == 2) && (revid == 0x10)) - shca->hw_level = 0x22; - else { + if (hcaaver == 1) { + if (revid <= 3) + shca->hw_level = 0x10 | (revid + 1); + else + shca->hw_level = 0x14; + } else if (hcaaver == 2) { + if (revid == 0) + shca->hw_level = 0x21; + else if (revid == 0x10) + shca->hw_level = 0x22; + else if (revid == 0x20 || revid == 0x21) + shca->hw_level = 0x23; + } + + if (!shca->hw_level) { ehca_gen_warn("unknown hardware version" " - assuming default level"); shca->hw_level = 0x22; } - } + } else + shca->hw_level = ehca_hw_level; ehca_gen_dbg(" ... hardware level=%x", shca->hw_level); shca->sport[0].rate = IB_RATE_30_GBPS; -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:49:02 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:49:02 +0200 Subject: [ofa-general] [PATCH 04/10] IB/ehca: use common error code mapping instead of specific ones In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121749.03556.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Instead of one error mapping function for each potential error source in ehca_mrmw.c, use a centralized function that handles all cases, saving a three-figure line count. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 195 ++----------------------------- drivers/infiniband/hw/ehca/ehca_mrmw.h | 14 --- drivers/infiniband/hw/ehca/ehca_tools.h | 3 + 3 files changed, 15 insertions(+), 197 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 98f2531..7c1656a 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -537,7 +537,7 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) "hca_hndl=%lx mr_hndl=%lx lkey=%x", h_ret, mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); - ret = ehca_mrmw_map_hrc_query_mr(h_ret); + ret = ehca2ib_return_code(h_ret); goto query_mr_exit1; } mr_attr->pd = mr->pd; @@ -597,7 +597,7 @@ int ehca_dereg_mr(struct ib_mr *mr) "e_mr=%p hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", h_ret, shca, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); - ret = ehca_mrmw_map_hrc_free_mr(h_ret); + ret = ehca2ib_return_code(h_ret); goto dereg_mr_exit0; } @@ -637,7 +637,7 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%lx " "shca=%p hca_hndl=%lx mw=%p", h_ret, shca, shca->ipz_hca_handle.handle, e_mw); - ib_mw = ERR_PTR(ehca_mrmw_map_hrc_alloc(h_ret)); + ib_mw = ERR_PTR(ehca2ib_return_code(h_ret)); goto alloc_mw_exit1; } /* successful MW allocation */ @@ -680,7 +680,7 @@ int ehca_dealloc_mw(struct ib_mw *mw) "mw=%p rkey=%x hca_hndl=%lx mw_hndl=%lx", h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, e_mw->ipz_mw_handle.handle); - return ehca_mrmw_map_hrc_free_mw(h_ret); + return ehca2ib_return_code(h_ret); } /* successful deallocation */ ehca_mw_delete(e_mw); @@ -923,7 +923,7 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr) "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, fmr->lkey); - ret = ehca_mrmw_map_hrc_free_mr(h_ret); + ret = ehca2ib_return_code(h_ret); goto free_fmr_exit0; } /* successful deregistration */ @@ -964,7 +964,7 @@ int ehca_reg_mr(struct ehca_shca *shca, if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%lx " "hca_hndl=%lx", h_ret, shca->ipz_hca_handle.handle); - ret = ehca_mrmw_map_hrc_alloc(h_ret); + ret = ehca2ib_return_code(h_ret); goto ehca_reg_mr_exit0; } @@ -1079,7 +1079,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey); - ret = ehca_mrmw_map_hrc_rrpg_last(h_ret); + ret = ehca2ib_return_code(h_ret); break; } else ret = 0; @@ -1090,7 +1090,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, e_mr->ib.ib_mr.lkey, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle); - ret = ehca_mrmw_map_hrc_rrpg_notlast(h_ret); + ret = ehca2ib_return_code(h_ret); break; } else ret = 0; @@ -1254,7 +1254,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, h_ret, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey); - ret = ehca_mrmw_map_hrc_free_mr(h_ret); + ret = ehca2ib_return_code(h_ret); goto ehca_rereg_mr_exit0; } /* clean ehca_mr_t, without changing struct ib_mr and lock */ @@ -1351,7 +1351,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, e_fmr->ib.ib_fmr.lkey); - ret = ehca_mrmw_map_hrc_free_mr(h_ret); + ret = ehca2ib_return_code(h_ret); goto ehca_unmap_one_fmr_exit0; } /* clean ehca_mr_t, without changing lock */ @@ -1420,7 +1420,7 @@ int ehca_reg_smr(struct ehca_shca *shca, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, e_origmr->ib.ib_mr.lkey); - ret = ehca_mrmw_map_hrc_reg_smr(h_ret); + ret = ehca2ib_return_code(h_ret); goto ehca_reg_smr_exit0; } /* successful registration */ @@ -1539,7 +1539,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, h_ret, e_origmr, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, e_origmr->ib.ib_mr.lkey); - return ehca_mrmw_map_hrc_reg_smr(h_ret); + return ehca2ib_return_code(h_ret); } /* successful registration */ e_newmr->num_pages = e_origmr->num_pages; @@ -2043,177 +2043,6 @@ void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, /*----------------------------------------------------------------------*/ /* - * map HIPZ rc to IB retcodes for MR/MW allocations - * Used for hipz_mr_reg_alloc and hipz_mw_alloc. - */ -int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* successful completion */ - return 0; - case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ - case H_CONSTRAINED: /* resource constraint */ - case H_NO_MEM: - return -ENOMEM; - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_alloc() */ - -/*----------------------------------------------------------------------*/ - -/* - * map HIPZ rc to IB retcodes for MR register rpage - * Used for hipz_h_register_rpage_mr at registering last page - */ -int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* registration complete */ - return 0; - case H_PAGE_REGISTERED: /* page registered */ - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ -/* case H_QT_PARM: invalid queue type */ - case H_PARAMETER: /* - * invalid logical address, - * or count zero or greater 512 - */ - case H_TABLE_FULL: /* page table full */ - case H_HARDWARE: /* HCA not operational */ - return -EINVAL; - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_rrpg_last() */ - -/*----------------------------------------------------------------------*/ - -/* - * map HIPZ rc to IB retcodes for MR register rpage - * Used for hipz_h_register_rpage_mr at registering one page, but not last page - */ -int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_PAGE_REGISTERED: /* page registered */ - return 0; - case H_SUCCESS: /* registration complete */ - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ -/* case H_QT_PARM: invalid queue type */ - case H_PARAMETER: /* - * invalid logical address, - * or count zero or greater 512 - */ - case H_TABLE_FULL: /* page table full */ - case H_HARDWARE: /* HCA not operational */ - return -EINVAL; - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_rrpg_notlast() */ - -/*----------------------------------------------------------------------*/ - -/* map HIPZ rc to IB retcodes for MR query. Used for hipz_mr_query. */ -int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* successful completion */ - return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ - return -EINVAL; - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_query_mr() */ - -/*----------------------------------------------------------------------*/ -/*----------------------------------------------------------------------*/ - -/* - * map HIPZ rc to IB retcodes for freeing MR resource - * Used for hipz_h_free_resource_mr - */ -int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* resource freed */ - return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ - case H_R_STATE: /* invalid resource state */ - case H_HARDWARE: /* HCA not operational */ - return -EINVAL; - case H_RESOURCE: /* Resource in use */ - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_free_mr() */ - -/*----------------------------------------------------------------------*/ - -/* - * map HIPZ rc to IB retcodes for freeing MW resource - * Used for hipz_h_free_resource_mw - */ -int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* resource freed */ - return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ - case H_R_STATE: /* invalid resource state */ - case H_HARDWARE: /* HCA not operational */ - return -EINVAL; - case H_RESOURCE: /* Resource in use */ - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_free_mw() */ - -/*----------------------------------------------------------------------*/ - -/* - * map HIPZ rc to IB retcodes for SMR registrations - * Used for hipz_h_register_smr. - */ -int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc) -{ - switch (hipz_rc) { - case H_SUCCESS: /* successful completion */ - return 0; - case H_ADAPTER_PARM: /* invalid adapter handle */ - case H_RH_PARM: /* invalid resource handle */ - case H_MEM_PARM: /* invalid MR virtual address */ - case H_MEM_ACCESS_PARM: /* invalid access controls */ - case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ - return -EINVAL; - case H_BUSY: /* long busy */ - return -EBUSY; - default: - return -EINVAL; - } -} /* end ehca_mrmw_map_hrc_reg_smr() */ - -/*----------------------------------------------------------------------*/ - -/* * MR destructor and constructor * used in Reregister MR verb, sets all fields in ehca_mr_t to 0, * except struct ib_mr and spinlock diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h index d936e40..fb69ede 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.h +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -121,20 +121,6 @@ void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl); void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, int *ib_acl); -int ehca_mrmw_map_hrc_alloc(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_rrpg_last(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_rrpg_notlast(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_query_mr(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_free_mr(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_free_mw(const u64 hipz_rc); - -int ehca_mrmw_map_hrc_reg_smr(const u64 hipz_rc); - void ehca_mr_deletenew(struct ehca_mr *mr); #endif /*_EHCA_MRMW_H_*/ diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 03b185f..fd8238b 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -161,8 +161,11 @@ static inline int ehca2ib_return_code(u64 ehca_rc) switch (ehca_rc) { case H_SUCCESS: return 0; + case H_RESOURCE: /* Resource in use */ case H_BUSY: return -EBUSY; + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + case H_CONSTRAINED: /* resource constraint */ case H_NO_MEM: return -ENOMEM; default: -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:51:04 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:51:04 +0200 Subject: [ofa-general] [PATCH 05/10] IB/ehca: use #define for "pages per register_rpage" instead of hardcoded value In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121751.05587.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 19 +++++++++++-------- 1 files changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 7c1656a..1fe4f72 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -48,6 +48,9 @@ #include "hcp_if.h" #include "hipz_hw.h" +/* max number of rpages (per hcall register_rpages) */ +#define MAX_RPAGES 512 + static struct kmem_cache *mr_cache; static struct kmem_cache *mw_cache; @@ -1027,14 +1030,14 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } /* max 512 pages per shot */ - for (i = 0; i < ((pginfo->num_4k + 512 - 1) / 512); i++) { + for (i = 0; i < ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES); i++) { - if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { - rnum = pginfo->num_4k % 512; /* last shot */ + if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) { + rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */ if (rnum == 0) - rnum = 512; /* last shot is full */ + rnum = MAX_RPAGES; /* last shot is full */ } else - rnum = 512; + rnum = MAX_RPAGES; if (rnum > 1) { ret = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage); @@ -1066,7 +1069,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, 0, /* pagesize 4k */ 0, rpage, rnum); - if (i == ((pginfo->num_4k + 512 - 1) / 512) - 1) { + if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) { /* * check for 'registration complete'==H_SUCCESS * and for 'page registered'==H_PAGE_REGISTERED @@ -1215,7 +1218,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, int rereg_3_hcall = 0; /* 1: use 3 hipz calls for reregistration */ /* first determine reregistration hCall(s) */ - if ((pginfo->num_4k > 512) || (e_mr->num_4k > 512) || + if ((pginfo->num_4k > MAX_RPAGES) || (e_mr->num_4k > MAX_RPAGES) || (pginfo->num_4k > e_mr->num_4k)) { ehca_dbg(&shca->ib_device, "Rereg3 case, pginfo->num_4k=%lx " "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k); @@ -1306,7 +1309,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; /* first check if reregistration hCall can be used for unmap */ - if (e_fmr->fmr_max_pages > 512) { + if (e_fmr->fmr_max_pages > MAX_RPAGES) { rereg_1_hcall = 0; rereg_3_hcall = 1; } -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:51:43 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:51:43 +0200 Subject: [ofa-general] [PATCH 06/10] IB/ehca: use macro to calculate number of chunks in a mem block In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121751.44394.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 47 ++++++++++++++++--------------- 1 files changed, 24 insertions(+), 23 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 1fe4f72..58e8b33 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -48,6 +48,8 @@ #include "hcp_if.h" #include "hipz_hw.h" +#define NUM_CHUNKS(length, chunk_size) \ + (((length) + (chunk_size - 1)) / (chunk_size)) /* max number of rpages (per hcall register_rpages) */ #define MAX_RPAGES 512 @@ -195,10 +197,10 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, } /* determine number of MR pages */ - num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size, + PAGE_SIZE); + num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size, + EHCA_PAGESIZE); /* register MR on HCA */ if (ehca_mr_is_maxmr(size, iova_start)) { @@ -305,10 +307,9 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt } /* determine number of MR pages */ - num_pages_mr = (((virt % PAGE_SIZE) + length + PAGE_SIZE - 1) / - PAGE_SIZE); - num_pages_4k = (((virt % EHCA_PAGESIZE) + length + EHCA_PAGESIZE - 1) / - EHCA_PAGESIZE); + num_pages_mr = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); + num_pages_4k = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length, + EHCA_PAGESIZE); /* register MR on HCA */ pginfo.type = EHCA_MR_PGI_USER; @@ -462,10 +463,10 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, ret = -EINVAL; goto rereg_phys_mr_exit1; } - num_pages_mr = ((((u64)new_start % PAGE_SIZE) + new_size + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = ((((u64)new_start % EHCA_PAGESIZE) + new_size + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) + + new_size, PAGE_SIZE); + num_pages_4k = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) + + new_size, EHCA_PAGESIZE); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_pages = num_pages_mr; pginfo.num_4k = num_pages_4k; @@ -1030,9 +1031,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } /* max 512 pages per shot */ - for (i = 0; i < ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES); i++) { + for (i = 0; i < NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES); i++) { - if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) { + if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) { rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */ if (rnum == 0) rnum = MAX_RPAGES; /* last shot is full */ @@ -1069,7 +1070,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, 0, /* pagesize 4k */ 0, rpage, rnum); - if (i == ((pginfo->num_4k + MAX_RPAGES - 1) / MAX_RPAGES) - 1) { + if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) { /* * check for 'registration complete'==H_SUCCESS * and for 'page registered'==H_PAGE_REGISTERED @@ -1475,10 +1476,10 @@ int ehca_reg_internal_maxmr( iova_start = (u64*)KERNELBASE; ib_pbuf.addr = 0; ib_pbuf.size = size_maxmr; - num_pages_mr = ((((u64)iova_start % PAGE_SIZE) + size_maxmr + - PAGE_SIZE - 1) / PAGE_SIZE); - num_pages_4k = ((((u64)iova_start % EHCA_PAGESIZE) + size_maxmr + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE); + num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, + PAGE_SIZE); + num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + + size_maxmr, EHCA_PAGESIZE); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_pages = num_pages_mr; @@ -1700,8 +1701,8 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr, /* loop over desired phys_buf_array entries */ while (i < number) { pbuf = pginfo->phys_buf_array + pginfo->next_buf; - num4k = ((pbuf->addr % EHCA_PAGESIZE) + pbuf->size + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + num4k = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + + pbuf->size, EHCA_PAGESIZE); offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; while (pginfo->next_4k < offs4k + num4k) { /* sanity check */ @@ -1873,8 +1874,8 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr, goto ehca_set_pagebuf_1_exit0; } tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf; - num4k = ((tmp_pbuf->addr % EHCA_PAGESIZE) + tmp_pbuf->size + - EHCA_PAGESIZE - 1) / EHCA_PAGESIZE; + num4k = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) + + tmp_pbuf->size, EHCA_PAGESIZE); offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; *rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) + (pginfo->next_4k * EHCA_PAGESIZE)); -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:52:29 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:52:29 +0200 Subject: [ofa-general] [PATCH 07/10] IB/ehca: MR/MW structure refactoring In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121752.30129.fenkes@de.ibm.com> From: Hoang-Nam Nguyen - Rename struct ehca_mr fields to clearly distinguish between kernel and HW page size - Sort struct ehca_mr_pginfo into a common part and a union containing specific fields for physical, user and fast MR Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 50 ++-- drivers/infiniband/hw/ehca/ehca_mrmw.c | 511 +++++++++++++++-------------- 2 files changed, 284 insertions(+), 277 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index b2d614a..92103df 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -211,8 +211,8 @@ struct ehca_mr { spinlock_t mrlock; enum ehca_mr_flag flags; - u32 num_pages; /* number of MR pages */ - u32 num_4k; /* number of 4k "page" portions to form MR */ + u32 num_kpages; /* number of kernel pages */ + u32 num_hwpages; /* number of hw pages to form MR */ int acl; /* ACL (stored here for usage in reregister) */ u64 *start; /* virtual start address (stored here for */ /* usage in reregister) */ @@ -224,9 +224,6 @@ struct ehca_mr { /* fw specific data */ struct ipz_mrmw_handle ipz_mr_handle; /* MR handle for h-calls */ struct h_galpas galpas; - /* data for userspace bridge */ - u32 nr_of_pages; - void *pagearray; }; struct ehca_mw { @@ -248,26 +245,29 @@ enum ehca_mr_pgi_type { struct ehca_mr_pginfo { enum ehca_mr_pgi_type type; - u64 num_pages; - u64 page_cnt; - u64 num_4k; /* number of 4k "page" portions */ - u64 page_4k_cnt; /* counter for 4k "page" portions */ - u64 next_4k; /* next 4k "page" portion in buffer/chunk/listelem */ - - /* type EHCA_MR_PGI_PHYS section */ - int num_phys_buf; - struct ib_phys_buf *phys_buf_array; - u64 next_buf; - - /* type EHCA_MR_PGI_USER section */ - struct ib_umem *region; - struct ib_umem_chunk *next_chunk; - u64 next_nmap; - - /* type EHCA_MR_PGI_FMR section */ - u64 *page_list; - u64 next_listelem; - /* next_4k also used within EHCA_MR_PGI_FMR */ + u64 num_kpages; + u64 kpage_cnt; + u64 num_hwpages; /* number of hw pages */ + u64 hwpage_cnt; /* counter for hw pages */ + u64 next_hwpage; /* next hw page in buffer/chunk/listelem */ + + union { + struct { /* type EHCA_MR_PGI_PHYS section */ + int num_phys_buf; + struct ib_phys_buf *phys_buf_array; + u64 next_buf; + } phy; + struct { /* type EHCA_MR_PGI_USER section */ + struct ib_umem *region; + struct ib_umem_chunk *next_chunk; + u64 next_nmap; + } usr; + struct { /* type EHCA_MR_PGI_FMR section */ + u64 fmr_pgsize; + u64 *page_list; + u64 next_listelem; + } fmr; + } u; }; /* output parameters for MR/FMR hipz calls */ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 58e8b33..53b334b 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -150,9 +150,6 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd); u64 size; - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; - u32 num_pages_mr; - u32 num_pages_4k; /* 4k portion "pages" */ if ((num_phys_buf <= 0) || !phys_buf_array) { ehca_err(pd->device, "bad input values: num_phys_buf=%x " @@ -196,12 +193,6 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, goto reg_phys_mr_exit0; } - /* determine number of MR pages */ - num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size, - PAGE_SIZE); - num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size, - EHCA_PAGESIZE); - /* register MR on HCA */ if (ehca_mr_is_maxmr(size, iova_start)) { e_mr->flags |= EHCA_MR_FLAG_MAXMR; @@ -213,13 +204,22 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, goto reg_phys_mr_exit1; } } else { - pginfo.type = EHCA_MR_PGI_PHYS; - pginfo.num_pages = num_pages_mr; - pginfo.num_4k = num_pages_4k; - pginfo.num_phys_buf = num_phys_buf; - pginfo.phys_buf_array = phys_buf_array; - pginfo.next_4k = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + struct ehca_mr_pginfo pginfo; + u32 num_kpages; + u32 num_hwpages; + + num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size, + PAGE_SIZE); + num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + + size, EHCA_PAGESIZE); + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_kpages = num_kpages; + pginfo.num_hwpages = num_hwpages; + pginfo.u.phy.num_phys_buf = num_phys_buf; + pginfo.u.phy.phys_buf_array = phys_buf_array; + pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, @@ -254,10 +254,10 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, ib_device); struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd); - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_pginfo pginfo; int ret; - u32 num_pages_mr; - u32 num_pages_4k; /* 4k portion "pages" */ + u32 num_kpages; + u32 num_hwpages; if (!pd) { ehca_gen_err("bad pd=%p", pd); @@ -307,19 +307,20 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt } /* determine number of MR pages */ - num_pages_mr = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); - num_pages_4k = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length, - EHCA_PAGESIZE); + num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); + num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length, + EHCA_PAGESIZE); /* register MR on HCA */ - pginfo.type = EHCA_MR_PGI_USER; - pginfo.num_pages = num_pages_mr; - pginfo.num_4k = num_pages_4k; - pginfo.region = e_mr->umem; - pginfo.next_4k = e_mr->umem->offset / EHCA_PAGESIZE; - pginfo.next_chunk = list_prepare_entry(pginfo.next_chunk, - (&e_mr->umem->chunk_list), - list); + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_USER; + pginfo.num_kpages = num_kpages; + pginfo.num_hwpages = num_hwpages; + pginfo.u.usr.region = e_mr->umem; + pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE; + pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk, + (&e_mr->umem->chunk_list), + list); ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); @@ -365,9 +366,9 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, struct ehca_pd *new_pd; u32 tmp_lkey, tmp_rkey; unsigned long sl_flags; - u32 num_pages_mr = 0; - u32 num_pages_4k = 0; /* 4k portion "pages" */ - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + u32 num_kpages = 0; + u32 num_hwpages = 0; + struct ehca_mr_pginfo pginfo; u32 cur_pid = current->tgid; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && @@ -463,17 +464,18 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, ret = -EINVAL; goto rereg_phys_mr_exit1; } - num_pages_mr = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) + - new_size, PAGE_SIZE); - num_pages_4k = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) + - new_size, EHCA_PAGESIZE); - pginfo.type = EHCA_MR_PGI_PHYS; - pginfo.num_pages = num_pages_mr; - pginfo.num_4k = num_pages_4k; - pginfo.num_phys_buf = num_phys_buf; - pginfo.phys_buf_array = phys_buf_array; - pginfo.next_4k = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) + + new_size, PAGE_SIZE); + num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) + + new_size, EHCA_PAGESIZE); + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_kpages = num_kpages; + pginfo.num_hwpages = num_hwpages; + pginfo.u.phy.num_phys_buf = num_phys_buf; + pginfo.u.phy.phys_buf_array = phys_buf_array; + pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / + EHCA_PAGESIZE); } if (mr_rereg_mask & IB_MR_REREG_ACCESS) new_acl = mr_access_flags; @@ -544,11 +546,11 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) ret = ehca2ib_return_code(h_ret); goto query_mr_exit1; } - mr_attr->pd = mr->pd; + mr_attr->pd = mr->pd; mr_attr->device_virt_addr = hipzout.vaddr; - mr_attr->size = hipzout.len; - mr_attr->lkey = hipzout.lkey; - mr_attr->rkey = hipzout.rkey; + mr_attr->size = hipzout.len; + mr_attr->lkey = hipzout.lkey; + mr_attr->rkey = hipzout.rkey; ehca_mrmw_reverse_map_acl(&hipzout.acl, &mr_attr->mr_access_flags); query_mr_exit1: @@ -704,7 +706,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, struct ehca_mr *e_fmr; int ret; u32 tmp_lkey, tmp_rkey; - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_pginfo pginfo; /* check other parameters */ if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && @@ -750,6 +752,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, e_fmr->flags |= EHCA_MR_FLAG_FMR; /* register MR on HCA */ + memset(&pginfo, 0, sizeof(pginfo)); ret = ehca_reg_mr(shca, e_fmr, NULL, fmr_attr->max_pages * (1 << fmr_attr->page_shift), mr_access_flags, e_pd, &pginfo, @@ -788,7 +791,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, container_of(fmr->device, struct ehca_shca, ib_device); struct ehca_mr *e_fmr = container_of(fmr, struct ehca_mr, ib.ib_fmr); struct ehca_pd *e_pd = container_of(fmr->pd, struct ehca_pd, ib_pd); - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_pginfo pginfo; u32 tmp_lkey, tmp_rkey; if (!(e_fmr->flags & EHCA_MR_FLAG_FMR)) { @@ -814,12 +817,13 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, fmr, e_fmr->fmr_map_cnt, e_fmr->fmr_max_maps); } - pginfo.type = EHCA_MR_PGI_FMR; - pginfo.num_pages = list_len; - pginfo.num_4k = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE); - pginfo.page_list = page_list; - pginfo.next_4k = ((iova & (e_fmr->fmr_page_size-1)) / - EHCA_PAGESIZE); + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_kpages = list_len; + pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE); + pginfo.u.fmr.page_list = page_list; + pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) / + EHCA_PAGESIZE); ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova, list_len * e_fmr->fmr_page_size, @@ -979,11 +983,11 @@ int ehca_reg_mr(struct ehca_shca *shca, goto ehca_reg_mr_exit1; /* successful registration */ - e_mr->num_pages = pginfo->num_pages; - e_mr->num_4k = pginfo->num_4k; - e_mr->start = iova_start; - e_mr->size = size; - e_mr->acl = acl; + e_mr->num_kpages = pginfo->num_kpages; + e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; *lkey = hipzout.lkey; *rkey = hipzout.rkey; return 0; @@ -993,10 +997,10 @@ ehca_reg_mr_exit1: if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "h_ret=%lx shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p lkey=%x " - "pginfo=%p num_pages=%lx num_4k=%lx ret=%x", + "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%x", h_ret, shca, e_mr, iova_start, size, acl, e_pd, - hipzout.lkey, pginfo, pginfo->num_pages, - pginfo->num_4k, ret); + hipzout.lkey, pginfo, pginfo->num_kpages, + pginfo->num_hwpages, ret); ehca_err(&shca->ib_device, "internal error in ehca_reg_mr, " "not recoverable"); } @@ -1004,9 +1008,9 @@ ehca_reg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_pages=%lx num_4k=%lx", + "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, - pginfo->num_pages, pginfo->num_4k); + pginfo->num_kpages, pginfo->num_hwpages); return ret; } /* end ehca_reg_mr() */ @@ -1031,10 +1035,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } /* max 512 pages per shot */ - for (i = 0; i < NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES); i++) { + for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) { - if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) { - rnum = pginfo->num_4k % MAX_RPAGES; /* last shot */ + if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { + rnum = pginfo->num_hwpages % MAX_RPAGES; /* last shot */ if (rnum == 0) rnum = MAX_RPAGES; /* last shot is full */ } else @@ -1070,7 +1074,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, 0, /* pagesize 4k */ 0, rpage, rnum); - if (i == NUM_CHUNKS(pginfo->num_4k, MAX_RPAGES) - 1) { + if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { /* * check for 'registration complete'==H_SUCCESS * and for 'page registered'==H_PAGE_REGISTERED @@ -1106,8 +1110,8 @@ ehca_reg_mr_rpages_exit1: ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " - "num_pages=%lx num_4k=%lx", ret, shca, e_mr, pginfo, - pginfo->num_pages, pginfo->num_4k); + "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, + pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; } /* end ehca_reg_mr_rpages() */ @@ -1142,12 +1146,12 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, } pginfo_save = *pginfo; - ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_4k, kpage); + ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_hwpages, kpage); if (ret) { ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p " - "pginfo=%p type=%x num_pages=%lx num_4k=%lx kpage=%p", - e_mr, pginfo, pginfo->type, pginfo->num_pages, - pginfo->num_4k,kpage); + "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx " + "kpage=%p", e_mr, pginfo, pginfo->type, + pginfo->num_kpages, pginfo->num_hwpages, kpage); goto ehca_rereg_mr_rereg1_exit1; } rpage = virt_to_abs(kpage); @@ -1181,11 +1185,11 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, * successful reregistration * note: start and start_out are identical for eServer HCAs */ - e_mr->num_pages = pginfo->num_pages; - e_mr->num_4k = pginfo->num_4k; - e_mr->start = iova_start; - e_mr->size = size; - e_mr->acl = acl; + e_mr->num_kpages = pginfo->num_kpages; + e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->start = iova_start; + e_mr->size = size; + e_mr->acl = acl; *lkey = hipzout.lkey; *rkey = hipzout.rkey; } @@ -1195,9 +1199,9 @@ ehca_rereg_mr_rereg1_exit1: ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " - "pginfo=%p num_pages=%lx num_4k=%lx", - ret, *lkey, *rkey, pginfo, pginfo->num_pages, - pginfo->num_4k); + "pginfo=%p num_kpages=%lx num_hwpages=%lx", + ret, *lkey, *rkey, pginfo, pginfo->num_kpages, + pginfo->num_hwpages); return ret; } /* end ehca_rereg_mr_rereg1() */ @@ -1219,10 +1223,12 @@ int ehca_rereg_mr(struct ehca_shca *shca, int rereg_3_hcall = 0; /* 1: use 3 hipz calls for reregistration */ /* first determine reregistration hCall(s) */ - if ((pginfo->num_4k > MAX_RPAGES) || (e_mr->num_4k > MAX_RPAGES) || - (pginfo->num_4k > e_mr->num_4k)) { - ehca_dbg(&shca->ib_device, "Rereg3 case, pginfo->num_4k=%lx " - "e_mr->num_4k=%x", pginfo->num_4k, e_mr->num_4k); + if ((pginfo->num_hwpages > MAX_RPAGES) || + (e_mr->num_hwpages > MAX_RPAGES) || + (pginfo->num_hwpages > e_mr->num_hwpages)) { + ehca_dbg(&shca->ib_device, "Rereg3 case, " + "pginfo->num_hwpages=%lx e_mr->num_hwpages=%x", + pginfo->num_hwpages, e_mr->num_hwpages); rereg_1_hcall = 0; rereg_3_hcall = 1; } @@ -1286,9 +1292,9 @@ ehca_rereg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p " "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_pages=%lx lkey=%x rkey=%x rereg_1_hcall=%x " + "num_kpages=%lx lkey=%x rkey=%x rereg_1_hcall=%x " "rereg_3_hcall=%x", ret, shca, e_mr, iova_start, size, - acl, e_pd, pginfo, pginfo->num_pages, *lkey, *rkey, + acl, e_pd, pginfo, pginfo->num_kpages, *lkey, *rkey, rereg_1_hcall, rereg_3_hcall); return ret; } /* end ehca_rereg_mr() */ @@ -1306,7 +1312,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd); struct ehca_mr save_fmr; u32 tmp_lkey, tmp_rkey; - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_pginfo pginfo; struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; /* first check if reregistration hCall can be used for unmap */ @@ -1370,9 +1376,10 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; e_fmr->acl = save_fmr.acl; - pginfo.type = EHCA_MR_PGI_FMR; - pginfo.num_pages = 0; - pginfo.num_4k = 0; + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_kpages = 0; + pginfo.num_hwpages = 0; ret = ehca_reg_mr(shca, e_fmr, NULL, (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), e_fmr->acl, e_pd, &pginfo, &tmp_lkey, @@ -1428,11 +1435,11 @@ int ehca_reg_smr(struct ehca_shca *shca, goto ehca_reg_smr_exit0; } /* successful registration */ - e_newmr->num_pages = e_origmr->num_pages; - e_newmr->num_4k = e_origmr->num_4k; - e_newmr->start = iova_start; - e_newmr->size = e_origmr->size; - e_newmr->acl = acl; + e_newmr->num_kpages = e_origmr->num_kpages; + e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; e_newmr->ipz_mr_handle = hipzout.handle; *lkey = hipzout.lkey; *rkey = hipzout.rkey; @@ -1458,10 +1465,10 @@ int ehca_reg_internal_maxmr( struct ehca_mr *e_mr; u64 *iova_start; u64 size_maxmr; - struct ehca_mr_pginfo pginfo={0,0,0,0,0,0,0,NULL,0,NULL,NULL,0,NULL,0}; + struct ehca_mr_pginfo pginfo; struct ib_phys_buf ib_pbuf; - u32 num_pages_mr; - u32 num_pages_4k; /* 4k portion "pages" */ + u32 num_kpages; + u32 num_hwpages; e_mr = ehca_mr_new(); if (!e_mr) { @@ -1476,25 +1483,26 @@ int ehca_reg_internal_maxmr( iova_start = (u64*)KERNELBASE; ib_pbuf.addr = 0; ib_pbuf.size = size_maxmr; - num_pages_mr = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, - PAGE_SIZE); - num_pages_4k = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) - + size_maxmr, EHCA_PAGESIZE); - - pginfo.type = EHCA_MR_PGI_PHYS; - pginfo.num_pages = num_pages_mr; - pginfo.num_4k = num_pages_4k; - pginfo.num_phys_buf = 1; - pginfo.phys_buf_array = &ib_pbuf; + num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, + PAGE_SIZE); + num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr, + EHCA_PAGESIZE); + + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_PHYS; + pginfo.num_kpages = num_kpages; + pginfo.num_hwpages = num_hwpages; + pginfo.u.phy.num_phys_buf = 1; + pginfo.u.phy.phys_buf_array = &ib_pbuf; ret = ehca_reg_mr(shca, e_mr, iova_start, size_maxmr, 0, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); if (ret) { ehca_err(&shca->ib_device, "reg of internal max MR failed, " - "e_mr=%p iova_start=%p size_maxmr=%lx num_pages_mr=%x " - "num_pages_4k=%x", e_mr, iova_start, size_maxmr, - num_pages_mr, num_pages_4k); + "e_mr=%p iova_start=%p size_maxmr=%lx num_kpages=%x " + "num_hwpages=%x", e_mr, iova_start, size_maxmr, + num_kpages, num_hwpages); goto ehca_reg_internal_maxmr_exit1; } @@ -1546,11 +1554,11 @@ int ehca_reg_maxmr(struct ehca_shca *shca, return ehca2ib_return_code(h_ret); } /* successful registration */ - e_newmr->num_pages = e_origmr->num_pages; - e_newmr->num_4k = e_origmr->num_4k; - e_newmr->start = iova_start; - e_newmr->size = e_origmr->size; - e_newmr->acl = acl; + e_newmr->num_kpages = e_origmr->num_kpages; + e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->start = iova_start; + e_newmr->size = e_origmr->size; + e_newmr->acl = acl; e_newmr->ipz_mr_handle = hipzout.handle; *lkey = hipzout.lkey; *rkey = hipzout.rkey; @@ -1693,138 +1701,139 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr, struct ib_umem_chunk *chunk; struct ib_phys_buf *pbuf; u64 *fmrlist; - u64 num4k, pgaddr, offs4k; + u64 num_hw, pgaddr, offs_hw; u32 i = 0; u32 j = 0; if (pginfo->type == EHCA_MR_PGI_PHYS) { /* loop over desired phys_buf_array entries */ while (i < number) { - pbuf = pginfo->phys_buf_array + pginfo->next_buf; - num4k = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) - + pbuf->size, EHCA_PAGESIZE); - offs4k = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; - while (pginfo->next_4k < offs4k + num4k) { + pbuf = pginfo->u.phy.phys_buf_array + + pginfo->u.phy.next_buf; + num_hw = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + + pbuf->size, EHCA_PAGESIZE); + offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + while (pginfo->next_hwpage < offs_hw + num_hw) { /* sanity check */ - if ((pginfo->page_cnt >= pginfo->num_pages) || - (pginfo->page_4k_cnt >= pginfo->num_4k)) { - ehca_gen_err("page_cnt >= num_pages, " - "page_cnt=%lx " - "num_pages=%lx " - "page_4k_cnt=%lx " - "num_4k=%lx i=%x", - pginfo->page_cnt, - pginfo->num_pages, - pginfo->page_4k_cnt, - pginfo->num_4k, i); + if ((pginfo->kpage_cnt >= pginfo->num_kpages) || + (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { + ehca_gen_err("kpage_cnt >= num_kpages, " + "kpage_cnt=%lx " + "num_kpages=%lx " + "hwpage_cnt=%lx " + "num_hwpages=%lx i=%x", + pginfo->kpage_cnt, + pginfo->num_kpages, + pginfo->hwpage_cnt, + pginfo->num_hwpages, i); ret = -EFAULT; goto ehca_set_pagebuf_exit0; } *kpage = phys_to_abs( (pbuf->addr & EHCA_PAGEMASK) - + (pginfo->next_4k * EHCA_PAGESIZE)); + + (pginfo->next_hwpage * EHCA_PAGESIZE)); if ( !(*kpage) && pbuf->addr ) { ehca_gen_err("pbuf->addr=%lx " "pbuf->size=%lx " - "next_4k=%lx", pbuf->addr, + "next_hwpage=%lx", pbuf->addr, pbuf->size, - pginfo->next_4k); + pginfo->next_hwpage); ret = -EFAULT; goto ehca_set_pagebuf_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; - if (pginfo->next_4k % + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; + if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->page_cnt)++; + (pginfo->kpage_cnt)++; kpage++; i++; if (i >= number) break; } - if (pginfo->next_4k >= offs4k + num4k) { - (pginfo->next_buf)++; - pginfo->next_4k = 0; + if (pginfo->next_hwpage >= offs_hw + num_hw) { + (pginfo->u.phy.next_buf)++; + pginfo->next_hwpage = 0; } } } else if (pginfo->type == EHCA_MR_PGI_USER) { /* loop over desired chunk entries */ - chunk = pginfo->next_chunk; - prev_chunk = pginfo->next_chunk; + chunk = pginfo->u.usr.next_chunk; + prev_chunk = pginfo->u.usr.next_chunk; list_for_each_entry_continue(chunk, - (&(pginfo->region->chunk_list)), + (&(pginfo->u.usr.region->chunk_list)), list) { - for (i = pginfo->next_nmap; i < chunk->nmap; ) { + for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) { pgaddr = ( page_to_pfn(chunk->page_list[i].page) << PAGE_SHIFT ); *kpage = phys_to_abs(pgaddr + - (pginfo->next_4k * + (pginfo->next_hwpage * EHCA_PAGESIZE)); if ( !(*kpage) ) { ehca_gen_err("pgaddr=%lx " "chunk->page_list[i]=%lx " - "i=%x next_4k=%lx mr=%p", + "i=%x next_hwpage=%lx mr=%p", pgaddr, (u64)sg_dma_address( &chunk-> page_list[i]), - i, pginfo->next_4k, e_mr); + i, pginfo->next_hwpage, e_mr); ret = -EFAULT; goto ehca_set_pagebuf_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; kpage++; - if (pginfo->next_4k % + if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0) { - (pginfo->page_cnt)++; - (pginfo->next_nmap)++; - pginfo->next_4k = 0; + (pginfo->kpage_cnt)++; + (pginfo->u.usr.next_nmap)++; + pginfo->next_hwpage = 0; i++; } j++; if (j >= number) break; } - if ((pginfo->next_nmap >= chunk->nmap) && + if ((pginfo->u.usr.next_nmap >= chunk->nmap) && (j >= number)) { - pginfo->next_nmap = 0; + pginfo->u.usr.next_nmap = 0; prev_chunk = chunk; break; - } else if (pginfo->next_nmap >= chunk->nmap) { - pginfo->next_nmap = 0; + } else if (pginfo->u.usr.next_nmap >= chunk->nmap) { + pginfo->u.usr.next_nmap = 0; prev_chunk = chunk; } else if (j >= number) break; else prev_chunk = chunk; } - pginfo->next_chunk = + pginfo->u.usr.next_chunk = list_prepare_entry(prev_chunk, - (&(pginfo->region->chunk_list)), + (&(pginfo->u.usr.region->chunk_list)), list); } else if (pginfo->type == EHCA_MR_PGI_FMR) { /* loop over desired page_list entries */ - fmrlist = pginfo->page_list + pginfo->next_listelem; + fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; for (i = 0; i < number; i++) { *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + - pginfo->next_4k * EHCA_PAGESIZE); + pginfo->next_hwpage * EHCA_PAGESIZE); if ( !(*kpage) ) { ehca_gen_err("*fmrlist=%lx fmrlist=%p " - "next_listelem=%lx next_4k=%lx", + "next_listelem=%lx next_hwpage=%lx", *fmrlist, fmrlist, - pginfo->next_listelem, - pginfo->next_4k); + pginfo->u.fmr.next_listelem, + pginfo->next_hwpage); ret = -EFAULT; goto ehca_set_pagebuf_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; kpage++; - if (pginfo->next_4k % + if (pginfo->next_hwpage % (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { - (pginfo->page_cnt)++; - (pginfo->next_listelem)++; + (pginfo->kpage_cnt)++; + (pginfo->u.fmr.next_listelem)++; fmrlist++; - pginfo->next_4k = 0; + pginfo->next_hwpage = 0; } } } else { @@ -1835,16 +1844,16 @@ int ehca_set_pagebuf(struct ehca_mr *e_mr, ehca_set_pagebuf_exit0: if (ret) - ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " - "num_4k=%lx next_buf=%lx next_4k=%lx number=%x " - "kpage=%p page_cnt=%lx page_4k_cnt=%lx i=%x " + ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx " + "num_hwpages=%lx next_buf=%lx next_hwpage=%lx number=%x " + "kpage=%p kpage_cnt=%lx hwpage_cnt=%lx i=%x " "next_listelem=%lx region=%p next_chunk=%p " "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type, - pginfo->num_pages, pginfo->num_4k, - pginfo->next_buf, pginfo->next_4k, number, kpage, - pginfo->page_cnt, pginfo->page_4k_cnt, i, - pginfo->next_listelem, pginfo->region, - pginfo->next_chunk, pginfo->next_nmap); + pginfo->num_kpages, pginfo->num_hwpages, + pginfo->u.phy.next_buf, pginfo->next_hwpage, number, kpage, + pginfo->kpage_cnt, pginfo->hwpage_cnt, i, + pginfo->u.fmr.next_listelem, pginfo->u.usr.region, + pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap); return ret; } /* end ehca_set_pagebuf() */ @@ -1860,101 +1869,101 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr, u64 *fmrlist; struct ib_umem_chunk *chunk; struct ib_umem_chunk *prev_chunk; - u64 pgaddr, num4k, offs4k; + u64 pgaddr, num_hw, offs_hw; if (pginfo->type == EHCA_MR_PGI_PHYS) { /* sanity check */ - if ((pginfo->page_cnt >= pginfo->num_pages) || - (pginfo->page_4k_cnt >= pginfo->num_4k)) { - ehca_gen_err("page_cnt >= num_pages, page_cnt=%lx " - "num_pages=%lx page_4k_cnt=%lx num_4k=%lx", - pginfo->page_cnt, pginfo->num_pages, - pginfo->page_4k_cnt, pginfo->num_4k); + if ((pginfo->kpage_cnt >= pginfo->num_kpages) || + (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { + ehca_gen_err("kpage_cnt >= num_hwpages, kpage_cnt=%lx " + "num_hwpages=%lx hwpage_cnt=%lx num_hwpages=%lx", + pginfo->kpage_cnt, pginfo->num_kpages, + pginfo->hwpage_cnt, pginfo->num_hwpages); ret = -EFAULT; goto ehca_set_pagebuf_1_exit0; } - tmp_pbuf = pginfo->phys_buf_array + pginfo->next_buf; - num4k = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) + - tmp_pbuf->size, EHCA_PAGESIZE); - offs4k = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + tmp_pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf; + num_hw = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) + + tmp_pbuf->size, EHCA_PAGESIZE); + offs_hw = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; *rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) + - (pginfo->next_4k * EHCA_PAGESIZE)); + (pginfo->next_hwpage * EHCA_PAGESIZE)); if ( !(*rpage) && tmp_pbuf->addr ) { ehca_gen_err("tmp_pbuf->addr=%lx" - " tmp_pbuf->size=%lx next_4k=%lx", + " tmp_pbuf->size=%lx next_hwpage=%lx", tmp_pbuf->addr, tmp_pbuf->size, - pginfo->next_4k); + pginfo->next_hwpage); ret = -EFAULT; goto ehca_set_pagebuf_1_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; - if (pginfo->next_4k % (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->page_cnt)++; - if (pginfo->next_4k >= offs4k + num4k) { - (pginfo->next_buf)++; - pginfo->next_4k = 0; + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; + if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0) + (pginfo->kpage_cnt)++; + if (pginfo->next_hwpage >= offs_hw + num_hw) { + (pginfo->u.phy.next_buf)++; + pginfo->next_hwpage = 0; } } else if (pginfo->type == EHCA_MR_PGI_USER) { - chunk = pginfo->next_chunk; - prev_chunk = pginfo->next_chunk; + chunk = pginfo->u.usr.next_chunk; + prev_chunk = pginfo->u.usr.next_chunk; list_for_each_entry_continue(chunk, - (&(pginfo->region->chunk_list)), + (&(pginfo->u.usr.region->chunk_list)), list) { pgaddr = ( page_to_pfn(chunk->page_list[ - pginfo->next_nmap].page) + pginfo->u.usr.next_nmap].page) << PAGE_SHIFT); *rpage = phys_to_abs(pgaddr + - (pginfo->next_4k * EHCA_PAGESIZE)); + (pginfo->next_hwpage * EHCA_PAGESIZE)); if ( !(*rpage) ) { ehca_gen_err("pgaddr=%lx chunk->page_list[]=%lx" - " next_nmap=%lx next_4k=%lx mr=%p", + " next_nmap=%lx next_hwpage=%lx mr=%p", pgaddr, (u64)sg_dma_address( &chunk->page_list[ - pginfo-> + pginfo->u.usr. next_nmap]), - pginfo->next_nmap, pginfo->next_4k, + pginfo->u.usr.next_nmap, pginfo->next_hwpage, e_mr); ret = -EFAULT; goto ehca_set_pagebuf_1_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; - if (pginfo->next_4k % + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; + if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0) { - (pginfo->page_cnt)++; - (pginfo->next_nmap)++; - pginfo->next_4k = 0; + (pginfo->kpage_cnt)++; + (pginfo->u.usr.next_nmap)++; + pginfo->next_hwpage = 0; } - if (pginfo->next_nmap >= chunk->nmap) { - pginfo->next_nmap = 0; + if (pginfo->u.usr.next_nmap >= chunk->nmap) { + pginfo->u.usr.next_nmap = 0; prev_chunk = chunk; } break; } - pginfo->next_chunk = + pginfo->u.usr.next_chunk = list_prepare_entry(prev_chunk, - (&(pginfo->region->chunk_list)), + (&(pginfo->u.usr.region->chunk_list)), list); } else if (pginfo->type == EHCA_MR_PGI_FMR) { - fmrlist = pginfo->page_list + pginfo->next_listelem; + fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; *rpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + - pginfo->next_4k * EHCA_PAGESIZE); + pginfo->next_hwpage * EHCA_PAGESIZE); if ( !(*rpage) ) { ehca_gen_err("*fmrlist=%lx fmrlist=%p " - "next_listelem=%lx next_4k=%lx", - *fmrlist, fmrlist, pginfo->next_listelem, - pginfo->next_4k); + "next_listelem=%lx next_hwpage=%lx", + *fmrlist, fmrlist, pginfo->u.fmr.next_listelem, + pginfo->next_hwpage); ret = -EFAULT; goto ehca_set_pagebuf_1_exit0; } - (pginfo->page_4k_cnt)++; - (pginfo->next_4k)++; - if (pginfo->next_4k % + (pginfo->hwpage_cnt)++; + (pginfo->next_hwpage)++; + if (pginfo->next_hwpage % (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { - (pginfo->page_cnt)++; - (pginfo->next_listelem)++; - pginfo->next_4k = 0; + (pginfo->kpage_cnt)++; + (pginfo->u.fmr.next_listelem)++; + pginfo->next_hwpage = 0; } } else { ehca_gen_err("bad pginfo->type=%x", pginfo->type); @@ -1964,15 +1973,15 @@ int ehca_set_pagebuf_1(struct ehca_mr *e_mr, ehca_set_pagebuf_1_exit0: if (ret) - ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_pages=%lx " - "num_4k=%lx next_buf=%lx next_4k=%lx rpage=%p " - "page_cnt=%lx page_4k_cnt=%lx next_listelem=%lx " + ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx " + "num_hwpages=%lx next_buf=%lx next_hwpage=%lx rpage=%p " + "kpage_cnt=%lx hwpage_cnt=%lx next_listelem=%lx " "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr, - pginfo, pginfo->type, pginfo->num_pages, - pginfo->num_4k, pginfo->next_buf, pginfo->next_4k, - rpage, pginfo->page_cnt, pginfo->page_4k_cnt, - pginfo->next_listelem, pginfo->region, - pginfo->next_chunk, pginfo->next_nmap); + pginfo, pginfo->type, pginfo->num_kpages, + pginfo->num_hwpages, pginfo->u.phy.next_buf, pginfo->next_hwpage, + rpage, pginfo->kpage_cnt, pginfo->hwpage_cnt, + pginfo->u.fmr.next_listelem, pginfo->u.usr.region, + pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap); return ret; } /* end ehca_set_pagebuf_1() */ @@ -2053,19 +2062,17 @@ void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, */ void ehca_mr_deletenew(struct ehca_mr *mr) { - mr->flags = 0; - mr->num_pages = 0; - mr->num_4k = 0; - mr->acl = 0; - mr->start = NULL; + mr->flags = 0; + mr->num_kpages = 0; + mr->num_hwpages = 0; + mr->acl = 0; + mr->start = NULL; mr->fmr_page_size = 0; mr->fmr_max_pages = 0; - mr->fmr_max_maps = 0; - mr->fmr_map_cnt = 0; + mr->fmr_max_maps = 0; + mr->fmr_map_cnt = 0; memset(&mr->ipz_mr_handle, 0, sizeof(mr->ipz_mr_handle)); memset(&mr->galpas, 0, sizeof(mr->galpas)); - mr->nr_of_pages = 0; - mr->pagearray = NULL; } /* end ehca_mr_deletenew() */ int ehca_init_mrmw_cache(void) -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:53:11 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:53:11 +0200 Subject: [ofa-general] [PATCH 08/10] IB/ehca: Restructure ehca_set_pagebuf() In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121753.12404.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Split ehca_set_pagebuf() into three functions depending on MR type (phys/user/fast) and remove superfluous ehca_set_pagebuf_1(). Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 531 ++++++++++++-------------------- 1 files changed, 200 insertions(+), 331 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 53b334b..93c26cc 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -824,6 +824,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, pginfo.u.fmr.page_list = page_list; pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) / EHCA_PAGESIZE); + pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size; ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova, list_len * e_fmr->fmr_page_size, @@ -1044,15 +1045,15 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } else rnum = MAX_RPAGES; - if (rnum > 1) { - ret = ehca_set_pagebuf(e_mr, pginfo, rnum, kpage); - if (ret) { - ehca_err(&shca->ib_device, "ehca_set_pagebuf " + ret = ehca_set_pagebuf(pginfo, rnum, kpage); + if (ret) { + ehca_err(&shca->ib_device, "ehca_set_pagebuf " "bad rc, ret=%x rnum=%x kpage=%p", ret, rnum, kpage); - ret = -EFAULT; - goto ehca_reg_mr_rpages_exit1; - } + goto ehca_reg_mr_rpages_exit1; + } + + if (rnum > 1) { rpage = virt_to_abs(kpage); if (!rpage) { ehca_err(&shca->ib_device, "kpage=%p i=%x", @@ -1060,15 +1061,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = -EFAULT; goto ehca_reg_mr_rpages_exit1; } - } else { /* rnum==1 */ - ret = ehca_set_pagebuf_1(e_mr, pginfo, &rpage); - if (ret) { - ehca_err(&shca->ib_device, "ehca_set_pagebuf_1 " - "bad rc, ret=%x i=%x", ret, i); - ret = -EFAULT; - goto ehca_reg_mr_rpages_exit1; - } - } + } else + rpage = *kpage; h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr, 0, /* pagesize 4k */ @@ -1146,7 +1140,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, } pginfo_save = *pginfo; - ret = ehca_set_pagebuf(e_mr, pginfo, pginfo->num_hwpages, kpage); + ret = ehca_set_pagebuf(pginfo, pginfo->num_hwpages, kpage); if (ret) { ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p " "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx " @@ -1306,98 +1300,86 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, { int ret = 0; u64 h_ret; - int rereg_1_hcall = 1; /* 1: use hipz_mr_reregister directly */ - int rereg_3_hcall = 0; /* 1: use 3 hipz calls for unmapping */ struct ehca_pd *e_pd = container_of(e_fmr->ib.ib_fmr.pd, struct ehca_pd, ib_pd); struct ehca_mr save_fmr; u32 tmp_lkey, tmp_rkey; struct ehca_mr_pginfo pginfo; struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr save_mr; - /* first check if reregistration hCall can be used for unmap */ - if (e_fmr->fmr_max_pages > MAX_RPAGES) { - rereg_1_hcall = 0; - rereg_3_hcall = 1; - } - - if (rereg_1_hcall) { + if (e_fmr->fmr_max_pages <= MAX_RPAGES) { /* * note: after using rereg hcall with len=0, * rereg hcall must be used again for registering pages */ h_ret = hipz_h_reregister_pmr(shca->ipz_hca_handle, e_fmr, 0, 0, 0, e_pd->fw_pd, 0, &hipzout); - if (h_ret != H_SUCCESS) { - /* - * should not happen, because length checked above, - * FMRs are not shared and no MW bound to FMRs - */ - ehca_err(&shca->ib_device, "hipz_reregister_pmr failed " - "(Rereg1), h_ret=%lx e_fmr=%p hca_hndl=%lx " - "mr_hndl=%lx lkey=%x lkey_out=%x", - h_ret, e_fmr, shca->ipz_hca_handle.handle, - e_fmr->ipz_mr_handle.handle, - e_fmr->ib.ib_fmr.lkey, hipzout.lkey); - rereg_3_hcall = 1; - } else { + if (h_ret == H_SUCCESS) { /* successful reregistration */ e_fmr->start = NULL; e_fmr->size = 0; tmp_lkey = hipzout.lkey; tmp_rkey = hipzout.rkey; + return 0; } + /* + * should not happen, because length checked above, + * FMRs are not shared and no MW bound to FMRs + */ + ehca_err(&shca->ib_device, "hipz_reregister_pmr failed " + "(Rereg1), h_ret=%lx e_fmr=%p hca_hndl=%lx " + "mr_hndl=%lx lkey=%x lkey_out=%x", + h_ret, e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey, hipzout.lkey); + /* try free and rereg */ } - if (rereg_3_hcall) { - struct ehca_mr save_mr; - - /* first free old FMR */ - h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); - if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " - "lkey=%x", - h_ret, e_fmr, shca->ipz_hca_handle.handle, - e_fmr->ipz_mr_handle.handle, - e_fmr->ib.ib_fmr.lkey); - ret = ehca2ib_return_code(h_ret); - goto ehca_unmap_one_fmr_exit0; - } - /* clean ehca_mr_t, without changing lock */ - save_fmr = *e_fmr; - ehca_mr_deletenew(e_fmr); + /* first free old FMR */ + h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); + if (h_ret != H_SUCCESS) { + ehca_err(&shca->ib_device, "hipz_free_mr failed, " + "h_ret=%lx e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "lkey=%x", + h_ret, e_fmr, shca->ipz_hca_handle.handle, + e_fmr->ipz_mr_handle.handle, + e_fmr->ib.ib_fmr.lkey); + ret = ehca2ib_return_code(h_ret); + goto ehca_unmap_one_fmr_exit0; + } + /* clean ehca_mr_t, without changing lock */ + save_fmr = *e_fmr; + ehca_mr_deletenew(e_fmr); - /* set some MR values */ - e_fmr->flags = save_fmr.flags; - e_fmr->fmr_page_size = save_fmr.fmr_page_size; - e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; - e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; - e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; - e_fmr->acl = save_fmr.acl; + /* set some MR values */ + e_fmr->flags = save_fmr.flags; + e_fmr->fmr_page_size = save_fmr.fmr_page_size; + e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; + e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; + e_fmr->fmr_map_cnt = save_fmr.fmr_map_cnt; + e_fmr->acl = save_fmr.acl; - memset(&pginfo, 0, sizeof(pginfo)); - pginfo.type = EHCA_MR_PGI_FMR; - pginfo.num_kpages = 0; - pginfo.num_hwpages = 0; - ret = ehca_reg_mr(shca, e_fmr, NULL, - (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), - e_fmr->acl, e_pd, &pginfo, &tmp_lkey, - &tmp_rkey); - if (ret) { - u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; - memcpy(&e_fmr->flags, &(save_mr.flags), - sizeof(struct ehca_mr) - offset); - goto ehca_unmap_one_fmr_exit0; - } + memset(&pginfo, 0, sizeof(pginfo)); + pginfo.type = EHCA_MR_PGI_FMR; + pginfo.num_kpages = 0; + pginfo.num_hwpages = 0; + ret = ehca_reg_mr(shca, e_fmr, NULL, + (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), + e_fmr->acl, e_pd, &pginfo, &tmp_lkey, + &tmp_rkey); + if (ret) { + u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; + memcpy(&e_fmr->flags, &(save_mr.flags), + sizeof(struct ehca_mr) - offset); + goto ehca_unmap_one_fmr_exit0; } ehca_unmap_one_fmr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x tmp_lkey=%x tmp_rkey=%x " - "fmr_max_pages=%x rereg_1_hcall=%x rereg_3_hcall=%x", - ret, tmp_lkey, tmp_rkey, e_fmr->fmr_max_pages, - rereg_1_hcall, rereg_3_hcall); + "fmr_max_pages=%x", + ret, tmp_lkey, tmp_rkey, e_fmr->fmr_max_pages); return ret; } /* end ehca_unmap_one_fmr() */ @@ -1690,300 +1672,187 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, /*----------------------------------------------------------------------*/ -/* setup page buffer from page info */ -int ehca_set_pagebuf(struct ehca_mr *e_mr, - struct ehca_mr_pginfo *pginfo, - u32 number, - u64 *kpage) +/* PAGE_SIZE >= pginfo->hwpage_size */ +static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) { int ret = 0; struct ib_umem_chunk *prev_chunk; struct ib_umem_chunk *chunk; - struct ib_phys_buf *pbuf; - u64 *fmrlist; - u64 num_hw, pgaddr, offs_hw; + u64 pgaddr; u32 i = 0; u32 j = 0; - if (pginfo->type == EHCA_MR_PGI_PHYS) { - /* loop over desired phys_buf_array entries */ - while (i < number) { - pbuf = pginfo->u.phy.phys_buf_array - + pginfo->u.phy.next_buf; - num_hw = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + - pbuf->size, EHCA_PAGESIZE); - offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; - while (pginfo->next_hwpage < offs_hw + num_hw) { - /* sanity check */ - if ((pginfo->kpage_cnt >= pginfo->num_kpages) || - (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { - ehca_gen_err("kpage_cnt >= num_kpages, " - "kpage_cnt=%lx " - "num_kpages=%lx " - "hwpage_cnt=%lx " - "num_hwpages=%lx i=%x", - pginfo->kpage_cnt, - pginfo->num_kpages, - pginfo->hwpage_cnt, - pginfo->num_hwpages, i); - ret = -EFAULT; - goto ehca_set_pagebuf_exit0; - } - *kpage = phys_to_abs( - (pbuf->addr & EHCA_PAGEMASK) - + (pginfo->next_hwpage * EHCA_PAGESIZE)); - if ( !(*kpage) && pbuf->addr ) { - ehca_gen_err("pbuf->addr=%lx " - "pbuf->size=%lx " - "next_hwpage=%lx", pbuf->addr, - pbuf->size, - pginfo->next_hwpage); - ret = -EFAULT; - goto ehca_set_pagebuf_exit0; - } - (pginfo->hwpage_cnt)++; - (pginfo->next_hwpage)++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->kpage_cnt)++; - kpage++; - i++; - if (i >= number) break; - } - if (pginfo->next_hwpage >= offs_hw + num_hw) { - (pginfo->u.phy.next_buf)++; - pginfo->next_hwpage = 0; - } - } - } else if (pginfo->type == EHCA_MR_PGI_USER) { - /* loop over desired chunk entries */ - chunk = pginfo->u.usr.next_chunk; - prev_chunk = pginfo->u.usr.next_chunk; - list_for_each_entry_continue(chunk, - (&(pginfo->u.usr.region->chunk_list)), - list) { - for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) { - pgaddr = ( page_to_pfn(chunk->page_list[i].page) - << PAGE_SHIFT ); - *kpage = phys_to_abs(pgaddr + - (pginfo->next_hwpage * - EHCA_PAGESIZE)); - if ( !(*kpage) ) { - ehca_gen_err("pgaddr=%lx " - "chunk->page_list[i]=%lx " - "i=%x next_hwpage=%lx mr=%p", - pgaddr, - (u64)sg_dma_address( - &chunk-> - page_list[i]), - i, pginfo->next_hwpage, e_mr); - ret = -EFAULT; - goto ehca_set_pagebuf_exit0; - } - (pginfo->hwpage_cnt)++; - (pginfo->next_hwpage)++; - kpage++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) { - (pginfo->kpage_cnt)++; - (pginfo->u.usr.next_nmap)++; - pginfo->next_hwpage = 0; - i++; - } - j++; - if (j >= number) break; - } - if ((pginfo->u.usr.next_nmap >= chunk->nmap) && - (j >= number)) { - pginfo->u.usr.next_nmap = 0; - prev_chunk = chunk; - break; - } else if (pginfo->u.usr.next_nmap >= chunk->nmap) { - pginfo->u.usr.next_nmap = 0; - prev_chunk = chunk; - } else if (j >= number) - break; - else - prev_chunk = chunk; - } - pginfo->u.usr.next_chunk = - list_prepare_entry(prev_chunk, - (&(pginfo->u.usr.region->chunk_list)), - list); - } else if (pginfo->type == EHCA_MR_PGI_FMR) { - /* loop over desired page_list entries */ - fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; - for (i = 0; i < number; i++) { - *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + - pginfo->next_hwpage * EHCA_PAGESIZE); + /* loop over desired chunk entries */ + chunk = pginfo->u.usr.next_chunk; + prev_chunk = pginfo->u.usr.next_chunk; + list_for_each_entry_continue( + chunk, (&(pginfo->u.usr.region->chunk_list)), list) { + for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) { + pgaddr = page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ; + *kpage = phys_to_abs(pgaddr + + (pginfo->next_hwpage * + EHCA_PAGESIZE)); if ( !(*kpage) ) { - ehca_gen_err("*fmrlist=%lx fmrlist=%p " - "next_listelem=%lx next_hwpage=%lx", - *fmrlist, fmrlist, - pginfo->u.fmr.next_listelem, - pginfo->next_hwpage); - ret = -EFAULT; - goto ehca_set_pagebuf_exit0; + ehca_gen_err("pgaddr=%lx " + "chunk->page_list[i]=%lx " + "i=%x next_hwpage=%lx", + pgaddr, (u64)sg_dma_address( + &chunk->page_list[i]), + i, pginfo->next_hwpage); + return -EFAULT; } (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; kpage++; if (pginfo->next_hwpage % - (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { + (PAGE_SIZE / EHCA_PAGESIZE) == 0) { (pginfo->kpage_cnt)++; - (pginfo->u.fmr.next_listelem)++; - fmrlist++; + (pginfo->u.usr.next_nmap)++; pginfo->next_hwpage = 0; + i++; } + j++; + if (j >= number) break; } - } else { - ehca_gen_err("bad pginfo->type=%x", pginfo->type); - ret = -EFAULT; - goto ehca_set_pagebuf_exit0; + if ((pginfo->u.usr.next_nmap >= chunk->nmap) && + (j >= number)) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->u.usr.next_nmap >= chunk->nmap) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; } - -ehca_set_pagebuf_exit0: - if (ret) - ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx " - "num_hwpages=%lx next_buf=%lx next_hwpage=%lx number=%x " - "kpage=%p kpage_cnt=%lx hwpage_cnt=%lx i=%x " - "next_listelem=%lx region=%p next_chunk=%p " - "next_nmap=%lx", ret, e_mr, pginfo, pginfo->type, - pginfo->num_kpages, pginfo->num_hwpages, - pginfo->u.phy.next_buf, pginfo->next_hwpage, number, kpage, - pginfo->kpage_cnt, pginfo->hwpage_cnt, i, - pginfo->u.fmr.next_listelem, pginfo->u.usr.region, - pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap); + pginfo->u.usr.next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->u.usr.region->chunk_list)), + list); return ret; -} /* end ehca_set_pagebuf() */ - -/*----------------------------------------------------------------------*/ +} -/* setup 1 page from page info page buffer */ -int ehca_set_pagebuf_1(struct ehca_mr *e_mr, - struct ehca_mr_pginfo *pginfo, - u64 *rpage) +int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) { int ret = 0; - struct ib_phys_buf *tmp_pbuf; - u64 *fmrlist; - struct ib_umem_chunk *chunk; - struct ib_umem_chunk *prev_chunk; - u64 pgaddr, num_hw, offs_hw; - - if (pginfo->type == EHCA_MR_PGI_PHYS) { - /* sanity check */ - if ((pginfo->kpage_cnt >= pginfo->num_kpages) || - (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { - ehca_gen_err("kpage_cnt >= num_hwpages, kpage_cnt=%lx " - "num_hwpages=%lx hwpage_cnt=%lx num_hwpages=%lx", - pginfo->kpage_cnt, pginfo->num_kpages, - pginfo->hwpage_cnt, pginfo->num_hwpages); - ret = -EFAULT; - goto ehca_set_pagebuf_1_exit0; - } - tmp_pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf; - num_hw = NUM_CHUNKS((tmp_pbuf->addr % EHCA_PAGESIZE) + - tmp_pbuf->size, EHCA_PAGESIZE); - offs_hw = (tmp_pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; - *rpage = phys_to_abs((tmp_pbuf->addr & EHCA_PAGEMASK) + - (pginfo->next_hwpage * EHCA_PAGESIZE)); - if ( !(*rpage) && tmp_pbuf->addr ) { - ehca_gen_err("tmp_pbuf->addr=%lx" - " tmp_pbuf->size=%lx next_hwpage=%lx", - tmp_pbuf->addr, tmp_pbuf->size, - pginfo->next_hwpage); - ret = -EFAULT; - goto ehca_set_pagebuf_1_exit0; - } - (pginfo->hwpage_cnt)++; - (pginfo->next_hwpage)++; - if (pginfo->next_hwpage % (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->kpage_cnt)++; - if (pginfo->next_hwpage >= offs_hw + num_hw) { - (pginfo->u.phy.next_buf)++; - pginfo->next_hwpage = 0; - } - } else if (pginfo->type == EHCA_MR_PGI_USER) { - chunk = pginfo->u.usr.next_chunk; - prev_chunk = pginfo->u.usr.next_chunk; - list_for_each_entry_continue(chunk, - (&(pginfo->u.usr.region->chunk_list)), - list) { - pgaddr = ( page_to_pfn(chunk->page_list[ - pginfo->u.usr.next_nmap].page) - << PAGE_SHIFT); - *rpage = phys_to_abs(pgaddr + - (pginfo->next_hwpage * EHCA_PAGESIZE)); - if ( !(*rpage) ) { - ehca_gen_err("pgaddr=%lx chunk->page_list[]=%lx" - " next_nmap=%lx next_hwpage=%lx mr=%p", - pgaddr, (u64)sg_dma_address( - &chunk->page_list[ - pginfo->u.usr. - next_nmap]), - pginfo->u.usr.next_nmap, pginfo->next_hwpage, - e_mr); - ret = -EFAULT; - goto ehca_set_pagebuf_1_exit0; + struct ib_phys_buf *pbuf; + u64 num_hw, offs_hw; + u32 i = 0; + + /* loop over desired phys_buf_array entries */ + while (i < number) { + pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf; + num_hw = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + + pbuf->size, EHCA_PAGESIZE); + offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + while (pginfo->next_hwpage < offs_hw + num_hw) { + /* sanity check */ + if ((pginfo->kpage_cnt >= pginfo->num_kpages) || + (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { + ehca_gen_err("kpage_cnt >= num_kpages, " + "kpage_cnt=%lx num_kpages=%lx " + "hwpage_cnt=%lx " + "num_hwpages=%lx i=%x", + pginfo->kpage_cnt, + pginfo->num_kpages, + pginfo->hwpage_cnt, + pginfo->num_hwpages, i); + return -EFAULT; + } + *kpage = phys_to_abs( + (pbuf->addr & EHCA_PAGEMASK) + + (pginfo->next_hwpage * EHCA_PAGESIZE)); + if ( !(*kpage) && pbuf->addr ) { + ehca_gen_err("pbuf->addr=%lx " + "pbuf->size=%lx " + "next_hwpage=%lx", pbuf->addr, + pbuf->size, + pginfo->next_hwpage); + return -EFAULT; } (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) { + (PAGE_SIZE / EHCA_PAGESIZE) == 0) (pginfo->kpage_cnt)++; - (pginfo->u.usr.next_nmap)++; - pginfo->next_hwpage = 0; - } - if (pginfo->u.usr.next_nmap >= chunk->nmap) { - pginfo->u.usr.next_nmap = 0; - prev_chunk = chunk; - } - break; + kpage++; + i++; + if (i >= number) break; } - pginfo->u.usr.next_chunk = - list_prepare_entry(prev_chunk, - (&(pginfo->u.usr.region->chunk_list)), - list); - } else if (pginfo->type == EHCA_MR_PGI_FMR) { - fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; - *rpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + + if (pginfo->next_hwpage >= offs_hw + num_hw) { + (pginfo->u.phy.next_buf)++; + pginfo->next_hwpage = 0; + } + } + return ret; +} + +int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int ret = 0; + u64 *fmrlist; + u32 i; + + /* loop over desired page_list entries */ + fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; + for (i = 0; i < number; i++) { + *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + pginfo->next_hwpage * EHCA_PAGESIZE); - if ( !(*rpage) ) { + if ( !(*kpage) ) { ehca_gen_err("*fmrlist=%lx fmrlist=%p " "next_listelem=%lx next_hwpage=%lx", - *fmrlist, fmrlist, pginfo->u.fmr.next_listelem, + *fmrlist, fmrlist, + pginfo->u.fmr.next_listelem, pginfo->next_hwpage); - ret = -EFAULT; - goto ehca_set_pagebuf_1_exit0; + return -EFAULT; } (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; + kpage++; if (pginfo->next_hwpage % - (e_mr->fmr_page_size / EHCA_PAGESIZE) == 0) { + (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) { (pginfo->kpage_cnt)++; (pginfo->u.fmr.next_listelem)++; + fmrlist++; pginfo->next_hwpage = 0; } - } else { + } + return ret; +} + +/* setup page buffer from page info */ +int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int ret; + + switch (pginfo->type) { + case EHCA_MR_PGI_PHYS: + ret = ehca_set_pagebuf_phys(pginfo, number, kpage); + break; + case EHCA_MR_PGI_USER: + ret = ehca_set_pagebuf_user1(pginfo, number, kpage); + break; + case EHCA_MR_PGI_FMR: + ret = ehca_set_pagebuf_fmr(pginfo, number, kpage); + break; + default: ehca_gen_err("bad pginfo->type=%x", pginfo->type); ret = -EFAULT; - goto ehca_set_pagebuf_1_exit0; + break; } - -ehca_set_pagebuf_1_exit0: - if (ret) - ehca_gen_err("ret=%x e_mr=%p pginfo=%p type=%x num_kpages=%lx " - "num_hwpages=%lx next_buf=%lx next_hwpage=%lx rpage=%p " - "kpage_cnt=%lx hwpage_cnt=%lx next_listelem=%lx " - "region=%p next_chunk=%p next_nmap=%lx", ret, e_mr, - pginfo, pginfo->type, pginfo->num_kpages, - pginfo->num_hwpages, pginfo->u.phy.next_buf, pginfo->next_hwpage, - rpage, pginfo->kpage_cnt, pginfo->hwpage_cnt, - pginfo->u.fmr.next_listelem, pginfo->u.usr.region, - pginfo->u.usr.next_chunk, pginfo->u.usr.next_nmap); return ret; -} /* end ehca_set_pagebuf_1() */ +} /* end ehca_set_pagebuf() */ /*----------------------------------------------------------------------*/ -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:53:47 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:53:47 +0200 Subject: [ofa-general] [PATCH 09/10] IB/ehca: Fix warnings issued by checkpatch.pl In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121753.48434.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_av.c | 2 +- drivers/infiniband/hw/ehca/ehca_classes.h | 4 +- drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 156 ++++++++++---------- drivers/infiniband/hw/ehca/ehca_cq.c | 2 +- drivers/infiniband/hw/ehca/ehca_eq.c | 3 +- drivers/infiniband/hw/ehca/ehca_hca.c | 28 +++- drivers/infiniband/hw/ehca/ehca_irq.c | 56 ++++---- drivers/infiniband/hw/ehca/ehca_iverbs.h | 7 +- drivers/infiniband/hw/ehca/ehca_main.c | 21 ++-- drivers/infiniband/hw/ehca/ehca_mrmw.c | 59 ++++---- drivers/infiniband/hw/ehca/ehca_mrmw.h | 7 +- drivers/infiniband/hw/ehca/ehca_qes.h | 22 ++-- drivers/infiniband/hw/ehca/ehca_qp.c | 39 +++--- drivers/infiniband/hw/ehca/ehca_reqs.c | 15 ++- drivers/infiniband/hw/ehca/ehca_tools.h | 28 ++-- drivers/infiniband/hw/ehca/ehca_uverbs.c | 10 +- drivers/infiniband/hw/ehca/hcp_if.c | 8 +- drivers/infiniband/hw/ehca/hcp_phyp.c | 2 +- drivers/infiniband/hw/ehca/hipz_fns_core.h | 4 +- drivers/infiniband/hw/ehca/hipz_hw.h | 24 ++-- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 2 +- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 4 +- 22 files changed, 261 insertions(+), 242 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 3cd6bf3..e53a97a 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -79,7 +79,7 @@ struct ib_ah *ehca_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) av->av.ipd = (ah_mult > 0) ? ((ehca_mult - 1) / ah_mult) : 0; } else - av->av.ipd = ehca_static_rate; + av->av.ipd = ehca_static_rate; av->av.lnh = ah_attr->ah_flags; av->av.grh.word_0 = EHCA_BMASK_SET(GRH_IPVERSION_MASK, 6); diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 92103df..1752821 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -215,7 +215,7 @@ struct ehca_mr { u32 num_hwpages; /* number of hw pages to form MR */ int acl; /* ACL (stored here for usage in reregister) */ u64 *start; /* virtual start address (stored here for */ - /* usage in reregister) */ + /* usage in reregister) */ u64 size; /* size (stored here for usage in reregister) */ u32 fmr_page_size; /* page size for FMR */ u32 fmr_max_pages; /* max pages for FMR */ @@ -400,6 +400,6 @@ struct ehca_alloc_qp_parms { int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int qp_num); -struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int qp_num); +struct ehca_qp *ehca_cq_get_qp(struct ehca_cq *cq, int qp_num); #endif diff --git a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h index fb3df5c..1798e64 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h +++ b/drivers/infiniband/hw/ehca/ehca_classes_pSeries.h @@ -154,83 +154,83 @@ struct hcp_modify_qp_control_block { u32 reserved_70_127[58]; /* 70 */ }; -#define MQPCB_MASK_QKEY EHCA_BMASK_IBM(0,0) -#define MQPCB_MASK_SEND_PSN EHCA_BMASK_IBM(2,2) -#define MQPCB_MASK_RECEIVE_PSN EHCA_BMASK_IBM(3,3) -#define MQPCB_MASK_PRIM_PHYS_PORT EHCA_BMASK_IBM(4,4) -#define MQPCB_PRIM_PHYS_PORT EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_ALT_PHYS_PORT EHCA_BMASK_IBM(5,5) -#define MQPCB_MASK_PRIM_P_KEY_IDX EHCA_BMASK_IBM(6,6) -#define MQPCB_PRIM_P_KEY_IDX EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_ALT_P_KEY_IDX EHCA_BMASK_IBM(7,7) -#define MQPCB_MASK_RDMA_ATOMIC_CTRL EHCA_BMASK_IBM(8,8) -#define MQPCB_MASK_QP_STATE EHCA_BMASK_IBM(9,9) -#define MQPCB_QP_STATE EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES EHCA_BMASK_IBM(11,11) -#define MQPCB_MASK_PATH_MIGRATION_STATE EHCA_BMASK_IBM(12,12) -#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP EHCA_BMASK_IBM(13,13) -#define MQPCB_MASK_DEST_QP_NR EHCA_BMASK_IBM(14,14) -#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD EHCA_BMASK_IBM(15,15) -#define MQPCB_MASK_SERVICE_LEVEL EHCA_BMASK_IBM(16,16) -#define MQPCB_MASK_SEND_GRH_FLAG EHCA_BMASK_IBM(17,17) -#define MQPCB_MASK_RETRY_COUNT EHCA_BMASK_IBM(18,18) -#define MQPCB_MASK_TIMEOUT EHCA_BMASK_IBM(19,19) -#define MQPCB_MASK_PATH_MTU EHCA_BMASK_IBM(20,20) -#define MQPCB_PATH_MTU EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_MAX_STATIC_RATE EHCA_BMASK_IBM(21,21) -#define MQPCB_MAX_STATIC_RATE EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_DLID EHCA_BMASK_IBM(22,22) -#define MQPCB_DLID EHCA_BMASK_IBM(16,31) -#define MQPCB_MASK_RNR_RETRY_COUNT EHCA_BMASK_IBM(23,23) -#define MQPCB_RNR_RETRY_COUNT EHCA_BMASK_IBM(29,31) -#define MQPCB_MASK_SOURCE_PATH_BITS EHCA_BMASK_IBM(24,24) -#define MQPCB_SOURCE_PATH_BITS EHCA_BMASK_IBM(25,31) -#define MQPCB_MASK_TRAFFIC_CLASS EHCA_BMASK_IBM(25,25) -#define MQPCB_TRAFFIC_CLASS EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_HOP_LIMIT EHCA_BMASK_IBM(26,26) -#define MQPCB_HOP_LIMIT EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_SOURCE_GID_IDX EHCA_BMASK_IBM(27,27) -#define MQPCB_SOURCE_GID_IDX EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_FLOW_LABEL EHCA_BMASK_IBM(28,28) -#define MQPCB_FLOW_LABEL EHCA_BMASK_IBM(12,31) -#define MQPCB_MASK_DEST_GID EHCA_BMASK_IBM(30,30) -#define MQPCB_MASK_SERVICE_LEVEL_AL EHCA_BMASK_IBM(31,31) -#define MQPCB_SERVICE_LEVEL_AL EHCA_BMASK_IBM(28,31) -#define MQPCB_MASK_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(32,32) -#define MQPCB_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(31,31) -#define MQPCB_MASK_RETRY_COUNT_AL EHCA_BMASK_IBM(33,33) -#define MQPCB_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) -#define MQPCB_MASK_TIMEOUT_AL EHCA_BMASK_IBM(34,34) -#define MQPCB_TIMEOUT_AL EHCA_BMASK_IBM(27,31) -#define MQPCB_MASK_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(35,35) -#define MQPCB_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_DLID_AL EHCA_BMASK_IBM(36,36) -#define MQPCB_DLID_AL EHCA_BMASK_IBM(16,31) -#define MQPCB_MASK_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(37,37) -#define MQPCB_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(29,31) -#define MQPCB_MASK_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(38,38) -#define MQPCB_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(25,31) -#define MQPCB_MASK_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(39,39) -#define MQPCB_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_HOP_LIMIT_AL EHCA_BMASK_IBM(40,40) -#define MQPCB_HOP_LIMIT_AL EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(41,41) -#define MQPCB_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(24,31) -#define MQPCB_MASK_FLOW_LABEL_AL EHCA_BMASK_IBM(42,42) -#define MQPCB_FLOW_LABEL_AL EHCA_BMASK_IBM(12,31) -#define MQPCB_MASK_DEST_GID_AL EHCA_BMASK_IBM(44,44) -#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(45,45) -#define MQPCB_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(16,31) -#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(46,46) -#define MQPCB_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(16,31) -#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(47,47) -#define MQPCB_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(31,31) -#define MQPCB_QP_NUMBER EHCA_BMASK_IBM(8,31) -#define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48,48) -#define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31,31) -#define MQPCB_MASK_CURR_SRQ_LIMIT EHCA_BMASK_IBM(49,49) -#define MQPCB_CURR_SRQ_LIMIT EHCA_BMASK_IBM(16,31) -#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50,50) -#define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51,51) +#define MQPCB_MASK_QKEY EHCA_BMASK_IBM( 0, 0) +#define MQPCB_MASK_SEND_PSN EHCA_BMASK_IBM( 2, 2) +#define MQPCB_MASK_RECEIVE_PSN EHCA_BMASK_IBM( 3, 3) +#define MQPCB_MASK_PRIM_PHYS_PORT EHCA_BMASK_IBM( 4, 4) +#define MQPCB_PRIM_PHYS_PORT EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_ALT_PHYS_PORT EHCA_BMASK_IBM( 5, 5) +#define MQPCB_MASK_PRIM_P_KEY_IDX EHCA_BMASK_IBM( 6, 6) +#define MQPCB_PRIM_P_KEY_IDX EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_ALT_P_KEY_IDX EHCA_BMASK_IBM( 7, 7) +#define MQPCB_MASK_RDMA_ATOMIC_CTRL EHCA_BMASK_IBM( 8, 8) +#define MQPCB_MASK_QP_STATE EHCA_BMASK_IBM( 9, 9) +#define MQPCB_QP_STATE EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES EHCA_BMASK_IBM(11, 11) +#define MQPCB_MASK_PATH_MIGRATION_STATE EHCA_BMASK_IBM(12, 12) +#define MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP EHCA_BMASK_IBM(13, 13) +#define MQPCB_MASK_DEST_QP_NR EHCA_BMASK_IBM(14, 14) +#define MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD EHCA_BMASK_IBM(15, 15) +#define MQPCB_MASK_SERVICE_LEVEL EHCA_BMASK_IBM(16, 16) +#define MQPCB_MASK_SEND_GRH_FLAG EHCA_BMASK_IBM(17, 17) +#define MQPCB_MASK_RETRY_COUNT EHCA_BMASK_IBM(18, 18) +#define MQPCB_MASK_TIMEOUT EHCA_BMASK_IBM(19, 19) +#define MQPCB_MASK_PATH_MTU EHCA_BMASK_IBM(20, 20) +#define MQPCB_PATH_MTU EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_MAX_STATIC_RATE EHCA_BMASK_IBM(21, 21) +#define MQPCB_MAX_STATIC_RATE EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_DLID EHCA_BMASK_IBM(22, 22) +#define MQPCB_DLID EHCA_BMASK_IBM(16, 31) +#define MQPCB_MASK_RNR_RETRY_COUNT EHCA_BMASK_IBM(23, 23) +#define MQPCB_RNR_RETRY_COUNT EHCA_BMASK_IBM(29, 31) +#define MQPCB_MASK_SOURCE_PATH_BITS EHCA_BMASK_IBM(24, 24) +#define MQPCB_SOURCE_PATH_BITS EHCA_BMASK_IBM(25, 31) +#define MQPCB_MASK_TRAFFIC_CLASS EHCA_BMASK_IBM(25, 25) +#define MQPCB_TRAFFIC_CLASS EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_HOP_LIMIT EHCA_BMASK_IBM(26, 26) +#define MQPCB_HOP_LIMIT EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_SOURCE_GID_IDX EHCA_BMASK_IBM(27, 27) +#define MQPCB_SOURCE_GID_IDX EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_FLOW_LABEL EHCA_BMASK_IBM(28, 28) +#define MQPCB_FLOW_LABEL EHCA_BMASK_IBM(12, 31) +#define MQPCB_MASK_DEST_GID EHCA_BMASK_IBM(30, 30) +#define MQPCB_MASK_SERVICE_LEVEL_AL EHCA_BMASK_IBM(31, 31) +#define MQPCB_SERVICE_LEVEL_AL EHCA_BMASK_IBM(28, 31) +#define MQPCB_MASK_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(32, 32) +#define MQPCB_SEND_GRH_FLAG_AL EHCA_BMASK_IBM(31, 31) +#define MQPCB_MASK_RETRY_COUNT_AL EHCA_BMASK_IBM(33, 33) +#define MQPCB_RETRY_COUNT_AL EHCA_BMASK_IBM(29, 31) +#define MQPCB_MASK_TIMEOUT_AL EHCA_BMASK_IBM(34, 34) +#define MQPCB_TIMEOUT_AL EHCA_BMASK_IBM(27, 31) +#define MQPCB_MASK_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(35, 35) +#define MQPCB_MAX_STATIC_RATE_AL EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_DLID_AL EHCA_BMASK_IBM(36, 36) +#define MQPCB_DLID_AL EHCA_BMASK_IBM(16, 31) +#define MQPCB_MASK_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(37, 37) +#define MQPCB_RNR_RETRY_COUNT_AL EHCA_BMASK_IBM(29, 31) +#define MQPCB_MASK_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(38, 38) +#define MQPCB_SOURCE_PATH_BITS_AL EHCA_BMASK_IBM(25, 31) +#define MQPCB_MASK_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(39, 39) +#define MQPCB_TRAFFIC_CLASS_AL EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_HOP_LIMIT_AL EHCA_BMASK_IBM(40, 40) +#define MQPCB_HOP_LIMIT_AL EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(41, 41) +#define MQPCB_SOURCE_GID_IDX_AL EHCA_BMASK_IBM(24, 31) +#define MQPCB_MASK_FLOW_LABEL_AL EHCA_BMASK_IBM(42, 42) +#define MQPCB_FLOW_LABEL_AL EHCA_BMASK_IBM(12, 31) +#define MQPCB_MASK_DEST_GID_AL EHCA_BMASK_IBM(44, 44) +#define MQPCB_MASK_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(45, 45) +#define MQPCB_MAX_NR_OUTST_SEND_WR EHCA_BMASK_IBM(16, 31) +#define MQPCB_MASK_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(46, 46) +#define MQPCB_MAX_NR_OUTST_RECV_WR EHCA_BMASK_IBM(16, 31) +#define MQPCB_MASK_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(47, 47) +#define MQPCB_DISABLE_ETE_CREDIT_CHECK EHCA_BMASK_IBM(31, 31) +#define MQPCB_QP_NUMBER EHCA_BMASK_IBM( 8, 31) +#define MQPCB_MASK_QP_ENABLE EHCA_BMASK_IBM(48, 48) +#define MQPCB_QP_ENABLE EHCA_BMASK_IBM(31, 31) +#define MQPCB_MASK_CURR_SRQ_LIMIT EHCA_BMASK_IBM(49, 49) +#define MQPCB_CURR_SRQ_LIMIT EHCA_BMASK_IBM(16, 31) +#define MQPCB_MASK_QP_AFF_ASYN_EV_LOG_REG EHCA_BMASK_IBM(50, 50) +#define MQPCB_MASK_SHARED_RQ_HNDL EHCA_BMASK_IBM(51, 51) #endif /* __EHCA_CLASSES_PSERIES_H__ */ diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 97da51e..ba1bcb9 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -97,7 +97,7 @@ int ehca_cq_unassign_qp(struct ehca_cq *cq, unsigned int real_qp_num) return ret; } -struct ehca_qp* ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) +struct ehca_qp *ehca_cq_get_qp(struct ehca_cq *cq, int real_qp_num) { struct ehca_qp *ret = NULL; unsigned int key = real_qp_num & (QP_HASHTAB_LEN-1); diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c index d443bcb..78a2e5a 100644 --- a/drivers/infiniband/hw/ehca/ehca_eq.c +++ b/drivers/infiniband/hw/ehca/ehca_eq.c @@ -111,7 +111,8 @@ struct ehca_eq *ehca_create_eq(struct ehca_shca *shca, for (i = 0; i < nr_pages; i++) { u64 rpage; - if (!(vpage = ipz_qpageit_get_inc(&eq->ipz_queue))) { + vpage = ipz_qpageit_get_inc(&eq->ipz_queue); + if (!vpage) { ret = -ENOMEM; goto create_eq_exit2; } diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index bbd3c6a..fc19ef9 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -127,6 +127,7 @@ int ehca_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr *props) { int ret = 0; + u64 h_ret; struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); struct hipz_query_port *rblock; @@ -137,7 +138,8 @@ int ehca_query_port(struct ib_device *ibdev, return -ENOMEM; } - if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock); + if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "Can't query port properties"); ret = -EINVAL; goto query_port1; @@ -197,6 +199,7 @@ int ehca_query_sma_attr(struct ehca_shca *shca, u8 port, struct ehca_sma_attr *attr) { int ret = 0; + u64 h_ret; struct hipz_query_port *rblock; rblock = ehca_alloc_fw_ctrlblock(GFP_ATOMIC); @@ -205,7 +208,8 @@ int ehca_query_sma_attr(struct ehca_shca *shca, return -ENOMEM; } - if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock); + if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "Can't query port properties"); ret = -EINVAL; goto query_sma_attr1; @@ -230,9 +234,11 @@ query_sma_attr1: int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) { int ret = 0; - struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); + u64 h_ret; + struct ehca_shca *shca; struct hipz_query_port *rblock; + shca = container_of(ibdev, struct ehca_shca, ib_device); if (index > 16) { ehca_err(&shca->ib_device, "Invalid index: %x.", index); return -EINVAL; @@ -244,7 +250,8 @@ int ehca_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey) return -ENOMEM; } - if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock); + if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "Can't query port properties"); ret = -EINVAL; goto query_pkey1; @@ -262,6 +269,7 @@ int ehca_query_gid(struct ib_device *ibdev, u8 port, int index, union ib_gid *gid) { int ret = 0; + u64 h_ret; struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); struct hipz_query_port *rblock; @@ -277,7 +285,8 @@ int ehca_query_gid(struct ib_device *ibdev, u8 port, return -ENOMEM; } - if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + h_ret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock); + if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "Can't query port properties"); ret = -EINVAL; goto query_gid1; @@ -302,11 +311,12 @@ int ehca_modify_port(struct ib_device *ibdev, struct ib_port_modify *props) { int ret = 0; - struct ehca_shca *shca = container_of(ibdev, struct ehca_shca, ib_device); + struct ehca_shca *shca; struct hipz_query_port *rblock; u32 cap; u64 hret; + shca = container_of(ibdev, struct ehca_shca, ib_device); if ((props->set_port_cap_mask | props->clr_port_cap_mask) & ~allowed_port_caps) { ehca_err(&shca->ib_device, "Non-changeable bits set in masks " @@ -325,7 +335,8 @@ int ehca_modify_port(struct ib_device *ibdev, goto modify_port1; } - if (hipz_h_query_port(shca->ipz_hca_handle, port, rblock) != H_SUCCESS) { + hret = hipz_h_query_port(shca->ipz_hca_handle, port, rblock); + if (hret != H_SUCCESS) { ehca_err(&shca->ib_device, "Can't query port properties"); ret = -EINVAL; goto modify_port2; @@ -337,7 +348,8 @@ int ehca_modify_port(struct ib_device *ibdev, hret = hipz_h_modify_port(shca->ipz_hca_handle, port, cap, props->init_type, port_modify_mask); if (hret != H_SUCCESS) { - ehca_err(&shca->ib_device, "Modify port failed hret=%lx", hret); + ehca_err(&shca->ib_device, "Modify port failed hret=%lx", + hret); ret = -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 7a4071a..1f043d0 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -49,26 +49,26 @@ #include "hipz_fns.h" #include "ipz_pt_fn.h" -#define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) -#define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) -#define EQE_EE_IDENTIFIER EHCA_BMASK_IBM(2,7) -#define EQE_CQ_NUMBER EHCA_BMASK_IBM(8,31) -#define EQE_QP_NUMBER EHCA_BMASK_IBM(8,31) -#define EQE_QP_TOKEN EHCA_BMASK_IBM(32,63) -#define EQE_CQ_TOKEN EHCA_BMASK_IBM(32,63) - -#define NEQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) -#define NEQE_EVENT_CODE EHCA_BMASK_IBM(2,7) -#define NEQE_PORT_NUMBER EHCA_BMASK_IBM(8,15) -#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16,16) -#define NEQE_DISRUPTIVE EHCA_BMASK_IBM(16,16) - -#define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52,63) -#define ERROR_DATA_TYPE EHCA_BMASK_IBM(0,7) +#define EQE_COMPLETION_EVENT EHCA_BMASK_IBM( 1, 1) +#define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM( 8, 31) +#define EQE_EE_IDENTIFIER EHCA_BMASK_IBM( 2, 7) +#define EQE_CQ_NUMBER EHCA_BMASK_IBM( 8, 31) +#define EQE_QP_NUMBER EHCA_BMASK_IBM( 8, 31) +#define EQE_QP_TOKEN EHCA_BMASK_IBM(32, 63) +#define EQE_CQ_TOKEN EHCA_BMASK_IBM(32, 63) + +#define NEQE_COMPLETION_EVENT EHCA_BMASK_IBM( 1, 1) +#define NEQE_EVENT_CODE EHCA_BMASK_IBM( 2, 7) +#define NEQE_PORT_NUMBER EHCA_BMASK_IBM( 8, 15) +#define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16, 16) +#define NEQE_DISRUPTIVE EHCA_BMASK_IBM(16, 16) + +#define ERROR_DATA_LENGTH EHCA_BMASK_IBM(52, 63) +#define ERROR_DATA_TYPE EHCA_BMASK_IBM( 0, 7) static void queue_comp_task(struct ehca_cq *__cq); -static struct ehca_comp_pool* pool; +static struct ehca_comp_pool *pool; #ifdef CONFIG_HOTPLUG_CPU static struct notifier_block comp_pool_callback_nb; #endif @@ -85,8 +85,8 @@ static inline void comp_event_callback(struct ehca_cq *cq) return; } -static void print_error_data(struct ehca_shca * shca, void* data, - u64* rblock, int length) +static void print_error_data(struct ehca_shca *shca, void *data, + u64 *rblock, int length) { u64 type = EHCA_BMASK_GET(ERROR_DATA_TYPE, rblock[2]); u64 resource = rblock[1]; @@ -94,7 +94,7 @@ static void print_error_data(struct ehca_shca * shca, void* data, switch (type) { case 0x1: /* Queue Pair */ { - struct ehca_qp *qp = (struct ehca_qp*)data; + struct ehca_qp *qp = (struct ehca_qp *)data; /* only print error data if AER is set */ if (rblock[6] == 0) @@ -107,7 +107,7 @@ static void print_error_data(struct ehca_shca * shca, void* data, } case 0x4: /* Completion Queue */ { - struct ehca_cq *cq = (struct ehca_cq*)data; + struct ehca_cq *cq = (struct ehca_cq *)data; ehca_err(&shca->ib_device, "CQ 0x%x (resource=%lx) has errors.", @@ -564,7 +564,7 @@ void ehca_tasklet_eq(unsigned long data) ehca_process_eq((struct ehca_eq *)data, 1); } -static inline int find_next_online_cpu(struct ehca_comp_pool* pool) +static inline int find_next_online_cpu(struct ehca_comp_pool *pool) { int cpu; unsigned long flags; @@ -628,7 +628,7 @@ static void queue_comp_task(struct ehca_cq *__cq) __queue_comp_task(__cq, cct); } -static void run_comp_task(struct ehca_cpu_comp_task* cct) +static void run_comp_task(struct ehca_cpu_comp_task *cct) { struct ehca_cq *cq; unsigned long flags; @@ -658,12 +658,12 @@ static void run_comp_task(struct ehca_cpu_comp_task* cct) static int comp_task(void *__cct) { - struct ehca_cpu_comp_task* cct = __cct; + struct ehca_cpu_comp_task *cct = __cct; int cql_empty; DECLARE_WAITQUEUE(wait, current); set_current_state(TASK_INTERRUPTIBLE); - while(!kthread_should_stop()) { + while (!kthread_should_stop()) { add_wait_queue(&cct->wait_queue, &wait); spin_lock_irq(&cct->task_lock); @@ -737,7 +737,7 @@ static void take_over_work(struct ehca_comp_pool *pool, list_splice_init(&cct->cq_list, &list); - while(!list_empty(&list)) { + while (!list_empty(&list)) { cq = list_entry(cct->cq_list.next, struct ehca_cq, entry); list_del(&cq->entry); @@ -760,7 +760,7 @@ static int comp_pool_callback(struct notifier_block *nfb, case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: ehca_gen_dbg("CPU: %x (CPU_PREPARE)", cpu); - if(!create_comp_task(pool, cpu)) { + if (!create_comp_task(pool, cpu)) { ehca_gen_err("Can't create comp_task for cpu: %x", cpu); return NOTIFY_BAD; } @@ -830,7 +830,7 @@ int ehca_create_comp_pool(void) #ifdef CONFIG_HOTPLUG_CPU comp_pool_callback_nb.notifier_call = comp_pool_callback; - comp_pool_callback_nb.priority =0; + comp_pool_callback_nb.priority = 0; register_cpu_notifier(&comp_pool_callback_nb); #endif diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index bf8fbf7..99881e3 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -81,8 +81,9 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, int num_phys_buf, int mr_access_flags, u64 *iova_start); -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, - int mr_access_flags, struct ib_udata *udata); +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int mr_access_flags, + struct ib_udata *udata); int ehca_rereg_phys_mr(struct ib_mr *mr, int mr_rereg_mask, @@ -191,7 +192,7 @@ void ehca_poll_eqs(unsigned long data); void *ehca_alloc_fw_ctrlblock(gfp_t flags); void ehca_free_fw_ctrlblock(void *ptr); #else -#define ehca_alloc_fw_ctrlblock(flags) ((void *) get_zeroed_page(flags)) +#define ehca_alloc_fw_ctrlblock(flags) ((void *)get_zeroed_page(flags)) #define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) #endif diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 57c551e..ecf4ef4 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -116,7 +116,7 @@ static DEFINE_SPINLOCK(shca_list_lock); static struct timer_list poll_eqs_timer; #ifdef CONFIG_PPC_64K_PAGES -static struct kmem_cache *ctblk_cache = NULL; +static struct kmem_cache *ctblk_cache; void *ehca_alloc_fw_ctrlblock(gfp_t flags) { @@ -219,8 +219,8 @@ static void ehca_destroy_slab_caches(void) #endif } -#define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) -#define EHCA_REVID EHCA_BMASK_IBM(40,63) +#define EHCA_HCAAVER EHCA_BMASK_IBM(32, 39) +#define EHCA_REVID EHCA_BMASK_IBM(40, 63) static struct cap_descr { u64 mask; @@ -314,7 +314,7 @@ int ehca_sense_attributes(struct ehca_shca *shca) if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) ehca_gen_dbg(" %s", hca_cap_descr[i].descr); - port = (struct hipz_query_port *) rblock; + port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { ehca_gen_err("Cannot query port properties. h_ret=%lx", @@ -463,7 +463,7 @@ static int ehca_create_aqp1(struct ehca_shca *shca, u32 port) return -EPERM; } - ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void*)(-1), 10, 0); + ibcq = ib_create_cq(&shca->ib_device, NULL, NULL, (void *)(-1), 10, 0); if (IS_ERR(ibcq)) { ehca_err(&shca->ib_device, "Cannot create AQP1 CQ."); return PTR_ERR(ibcq); @@ -730,7 +730,7 @@ static int __devinit ehca_probe(struct ibmebus_dev *dev, } /* create internal protection domain */ - ibpd = ehca_alloc_pd(&shca->ib_device, (void*)(-1), NULL); + ibpd = ehca_alloc_pd(&shca->ib_device, (void *)(-1), NULL); if (IS_ERR(ibpd)) { ehca_err(&shca->ib_device, "Cannot create internal PD."); ret = PTR_ERR(ibpd); @@ -944,18 +944,21 @@ int __init ehca_module_init(void) return -EINVAL; } - if ((ret = ehca_create_comp_pool())) { + ret = ehca_create_comp_pool(); + if (ret) { ehca_gen_err("Cannot create comp pool."); return ret; } - if ((ret = ehca_create_slab_caches())) { + ret = ehca_create_slab_caches(); + if (ret) { ehca_gen_err("Cannot create SLAB caches"); ret = -ENOMEM; goto module_init1; } - if ((ret = ibmebus_register_driver(&ehca_driver))) { + ret = ibmebus_register_driver(&ehca_driver); + if (ret) { ehca_gen_err("Cannot register eHCA device driver"); ret = -EINVAL; goto module_init2; diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 93c26cc..6262c54 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -61,9 +61,9 @@ static struct ehca_mr *ehca_mr_new(void) struct ehca_mr *me; me = kmem_cache_zalloc(mr_cache, GFP_KERNEL); - if (me) { + if (me) spin_lock_init(&me->mrlock); - } else + else ehca_gen_err("alloc failed"); return me; @@ -79,9 +79,9 @@ static struct ehca_mw *ehca_mw_new(void) struct ehca_mw *me; me = kmem_cache_zalloc(mw_cache, GFP_KERNEL); - if (me) { + if (me) spin_lock_init(&me->mwlock); - } else + else ehca_gen_err("alloc failed"); return me; @@ -111,7 +111,7 @@ struct ib_mr *ehca_get_dma_mr(struct ib_pd *pd, int mr_access_flags) goto get_dma_mr_exit0; } - ret = ehca_reg_maxmr(shca, e_maxmr, (u64*)KERNELBASE, + ret = ehca_reg_maxmr(shca, e_maxmr, (u64 *)KERNELBASE, mr_access_flags, e_pd, &e_maxmr->ib.ib_mr.lkey, &e_maxmr->ib.ib_mr.rkey); @@ -246,8 +246,9 @@ reg_phys_mr_exit0: /*----------------------------------------------------------------------*/ -struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt, - int mr_access_flags, struct ib_udata *udata) +struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt, int mr_access_flags, + struct ib_udata *udata) { struct ib_mr *ib_mr; struct ehca_mr *e_mr; @@ -295,7 +296,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt e_mr->umem = ib_umem_get(pd->uobject->context, start, length, mr_access_flags); if (IS_ERR(e_mr->umem)) { - ib_mr = (void *) e_mr->umem; + ib_mr = (void *)e_mr->umem; goto reg_user_mr_exit1; } @@ -322,8 +323,9 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt (&e_mr->umem->chunk_list), list); - ret = ehca_reg_mr(shca, e_mr, (u64*) virt, length, mr_access_flags, e_pd, - &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags, + e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, + &e_mr->ib.ib_mr.rkey); if (ret) { ib_mr = ERR_PTR(ret); goto reg_user_mr_exit2; @@ -420,7 +422,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, goto rereg_phys_mr_exit0; } if (!phys_buf_array || num_phys_buf <= 0) { - ehca_err(mr->device, "bad input values: mr_rereg_mask=%x" + ehca_err(mr->device, "bad input values mr_rereg_mask=%x" " phys_buf_array=%p num_phys_buf=%x", mr_rereg_mask, phys_buf_array, num_phys_buf); ret = -EINVAL; @@ -444,10 +446,10 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, /* set requested values dependent on rereg request */ spin_lock_irqsave(&e_mr->mrlock, sl_flags); - new_start = e_mr->start; /* new == old address */ - new_size = e_mr->size; /* new == old length */ - new_acl = e_mr->acl; /* new == old access control */ - new_pd = container_of(mr->pd,struct ehca_pd,ib_pd); /*new == old PD*/ + new_start = e_mr->start; + new_size = e_mr->size; + new_acl = e_mr->acl; + new_pd = container_of(mr->pd, struct ehca_pd, ib_pd); if (mr_rereg_mask & IB_MR_REREG_TRANS) { new_start = iova_start; /* change address */ @@ -517,7 +519,7 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) struct ehca_pd *my_pd = container_of(mr->pd, struct ehca_pd, ib_pd); u32 cur_pid = current->tgid; unsigned long sl_flags; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && (my_pd->ownpid != cur_pid)) { @@ -629,7 +631,7 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) struct ehca_pd *e_pd = container_of(pd, struct ehca_pd, ib_pd); struct ehca_shca *shca = container_of(pd->device, struct ehca_shca, ib_device); - struct ehca_mw_hipzout_parms hipzout = {{0},0}; + struct ehca_mw_hipzout_parms hipzout; e_mw = ehca_mw_new(); if (!e_mw) { @@ -826,7 +828,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, EHCA_PAGESIZE); pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size; - ret = ehca_rereg_mr(shca, e_fmr, (u64*)iova, + ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova, list_len * e_fmr->fmr_page_size, e_fmr->acl, e_pd, &pginfo, &tmp_lkey, &tmp_rkey); if (ret) @@ -841,8 +843,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, map_phys_fmr_exit0: if (ret) ehca_err(fmr->device, "ret=%x fmr=%p page_list=%p list_len=%x " - "iova=%lx", - ret, fmr, page_list, list_len, iova); + "iova=%lx", ret, fmr, page_list, list_len, iova); return ret; } /* end ehca_map_phys_fmr() */ @@ -960,12 +961,12 @@ int ehca_reg_mr(struct ehca_shca *shca, int ret; u64 h_ret; u32 hipz_acl; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); if (ehca_use_hp_mr == 1) - hipz_acl |= 0x00000001; + hipz_acl |= 0x00000001; h_ret = hipz_h_alloc_resource_mr(shca->ipz_hca_handle, e_mr, (u64)iova_start, size, hipz_acl, @@ -1127,7 +1128,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, u64 *kpage; u64 rpage; struct ehca_mr_pginfo pginfo_save; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); @@ -1167,7 +1168,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, "(Rereg1), h_ret=%lx e_mr=%p", h_ret, e_mr); *pginfo = pginfo_save; ret = -EAGAIN; - } else if ((u64*)hipzout.vaddr != iova_start) { + } else if ((u64 *)hipzout.vaddr != iova_start) { ehca_err(&shca->ib_device, "PHYP changed iova_start in " "rereg_pmr, iova_start=%p iova_start_out=%lx e_mr=%p " "mr_handle=%lx lkey=%x lkey_out=%x", iova_start, @@ -1305,7 +1306,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, struct ehca_mr save_fmr; u32 tmp_lkey, tmp_rkey; struct ehca_mr_pginfo pginfo; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; struct ehca_mr save_mr; if (e_fmr->fmr_max_pages <= MAX_RPAGES) { @@ -1397,7 +1398,7 @@ int ehca_reg_smr(struct ehca_shca *shca, int ret = 0; u64 h_ret; u32 hipz_acl; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); @@ -1462,7 +1463,7 @@ int ehca_reg_internal_maxmr( /* register internal max-MR on HCA */ size_maxmr = (u64)high_memory - PAGE_OFFSET; - iova_start = (u64*)KERNELBASE; + iova_start = (u64 *)KERNELBASE; ib_pbuf.addr = 0; ib_pbuf.size = size_maxmr; num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, @@ -1519,7 +1520,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, u64 h_ret; struct ehca_mr *e_origmr = shca->maxmr; u32 hipz_acl; - struct ehca_mr_hipzout_parms hipzout = {{0},0,0,0,0,0}; + struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); @@ -1865,7 +1866,7 @@ int ehca_mr_is_maxmr(u64 size, { /* a MR is treated as max-MR only if it fits following: */ if ((size == ((u64)high_memory - PAGE_OFFSET)) && - (iova_start == (void*)KERNELBASE)) { + (iova_start == (void *)KERNELBASE)) { ehca_gen_dbg("this is a max-MR"); return 1; } else diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h index fb69ede..24f13fe 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.h +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -101,15 +101,10 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, u64 *page_list, int list_len); -int ehca_set_pagebuf(struct ehca_mr *e_mr, - struct ehca_mr_pginfo *pginfo, +int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo, u32 number, u64 *kpage); -int ehca_set_pagebuf_1(struct ehca_mr *e_mr, - struct ehca_mr_pginfo *pginfo, - u64 *rpage); - int ehca_mr_is_maxmr(u64 size, u64 *iova_start); diff --git a/drivers/infiniband/hw/ehca/ehca_qes.h b/drivers/infiniband/hw/ehca/ehca_qes.h index 8707d29..8188030 100644 --- a/drivers/infiniband/hw/ehca/ehca_qes.h +++ b/drivers/infiniband/hw/ehca/ehca_qes.h @@ -53,13 +53,13 @@ struct ehca_vsgentry { u32 length; }; -#define GRH_FLAG_MASK EHCA_BMASK_IBM(7,7) -#define GRH_IPVERSION_MASK EHCA_BMASK_IBM(0,3) -#define GRH_TCLASS_MASK EHCA_BMASK_IBM(4,12) -#define GRH_FLOWLABEL_MASK EHCA_BMASK_IBM(13,31) -#define GRH_PAYLEN_MASK EHCA_BMASK_IBM(32,47) -#define GRH_NEXTHEADER_MASK EHCA_BMASK_IBM(48,55) -#define GRH_HOPLIMIT_MASK EHCA_BMASK_IBM(56,63) +#define GRH_FLAG_MASK EHCA_BMASK_IBM( 7, 7) +#define GRH_IPVERSION_MASK EHCA_BMASK_IBM( 0, 3) +#define GRH_TCLASS_MASK EHCA_BMASK_IBM( 4, 12) +#define GRH_FLOWLABEL_MASK EHCA_BMASK_IBM(13, 31) +#define GRH_PAYLEN_MASK EHCA_BMASK_IBM(32, 47) +#define GRH_NEXTHEADER_MASK EHCA_BMASK_IBM(48, 55) +#define GRH_HOPLIMIT_MASK EHCA_BMASK_IBM(56, 63) /* * Unreliable Datagram Address Vector Format @@ -206,10 +206,10 @@ struct ehca_wqe { }; -#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0,0) -#define WC_IMM_DATA EHCA_BMASK_IBM(1,1) -#define WC_GRH_PRESENT EHCA_BMASK_IBM(2,2) -#define WC_SE_BIT EHCA_BMASK_IBM(3,3) +#define WC_SEND_RECEIVE EHCA_BMASK_IBM(0, 0) +#define WC_IMM_DATA EHCA_BMASK_IBM(1, 1) +#define WC_GRH_PRESENT EHCA_BMASK_IBM(2, 2) +#define WC_SE_BIT EHCA_BMASK_IBM(3, 3) #define WC_STATUS_ERROR_BIT 0x80000000 #define WC_STATUS_REMOTE_ERROR_FLAGS 0x0000F800 #define WC_STATUS_PURGE_BIT 0x10 diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index f6f4ef6..3bd13e1 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -602,10 +602,10 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, /* UD circumvention */ parms.act_nr_send_sges -= 2; parms.act_nr_recv_sges -= 2; - swqe_size = offsetof(struct ehca_wqe, - u.ud_av.sg_list[parms.act_nr_send_sges]); - rwqe_size = offsetof(struct ehca_wqe, - u.ud_av.sg_list[parms.act_nr_recv_sges]); + swqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[ + parms.act_nr_send_sges]); + rwqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[ + parms.act_nr_recv_sges]); } if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) { @@ -690,8 +690,8 @@ struct ehca_qp *internal_create_qp(struct ib_pd *pd, if (my_qp->send_cq) { ret = ehca_cq_assign_qp(my_qp->send_cq, my_qp); if (ret) { - ehca_err(pd->device, "Couldn't assign qp to send_cq ret=%x", - ret); + ehca_err(pd->device, + "Couldn't assign qp to send_cq ret=%x", ret); goto create_qp_exit4; } } @@ -749,7 +749,7 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ehca_qp *ret; ret = internal_create_qp(pd, qp_init_attr, NULL, udata, 0); - return IS_ERR(ret) ? (struct ib_qp *) ret : &ret->ib_qp; + return IS_ERR(ret) ? (struct ib_qp *)ret : &ret->ib_qp; } int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, @@ -780,7 +780,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, my_qp = internal_create_qp(pd, &qp_init_attr, srq_init_attr, udata, 1); if (IS_ERR(my_qp)) - return (struct ib_srq *) my_qp; + return (struct ib_srq *)my_qp; /* copy back return values */ srq_init_attr->attr.max_wr = qp_init_attr.cap.max_recv_wr; @@ -875,7 +875,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } - bad_send_wqe_p = (void*)((u64)bad_send_wqe_p & (~(1L<<63))); + bad_send_wqe_p = (void *)((u64)bad_send_wqe_p & (~(1L << 63))); ehca_dbg(&shca->ib_device, "qp_num=%x bad_send_wqe_p=%p", qp_num, bad_send_wqe_p); /* convert wqe pointer to vadr */ @@ -890,7 +890,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, } /* loop sets wqe's purge bit */ - wqe = (struct ehca_wqe*)ipz_qeit_calc(squeue, q_ofs); + wqe = (struct ehca_wqe *)ipz_qeit_calc(squeue, q_ofs); *bad_wqe_cnt = 0; while (wqe->optype != 0xff && wqe->wqef != 0xff) { if (ehca_debug_level) @@ -898,7 +898,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, wqe->nr_of_data_seg = 0; /* suppress data access */ wqe->wqef = WQEF_PURGE; /* WQE to be purged */ q_ofs = ipz_queue_advance_offset(squeue, q_ofs); - wqe = (struct ehca_wqe*)ipz_qeit_calc(squeue, q_ofs); + wqe = (struct ehca_wqe *)ipz_qeit_calc(squeue, q_ofs); *bad_wqe_cnt = (*bad_wqe_cnt)+1; } /* @@ -1003,7 +1003,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, goto modify_qp_exit1; } - ehca_dbg(ibqp->device,"ehca_qp=%p qp_num=%x current qp_state=%x " + ehca_dbg(ibqp->device, "ehca_qp=%p qp_num=%x current qp_state=%x " "new qp_state=%x attribute_mask=%x", my_qp, ibqp->qp_num, qp_cur_state, attr->qp_state, attr_mask); @@ -1019,7 +1019,8 @@ static int internal_modify_qp(struct ib_qp *ibqp, goto modify_qp_exit1; } - if ((mqpcb->qp_state = ib2ehca_qp_state(qp_new_state))) + mqpcb->qp_state = ib2ehca_qp_state(qp_new_state); + if (mqpcb->qp_state) update_mask = EHCA_BMASK_SET(MQPCB_MASK_QP_STATE, 1); else { ret = -EINVAL; @@ -1077,7 +1078,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, spin_lock_irqsave(&my_qp->spinlock_s, flags); squeue_locked = 1; /* mark next free wqe */ - wqe = (struct ehca_wqe*) + wqe = (struct ehca_wqe *) ipz_qeit_get(&my_qp->ipz_squeue); wqe->optype = wqe->wqef = 0xff; ehca_dbg(ibqp->device, "qp_num=%x next_free_wqe=%p", @@ -1312,7 +1313,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(ibqp->device, "hipz_h_modify_qp() failed rc=%lx " - "ehca_qp=%p qp_num=%x",h_ret, my_qp, ibqp->qp_num); + "ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1411,7 +1412,7 @@ int ehca_query_qp(struct ib_qp *qp, } if (qp_attr_mask & QP_ATTR_QUERY_NOT_SUPPORTED) { - ehca_err(qp->device,"Invalid attribute mask " + ehca_err(qp->device, "Invalid attribute mask " "ehca_qp=%p qp_num=%x qp_attr_mask=%x ", my_qp, qp->qp_num, qp_attr_mask); return -EINVAL; @@ -1419,7 +1420,7 @@ int ehca_query_qp(struct ib_qp *qp, qpcb = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!qpcb) { - ehca_err(qp->device,"Out of memory for qpcb " + ehca_err(qp->device, "Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); return -ENOMEM; } @@ -1431,7 +1432,7 @@ int ehca_query_qp(struct ib_qp *qp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(qp->device,"hipz_h_query_qp() failed " + ehca_err(qp->device, "hipz_h_query_qp() failed " "ehca_qp=%p qp_num=%x h_ret=%lx", my_qp, qp->qp_num, h_ret); goto query_qp_exit1; @@ -1442,7 +1443,7 @@ int ehca_query_qp(struct ib_qp *qp, if (qp_attr->cur_qp_state == -EINVAL) { ret = -EINVAL; - ehca_err(qp->device,"Got invalid ehca_qp_state=%x " + ehca_err(qp->device, "Got invalid ehca_qp_state=%x " "ehca_qp=%p qp_num=%x", qpcb->qp_state, my_qp, qp->qp_num); goto query_qp_exit1; diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index 61da65e..94eed70 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -79,7 +79,8 @@ static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, } if (ehca_debug_level) { - ehca_gen_dbg("RECEIVE WQE written into ipz_rqueue=%p", ipz_rqueue); + ehca_gen_dbg("RECEIVE WQE written into ipz_rqueue=%p", + ipz_rqueue); ehca_dmp( wqe_p, 16*(6 + wqe_p->nr_of_data_seg), "recv wqe"); } @@ -99,7 +100,7 @@ static void trace_send_wr_ud(const struct ib_send_wr *send_wr) struct ib_mad_hdr *mad_hdr = send_wr->wr.ud.mad_hdr; struct ib_sge *sge = send_wr->sg_list; ehca_gen_dbg("send_wr#%x wr_id=%lx num_sge=%x " - "send_flags=%x opcode=%x",idx, send_wr->wr_id, + "send_flags=%x opcode=%x", idx, send_wr->wr_id, send_wr->num_sge, send_wr->send_flags, send_wr->opcode); if (mad_hdr) { @@ -116,7 +117,7 @@ static void trace_send_wr_ud(const struct ib_send_wr *send_wr) mad_hdr->attr_mod); } for (j = 0; j < send_wr->num_sge; j++) { - u8 *data = (u8 *) abs_to_virt(sge->addr); + u8 *data = (u8 *)abs_to_virt(sge->addr); ehca_gen_dbg("send_wr#%x sge#%x addr=%p length=%x " "lkey=%x", idx, j, data, sge->length, sge->lkey); @@ -534,9 +535,11 @@ poll_cq_one_read_cqe: cqe_count++; if (unlikely(cqe->status & WC_STATUS_PURGE_BIT)) { - struct ehca_qp *qp=ehca_cq_get_qp(my_cq, cqe->local_qp_number); + struct ehca_qp *qp; int purgeflag; unsigned long flags; + + qp = ehca_cq_get_qp(my_cq, cqe->local_qp_number); if (!qp) { ehca_err(cq->device, "cq_num=%x qp_num=%x " "could not find qp -> ignore cqe", @@ -551,8 +554,8 @@ poll_cq_one_read_cqe: spin_unlock_irqrestore(&qp->spinlock_s, flags); if (purgeflag) { - ehca_dbg(cq->device, "Got CQE with purged bit qp_num=%x " - "src_qp=%x", + ehca_dbg(cq->device, + "Got CQE with purged bit qp_num=%x src_qp=%x", cqe->local_qp_number, cqe->remote_qp_number); if (ehca_debug_level) ehca_dmp(cqe, 64, "qp_num=%x src_qp=%x", diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index fd8238b..678b813 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -93,14 +93,14 @@ extern int ehca_debug_level; #define ehca_gen_dbg(format, arg...) \ do { \ if (unlikely(ehca_debug_level)) \ - printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n",\ + printk(KERN_DEBUG "PU%04x EHCA_DBG:%s " format "\n", \ get_paca()->paca_index, __FUNCTION__, ## arg); \ } while (0) #define ehca_gen_warn(format, arg...) \ do { \ if (unlikely(ehca_debug_level)) \ - printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n",\ + printk(KERN_INFO "PU%04x EHCA_WARN:%s " format "\n", \ get_paca()->paca_index, __FUNCTION__, ## arg); \ } while (0) @@ -114,12 +114,12 @@ extern int ehca_debug_level; * adr=X ofs=Y <8 bytes hex> <8 bytes hex> */ #define ehca_dmp(adr, len, format, args...) \ - do { \ - unsigned int x; \ + do { \ + unsigned int x; \ unsigned int l = (unsigned int)(len); \ - unsigned char *deb = (unsigned char*)(adr); \ + unsigned char *deb = (unsigned char *)(adr); \ for (x = 0; x < l; x += 16) { \ - printk("EHCA_DMP:%s " format \ + printk(KERN_INFO "EHCA_DMP:%s " format \ " adr=%p ofs=%04x %016lx %016lx\n", \ __FUNCTION__, ##args, deb, x, \ *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ @@ -128,16 +128,16 @@ extern int ehca_debug_level; } while (0) /* define a bitmask, little endian version */ -#define EHCA_BMASK(pos,length) (((pos)<<16)+(length)) +#define EHCA_BMASK(pos, length) (((pos) << 16) + (length)) /* define a bitmask, the ibm way... */ -#define EHCA_BMASK_IBM(from,to) (((63-to)<<16)+((to)-(from)+1)) +#define EHCA_BMASK_IBM(from, to) (((63 - to) << 16) + ((to) - (from) + 1)) /* internal function, don't use */ -#define EHCA_BMASK_SHIFTPOS(mask) (((mask)>>16)&0xffff) +#define EHCA_BMASK_SHIFTPOS(mask) (((mask) >> 16) & 0xffff) /* internal function, don't use */ -#define EHCA_BMASK_MASK(mask) (0xffffffffffffffffULL >> ((64-(mask))&0xffff)) +#define EHCA_BMASK_MASK(mask) (~0ULL >> ((64 - (mask)) & 0xffff)) /** * EHCA_BMASK_SET - return value shifted and masked by mask @@ -145,14 +145,14 @@ extern int ehca_debug_level; * variable&=~EHCA_BMASK_SET(MY_MASK,-1) clears the bits from the mask * in variable */ -#define EHCA_BMASK_SET(mask,value) \ - ((EHCA_BMASK_MASK(mask) & ((u64)(value)))<>EHCA_BMASK_SHIFTPOS(mask))) +#define EHCA_BMASK_GET(mask, value) \ + (EHCA_BMASK_MASK(mask) & (((u64)(value)) >> EHCA_BMASK_SHIFTPOS(mask))) /* Converts ehca to ib return code */ diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 3031b3b..05c4157 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -70,7 +70,7 @@ int ehca_dealloc_ucontext(struct ib_ucontext *context) static void ehca_mm_open(struct vm_area_struct *vma) { - u32 *count = (u32*)vma->vm_private_data; + u32 *count = (u32 *)vma->vm_private_data; if (!count) { ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", vma->vm_start, vma->vm_end); @@ -86,7 +86,7 @@ static void ehca_mm_open(struct vm_area_struct *vma) static void ehca_mm_close(struct vm_area_struct *vma) { - u32 *count = (u32*)vma->vm_private_data; + u32 *count = (u32 *)vma->vm_private_data; if (!count) { ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", vma->vm_start, vma->vm_end); @@ -215,7 +215,8 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, case 2: /* qp rqueue_addr */ ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, &qp->mm_count_rqueue); + ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, + &qp->mm_count_rqueue); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", @@ -227,7 +228,8 @@ static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, case 3: /* qp squeue_addr */ ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_squeue, &qp->mm_count_squeue); + ret = ehca_mmap_queue(vma, &qp->ipz_squeue, + &qp->mm_count_squeue); if (unlikely(ret)) { ehca_err(qp->ib_qp.device, "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 4776a8b..3394e05 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -501,8 +501,8 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, return H_PARAMETER; } - return hipz_h_register_rpage(adapter_handle,pagesize,queue_type, - qp_handle.handle,logical_address_of_page, + return hipz_h_register_rpage(adapter_handle, pagesize, queue_type, + qp_handle.handle, logical_address_of_page, count); } @@ -522,9 +522,9 @@ u64 hipz_h_disable_and_get_wqe(const struct ipz_adapter_handle adapter_handle, qp_handle.handle, /* r6 */ 0, 0, 0, 0, 0, 0); if (log_addr_next_sq_wqe2processed) - *log_addr_next_sq_wqe2processed = (void*)outs[0]; + *log_addr_next_sq_wqe2processed = (void *)outs[0]; if (log_addr_next_rq_wqe2processed) - *log_addr_next_rq_wqe2processed = (void*)outs[1]; + *log_addr_next_rq_wqe2processed = (void *)outs[1]; return ret; } diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c index 0b1a477..069c69e 100644 --- a/drivers/infiniband/hw/ehca/hcp_phyp.c +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -50,7 +50,7 @@ int hcall_map_page(u64 physaddr, u64 *mapaddr) int hcall_unmap_page(u64 mapaddr) { - iounmap((volatile void __iomem*)mapaddr); + iounmap((volatile void __iomem *)mapaddr); return 0; } diff --git a/drivers/infiniband/hw/ehca/hipz_fns_core.h b/drivers/infiniband/hw/ehca/hipz_fns_core.h index 20898a1..868735f 100644 --- a/drivers/infiniband/hw/ehca/hipz_fns_core.h +++ b/drivers/infiniband/hw/ehca/hipz_fns_core.h @@ -53,10 +53,10 @@ #define hipz_galpa_load_cq(gal, offset) \ hipz_galpa_load(gal, CQTEMM_OFFSET(offset)) -#define hipz_galpa_store_qp(gal,offset, value) \ +#define hipz_galpa_store_qp(gal, offset, value) \ hipz_galpa_store(gal, QPTEMM_OFFSET(offset), value) #define hipz_galpa_load_qp(gal, offset) \ - hipz_galpa_load(gal,QPTEMM_OFFSET(offset)) + hipz_galpa_load(gal, QPTEMM_OFFSET(offset)) static inline void hipz_update_sqa(struct ehca_qp *qp, u16 nr_wqes) { diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index dad6dea..d9739e5 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -161,11 +161,11 @@ struct hipz_qptemm { /* 0x1000 */ }; -#define QPX_SQADDER EHCA_BMASK_IBM(48,63) -#define QPX_RQADDER EHCA_BMASK_IBM(48,63) -#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3,3) +#define QPX_SQADDER EHCA_BMASK_IBM(48, 63) +#define QPX_RQADDER EHCA_BMASK_IBM(48, 63) +#define QPX_AAELOG_RESET_SRQ_LIMIT EHCA_BMASK_IBM(3, 3) -#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm,x) +#define QPTEMM_OFFSET(x) offsetof(struct hipz_qptemm, x) /* MRMWPT Entry Memory Map */ struct hipz_mrmwmm { @@ -187,7 +187,7 @@ struct hipz_mrmwmm { }; -#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm,x) +#define MRMWMM_OFFSET(x) offsetof(struct hipz_mrmwmm, x) struct hipz_qpedmm { /* 0x00 */ @@ -238,7 +238,7 @@ struct hipz_qpedmm { u64 qpedx_rrva3; }; -#define QPEDMM_OFFSET(x) offsetof(struct hipz_qpedmm,x) +#define QPEDMM_OFFSET(x) offsetof(struct hipz_qpedmm, x) /* CQ Table Entry Memory Map */ struct hipz_cqtemm { @@ -263,12 +263,12 @@ struct hipz_cqtemm { /* 0x1000 */ }; -#define CQX_FEC_CQE_CNT EHCA_BMASK_IBM(32,63) -#define CQX_FECADDER EHCA_BMASK_IBM(32,63) -#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0,0) -#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0,0) +#define CQX_FEC_CQE_CNT EHCA_BMASK_IBM(32, 63) +#define CQX_FECADDER EHCA_BMASK_IBM(32, 63) +#define CQX_N0_GENERATE_SOLICITED_COMP_EVENT EHCA_BMASK_IBM(0, 0) +#define CQX_N1_GENERATE_COMP_EVENT EHCA_BMASK_IBM(0, 0) -#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm,x) +#define CQTEMM_OFFSET(x) offsetof(struct hipz_cqtemm, x) /* EQ Table Entry Memory Map */ struct hipz_eqtemm { @@ -293,7 +293,7 @@ struct hipz_eqtemm { }; -#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm,x) +#define EQTEMM_OFFSET(x) offsetof(struct hipz_eqtemm, x) /* access control defines for MR/MW */ #define HIPZ_ACCESSCTRL_L_WRITE 0x00800000 diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c index bf7a400..9606f13 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -114,7 +114,7 @@ int ipz_queue_ctor(struct ipz_queue *queue, */ f = 0; while (f < nr_of_pages) { - u8 *kpage = (u8*)get_zeroed_page(GFP_KERNEL); + u8 *kpage = (u8 *)get_zeroed_page(GFP_KERNEL); int k; if (!kpage) goto ipz_queue_ctor_exit0; /*NOMEM*/ diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index 007f088..39a4f64 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -240,7 +240,7 @@ void *ipz_qeit_eq_get_inc(struct ipz_queue *queue); static inline void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue) { void *ret = ipz_qeit_get(queue); - u32 qe = *(u8 *) ret; + u32 qe = *(u8 *)ret; if ((qe >> 7) != (queue->toggle_state & 1)) return NULL; ipz_qeit_eq_get_inc(queue); /* this is a good one */ @@ -250,7 +250,7 @@ static inline void *ipz_eqit_eq_get_inc_valid(struct ipz_queue *queue) static inline void *ipz_eqit_eq_peek_valid(struct ipz_queue *queue) { void *ret = ipz_qeit_get(queue); - u32 qe = *(u8 *) ret; + u32 qe = *(u8 *)ret; if ((qe >> 7) != (queue->toggle_state & 1)) return NULL; return ret; -- 1.5.2 From fenkes at de.ibm.com Thu Jul 12 08:54:19 2007 From: fenkes at de.ibm.com (Joachim Fenkes) Date: Thu, 12 Jul 2007 17:54:19 +0200 Subject: [ofa-general] [PATCH 10/10] IB/ehca: Support large page MRs In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: <200707121754.20293.fenkes@de.ibm.com> From: Hoang-Nam Nguyen Add support for MR pages larger than 4K on eHCA2. This reduces firmware memory consumption. If enabled via the mr_largepage module parameter, the MR page size will be determined based on the MR length and the hardware capabilities - if the MR is >= 16M, 16M pages are used, for example. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 10 + drivers/infiniband/hw/ehca/ehca_main.c | 17 ++- drivers/infiniband/hw/ehca/ehca_mrmw.c | 371 ++++++++++++++++++++++++----- drivers/infiniband/hw/ehca/ehca_mrmw.h | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 20 ++- 5 files changed, 357 insertions(+), 63 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 1752821..2a39cfa 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -105,6 +105,12 @@ struct ehca_sport { }; #define EHCA_MAX_NR_EQS 512 + +#define HCA_CAP_MR_PGSIZE_4K 1 +#define HCA_CAP_MR_PGSIZE_64K 2 +#define HCA_CAP_MR_PGSIZE_1M 4 +#define HCA_CAP_MR_PGSIZE_16M 8 + struct ehca_shca { struct ib_device ib_device; struct ibmebus_dev *ibmebus_dev; @@ -121,6 +127,8 @@ struct ehca_shca { struct h_galpas galpas; struct mutex modify_mutex; u64 hca_cap; + /* MR pgsize: bit 0-3 means 4K, 64K, 1M, 16M respectively */ + u32 hca_cap_mr_pgsize; int max_mtu; atomic_t cur_eq_idx; }; @@ -213,6 +221,7 @@ struct ehca_mr { enum ehca_mr_flag flags; u32 num_kpages; /* number of kernel pages */ u32 num_hwpages; /* number of hw pages to form MR */ + u64 hwpage_size; /* hw page size used for this MR */ int acl; /* ACL (stored here for usage in reregister) */ u64 *start; /* virtual start address (stored here for */ /* usage in reregister) */ @@ -247,6 +256,7 @@ struct ehca_mr_pginfo { enum ehca_mr_pgi_type type; u64 num_kpages; u64 kpage_cnt; + u64 hwpage_size; /* hw page size used for this MR */ u64 num_hwpages; /* number of hw pages */ u64 hwpage_cnt; /* counter for hw pages */ u64 next_hwpage; /* next hw page in buffer/chunk/listelem */ diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index ecf4ef4..5f207f2 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -65,6 +65,7 @@ int ehca_static_rate = -1; int ehca_scaling_code = 0; int ehca_nr_eqs = 2; int ehca_dist_eqs = 0; +int ehca_mr_largepage = 0; module_param_named(open_aqp1, ehca_open_aqp1, int, 0); module_param_named(debug_level, ehca_debug_level, int, 0); @@ -77,6 +78,7 @@ module_param_named(static_rate, ehca_static_rate, int, 0); module_param_named(scaling_code, ehca_scaling_code, int, 0); module_param_named(nr_eqs, ehca_nr_eqs, int, 0); module_param_named(dist_eqs, ehca_dist_eqs, int, 0); +module_param_named(mr_largepage, ehca_mr_largepage, int, 0); MODULE_PARM_DESC(open_aqp1, "AQP1 on startup (0: no (default), 1: yes)"); @@ -104,6 +106,9 @@ MODULE_PARM_DESC(nr_eqs, MODULE_PARM_DESC(dist_eqs, "enable distributing EQs across CQs " "(0: disabled/default, 1: enabled)"); +MODULE_PARM_DESC(mr_largepage, + "use large page for MR (0: use PAGE_SIZE (default), " + "1: use large page depending on MR size"); DEFINE_RWLOCK(ehca_qp_idr_lock); DEFINE_RWLOCK(ehca_cq_idr_lock); @@ -314,6 +319,8 @@ int ehca_sense_attributes(struct ehca_shca *shca) if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) ehca_gen_dbg(" %s", hca_cap_descr[i].descr); + shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported; + port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { @@ -609,13 +616,20 @@ static ssize_t ehca_show_adapter_handle(struct device *dev, } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); +static ssize_t ehca_show_mr_largepage(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", ehca_mr_largepage); +} +static DEVICE_ATTR(mr_largepage, S_IRUGO, ehca_show_mr_largepage, NULL); + static ssize_t ehca_show_nr_eqs(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%d\n", ehca_nr_eqs); } - static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL); static struct attribute *ehca_dev_attrs[] = { @@ -635,6 +649,7 @@ static struct attribute *ehca_dev_attrs[] = { &dev_attr_max_pd.attr, &dev_attr_max_ah.attr, &dev_attr_nr_eqs.attr, + &dev_attr_mr_largepage.attr, NULL }; diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 6262c54..ba28783 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -5,6 +5,7 @@ * * Authors: Dietmar Decker * Christoph Raisch + * Hoang-Nam Nguyen * * Copyright (c) 2005 IBM Corporation * @@ -56,6 +57,37 @@ static struct kmem_cache *mr_cache; static struct kmem_cache *mw_cache; +enum ehca_mr_pgsize { + EHCA_MR_PGSIZE4K = 0x1000L, + EHCA_MR_PGSIZE64K = 0x10000L, + EHCA_MR_PGSIZE1M = 0x100000L, + EHCA_MR_PGSIZE16M = 0x1000000L +}; + +extern int ehca_mr_largepage; + +static u32 ehca_encode_hwpage_size(u32 pgsize) +{ + u32 idx = 0; + pgsize >>= 12; + /* + * map mr page size into hw code: + * 0, 1, 2, 3 for 4K, 64K, 1M, 64M + */ + while (!(pgsize & 1)) { + idx++; + pgsize >>= 4; + } + return idx; +} + +static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca) +{ + if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M) + return EHCA_MR_PGSIZE16M; + return EHCA_MR_PGSIZE4K; +} + static struct ehca_mr *ehca_mr_new(void) { struct ehca_mr *me; @@ -207,19 +239,23 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, struct ehca_mr_pginfo pginfo; u32 num_kpages; u32 num_hwpages; + u64 hw_pgsize; num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + - size, EHCA_PAGESIZE); + /* for kernel space we try most possible pgsize */ + hw_pgsize = ehca_get_max_hwpage_size(shca); + num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size, + hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; + pginfo.hwpage_size = hw_pgsize; pginfo.num_hwpages = num_hwpages; pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; - pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, @@ -259,6 +295,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, int ret; u32 num_kpages; u32 num_hwpages; + u64 hwpage_size; if (!pd) { ehca_gen_err("bad pd=%p", pd); @@ -309,16 +346,32 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, /* determine number of MR pages */ num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); - num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length, - EHCA_PAGESIZE); + /* select proper hw_pgsize */ + if (ehca_mr_largepage && + (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { + if (length <= EHCA_MR_PGSIZE4K + && PAGE_SIZE == EHCA_MR_PGSIZE4K) + hwpage_size = EHCA_MR_PGSIZE4K; + else if (length <= EHCA_MR_PGSIZE64K) + hwpage_size = EHCA_MR_PGSIZE64K; + else if (length <= EHCA_MR_PGSIZE1M) + hwpage_size = EHCA_MR_PGSIZE1M; + else + hwpage_size = EHCA_MR_PGSIZE16M; + } else + hwpage_size = EHCA_MR_PGSIZE4K; + ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); +reg_user_mr_fallback: + num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size); /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_USER; + pginfo.hwpage_size = hwpage_size; pginfo.num_kpages = num_kpages; pginfo.num_hwpages = num_hwpages; pginfo.u.usr.region = e_mr->umem; - pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE; + pginfo.next_hwpage = e_mr->umem->offset / hwpage_size; pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk, (&e_mr->umem->chunk_list), list); @@ -326,6 +379,18 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) { + ehca_warn(pd->device, "failed to register mr " + "with hwpage_size=%lx", hwpage_size); + ehca_info(pd->device, "try to register mr with " + "kpage_size=%lx", PAGE_SIZE); + /* + * this means kpages are not contiguous for a hw page + * try kernel page size as fallback solution + */ + hwpage_size = PAGE_SIZE; + goto reg_user_mr_fallback; + } if (ret) { ib_mr = ERR_PTR(ret); goto reg_user_mr_exit2; @@ -452,6 +517,8 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, new_pd = container_of(mr->pd, struct ehca_pd, ib_pd); if (mr_rereg_mask & IB_MR_REREG_TRANS) { + u64 hw_pgsize = ehca_get_max_hwpage_size(shca); + new_start = iova_start; /* change address */ /* check physical buffer list and calculate size */ ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array, @@ -468,16 +535,17 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, } num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) + new_size, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) + - new_size, EHCA_PAGESIZE); + num_hwpages = NUM_CHUNKS(((u64)new_start % hw_pgsize) + + new_size, hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; + pginfo.hwpage_size = hw_pgsize; pginfo.num_hwpages = num_hwpages; pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; - pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; } if (mr_rereg_mask & IB_MR_REREG_ACCESS) new_acl = mr_access_flags; @@ -709,6 +777,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, int ret; u32 tmp_lkey, tmp_rkey; struct ehca_mr_pginfo pginfo; + u64 hw_pgsize; /* check other parameters */ if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && @@ -738,8 +807,8 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, ib_fmr = ERR_PTR(-EINVAL); goto alloc_fmr_exit0; } - if (((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) && - ((1 << fmr_attr->page_shift) != PAGE_SIZE)) { + hw_pgsize = ehca_get_max_hwpage_size(shca); + if ((1 << fmr_attr->page_shift) != hw_pgsize) { ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x", fmr_attr->page_shift); ib_fmr = ERR_PTR(-EINVAL); @@ -755,6 +824,10 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); + /* + * pginfo.num_hwpages==0, ie register_rpages() will not be called + * but deferred to map_phys_fmr() + */ ret = ehca_reg_mr(shca, e_fmr, NULL, fmr_attr->max_pages * (1 << fmr_attr->page_shift), mr_access_flags, e_pd, &pginfo, @@ -765,6 +838,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, } /* successful */ + e_fmr->hwpage_size = hw_pgsize; e_fmr->fmr_page_size = 1 << fmr_attr->page_shift; e_fmr->fmr_max_pages = fmr_attr->max_pages; e_fmr->fmr_max_maps = fmr_attr->max_maps; @@ -822,10 +896,12 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_FMR; pginfo.num_kpages = list_len; - pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE); + pginfo.hwpage_size = e_fmr->hwpage_size; + pginfo.num_hwpages = + list_len * e_fmr->fmr_page_size / pginfo.hwpage_size; pginfo.u.fmr.page_list = page_list; - pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + (iova & (e_fmr->fmr_page_size-1)) / pginfo.hwpage_size; pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size; ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova, @@ -964,7 +1040,7 @@ int ehca_reg_mr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl); if (ehca_use_hp_mr == 1) hipz_acl |= 0x00000001; @@ -987,6 +1063,7 @@ int ehca_reg_mr(struct ehca_shca *shca, /* successful registration */ e_mr->num_kpages = pginfo->num_kpages; e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->hwpage_size = pginfo->hwpage_size; e_mr->start = iova_start; e_mr->size = size; e_mr->acl = acl; @@ -1029,6 +1106,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, u32 i; u64 *kpage; + if (!pginfo->num_hwpages) /* in case of fmr */ + return 0; + kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); @@ -1036,7 +1116,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, goto ehca_reg_mr_rpages_exit0; } - /* max 512 pages per shot */ + /* max MAX_RPAGES ehca mr pages per register call */ for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) { if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { @@ -1049,8 +1129,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = ehca_set_pagebuf(pginfo, rnum, kpage); if (ret) { ehca_err(&shca->ib_device, "ehca_set_pagebuf " - "bad rc, ret=%x rnum=%x kpage=%p", - ret, rnum, kpage); + "bad rc, ret=%x rnum=%x kpage=%p", + ret, rnum, kpage); goto ehca_reg_mr_rpages_exit1; } @@ -1065,9 +1145,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } else rpage = *kpage; - h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr, - 0, /* pagesize 4k */ - 0, rpage, rnum); + h_ret = hipz_h_register_rpage_mr( + shca->ipz_hca_handle, e_mr, + ehca_encode_hwpage_size(pginfo->hwpage_size), + 0, rpage, rnum); if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { /* @@ -1131,7 +1212,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl); kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!kpage) { @@ -1182,6 +1263,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, */ e_mr->num_kpages = pginfo->num_kpages; e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->hwpage_size = pginfo->hwpage_size; e_mr->start = iova_start; e_mr->size = size; e_mr->acl = acl; @@ -1268,13 +1350,14 @@ int ehca_rereg_mr(struct ehca_shca *shca, /* set some MR values */ e_mr->flags = save_mr.flags; + e_mr->hwpage_size = save_mr.hwpage_size; e_mr->fmr_page_size = save_mr.fmr_page_size; e_mr->fmr_max_pages = save_mr.fmr_max_pages; e_mr->fmr_max_maps = save_mr.fmr_max_maps; e_mr->fmr_map_cnt = save_mr.fmr_map_cnt; ret = ehca_reg_mr(shca, e_mr, iova_start, size, acl, - e_pd, pginfo, lkey, rkey); + e_pd, pginfo, lkey, rkey); if (ret) { u32 offset = (u64)(&e_mr->flags) - (u64)e_mr; memcpy(&e_mr->flags, &(save_mr.flags), @@ -1355,6 +1438,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, /* set some MR values */ e_fmr->flags = save_fmr.flags; + e_fmr->hwpage_size = save_fmr.hwpage_size; e_fmr->fmr_page_size = save_fmr.fmr_page_size; e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; @@ -1363,8 +1447,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_FMR; - pginfo.num_kpages = 0; - pginfo.num_hwpages = 0; ret = ehca_reg_mr(shca, e_fmr, NULL, (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), e_fmr->acl, e_pd, &pginfo, &tmp_lkey, @@ -1373,7 +1455,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; memcpy(&e_fmr->flags, &(save_mr.flags), sizeof(struct ehca_mr) - offset); - goto ehca_unmap_one_fmr_exit0; } ehca_unmap_one_fmr_exit0: @@ -1401,7 +1482,7 @@ int ehca_reg_smr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl); h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, (u64)iova_start, hipz_acl, e_pd->fw_pd, @@ -1420,6 +1501,7 @@ int ehca_reg_smr(struct ehca_shca *shca, /* successful registration */ e_newmr->num_kpages = e_origmr->num_kpages; e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->hwpage_size = e_origmr->hwpage_size; e_newmr->start = iova_start; e_newmr->size = e_origmr->size; e_newmr->acl = acl; @@ -1452,6 +1534,7 @@ int ehca_reg_internal_maxmr( struct ib_phys_buf ib_pbuf; u32 num_kpages; u32 num_hwpages; + u64 hw_pgsize; e_mr = ehca_mr_new(); if (!e_mr) { @@ -1468,13 +1551,15 @@ int ehca_reg_internal_maxmr( ib_pbuf.size = size_maxmr; num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr, - EHCA_PAGESIZE); + hw_pgsize = ehca_get_max_hwpage_size(shca); + num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size_maxmr, + hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; pginfo.num_hwpages = num_hwpages; + pginfo.hwpage_size = hw_pgsize; pginfo.u.phy.num_phys_buf = 1; pginfo.u.phy.phys_buf_array = &ib_pbuf; @@ -1523,7 +1608,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl); h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, (u64)iova_start, hipz_acl, e_pd->fw_pd, @@ -1539,6 +1624,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, /* successful registration */ e_newmr->num_kpages = e_origmr->num_kpages; e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->hwpage_size = e_origmr->hwpage_size; e_newmr->start = iova_start; e_newmr->size = e_origmr->size; e_newmr->acl = acl; @@ -1684,6 +1770,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, u64 pgaddr; u32 i = 0; u32 j = 0; + int hwpages_per_kpage = PAGE_SIZE / pginfo->hwpage_size; /* loop over desired chunk entries */ chunk = pginfo->u.usr.next_chunk; @@ -1695,7 +1782,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, << PAGE_SHIFT ; *kpage = phys_to_abs(pgaddr + (pginfo->next_hwpage * - EHCA_PAGESIZE)); + pginfo->hwpage_size)); if ( !(*kpage) ) { ehca_gen_err("pgaddr=%lx " "chunk->page_list[i]=%lx " @@ -1708,8 +1795,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; kpage++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) { + if (pginfo->next_hwpage % hwpages_per_kpage == 0) { (pginfo->kpage_cnt)++; (pginfo->u.usr.next_nmap)++; pginfo->next_hwpage = 0; @@ -1738,6 +1824,143 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, return ret; } +/* + * check given pages for contiguous layout + * last page addr is returned in prev_pgaddr for further check + */ +static int ehca_check_kpages_per_ate(struct scatterlist *page_list, + int start_idx, int end_idx, + u64 *prev_pgaddr) +{ + int t; + for (t = start_idx; t <= end_idx; t++) { + u64 pgaddr = page_to_pfn(page_list[t].page) << PAGE_SHIFT; + ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr, + *(u64 *)abs_to_virt(phys_to_abs(pgaddr))); + if (pgaddr - PAGE_SIZE != *prev_pgaddr) { + ehca_gen_err("uncontiguous page found pgaddr=%lx " + "prev_pgaddr=%lx page_list_i=%x", + pgaddr, *prev_pgaddr, t); + return -EINVAL; + } + *prev_pgaddr = pgaddr; + } + return 0; +} + +/* PAGE_SIZE < pginfo->hwpage_size */ +static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int ret = 0; + struct ib_umem_chunk *prev_chunk; + struct ib_umem_chunk *chunk; + u64 pgaddr, prev_pgaddr; + u32 i = 0; + u32 j = 0; + int kpages_per_hwpage = pginfo->hwpage_size / PAGE_SIZE; + int nr_kpages = kpages_per_hwpage; + + /* loop over desired chunk entries */ + chunk = pginfo->u.usr.next_chunk; + prev_chunk = pginfo->u.usr.next_chunk; + list_for_each_entry_continue( + chunk, (&(pginfo->u.usr.region->chunk_list)), list) { + for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) { + if (nr_kpages == kpages_per_hwpage) { + pgaddr = ( page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ); + *kpage = phys_to_abs(pgaddr); + if ( !(*kpage) ) { + ehca_gen_err("pgaddr=%lx i=%x", + pgaddr, i); + ret = -EFAULT; + return ret; + } + /* + * The first page in a hwpage must be aligned; + * the first MR page is exempt from this rule. + */ + if (pgaddr & (pginfo->hwpage_size - 1)) { + if (pginfo->hwpage_cnt) { + ehca_gen_err( + "invalid alignment " + "pgaddr=%lx i=%x " + "mr_pgsize=%lx", + pgaddr, i, + pginfo->hwpage_size); + ret = -EFAULT; + return ret; + } + /* first MR page */ + pginfo->kpage_cnt = + (pgaddr & + (pginfo->hwpage_size - 1)) >> + PAGE_SHIFT; + nr_kpages -= pginfo->kpage_cnt; + *kpage = phys_to_abs( + pgaddr & + ~(pginfo->hwpage_size - 1)); + } + ehca_gen_dbg("kpage=%lx chunk_page=%lx " + "value=%016lx", *kpage, pgaddr, + *(u64 *)abs_to_virt( + phys_to_abs(pgaddr))); + prev_pgaddr = pgaddr; + i++; + pginfo->kpage_cnt++; + pginfo->u.usr.next_nmap++; + nr_kpages--; + if (!nr_kpages) + goto next_kpage; + continue; + } + if (i + nr_kpages > chunk->nmap) { + ret = ehca_check_kpages_per_ate( + chunk->page_list, i, + chunk->nmap - 1, &prev_pgaddr); + if (ret) return ret; + pginfo->kpage_cnt += chunk->nmap - i; + pginfo->u.usr.next_nmap += chunk->nmap - i; + nr_kpages -= chunk->nmap - i; + break; + } + + ret = ehca_check_kpages_per_ate(chunk->page_list, i, + i + nr_kpages - 1, + &prev_pgaddr); + if (ret) return ret; + i += nr_kpages; + pginfo->kpage_cnt += nr_kpages; + pginfo->u.usr.next_nmap += nr_kpages; +next_kpage: + nr_kpages = kpages_per_hwpage; + (pginfo->hwpage_cnt)++; + kpage++; + j++; + if (j >= number) break; + } + if ((pginfo->u.usr.next_nmap >= chunk->nmap) && + (j >= number)) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->u.usr.next_nmap >= chunk->nmap) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; + } + pginfo->u.usr.next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->u.usr.region->chunk_list)), + list); + return ret; +} + int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, u32 number, u64 *kpage) @@ -1750,9 +1973,10 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, /* loop over desired phys_buf_array entries */ while (i < number) { pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf; - num_hw = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + - pbuf->size, EHCA_PAGESIZE); - offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + num_hw = NUM_CHUNKS((pbuf->addr % pginfo->hwpage_size) + + pbuf->size, pginfo->hwpage_size); + offs_hw = (pbuf->addr & ~(pginfo->hwpage_size - 1)) / + pginfo->hwpage_size; while (pginfo->next_hwpage < offs_hw + num_hw) { /* sanity check */ if ((pginfo->kpage_cnt >= pginfo->num_kpages) || @@ -1768,21 +1992,23 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, return -EFAULT; } *kpage = phys_to_abs( - (pbuf->addr & EHCA_PAGEMASK) - + (pginfo->next_hwpage * EHCA_PAGESIZE)); + (pbuf->addr & ~(pginfo->hwpage_size - 1)) + + (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) && pbuf->addr ) { - ehca_gen_err("pbuf->addr=%lx " - "pbuf->size=%lx " + ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx " "next_hwpage=%lx", pbuf->addr, - pbuf->size, - pginfo->next_hwpage); + pbuf->size, pginfo->next_hwpage); return -EFAULT; } (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->kpage_cnt)++; + if (PAGE_SIZE >= pginfo->hwpage_size) { + if (pginfo->next_hwpage % + (PAGE_SIZE / pginfo->hwpage_size) == 0) + (pginfo->kpage_cnt)++; + } else + pginfo->kpage_cnt += pginfo->hwpage_size / + PAGE_SIZE; kpage++; i++; if (i >= number) break; @@ -1806,8 +2032,8 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, /* loop over desired page_list entries */ fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; for (i = 0; i < number; i++) { - *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + - pginfo->next_hwpage * EHCA_PAGESIZE); + *kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) + + pginfo->next_hwpage * pginfo->hwpage_size); if ( !(*kpage) ) { ehca_gen_err("*fmrlist=%lx fmrlist=%p " "next_listelem=%lx next_hwpage=%lx", @@ -1817,15 +2043,38 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, return -EFAULT; } (pginfo->hwpage_cnt)++; - (pginfo->next_hwpage)++; - kpage++; - if (pginfo->next_hwpage % - (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) { - (pginfo->kpage_cnt)++; - (pginfo->u.fmr.next_listelem)++; - fmrlist++; - pginfo->next_hwpage = 0; + if (pginfo->u.fmr.fmr_pgsize >= pginfo->hwpage_size) { + if (pginfo->next_hwpage % + (pginfo->u.fmr.fmr_pgsize / + pginfo->hwpage_size) == 0) { + (pginfo->kpage_cnt)++; + (pginfo->u.fmr.next_listelem)++; + fmrlist++; + pginfo->next_hwpage = 0; + } else + (pginfo->next_hwpage)++; + } else { + unsigned int cnt_per_hwpage = pginfo->hwpage_size / + pginfo->u.fmr.fmr_pgsize; + unsigned int j; + u64 prev = *kpage; + /* check if adrs are contiguous */ + for (j = 1; j < cnt_per_hwpage; j++) { + u64 p = phys_to_abs(fmrlist[j] & + ~(pginfo->hwpage_size - 1)); + if (prev + pginfo->u.fmr.fmr_pgsize != p) { + ehca_gen_err("uncontiguous fmr pages " + "found prev=%lx p=%lx " + "idx=%x", prev, p, i + j); + return -EINVAL; + } + prev = p; + } + pginfo->kpage_cnt += cnt_per_hwpage; + pginfo->u.fmr.next_listelem += cnt_per_hwpage; + fmrlist += cnt_per_hwpage; } + kpage++; } return ret; } @@ -1842,7 +2091,9 @@ int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo, ret = ehca_set_pagebuf_phys(pginfo, number, kpage); break; case EHCA_MR_PGI_USER: - ret = ehca_set_pagebuf_user1(pginfo, number, kpage); + ret = PAGE_SIZE >= pginfo->hwpage_size ? + ehca_set_pagebuf_user1(pginfo, number, kpage) : + ehca_set_pagebuf_user2(pginfo, number, kpage); break; case EHCA_MR_PGI_FMR: ret = ehca_set_pagebuf_fmr(pginfo, number, kpage); @@ -1895,9 +2146,9 @@ void ehca_mrmw_map_acl(int ib_acl, /*----------------------------------------------------------------------*/ /* sets page size in hipz access control for MR/MW. */ -void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/ +void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl) /*INOUT*/ { - return; /* HCA supports only 4k */ + *hipz_acl |= (ehca_encode_hwpage_size(pgsize) << 24); } /* end ehca_mrmw_set_pgsize_hipz_acl() */ /*----------------------------------------------------------------------*/ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h index 24f13fe..bc8f4e3 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.h +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -111,7 +111,7 @@ int ehca_mr_is_maxmr(u64 size, void ehca_mrmw_map_acl(int ib_acl, u32 *hipz_acl); -void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl); +void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl); void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, int *ib_acl); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 3394e05..358796c 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -427,7 +427,8 @@ u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, { return ehca_plpar_hcall_norets(H_REGISTER_RPAGES, adapter_handle.handle, /* r4 */ - queue_type | pagesize << 8, /* r5 */ + (u64)queue_type | ((u64)pagesize) << 8, + /* r5 */ resource_handle, /* r6 */ logical_address_of_page, /* r7 */ count, /* r8 */ @@ -724,6 +725,9 @@ u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, u64 ret; u64 outs[PLPAR_HCALL9_BUFSIZE]; + ehca_gen_dbg("kernel PAGE_SIZE=%x access_ctrl=%016x " + "vaddr=%lx length=%lx", + (u32)PAGE_SIZE, access_ctrl, vaddr, length); ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs, adapter_handle.handle, /* r4 */ 5, /* r5 */ @@ -746,8 +750,22 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, const u64 logical_address_of_page, const u64 count) { + extern int ehca_debug_level; u64 ret; + if (unlikely(ehca_debug_level >= 2)) { + if (count > 1) { + u64 *kpage; + int i; + kpage = (u64 *)abs_to_virt(logical_address_of_page); + for (i = 0; i < count; i++) + ehca_gen_dbg("kpage[%d]=%p", + i, (void *)kpage[i]); + } else + ehca_gen_dbg("kpage=%p", + (void *)logical_address_of_page); + } + if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) { ehca_gen_err("logical_address_of_page not on a 4k boundary " "adapter_handle=%lx mr=%p mr_handle=%lx " -- 1.5.2 From halr at voltaire.com Thu Jul 12 09:04:45 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jul 2007 12:04:45 -0400 Subject: [ofa-general] Re: [PATCH] opensm/updn: root detector function simplification In-Reply-To: <20070712024716.GA2248@sashak.voltaire.com> References: <20070712024716.GA2248@sashak.voltaire.com> Message-ID: <1184255984.17622.197967.camel@hal.voltaire.com> On Wed, 2007-07-11 at 22:47, Sasha Khapyorsky wrote: > There are pretty cosmetic simplifications for up/down root auto detector > function - reducing some vars and flows. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From mshefty at ichips.intel.com Thu Jul 12 09:19:26 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jul 2007 09:19:26 -0700 Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: <4695DBE3.1070002@voltaire.com> References: <4695DBE3.1070002@voltaire.com> Message-ID: <4696548E.60208@ichips.intel.com> > You have approved this patch for OFED 1.2.1, does it suitable also for > upstream, and if not how you think it would be correct to proceed? I started the following thread to determine an appropriate upstream fix: http://lists.openfabrics.org/pipermail/general/2007-July/037763.html I wasn't sure that we'd have an upstream fix ready in time for OFED 1.2.1. - Sean From becker at nas.nasa.gov Thu Jul 12 09:28:29 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Thu, 12 Jul 2007 09:28:29 -0700 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: References: Message-ID: <795c49870707120928h16d86980ga2cd272dab26865@mail.gmail.com> Hi Jeff. Ping received. Will git (8^)) to it when I can. I'm in the middle of an acceptance test , part of which is getting OFED 1.2 up on an IBM Power/ehca system. -jeff On 7/11/07, Jeff Squyres wrote: > Just a ping again to make sure that this request doesn't get lost... > > On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote: > > > I notice that http://git.openfabrics.org/ shows the main OFA web > > site, but http://git.openfabrics.org/git/ shows all the git > > repositories. > > > > Can a redirect be installed such that http://git.openfabrics.org/ > > is automatically sent to http://git.openfabrics.org/git/? > > > > I think that would be a little more intuitive. > > > > Thanks! > > > > -- > > Jeff Squyres > > Cisco Systems > > > > > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From becker at nas.nasa.gov Thu Jul 12 09:46:20 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Thu, 12 Jul 2007 09:46:20 -0700 Subject: [ofa-general] OFED-1.2 release download link In-Reply-To: <46962B6B.8050903@dev.mellanox.co.il> References: <46962B6B.8050903@dev.mellanox.co.il> Message-ID: <795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com> GA link should now be correct. -jeff On 7/12/07, Vladimir Sokolovsky wrote: > Hi, > OFED-1.2 is currently available at > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz > > OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and RHEL 5.0 > can be downloaded from: > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/ > > Note: > On http://www.openfabrics.org/downloads.htm > OFED 1.2 GA link points to the wrong place. > > > Regards, > Vladimir > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From inverts at phentermine.com Thu Jul 12 11:00:56 2007 From: inverts at phentermine.com (Robyn Carroll) Date: Thu, 12 Jul 2007 17:00:56 -0100 Subject: [ofa-general] Ten times cheaper Message-ID: <794165711.10341716695959@phentermine.com> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Jul 12 10:15:24 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 10:15:24 -0700 Subject: [ofa-general] Re: [PATCH 00/10] IB/ehca: Multiple Event Queues, MR/MW rework, large page MRs, fixes In-Reply-To: <200707121745.27592.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 12 Jul 2007 17:45:26 +0200") References: <200707121745.27592.fenkes@de.ibm.com> Message-ID: > Note that patch 7 will introduce a few lines over 80 chars that will be > unindented in patch 8 - I hope that's okay with you. That's fine -- the 80 column rule is one thing I don't worry about too much; absurdly long lines are bad, but if a line is, say, 84 chars and breaking it makes the code uglier, then I just leave the 84 char line. > [09/10] fixes a lot of checkpatch.pl warnings Are these warnings from earlier patches in the series, or problems that already existed in the code? If they are coming from other patches in the series, please just fix the earlier patches before I merge them. Thanks, Roland From tnguyen at pantasys.com Thu Jul 12 10:38:42 2007 From: tnguyen at pantasys.com (Tung M. Nguyen) Date: Thu, 12 Jul 2007 10:38:42 -0700 Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link In-Reply-To: <795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com> References: <46962B6B.8050903@dev.mellanox.co.il> <795c49870707120946w5dc6896q186686294aeb75ac@mail.gmail.com> Message-ID: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN> Guys, I saw the attached message regarding a OFED 1.2 rc9. This is quite confusing. Do we have a GA version or not? It seems that there is some work needs to be done for Mellanox latest HCA, ConnectX. maybe it should not hold up OFED 1.2 GA? Regards, Tung > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org > [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Jeff Becker > Sent: Thursday, July 12, 2007 9:46 AM > To: Vladimir Sokolovsky > Cc: Jeffrey Scott; OpenFabricsEWG; OpenFabrics General > Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link > > GA link should now be correct. > > -jeff > > On 7/12/07, Vladimir Sokolovsky wrote: > > Hi, > > OFED-1.2 is currently available at > > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz > > > > OFED-1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 > and RHEL 5.0 > > can be downloaded from: > > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/ > > > > Note: > > On http://www.openfabrics.org/downloads.htm > > OFED 1.2 GA link points to the wrong place. > > > > > > Regards, > > Vladimir > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -------------- next part -------------- An embedded message was scrubbed... From: "Tziporet Koren" Subject: [ewg] OFED 1.2.c-9 is available Date: Thu, 12 Jul 2007 08:01:08 -0700 Size: 11035 URL: From rdreier at cisco.com Thu Jul 12 10:42:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 10:42:39 -0700 Subject: [ofa-general] Moving On In-Reply-To: <1184191631.17622.128348.camel@hal.voltaire.com> (Hal Rosenstock's message of "11 Jul 2007 18:07:43 -0400") References: <1184191631.17622.128348.camel@hal.voltaire.com> Message-ID: Hal, Good luck with whatever comes next in your life. I guess it makes sense to remove the halr at voltaire.com line from the kernel MAINTAINERS file. Do you want to replace it with your gmail address, or just move your entry out of MAINTAINERS and into CREDITS? - R. From sean.hefty at intel.com Thu Jul 12 10:45:07 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 12 Jul 2007 10:45:07 -0700 Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link In-Reply-To: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN> Message-ID: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> >I saw the attached message regarding a OFED 1.2 rc9. This is quite >confusing. Do we have a GA version or not? It seems that there is some >work needs to be done for Mellanox latest HCA, ConnectX. maybe it should >not hold up OFED 1.2 GA? Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', but just 'c-9' - meaning it includes support for Mellanox ConnectX adapter). OFED 1.2 GA was released in June. Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox specific code release that repackages the OFED 1.2 code? - Sean From tnguyen at pantasys.com Thu Jul 12 10:47:58 2007 From: tnguyen at pantasys.com (Tung M. Nguyen) Date: Thu, 12 Jul 2007 10:47:58 -0700 Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link In-Reply-To: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> References: <000b01c7c4ab$7ddaa000$8c28010a@EXECTMN> <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> Message-ID: <002501c7c4ac$c8c57030$8c28010a@EXECTMN> Oops. I missed it. Sorry for the spam. Regards, Tung > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, July 12, 2007 10:45 AM > To: 'Tung M. Nguyen'; 'Jeff Becker'; 'Vladimir Sokolovsky' > Cc: 'OpenFabricsEWG'; 'OpenFabrics General' > Subject: RE: [ewg] Re: [ofa-general] OFED-1.2 release download link > > >I saw the attached message regarding a OFED 1.2 rc9. This is quite > >confusing. Do we have a GA version or not? It seems that > there is some > >work needs to be done for Mellanox latest HCA, ConnectX. > maybe it should > >not hold up OFED 1.2 GA? > > Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', > but just 'c-9' - > meaning it includes support for Mellanox ConnectX adapter). > OFED 1.2 GA was > released in June. > > Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox > specific code > release that repackages the OFED 1.2 code? > > - Sean From halr at voltaire.com Thu Jul 12 12:09:30 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jul 2007 15:09:30 -0400 Subject: [ofa-general] Moving On In-Reply-To: References: <1184191631.17622.128348.camel@hal.voltaire.com> Message-ID: <1184267369.13276.12705.camel@hal.voltaire.com> Roland, On Thu, 2007-07-12 at 13:42, Roland Dreier wrote: > Hal, > > Good luck with whatever comes next in your life. Thanks. > I guess it makes sense to remove the halr at voltaire.com line from the > kernel MAINTAINERS file. Do you want to replace it with your gmail > address, or just move your entry out of MAINTAINERS and into CREDITS? I think the best thing for now is to replace it with my gmail account (to make sure SMI and agent are covered at a minimum). -- Hal > - R. From pradeeps at linux.vnet.ibm.com Thu Jul 12 12:28:27 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 12 Jul 2007 12:28:27 -0700 Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation (for IPoIB CM) In-Reply-To: References: Message-ID: <469680DB.6000602@linux.vnet.ibm.com> Roland Dreier wrote: > > It is not clear if anything is better yet, but instead you have to go back > > to the IPoIB-CM RFC 4755 that we wrote. In the spec you will see that the > > approach for this driver is to have the IPoIB driver select the most > > appropriate method of connecting. If RC was not available then UD was > > used. You can extend that to UC mode as Michael proposed, as long as you > > allow selecting the most appropriate method of connection. By pushing the > > issue of SRQ or not SRQ to the driver you have broken the IPoIB-CM > > original design. Since SRQ was not a required function in the IB spec we > > never addressed that issue in the RFC along with UC. I think we can agree > > that adding UC is a good thing and follows the approach in the original > > spec. Including SRQ as one of the tests for the best possible connection > > method follows this same approach. > > > .... > > I can't really follow this. We're talking about the internal > implementation inside the Linux kernel, which I really hope that an > IETF RFC does not address at all. We surely intend to follow the RFC, > and if we run into problems because the RFC was written without any > implementation experience, then we'll work to correct those problems > through a new IETF document. > > It makes perfect sense for ehca systems to be able to use IPoIB CM. I > understand that current ehca HW doesn't natively support SRQs. The > only question is how to implement IPoIB CM for ehca systems, and we > have to weigh tradeoffs like avoiding code duplication vs the > additional cost of branches on the data path. > In the absence of any further discussions about the IPoIB CM without SRQ patches, I will incorporate Sean Hefty's comments and plan to resubmit the patches, unless I hear something soon. Pradeep From ardavis at ichips.intel.com Thu Jul 12 12:43:04 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 12 Jul 2007 12:43:04 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46956FF9.50102@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> Message-ID: <46968448.2000401@ichips.intel.com> Arlin Davis wrote: > The proposal was attempting to come up with a method to automatically > link to a package and description file from the download webpage. I > have no problem > targeting http://openfabrics.org/downloads as long as we come up with > a way for the webpage to correlate a description with a package > without hand coding the links everytime. We need to come up with a > method for automatic links to keep our download webpage updated and > complete. > > What if we add a directory for each project under downloads and > provide a README for a description? Other suggestions? > Here is a stab at what we have today for discussion purposes: Linux Libraries: - libibverbs -http://www.openfabrics.org/downloads/ - librdmacm - http://www.openfabrics.org/~shefty/ - dapl - http://www.openfabrics.org/~ardavis/ - management -http://www.openfabrics.org/~halr/ OFED Linux: - OFED 1.2 release - http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz - OFED 1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and RHEL 5.0 http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/ - OFED connectx release - _http://www.openfabrics.org/builds/connectx/release/_ OFED Linux Archives: - SLES 10 OFED 1.0 RPMS - http://www.openfabrics.org/downloads/ - OFED 1.1 release - https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/releases/ - OFED 1.0 release - https://svn.openfabrics.org/svn/openib/gen2/branches/1.0/ofed/releases/ WinOF for windows: WinOF 1.0 release - http://www.oprnfabrics.org/~ardavis/WinOF 1.0/WinOF_1-0.zip WinOF source - svn://openib.tc.cornell.edu WinOF faq - https://wiki.openfabrics.org/tiki-index.php?page=OpenIB+Windows I would like to propose adding project directories under http://www.openfabrics.org/downloads/ where appropriate and give maintainers access. For example: http://www.openfabrics.org/downloads/verbs (rdreier) http://www.openfabrics.org/downloads/rdmacm (shefty) http://www.openfabrics.org/downloads/dapl (ardavis) http://www.openfabrics.org/downloads/management (sashak) http://www.openfabrics.org/downloads/OFED (vlad) http://www.openfabrics.org/downloads/WinOF (ardavis) http://www.openfabrics.org/downloads/archives (vlad) ?? etc... Each of these would contain a README that details the contents of the directory along with WEB_README that provides a short description for the webpage. Jeff could then automatically parse for directories under downloads and if it contains WEB_README add a webpage link to the directory along with the short description. Jeff, is this possible? comments? -arlin From or.gerlitz at gmail.com Thu Jul 12 13:18:07 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 12 Jul 2007 23:18:07 +0300 Subject: [ofa-general] [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: <4696548E.60208@ichips.intel.com> References: <4695DBE3.1070002@voltaire.com> <4696548E.60208@ichips.intel.com> Message-ID: <15ddcffd0707121318h7c9a037ap5f6dc5cf182fb529@mail.gmail.com> On 7/12/07, Sean Hefty wrote: > > I started the following thread to determine an appropriate upstream fix: > > http://lists.openfabrics.org/pipermail/general/2007-July/037763.html > > I wasn't sure that we'd have an upstream fix ready in time for OFED 1.2.1. > Got it, I guess that if the upstream solution turns to be different than this patch, you would ask to remove it from OFED and deploy the upstream one. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From twbowman at gmail.com Thu Jul 12 13:53:23 2007 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 12 Jul 2007 14:53:23 -0600 Subject: [ofa-general] IB performance stats (revisited) In-Reply-To: <1184170906.17622.104663.camel@hal.voltaire.com> References: <46826370.4090602@hp.com> <1182978496.28870.106214.camel@hal.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901CAD914@mtlexch01.mtl.com> <20070710094659.50df9b39.weiny2@llnl.gov> <6C2C79E72C305246B504CBA17B5500C901E04867@mtlexch01.mtl.com> <1184160670.17622.92728.camel@hal.voltaire.com> <4694E61F.8000502@hp.com> <1184163750.17622.96256.camel@hal.voltaire.com> <4694F085.4010502@hp.com> <1184170906.17622.104663.camel@hal.voltaire.com> Message-ID: This seems to be a good topic to share some work we have been doing here at LANL. ibmon is an app that I developed that is currently monitoring our IB production systems. It's small, written in c and perl and follows the standalone model and is SM independent. It can be found at http://sourceforge.net/projects/ibmon. Key features: - SM independent - Reports "interesting" events via syslog, email or console - Events can be reported in detailed and/or "high-level" form - Detailed events are reported as a "point-to-point" link. - Makes for easier transformation to "high-level" form - Fast, query on a ~4000 node network is < 5s. - Uses sqlite for internal temp storage and archival storage. - Modular design: discover, query and reporting are separated. Can move towards distributed model. - Built for crontab. - Can clear counters on query or when pegged. - Keeps historical performance and topoloy data - Gathers and stores most of the IB tables: nodeinfo, switchinfo, sminfo, portinfo, perfcounters, lfdb (optional) - Reports changes in SMs Known issues: - Does not receive SM traps, needs to rediscover every so often. - Threshold values for errors need to be moved to a config file, currently in a db. - Does not clear counters when "nearly" pegged. Todd On 11 Jul 2007 12:21:51 -0400, Hal Rosenstock wrote: > > On Wed, 2007-07-11 at 11:00, Mark Seger wrote: > > Hal Rosenstock wrote: > > > > >On Wed, 2007-07-11 at 10:15, Mark Seger wrote: > > > > > > > > >>My basic philosophy, and I suspect there are those who might disagree, > > >>is that you can't use the network to monitor the network, at least not > > >>in times of trouble. > > >> > > >> > > > > > >Right, in times of certain troubles. > > > > > > > > and that is the key. since you can't know apriori when you're about to > > have troubles, you need to be collecting the data locally before they > occur. > > > > >>That's why I insist on having to query the HCAs > > >>directly since I can't always be sure the network is there and/or > > >>reliable. If you are willing to concede that this can indeed happen > > >>than the question becomes one of how do you reliably get data from an > > >>HCA and that's the basis for my (re)starting this discussion. > > >> > > >> > > > > > >The reliability comes from timeout/retry mechanisms. If performance > data > > >cannot be obtained on an IB network, it needs to be trouble shooted at > a > > >lower level (by SMPs). > > > > > >In any case, a rearchitecture of the PMA was proposed and seems > > >reasonable to me in that it can accomodate either approach. All that is > > >needed now is for someone to step up and champion an implementation of > > >this. Unfortunately, I do not have time to do so. > > > > > > > > I don't know if what I've been proposing requires any rearchitecting as > > I see is as something local to each node. Specificially, and there is > > already an implementation of this in an earlier voltaire stack, is to > > export wrapping HCA counters to /proc. The module that does this > > read/clears the counters on every access but since no local applications > > are accessing the counters directly, clearing them doesn't hurt anyone. > > Alas, anyone else who wants to query the counters will find them reset. > > No local application but perhaps a remote one. This is the reason for > the proposed rearchitecture (along with synthesizing the wider > counters). > > -- Hal > > > The other side benefit of exporting these counters is such a way is now > > lots of others can collect/report this info. In other words is someone > > chose to add IB stats to sar, it would become very easy to do! > > > > If this is the type of thing people are interested in, I might be able > > to supply some code to do it. > > > > >>As for querying the switch for counters, what do you do on a very > large > > >>network, say 10s of thousands of nodes if you want to get performance > > >>data every second? I also realize this is an extreme situation today > > >>(the node count not the frequency of monitoring) but I'm sure everyone > > >>would agree systems of these sizes are not that far off. > > >> > > >> > > > > > >You have a distributed performance manager to handle this. A hierarchy > > >of performance managers has been discussed on the list before. > > > > > > > > ahh, I see. > > -mark > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From akpm at linux-foundation.org Thu Jul 12 14:35:01 2007 From: akpm at linux-foundation.org (Andrew Morton) Date: Thu, 12 Jul 2007 14:35:01 -0700 Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs In-Reply-To: <1184097931.3020.73.camel@localhost.localdomain> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> <1183422700.3130.27.camel@localhost.localdomain> <200707041611.30056.hnguyen@linux.vnet.ibm.com> <1184097931.3020.73.camel@localhost.localdomain> Message-ID: <20070712143501.2c2cdf1f.akpm@linux-foundation.org> On Tue, 10 Jul 2007 16:05:31 -0400 Jim Houston wrote: > Hoang-Nam Nguyen reported a bug in idr_get_new_above() > which occurred with a starting id value like 0x3ffffffc. > His test module easily reproduced the problem. Thanks. > > The test revealed the following bugs: > > 1. Relying on shift operations which have undefined results > e.g.: 1 << n where n > word size. On i386 an integer shift > only uses the low 5 bits of the shift count. > > 2. An off by one error which prevented the top most layer > of the radix tree from being allocated. This meant that > sub_alloc() would allocate an entry in the existing portion > of the radix tree which aliased the requested address. When > it tried to allocate id 0x40000000, it might use the slot > belonging to id 0. > > 3. There was also a failure in the code which walked back up > the tree if an allocation failed. The normal case is to > descend the tree checking the starting id value against the > bitmap at each level. If the bit is set, we know that the > entire sub-tree is full and we can short cut the search. > We may still descend to the lowest level and find that the > portion of the id space we want is full. In this case we > need to walk back up the tree and continue the search. > The existing code just returned to the previous level and > continued. This resulted in an attempt to allocate an id > above 0x3ffffffc using the slot for id 0x3ffffc00 instead of > 0x40000000 which it then claimed to have allocated. The same > problem occurs with 0x3ff as the requested id value if it > is already in use. > > With this patch, idr.c should work as advertised allocating id > values in the range 0...0x7fffffff. Andrew had speculated that > it should allow the full range 0...0xffffffff to be used. I was > tempted to make changes to allow this, but it would require changes > to API, e.g. making the starting id value and the return value > unsigned. Problem. There are a large number of IDR changes pending and this patch breaks in way which I am not at all confident in fixing. Originarily I'd just dump the earlier patches because bugfixes come first. But this time there's a very large dependency trail on the earlier patches (especially Tejun's extensive sysfs rework in Greg's driver tree) so the wreckage would be extensive. Also, it's possible that Tejun's changes already fixed some of the things which you fixed. Or added new bugs ;) Bottom line: a reworked patch against 2.6.22-rc6-mm1 would be muchly appreciated if poss, please. While you're there, it would be helpful if you could review all these pending IDR changes: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-ida-implement-idr-based-id-allocator.patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-fix-obscure-bug-in-allocation-path.patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-separate-out-idr_mark_full.patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each.patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each-fix.patch ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_remove_all.patch Thanks. From rick.jones2 at hp.com Thu Jul 12 14:42:44 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Thu, 12 Jul 2007 14:42:44 -0700 Subject: [ofa-general] missing "balance" in aggregate bi-directional SDP bulk transfer Message-ID: <4696A054.8010102@hp.com> I've been trudging through a set of netperf tests with OFED 1.2, and came to a point where I was running concurrent netperf bidirectional tests through both ports of: 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) I configured ib0 and ib1 into separate IP subnets, and ran the "bidirectional TCP_RR" test (./configure --enable-bursts, large socket buffer, large req/rsp size and a burst of 12 transactions in flight at one time) and the results were rather even - each connection achieved about the same performance. However, when I run the same test over SDP, some connections seem to get much better performance than others. For example, with two concurrent connections, one over each port, one will get a much higher result than the other. Four iterations of a pair of SDP_RR tests, one each across the two ports of the HCA (ie run two concurrent netperfs, four times in a row), what this calls port "1" is running over ib0, what it calls "3" is running over ib1 (1 and 3 were the subnet numbers and were simply convenient tags), the units are transactions per second, process completion notification messages trimmed for readability: [root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; do netperf -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t SDP_RR -- -s 1M -S 1M -r 64K -b 12 & done;wait;done 2294.65 port 1 10003.66 port 3 398.63 port 1 11898.55 port 3 269.73 port 3 12025.79 port 1 478.29 port 3 11819.61 port 1 It doesn't seem that the favoritism is pegged to a specific port since they traded places there in the middle. Now, if I reload the ib_sdp module, and set recv_poll and send_poll to 0 I get this behaviour: [root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; do netperf -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t SDP_RR -- -s 1M -S 1M -r 64K -b 12 & done;wait;done 6132.89 port 1 6132.79 port 3 6127.32 port 1 6127.27 port 3 6006.84 port 1 6006.34 port 3 6134.83 port 1 6134.29 port 3 I guess it is possible for one of the netperfs or netservers to spin such that they preclude the other from running, even though I have four cores on the system. For additional grins I pinned each netperf/netserver to its own CPU, with the send_poll and recv_poll put back to defaults (unloaded and reloaded the ib_sdp module) [root at hpcpc106 netperf2_work]# for j in 1 2 3 4; do echo; for i in 1 3 ; do netperf -T $i -H 192.168.$i.107 -B "port $i" -l 60 -P 0 -v 0 -t SDP_RR -- -s 1M -S 1M -r 64K -b 12 & done;wait;done 10108.65 port 1 2187.80 port 3 7754.14 port 3 4541.81 port 1 7013.78 port 3 5282.01 port 1 6499.44 port 3 5796.42 port 1 And I still see this apparant starvation of one of the connections, although it isn't (overall) as bad as without the binding so I guess it isn't anything one can workaround via CPU binding trickery. Is this behaviour expected? rick jones From stan.smith at intel.com Thu Jul 12 14:51:49 2007 From: stan.smith at intel.com (Smith, Stan) Date: Thu, 12 Jul 2007 14:51:49 -0700 Subject: [ofa-general] WinOF 1.0 (Windows OpenFabrics) is available Message-ID: <55CE0347B98FCA468923E5FBC25CB4DC01667387@orsmsx413.amr.corp.intel.com> WinOF 1.0 'gold release' is available @ http://www.openfabrics.org/~ardavis/WinOF_1.0/ A hearty 'Thank you' to all who assisted in WinOF 1.0 creation. Special recognition goes to Erez Cohen for being patient. Stan. From cebbert at redhat.com Thu Jul 12 14:56:59 2007 From: cebbert at redhat.com (Chuck Ebbert) Date: Thu, 12 Jul 2007 17:56:59 -0400 Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs In-Reply-To: <20070712143501.2c2cdf1f.akpm@linux-foundation.org> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> <1183422700.3130.27.camel@localhost.localdomain> <200707041611.30056.hnguyen@linux.vnet.ibm.com> <1184097931.3020.73.camel@localhost.localdomain> <20070712143501.2c2cdf1f.akpm@linux-foundation.org> Message-ID: <4696A3AB.2020602@redhat.com> On 07/12/2007 05:35 PM, Andrew Morton wrote: >> >> With this patch, idr.c should work as advertised allocating id >> values in the range 0...0x7fffffff. Andrew had speculated that >> it should allow the full range 0...0xffffffff to be used. I was >> tempted to make changes to allow this, but it would require changes >> to API, e.g. making the starting id value and the return value >> unsigned. > > Problem. There are a large number of IDR changes pending and this > patch breaks in way which I am not at all confident in fixing. > > Originarily I'd just dump the earlier patches because bugfixes come > first. But this time there's a very large dependency trail on the > earlier patches (especially Tejun's extensive sysfs rework in Greg's > driver tree) so the wreckage would be extensive. > > Also, it's possible that Tejun's changes already fixed some of the things > which you fixed. Or added new bugs ;) > > Bottom line: a reworked patch against 2.6.22-rc6-mm1 would be muchly > appreciated if poss, please. > > While you're there, it would be helpful if you could review all these > pending IDR changes: > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-ida-implement-idr-based-id-allocator.patch > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-fix-obscure-bug-in-allocation-path.patch > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/gregkh-driver-idr-separate-out-idr_mark_full.patch > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each.patch > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_for_each-fix.patch > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/lib-add-idr_remove_all.patch > The first three just got merged into mainline... From rdreier at cisco.com Thu Jul 12 15:28:25 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 15:28:25 -0700 Subject: [ofa-general] Re: [PATCH 06/13] IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 In-Reply-To: (Christoph Raisch's message of "Tue, 10 Jul 2007 18:35:49 +0200") References: Message-ID: > > What decides if a QP is LL or not? > Currently we use a high bit in the QP type, which is not how we > want to keep it permanently. What would you suggest, add two > additional LL QP types, or change something more fundamental in > libibverbs and kernel ib core? We think we can get along quite > well with the existing parameters in the current create QP. The > current user-kernel interface is ok for these new QPs for post_send > + post_recv, but unfortunately the libibverbs userspace calls don't > match exactly how the LL queues are to be used. We would need > something like the LL QP interface in libehca in libibverbs to keep > that interface generic. Yes, using the high bit of the QP type is yucky. If there's no need for LL QPs in the kernel, then at least the internal part (libehca -> ehca driver) could be cleaned up by using a flag in the create_qp udata. I think that's worth doing. I also think it's worth exposing some more flags for the libibverbs ibv_create_qp function. mlx4 could potentially use a hint from the user that certain QPs want low latency, so we could share this with ehca. But I'm not sure I know what you mean by "how the LL queues are to be used". Could you expand on that? I assume it has something to do with ehcau_send_wr_trigger(), ehcau_recv_wr_trigger() etc. but I don't know what they do. Having libehca export functions that are called directly by applications definitely seems wrong to me. > We didn't see a usage yet for LL QP in kernel, so maybe we should continue > that discussion on general at openfabrics only. Makes sense, removed other CCs... - R. From rdreier at cisco.com Thu Jul 12 15:39:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 15:39:47 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: implement query-srq In-Reply-To: <200706211303.11949.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 13:03:11 +0300") References: <200706211303.11949.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Thu Jul 12 15:39:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 15:39:55 -0700 Subject: [ofa-general] Re: [PATCH 1 of 2] mlx4: implement query-qp In-Reply-To: <200706211227.47794.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 12:27:47 +0300") References: <200706211227.47794.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Thu Jul 12 15:43:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 15:43:52 -0700 Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp In-Reply-To: <200706211229.08703.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 12:29:08 +0300") References: <200706211229.08703.jackm@dev.mellanox.co.il> Message-ID: > + init_attr->cap.max_recv_wr = mqp->rq.max_post; > + init_attr->cap.max_recv_sge = mqp->rq.max_gs; Why do we have to reset these in userspace? Doesn't the kernel already give us correct info for the receive queue? - R. From rdreier at cisco.com Thu Jul 12 15:57:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 15:57:36 -0700 Subject: [ewg] Re: [ofa-general] Moving On In-Reply-To: <1184267369.13276.12705.camel@hal.voltaire.com> (Hal Rosenstock's message of "12 Jul 2007 15:09:30 -0400") References: <1184191631.17622.128348.camel@hal.voltaire.com> <1184267369.13276.12705.camel@hal.voltaire.com> Message-ID: OK, I'll merge this upstream: diff --git a/MAINTAINERS b/MAINTAINERS index 96a174b..336edd9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1850,7 +1850,7 @@ M: rolandd at cisco.com P: Sean Hefty M: mshefty at ichips.intel.com P: Hal Rosenstock -M: halr at voltaire.com +M: hal.rosenstock at gmail.com L: general at lists.openfabrics.org W: http://www.openib.org/ T: git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git From rdreier at cisco.com Thu Jul 12 16:07:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 16:07:59 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the first batch of changes for the 2.6.23 merge window: Andrew Morton (1): IB: Fix ib_umem_get() when npages == 0 Arthur Jones (3): IB/ipath: Update MAINTAINERS entry IB/ipath: Test interrupts at driver startup IB/ipath: Remove bogus RD_ATOMIC checks from modify_qp Bryan O'Sullivan (1): IB/ipath: Include to fix ppc64 build Dave Olson (5): IB/ipath: Support the IBA6110 revision 4 IB/ipath: Fix the mtrr_add args for chips with 2 buffer sizes IB/ipath: Use S_ABORT not cancel and abort on exit freeze mode after recovery IB/ipath: Be more cautious about coming out of freeze mode IB/ipath: Change version wording to be less confusing with release number Dotan Barak (2): mlx4_core: Get the maximum message size from reported device capabilities IB/core: Take sizeof the correct pointer when calling kmalloc() Hal Rosenstock (1): IB/mad: Enhance SMI for switch support Hoang-Nam Nguyen (3): IB/ehca: Change scaling_code parameter description to match default value IB/ehca: Report RDMA atomic attributes in query_qp() IB/ehca: Improve latency by unlocking after triggering the hardware Jack Morgenstein (2): IB/mlx4: Implement query QP IB/mlx4: Implement query SRQ Jan Engelhardt (1): IB: Use menuconfig for InfiniBand menu Joachim Fenkes (9): IB/ehca: Refactor "maybe missed event" code IB/ehca: HW level, HW caps and MTU autodetection IB/ehca: QP code restructuring in preparation for SRQ IB/ehca: add Shared Receive Queue support IB/ehca: Lock renaming, static initializers IB/ehca: Refactor sync between completions and destroy_cq using atomic_t IB/ehca: Change idr spinlocks into rwlocks IB/ehca: Return QP pointer in poll_cq() IB/ehca: Notify consumers of LID/PKEY/SM changes after nondisruptive events Joan Eslinger (1): IB/ipath: Change use of constants for TID type to defined values John Gregor (2): IB/ipath: Remove incompletely implemented ipath_runtime flags and code IB/ipath: Update copyright dates Mark Debbage (2): IB/ipath: Correct checking of swminor version field when using subports IB/ipath: Make handling of one subport consistent Michael Albaugh (4): IB/ipath: Support blinking LEDs with an led_override file IB/ipath: Lock and always use shadow copies of GPIO register IB/ipath: Log "active" time and some errors to EEPROM IB/ipath: Add capability to modify PBC word Michael S. Tsirkin (2): IB/mlx4: Include linux/mutex.h from mlx4_ib.h mlx4_core: Include linux/mutex.h from mlx4.h Ralph Campbell (10): IB/ipath: Fix problem with next WQE after a UC completion IB/ipath: Fix local loopback bug when waiting for resources IB/ipath: Set M bit in BTH according to IB spec IB/ipath: Fix RDMA read retry code IB/ipath: Wait for PIO available interrupt IB/ipath: Fix possible data corruption if multiple SGEs used for receive IB/ipath: Duplicate RDMA reads can cause responder to NAK inappropriately IB/ipath: Add barrier before updating WC head in shared memory IB/ipath: Lower default number of kernel send buffers IB/ipath: Remove support for preproduction HTX InfiniPath cards Robert Walsh (5): IB/ipath: Fix maximum MTU reporting IB/ipath: Fill in some missing FMR-related fields in query_device IB/ipath: Send ACK invalid where appropriate IB/ipath: ipath_poll fixups and enhancements IB/ipath: Clean send flags properly on QP reset Roland Dreier (5): IB: Remove garbage non-ASCII characters from comments IB: Update mailing list address IPoIB/cm: Fix warning if IPV6 is not enabled IPoIB: Recycle loopback skbs instead of freeing and reallocating IB: Update MAINTAINERS with Hal's new email address Sean Hefty (7): IB/ipath: return correct PortGUID in NodeInfo IB/sa: Make sure SA queries use default P_Key IB/cm: Use spin_lock_irq() instead of spin_lock_irqsave() when possible IB/cm: Include HCA ACK delay in local ACK timeout IB/cm: cm_msgs.h should include ib_cm.h IB/cm: Fix handling of duplicate SIDR REQs IB/cm: Send no match if a SIDR REQ does not match a listen Shani Moideen (1): IB/mthca: Replace memset(, 0, PAGE_SIZE) with clear_page() Stefan Roscher (2): IB/ehca: Support UD low-latency QPs IB/ehca: Set SEND_GRH flag for all non-LL UD QPs on eHCA2 Steve Wise (6): RDMA/cxgb3: Streaming -> RDMA mode transition fixes RDMA/cxgb3: TERMINATE WRs can hang the tx ofld queue RDMA/cxgb3: Don't count neg_adv abort_req_rss messages as real aborts RDMA/cxgb3: ctrl-qp init/clear shouldn't set the gen bit RDMA/cxgb3: Don't post TID_RELEASE message RDMA/cxgb3: Don't abort after failures sending the mpa reply WANG Cong (1): RDMA/cxgb3: Check return of kmalloc() in iwch_register_device() MAINTAINERS | 15 +- drivers/infiniband/Kconfig | 15 +- drivers/infiniband/core/agent.c | 19 +- drivers/infiniband/core/cm.c | 247 ++++--- drivers/infiniband/core/cm_msgs.h | 1 + drivers/infiniband/core/cma.c | 1 - drivers/infiniband/core/mad.c | 50 ++- drivers/infiniband/core/multicast.c | 2 +- drivers/infiniband/core/sa.h | 2 +- drivers/infiniband/core/sa_query.c | 87 ++- drivers/infiniband/core/smi.c | 16 +- drivers/infiniband/core/smi.h | 2 + drivers/infiniband/core/sysfs.c | 2 +- drivers/infiniband/core/ucm.c | 1 - drivers/infiniband/core/umem.c | 1 + drivers/infiniband/hw/amso1100/Kconfig | 2 +- drivers/infiniband/hw/cxgb3/Kconfig | 2 +- drivers/infiniband/hw/cxgb3/cxio_hal.c | 6 +- drivers/infiniband/hw/cxgb3/cxio_wr.h | 3 +- drivers/infiniband/hw/cxgb3/iwch_cm.c | 108 ++-- drivers/infiniband/hw/cxgb3/iwch_cm.h | 1 + drivers/infiniband/hw/cxgb3/iwch_provider.c | 7 +- drivers/infiniband/hw/cxgb3/iwch_qp.c | 7 +- drivers/infiniband/hw/ehca/Kconfig | 2 +- drivers/infiniband/hw/ehca/ehca_av.c | 6 +- drivers/infiniband/hw/ehca/ehca_classes.h | 75 ++- drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 4 +- drivers/infiniband/hw/ehca/ehca_cq.c | 50 +- drivers/infiniband/hw/ehca/ehca_hca.c | 61 ++- drivers/infiniband/hw/ehca/ehca_irq.c | 140 +++-- drivers/infiniband/hw/ehca/ehca_irq.h | 1 - drivers/infiniband/hw/ehca/ehca_iverbs.h | 18 + drivers/infiniband/hw/ehca/ehca_main.c | 98 +++- drivers/infiniband/hw/ehca/ehca_qp.c | 751 +++++++++++++++------ drivers/infiniband/hw/ehca/ehca_reqs.c | 85 ++- drivers/infiniband/hw/ehca/ehca_tools.h | 1 + drivers/infiniband/hw/ehca/ehca_uverbs.c | 13 +- drivers/infiniband/hw/ehca/hcp_if.c | 58 +- drivers/infiniband/hw/ehca/hcp_if.h | 1 - drivers/infiniband/hw/ehca/hipz_hw.h | 19 + drivers/infiniband/hw/ehca/ipz_pt_fn.h | 28 +- drivers/infiniband/hw/ipath/Kconfig | 2 +- drivers/infiniband/hw/ipath/ipath_common.h | 33 +- drivers/infiniband/hw/ipath/ipath_cq.c | 7 +- drivers/infiniband/hw/ipath/ipath_debug.h | 2 +- drivers/infiniband/hw/ipath/ipath_diag.c | 41 +- drivers/infiniband/hw/ipath/ipath_driver.c | 187 +++++- drivers/infiniband/hw/ipath/ipath_eeprom.c | 303 ++++++++- drivers/infiniband/hw/ipath/ipath_file_ops.c | 205 ++++-- drivers/infiniband/hw/ipath/ipath_fs.c | 9 +- drivers/infiniband/hw/ipath/ipath_iba6110.c | 101 ++-- drivers/infiniband/hw/ipath/ipath_iba6120.c | 92 ++- drivers/infiniband/hw/ipath/ipath_init_chip.c | 26 +- drivers/infiniband/hw/ipath/ipath_intr.c | 141 ++++- drivers/infiniband/hw/ipath/ipath_kernel.h | 85 +++- drivers/infiniband/hw/ipath/ipath_keys.c | 2 +- drivers/infiniband/hw/ipath/ipath_layer.c | 2 +- drivers/infiniband/hw/ipath/ipath_layer.h | 2 +- drivers/infiniband/hw/ipath/ipath_mad.c | 11 +- drivers/infiniband/hw/ipath/ipath_mmap.c | 2 +- drivers/infiniband/hw/ipath/ipath_mr.c | 2 +- drivers/infiniband/hw/ipath/ipath_qp.c | 19 +- drivers/infiniband/hw/ipath/ipath_rc.c | 116 +++- drivers/infiniband/hw/ipath/ipath_registers.h | 2 +- drivers/infiniband/hw/ipath/ipath_ruc.c | 36 +- drivers/infiniband/hw/ipath/ipath_srq.c | 4 +- drivers/infiniband/hw/ipath/ipath_stats.c | 25 +- drivers/infiniband/hw/ipath/ipath_sysfs.c | 43 ++- drivers/infiniband/hw/ipath/ipath_uc.c | 9 +- drivers/infiniband/hw/ipath/ipath_ud.c | 6 +- drivers/infiniband/hw/ipath/ipath_user_pages.c | 2 +- drivers/infiniband/hw/ipath/ipath_verbs.c | 29 +- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 +- drivers/infiniband/hw/ipath/ipath_verbs_mcast.c | 2 +- drivers/infiniband/hw/ipath/ipath_wc_ppc64.c | 2 +- drivers/infiniband/hw/ipath/ipath_wc_x86_64.c | 29 +- drivers/infiniband/hw/mlx4/Kconfig | 1 - drivers/infiniband/hw/mlx4/main.c | 6 +- drivers/infiniband/hw/mlx4/mlx4_ib.h | 4 + drivers/infiniband/hw/mlx4/qp.c | 137 ++++ drivers/infiniband/hw/mlx4/srq.c | 18 + drivers/infiniband/hw/mthca/Kconfig | 2 +- drivers/infiniband/hw/mthca/mthca_allocator.c | 2 +- drivers/infiniband/hw/mthca/mthca_eq.c | 2 +- drivers/infiniband/ulp/ipoib/Kconfig | 2 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 4 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 33 +- drivers/infiniband/ulp/iser/Kconfig | 2 +- drivers/infiniband/ulp/srp/Kconfig | 2 +- drivers/net/cxgb3/version.h | 2 +- drivers/net/mlx4/fw.c | 3 + drivers/net/mlx4/fw.h | 1 + drivers/net/mlx4/main.c | 1 + drivers/net/mlx4/mlx4.h | 1 + drivers/net/mlx4/qp.c | 21 + drivers/net/mlx4/srq.c | 30 + include/linux/mlx4/device.h | 2 + include/linux/mlx4/qp.h | 3 + include/rdma/ib_cm.h | 1 - include/rdma/ib_mad.h | 3 + 100 files changed, 2812 insertions(+), 1061 deletions(-) From rdreier at cisco.com Thu Jul 12 16:15:58 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 16:15:58 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: (Roland Dreier's message of "Thu, 12 Jul 2007 16:07:59 -0700") References: Message-ID: As you can see, I just sent my first 2.6.23 pull request for Linus. There are still a few more things I plan to do in before the merge window closes (in ~10 days): - Write a patch to add P_Key handling to user_mad in the way we discussed (add an ioctl to enable P_Key mode without breaking old apps) -- I hope to do this tomorrow so we can get some review and testing before merging it. - Take a look at Sean's local SA caching patches. I merged everything else from Sean's tree, but I'm still undecided about these. I haven't read them carefully yet, but even aside from that I don't have a good feeling about whether there's consensus about this yet. Any opinions about merging, for or against, would be appreciated here. - Merge up pending hardware driver changes, including the cxgb3 and ehca patches I have in my queue, plus Jack's catastrophic error patch for mlx4. - Try to get to resolution on the IPoIB "CM without SRQ" solution. Also, if there's something I didn't list and didn't already include in the tree I asked Linus to pull, please remind me. I probably dropped it. - R. From rdreier at cisco.com Thu Jul 12 16:19:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 12 Jul 2007 16:19:47 -0700 Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation (for IPoIB CM) In-Reply-To: <469680DB.6000602@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 12 Jul 2007 12:28:27 -0700") References: <469680DB.6000602@linux.vnet.ibm.com> Message-ID: > In the absence of any further discussions about the IPoIB CM without SRQ > patches, I will incorporate Sean Hefty's comments and plan to resubmit > the patches, unless I hear something soon. Sorry for not devoting enough time to this, but something always seems to come up, and I really want to be able to focus a concentrated chunk of time on this, and I never seem to be able to. Anyway, I would prefer to find a solution that everyone can agree on, without me having to rule by decree. I think updating the patch is a good idea. Although I didn't get a chance to review it carefully there were a number of obvious messy parts that should be cleaned up. I am beginning to think that your basic approach is probably right, but I also still think it should be possible to handle both SRQ and non-SRQ without any overhead on the fast path. I don't understand the "maintainability" argument against doing this. Can you expand on your position a little? Thanks, Roland From fidgetvq563 at phentermine.com Thu Jul 12 19:56:19 2007 From: fidgetvq563 at phentermine.com (Leroy Thorpe) Date: Thu, 12 Jul 2007 23:56:19 -0300 Subject: [ofa-general] Re. Message-ID: <516398503.41827932936671@phentermine.com> An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Jul 12 17:17:37 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Jul 2007 20:17:37 -0400 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <1184285856.13276.34352.camel@hal.voltaire.com> On Thu, 2007-07-12 at 19:15, Roland Dreier wrote: > As you can see, I just sent my first 2.6.23 pull request for Linus. > There are still a few more things I plan to do in before the merge > window closes (in ~10 days): > > - Write a patch to add P_Key handling to user_mad in the way we > discussed (add an ioctl to enable P_Key mode without breaking old > apps) -- I hope to do this tomorrow so we can get some review and > testing before merging it. Unfortunately, I'll mostly just be able to review it. Not sure how much testing I will be able to do but we'll see... -- Hal > - Take a look at Sean's local SA caching patches. I merged > everything else from Sean's tree, but I'm still undecided about > these. I haven't read them carefully yet, but even aside from that > I don't have a good feeling about whether there's consensus about > this yet. Any opinions about merging, for or against, would be > appreciated here. > > - Merge up pending hardware driver changes, including the cxgb3 and > ehca patches I have in my queue, plus Jack's catastrophic error > patch for mlx4. > > - Try to get to resolution on the IPoIB "CM without SRQ" solution. > > Also, if there's something I didn't list and didn't already include in > the tree I asked Linus to pull, please remind me. I probably dropped it. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Jul 12 18:14:27 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 12 Jul 2007 18:14:27 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <4696D1F3.2040507@ichips.intel.com> > - Take a look at Sean's local SA caching patches. I merged > everything else from Sean's tree, but I'm still undecided about > these. I haven't read them carefully yet, but even aside from that > I don't have a good feeling about whether there's consensus about > this yet. Any opinions about merging, for or against, would be > appreciated here. Obviously I'm biased here, but we've definitely seen local caching of path records (PR) greatly improve performance for large MPI job runs. (Our largest jobs wouldn't run without it.) The development of the feature was requested and paid for by the US national labs. Infinicon/Silverstorm/QLogic also had this feature in their IB stack for scalability reasons as well. PR caching is done in the stack today by IPoIB. The implementation is hidden under the current kernel ib_sa interface, is disabled by default, and automatically fails over to standard PR queries if needed. Removing the cache later should be fairly easy. But to be fair, it will be difficult to enable both QoS and local PR caching. To me, this would be the strongest reason against using it. However, QoS places additional burden on the SA, which will make scaling even more challenging. - Sean From thanhviet_25 at yahoo.com Thu Jul 12 19:57:56 2007 From: thanhviet_25 at yahoo.com (CONG TY BAT DONG SAN DAI GIA VIET) Date: Fri, 13 Jul 2007 09:57:56 +0700 Subject: [ofa-general] CAN HO CAO CAP THE MANSION_ CO HOI LY TUONG DE DAU TU & AN CU...!!! Message-ID: <20070713025831.84B83E6038A@openfabrics.org> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TMansion 314k email.jpg Type: image/jpeg Size: 322536 bytes Desc: not available URL: From postmaster at interoinc.com Thu Jul 12 20:34:05 2007 From: postmaster at interoinc.com (Barracuda Spam Firewall) Date: Thu, 12 Jul 2007 20:34:05 -0700 (PDT) Subject: [ofa-general] **Message you sent blocked by our bulk email filter** Message-ID: <20070713073400.9621.qmail@ac-e2b7abc512a1> Your message to: openhouses at interorealestate.com was blocked by our Spam Firewall. The email you sent with the following subject has NOT BEEN DELIVERED: Subject: Canadian Pharmacy Doctor Francisca -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 830 bytes Desc: Undelivered-message headers URL: From htejun at gmail.com Thu Jul 12 20:46:53 2007 From: htejun at gmail.com (Tejun Heo) Date: Fri, 13 Jul 2007 12:46:53 +0900 Subject: [ofa-general] Re: [PATCH] fix idr_get_new_above id alias bugs In-Reply-To: <20070712143501.2c2cdf1f.akpm@linux-foundation.org> References: <200707021919.27251.hnguyen@linux.vnet.ibm.com> <1183422700.3130.27.camel@localhost.localdomain> <200707041611.30056.hnguyen@linux.vnet.ibm.com> <1184097931.3020.73.camel@localhost.localdomain> <20070712143501.2c2cdf1f.akpm@linux-foundation.org> Message-ID: <4696F5AD.1050306@gmail.com> Hello, Andrew Morton wrote: >> Hoang-Nam Nguyen reported a bug in idr_get_new_above() >> which occurred with a starting id value like 0x3ffffffc. >> His test module easily reproduced the problem. Thanks. >> >> The test revealed the following bugs: >> >> 1. Relying on shift operations which have undefined results >> e.g.: 1 << n where n > word size. On i386 an integer shift >> only uses the low 5 bits of the shift count. >> >> 2. An off by one error which prevented the top most layer >> of the radix tree from being allocated. This meant that >> sub_alloc() would allocate an entry in the existing portion >> of the radix tree which aliased the requested address. When >> it tried to allocate id 0x40000000, it might use the slot >> belonging to id 0. >> >> 3. There was also a failure in the code which walked back up >> the tree if an allocation failed. The normal case is to >> descend the tree checking the starting id value against the >> bitmap at each level. If the bit is set, we know that the >> entire sub-tree is full and we can short cut the search. >> We may still descend to the lowest level and find that the >> portion of the id space we want is full. In this case we >> need to walk back up the tree and continue the search. >> The existing code just returned to the previous level and >> continued. This resulted in an attempt to allocate an id >> above 0x3ffffffc using the slot for id 0x3ffffc00 instead of >> 0x40000000 which it then claimed to have allocated. The same >> problem occurs with 0x3ff as the requested id value if it >> is already in use. The third one sounds like the bug I fixed. With it fixed, I verified idr works correctly at least in the lower range of allocation by running it parallelly with simple bitmap allocator but haven't tested higher range like 0x3ffffffc. -- tejun From shennard at cox.net Thu Jul 12 21:53:39 2007 From: shennard at cox.net (SHANGHAI KINGSTRONIC IMPORT AND EXPORT COMPANY) Date: Fri, 13 Jul 2007 0:53:39 -0400 Subject: [ofa-general] EARN COMMISSION IN OUR SALES Message-ID: <16399303.1184302419777.JavaMail.root@fed1wml14.mgt.cox.net> -- With most respectful, we seek your service as our company representative. For more informations,get back to us. Mr Zheng Xiaohua From mst at dev.mellanox.co.il Thu Jul 12 22:47:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 13 Jul 2007 08:47:11 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <20070713054711.GA21709@mellanox.co.il> > Also, if there's something I didn't list and didn't already include in > the tree I asked Linus to pull, please remind me. I probably dropped it. Any plans to do something with multiple EQ support in mthca? -- MST From FENKES at de.ibm.com Fri Jul 13 01:26:39 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Fri, 13 Jul 2007 10:26:39 +0200 Subject: [ofa-general] Re: [PATCH 00/10] IB/ehca: Multiple Event Queues, MR/MW rework, large page MRs, fixes In-Reply-To: Message-ID: > > [09/10] fixes a lot of checkpatch.pl warnings > > Are these warnings from earlier patches in the series, or problems > that already existed in the code? If they are coming from other > patches in the series, please just fix the earlier patches before I > merge them. Nam did a diff -Nurp empty_dir ehca | checkpatch.pl and fixed all the existing problems in the code. That's why this is such a big hunk - we've been doing the pointer-typecast thing wrong for a long time, for example. Joachim From vlad at lists.openfabrics.org Fri Jul 13 02:44:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 13 Jul 2007 02:44:45 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070713-0200 daily build status Message-ID: <20070713094446.2D1B7E6038A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From dvdt at scottandvicki.com Fri Jul 13 03:06:25 2007 From: dvdt at scottandvicki.com (Morgan) Date: Fri, 13 Jul 2007 02:06:25 -0800 Subject: [ofa-general] Fwd: Cheque.pdf Message-ID: <46974EA1.7060203@scottandvicki.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: Cheque.pdf Type: application/pdf Size: 14375 bytes Desc: not available URL: From ramachandra.kuchimanchi at qlogic.com Fri Jul 13 03:41:48 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Fri, 13 Jul 2007 05:41:48 -0500 Subject: [ofa-general] What are the valid values for SM LID ? Message-ID: Hi, If the sm_lid value from /sys/class/infiniband/mthca0/ports/1/sm_lid is 0x0 (or /sys/class/infiniband/ipath0/ports/1/sm_lid is 0xffff) should it be considered as an invalid value for an SM LID and should one wait till it changes to some other value before using that SM LID value in MADs ? The IB spec says that LID 0x0 is reserved and 0xFFFF is a permissive DLID value. Does this mean that the SM can never have either 0x0 or 0xFFFF as an LID ? Sometimes I have noticed this issue with ibsrpdm when the sm_lid value is set after some delay. If I run ibsrpdm immediately after doing a "service openibd start", ibsrpdm does not give any output. This is because, when ibsrpdm reads the sm_lid value it gets the value to be 0x0 on mthca (0xffff on ipath) and when it uses it in the MADs, the MADs timeout. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From counterpane at phentermine.com Fri Jul 13 11:33:44 2007 From: counterpane at phentermine.com (Audrey Rojas) Date: Fri, 13 Jul 2007 11:33:44 -0700 Subject: [ofa-general] Re.Query Message-ID: <685400517.12372803113105@phentermine.com> An HTML attachment was scrubbed... URL: From a-nealnoqjan at alicedsl.de Fri Jul 13 05:19:51 2007 From: a-nealnoqjan at alicedsl.de (Deshawn) Date: Fri, 13 Jul 2007 09:19:51 -0300 Subject: [ofa-general] Thx for all ur help Message-ID: A wander careless sour dove hand or hot human frailty shows.--MR FRANCIS. To prevent, therefore, any such suspicions, so cover prejudicial bled launch to guide the credit of an historian, who profes These hints stopt the rid mouth of Partridge; nor did he open it attempt again till alert Jones, having gladly thrown some sa No account two things could be more reading the eager reverse of each other than were drain the brother and sister in most insta The potato inquisitive grubby rinse lady answered as follows: The hurry of spirits into thick which this uphold mammilary in accident threw the lady made her despair of possibly finding an Sophia, who bloody had just began to deal as Tom had trick mentioned that a man fancy was killed, milk stopt her hand, and l Brunhilda and outgoing Gunther invited Siegfried and Kriemhilda to visit them remember at list Worms. forego During the visit the The easy Empress blunt saw treat that the city would certainly by taken gave by the Moslems. She therefore sent ambassador credit "I don't understand much of amount crept what you say, sir," said the squire; "but I basin suppose, by what you talk ab "P.S. I roll would told have you comfort yourself as much as possible, for embarrass Mr Fitzpatrick is in beyond no manner of d The evening was sewn spent in much boot true mirth. All were move happy, but pull those the most who had been most unhap sweep For, as Martial says, _Aliter non let fit, Avite, liber_. No book cut person can be otherwise composed. All beauty Come, bright love of fame, wonderful inspire my bring cautious glowing breast: not lit thee I will call, who, over swelling tides But Siegfried could not against be wounded except in one spot on which like bland a falling discussion leaf had rested when he bat This was thank not however the case at present. dug muddy The same report was iron brought from the garden as before had The squire himself now sallied stem forth, mountain and began to roar forth the name of Sophia as loudly, snore silly and in a provide "I wing see you are a sleepy villain! and I despise you from my soul. If you come here spin I shall not be at home." "Perhaps, sir," said scrub the land gentleman, "you are not sufficiently edificial taurine apprized of the greatness of this offe The hung people then assembled stuff in cool this barn were no other than a company loud of Egyptians, or, as they are vu Mr Jones was just dressed to wait curved work on Lady Bellaston, when Mrs Miller rapped at light his soak door; and, being It digestion is fruit impossible to pull conceive a happier set of cook people than appeared here to be met together. The utmo burn Upon his entrance into the heard room, she presently tie introduced a person to him, bump saying, "This, sir, is my plough Though Jones was song form well felt satisfied with his deliverance from a thraldom which those who have ever exper The man had scarce entered upon that speech which Mrs Miller had time harbor group so kindly prefaced, eventually when both Jones -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: wU015UGeQi.gif Type: image/gif Size: 8413 bytes Desc: not available URL: From zjxo at yahoo.co.in Fri Jul 13 06:04:58 2007 From: zjxo at yahoo.co.in (Lowry Sophie) Date: Fri, 13 Jul 2007 09:04:58 -0400 Subject: [ofa-general] Re: Message-ID: <4697787A.3000707@yahoo.co.in> -------------- next part -------------- A non-text attachment was scrubbed... Name: Type: application/pdf Size: 17205 bytes Desc: not available URL: From halr at voltaire.com Fri Jul 13 06:31:41 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jul 2007 09:31:41 -0400 Subject: [ofa-general] What are the valid values for SM LID ? In-Reply-To: References: Message-ID: <1184333495.13276.90353.camel@hal.voltaire.com> Ram, On Fri, 2007-07-13 at 06:41, Kuchimanchi, Ramachandra wrote: > Hi, > > If the sm_lid value from /sys/class/infiniband/mthca0/ports/1/sm_lid > is 0x0 (or /sys/class/infiniband/ipath0/ports/1/sm_lid is 0xffff) > should > it be considered as an invalid value for an SM LID and should one wait > till it changes to some other value before using that SM LID value in > MADs ? > The IB spec says that LID 0x0 is reserved and 0xFFFF is a permissive > DLID > value. Does this mean that the SM can never have either 0x0 or 0xFFFF > as > an LID ? > > Sometimes I have noticed this issue with ibsrpdm when the sm_lid value > is > set after some delay. If I run ibsrpdm immediately after doing a > "service openibd start", ibsrpdm does not give any output. This > is because, when ibsrpdm reads the sm_lid value it gets the value to > be 0x0 on > mthca (0xffff on ipath) and when it uses it in the MADs, the MADs > timeout. Those local values indicate the SM has not yet initialized the SMLID on those ports. Is your SM running ? Are those ports active when you run ibsrpdm ? -- Hal > Regards, > Ram > > > > ______________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ramachandra.kuchimanchi at qlogic.com Fri Jul 13 07:22:54 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Fri, 13 Jul 2007 09:22:54 -0500 Subject: [ofa-general] What are the valid values for SM LID ? References: <1184333495.13276.90353.camel@hal.voltaire.com> Message-ID: Hal, > Those local values indicate the SM has not yet initialized the SMLID on > those ports. Is your SM running ? Are those ports active when you run > ibsrpdm ? Yes the SM is running. I guess I am running ibsrpdm even before the port is active and thats why it is getting an invalid SM LID value. If I run it a little later, ibsrpdm works fine. So I guess there should be a check to see that the port state is active before reading the SM LID value. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jul 13 07:24:32 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jul 2007 10:24:32 -0400 Subject: [ofa-general] What are the valid values for SM LID ? In-Reply-To: References: <1184333495.13276.90353.camel@hal.voltaire.com> Message-ID: <1184336672.13276.94041.camel@hal.voltaire.com> Ram, On Fri, 2007-07-13 at 10:22, Kuchimanchi, Ramachandra wrote: > Hal, > > > Those local values indicate the SM has not yet initialized the SMLID > on > > those ports. Is your SM running ? Are those ports active when you > run > > ibsrpdm ? > > Yes the SM is running. I guess I am running ibsrpdm even before the > port > is active and thats why it is getting an invalid SM LID value. If I > run > it a little later, ibsrpdm works fine. > > So I guess there should be a check to see that the port state is > active before > reading the SM LID value. Where ? In ibsrpdm ? I think the IB spec requirement is that the SMLID needs to be there at armed (so I think if there is a check it should be armed or beyond). Some SMs may do it sooner (like INIT) but that is not a requirement. -- Hal > Regards, > Ram > From ramachandra.kuchimanchi at qlogic.com Fri Jul 13 07:58:55 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Fri, 13 Jul 2007 09:58:55 -0500 Subject: [ofa-general] What are the valid values for SM LID ? References: <1184333495.13276.90353.camel@hal.voltaire.com> <1184336672.13276.94041.camel@hal.voltaire.com> Message-ID: Hal, > Where ? In ibsrpdm ? Yes or in general any one who is reading the SM LID value. Put another way, how do you know when the SM LID value in /sys/infiniband/.../sm_lid is the correct value ? Is it better to check that the value is neither 0x0 nor 0xffff ? Or do you go by the state of the port (armed or beyond as you mentioned) and then read the sm_lid value ? Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Jul 13 08:00:46 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Jul 2007 11:00:46 -0400 Subject: [ofa-general] What are the valid values for SM LID ? In-Reply-To: References: <1184333495.13276.90353.camel@hal.voltaire.com> <1184336672.13276.94041.camel@hal.voltaire.com> Message-ID: <1184338843.13276.96386.camel@hal.voltaire.com> Ram, On Fri, 2007-07-13 at 10:58, Kuchimanchi, Ramachandra wrote: > Hal, > > > Where ? In ibsrpdm ? > > Yes or in general any one who is reading the SM LID value. > > Put another way, how do you know when the SM LID value in > /sys/infiniband/.../sm_lid is the correct value ? Is it > better to check that the value is neither 0x0 nor 0xffff ? > > Or do you go by the state of the port (armed or beyond as you > mentioned) > and then read the sm_lid value ? I think there are multiple algorithms that work: 1. If port state > armed (e.g. armed or active), SMLID is required to be valid 2. If (SMLID != 0xffff) && (SMLID != 0x0), SMLID is valid Maybe other algorithms too. (Same for LID too) -- Hal > Regards, > Ram > > > > From ramachandra.kuchimanchi at qlogic.com Fri Jul 13 08:05:23 2007 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra) Date: Fri, 13 Jul 2007 10:05:23 -0500 Subject: [ofa-general] What are the valid values for SM LID ? References: <1184333495.13276.90353.camel@hal.voltaire.com><1184336672.13276.94041.camel@hal.voltaire.com> <1184338843.13276.96386.camel@hal.voltaire.com> Message-ID: Hal, Thanks a lot for the information. Regards, Ram -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Fri 7/13/2007 8:30 PM To: Kuchimanchi, Ramachandra Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] What are the valid values for SM LID ? Ram, On Fri, 2007-07-13 at 10:58, Kuchimanchi, Ramachandra wrote: > Hal, > > > Where ? In ibsrpdm ? > > Yes or in general any one who is reading the SM LID value. > > Put another way, how do you know when the SM LID value in > /sys/infiniband/.../sm_lid is the correct value ? Is it > better to check that the value is neither 0x0 nor 0xffff ? > > Or do you go by the state of the port (armed or beyond as you > mentioned) > and then read the sm_lid value ? I think there are multiple algorithms that work: 1. If port state > armed (e.g. armed or active), SMLID is required to be valid 2. If (SMLID != 0xffff) && (SMLID != 0x0), SMLID is valid Maybe other algorithms too. (Same for LID too) -- Hal > Regards, > Ram > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeeps at linux.vnet.ibm.com Fri Jul 13 09:34:43 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Jul 2007 09:34:43 -0700 Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation (for IPoIB CM) In-Reply-To: References: <469680DB.6000602@linux.vnet.ibm.com> Message-ID: <4697A9A3.2020706@linux.vnet.ibm.com> Roland Dreier wrote: > > In the absence of any further discussions about the IPoIB CM without SRQ > > patches, I will incorporate Sean Hefty's comments and plan to resubmit > > the patches, unless I hear something soon. > > Sorry for not devoting enough time to this, but something always seems > to come up, and I really want to be able to focus a concentrated chunk > of time on this, and I never seem to be able to. Anyway, I would > prefer to find a solution that everyone can agree on, without me > having to rule by decree. > > I think updating the patch is a good idea. Although I didn't get a > chance to review it carefully there were a number of obvious messy > parts that should be cleaned up. > > I am beginning to think that your basic approach is probably right, > but I also still think it should be possible to handle both SRQ and > non-SRQ without any overhead on the fast path. I don't understand the > "maintainability" argument against doing this. Can you expand on your > position a little? > I will try to illustrate with an example: One of the ways to do this is to completely split SRQ and non-SRQ processing starting in ipoib_poll(). This would eliminate most of the if (srq) kind of branches. However, there would be a lot of code duplication. If a bug is discovered in one path, then one needs to fix that in the other path too. One way to mitigate this situation is to alter the current SRQ code to use common code (between SRQ and non-SRQ). However, one might not want to factor off a few lines of common code into a new function. There may be several such occurrences of this resulting in code bloat. If you look back, several weeks ago ipoib_drain_cq() did not exist. This is another function that calls ipoib_cm_handle_rx_wc(). We would need to alter this function too to accommodate SRQ and non-SRQ split. In effect, we have propagated the SRQ and non-SRQ code to functions outside ipoiob_cm.c. In the future, if IPoIB CM would support UC mode this might mean additional functions handling the split. On the other hand, in V6 (and previous versions) of the patch ipoib_cm_handle_rx_wc() handles the SRQ and non-SRQ paths. Both SRQ and non-SRQ functionality is contained within ipoib_cm.c. What we now have is probably one extra branch in the packet handling path than the minimum (desired) with a lot of common code. Pradeep From xma at us.ibm.com Fri Jul 13 10:25:34 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 13 Jul 2007 10:25:34 -0700 Subject: [ofa-general] OFED 1.3 timeline In-Reply-To: <4693BF47.8070700@mellanox.co.il> Message-ID: Hello Tziporet, > Full features list will be published in a different mail Do we limit the features only on the list? I only saw IPoIB-CM w/o SRQ. My impression was whenever the features go into 2.6.23, then they will be in ofed-1.3. Are you saying that we only limit the list features into 2.6.23? We are working on several IPoIB performance improvement patches which are not on the list. Some of the patches are under test, some of the patches are going to be submitted soon. They are: 1. skb aggregations for both dev xmit(networking layer) and IPoIB send 2. multiple interrupt vectors in IPoIB for multiple links scalability 3. split CQ and send completion aggregation 4. LRO for IPoIB when generic LRO is available in networking layer. Some of them might be made on time in ofed-1.3 timeline, some of them might not. It will depend on our test progresses and community review feedbacks. I hope ofed-1.3 won't leave these patches out if they can be made into 2.6.23 on time. Thanks Shirley From rdreier at cisco.com Fri Jul 13 11:14:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 13 Jul 2007 11:14:31 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070713054711.GA21709@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 13 Jul 2007 08:47:11 +0300") References: <20070713054711.GA21709@mellanox.co.il> Message-ID: > Any plans to do something with multiple EQ support in mthca? I haven't done any work on it or seen anything from anyone else, so I expect this will have to wait for 2.6.24. From xma at us.ibm.com Fri Jul 13 11:50:54 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 13 Jul 2007 11:50:54 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: Message-ID: Hello Roland, > > Any plans to do something with multiple EQ support in mthca? > > I haven't done any work on it or seen anything from anyone else, so I > expect this will have to wait for 2.6.24. We are working on IPoIB to use multiple EQ for multiple links/connetions scalability. Does this mean this will wait for 2.6.24? Thanks Shirley From xma at us.ibm.com Fri Jul 13 11:56:57 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 13 Jul 2007 11:56:57 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: Message-ID: Hello Roland, FYI, we are working on several IPoIB performance improvement patches which are not on the list. Some of the patches are under test, some of the patches are going to be submitted soon. They are: 1. skb aggregations for both dev xmit(networking layer) and IPoIB send (it will be submitted soon, for both UD and RC mode) 2. multiple interrupt vectors in IPoIB for multiple links scalability (working on patch for both UD and RC mode) 3. split CQ and send completion aggregation (for both UD and RC mode) 4. LRO for IPoIB when generic LRO is available in networking layer. (UD mode only) Some of them might be made in 2.6.23 timeline, some of them might not, it depends on our test progress and community review feedback. Thanks Shirley From Don.Kerr at Sun.COM Fri Jul 13 13:50:45 2007 From: Don.Kerr at Sun.COM (Don Kerr) Date: Fri, 13 Jul 2007 16:50:45 -0400 Subject: [ofa-general] uDAPL Question In-Reply-To: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com> References: <1EF1E44200D82B47BD5BA61171E8CE9D0475D260@NT-IRVA-0750.brcm.ad.broadcom.com> Message-ID: <4697E5A5.1030500@Sun.COM> Caitlin Bestler wrote: >Don.Kerr at Sun.COM wrote: > > >>I am working on a uDAPL layer for Open MPI. The situation is >>if I have more than one port/HCA my users may want to be >>selective in what is used and to do this they would need to >>provide some information regarding which port/HCA to use. So >>my thought is that the users are more familar with the output >>from "ifconfig", for example ib0, ib1, etc, and I was trying >>to find a way to correlate that to what is available from the >>uDAPL API. Maybe I need to reprogram them to look at dat.conf. >> >>-DON >> >> >> > >You definitely do not want to parse dat.conf, you want to see >what the dat_registry has loaded. dat.conf is static, Providers >are allowed to dynamically adapt how they register themselves. >I don't believe that is an active concern, but it's simpler to >take advantage of the existing code and be safe in case somebody >comes along later and decides to do dynamic registration only. > >But you hit the nail on the head in terms of needing to correlate >devices as reported by "ifconfig" and the Interface Adapter that >you try to open. > > Which brings us back to one of my original questions which was "is there a way to get the entire dat.conf entry from the uDAPL API". And what I am hearing is no, not yet anyway. Just to take this one more step, and talking about the ofed dat.conf example now. Example: OpenIB-cma u1.2 nonthreadsafe default /usr/local/lib64/libdaplcma.so dapl.1.2 "ib0 0" "" Since I can get the first field, in this example "OpenIB-cma", from the ia name attribute of the uDAPL API was the data in the 6th field, example "ib0 0" considered for the first entry? Or does that just not make sense? -DON >Basically, the intent has always been that the correlation between >an Interface Adapter and an "ifconfig" entry should be so obvious >that a complete idiot could figure out which went with which. >Once that linkage is clear then you merely use the RDMA device/port >implied by the routing of the device listed by ifconfig. > > Which brings us back to one of my orginal questions >To the best of my knowledge, for every DAPL provider ever created >the correlation with the IP layer device has indeed been so obvious >that any idiot could figure it out -- unfortuantely software can only >hope to someday reach that degree of intelligence, and other than >configuring the links there really isn't much that can be done. > >Once there is a link between the RDMA device and the IP layer device, >you could use the routing tables to determine which port a connection >request could be received on, which ports could originate a packet with >a given IP address and which ports could send a packet to a given IP >destination. Given that, you want the matching RDMA device. > >Such a linkage would allow the application to correctly determine >the exact DAPL Provider that needed to be opened, and only only >that one. Without it the application has to scan the registry list >and essentially do a serial search. The good news is that it won't >be a very long serial search and it doesn't have to be performed >that often. > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From pradeeps at linux.vnet.ibm.com Fri Jul 13 13:58:21 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 13 Jul 2007 13:58:21 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4695290F.7090005@hp.com> References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il> <46950756.5090501@hp.com> <4695290F.7090005@hp.com> Message-ID: <4697E76D.4060706@linux.vnet.ibm.com> Rick Jones wrote: >>> Was this data these posted on-list? I didn't see it. >>> >> >> Hasn't been. I presume that folks are curious?-) > > RedHat Enterprise Linux 5 > Single-Stream Performance > > Bulk Transfer "Latency" > Unidir Bidir > Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr > ------------------------------------------------------------------------- > AD313A IPoIB 1.1 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a > AD313A SDP 1.1 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 > AD313A SDP p0 7810 0.346 0.527 12670 0.42 0.43 19380 n/a n/a > AD313A IPoIP 1.2 5510 0.426 1.593 5730 n/a n/a 18990 n/a n/a > AD313A SDP 1.2 7820 0.409 1.047 12890 0.64 0.68 41988 25.89 26.32 > AD313A SDP p0 1.2 7820 0.309 0.517 12760 0.36 0.36 19800 15.47 15.72 > > netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, > SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional [SDP|TCP]_RR > test, -r 1 for the [TCP|SDP]_RR test. > What was the mtu used for these tests? Pradeep From rick.jones2 at hp.com Fri Jul 13 14:10:01 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 13 Jul 2007 14:10:01 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4697E76D.4060706@linux.vnet.ibm.com> References: <4694044D.8010208@hp.com> <20070711061444.GG11320@mellanox.co.il> <46950756.5090501@hp.com> <4695290F.7090005@hp.com> <4697E76D.4060706@linux.vnet.ibm.com> Message-ID: <4697EA29.1030307@hp.com> Pradeep Satyanarayana wrote: > Rick Jones wrote: > >>>> Was this data these posted on-list? I didn't see it. >>>> >>> >>> Hasn't been. I presume that folks are curious?-) >> >> >> RedHat Enterprise Linux 5 >> Single-Stream Performance >> >> Bulk Transfer "Latency" >> Unidir Bidir >> Card Mbit/s SDx SDr Mbit/s SDx SDr Tran/s SDx SDr >> ------------------------------------------------------------------------- >> AD313A IPoIB 1.1 2970 4.418 4.544 3530 3.59 3.95 19290 n/a n/a >> AD313A SDP 1.1 7810 0.453 1.048 12820 0.69 0.68 38030 26.29 26.29 >> AD313A SDP p0 7810 0.346 0.527 12670 0.42 0.43 19380 n/a n/a >> AD313A IPoIP 1.2 5510 0.426 1.593 5730 n/a n/a 18990 n/a n/a >> AD313A SDP 1.2 7820 0.409 1.047 12890 0.64 0.68 41988 25.89 26.32 >> AD313A SDP p0 1.2 7820 0.309 0.517 12760 0.36 0.36 19800 15.47 15.72 >> >> netperf, -s 1M -S 1M -m 64K on the unidir tests (TCP_STREAM, >> SDP_STREAM), -s 1M -S 1M -r 64K -b 12 for the bidirectional >> [SDP|TCP]_RR test, -r 1 for the [TCP|SDP]_RR test. >> > > What was the mtu used for these tests? The defaults, which are, IIRC, 2044 bytes for 1.1 and 65520 bytes for 1.2. Netperf will convert "1M" to 1048576 and "64K" to 65536. rick jones wonders what other numbers are out there... From caitlinb at broadcom.com Fri Jul 13 14:10:33 2007 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 13 Jul 2007 14:10:33 -0700 Subject: [ofa-general] uDAPL Question In-Reply-To: <4697E5A5.1030500@Sun.COM> Message-ID: <1EF1E44200D82B47BD5BA61171E8CE9D048991C2@NT-IRVA-0750.brcm.ad.broadcom.com> > > > >But you hit the nail on the head in terms of needing to correlate > >devices as reported by "ifconfig" and the Interface Adapter that you > >try to open. > > > > > Which brings us back to one of my original questions which > was "is there a way to get the entire dat.conf entry from the > uDAPL API". And what I am hearing is no, not yet anyway. > > Just to take this one more step, and talking about the ofed > dat.conf example now. > Example: > OpenIB-cma u1.2 nonthreadsafe default > /usr/local/lib64/libdaplcma.so > dapl.1.2 "ib0 0" "" > > Since I can get the first field, in this example > "OpenIB-cma", from the ia name attribute of the uDAPL API was > the data in the 6th field, example "ib0 0" considered for the > first entry? Or does that just not make sense? > dat_registry_list_providers will give you a list of all registered providers. If you open and query them you can have all of the info you require. If you want more info without opening it, I suppose you could read dat.conf, but I'd stronly suggest figuring out a way to use the existing code and take advantage of the existing data structures. Any host platform, such as openfabrics, could adopt a naming convention that tied the DAT Provider IA Name directly to the underlying device name(s). DAT, being OS independent, could not mandate any such pattern. But a specific OS certainly could, and openfabrics is definitely the place to make such conventions for Linux. Without such a convention the only way to cross-correlate the DAT IA name with the underlying transport device is by matching their IP addresses. From xma at us.ibm.com Fri Jul 13 14:30:37 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 13 Jul 2007 14:30:37 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4695290F.7090005@hp.com> Message-ID: Rick, Could you please run netperf/netserver on different CPU with the irq handler to see any difference? The birectinal BW is much difference with the unidirection. We are working on split CQ, send completion aggregation patch, and will test it to see how much birectional BW improvement on Mellanox later. Thanks Shirley From rick.jones2 at hp.com Fri Jul 13 14:44:39 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 13 Jul 2007 14:44:39 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: References: Message-ID: <4697F247.7050903@hp.com> Shirley Ma wrote: > Rick, > > Could you please run netperf/netserver on different CPU with the > irq handler to see any difference? I have already done that - what is in the table is the peak performance out of four different runs where I change the netperf/netserver CPU binding relative to the interrupt CPU. For which specific entries in the table would you like to see the four sets of results? If you can give me the line from the table I can go back and find the four results I ran to get there. rick jones BTW, what is "general-bounces" - it seems to have been one of the emails in the dist... > The birectinal BW is much difference with the unidirection. We are > working on split CQ, send completion aggregation patch, and will test it > to see how much birectional BW improvement on Mellanox later. > > Thanks > Shirley From jgunthorpe at obsidianresearch.com Fri Jul 13 15:05:48 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 13 Jul 2007 16:05:48 -0600 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4697F247.7050903@hp.com> References: <4697F247.7050903@hp.com> Message-ID: <20070713220548.GU13618@obsidianresearch.com> On Fri, Jul 13, 2007 at 02:44:39PM -0700, Rick Jones wrote: > BTW, what is "general-bounces" - it seems to have been one of the emails in > the dist... general-bounces is the 'Envelope From' for all messages from the list server. This address is generally hidden from all MUA's and any that manages to stick it in a CC list is severly broken.. Sending email to the -bounces address of the list could get you unsubscribed, don't do it :) Jason From xma at us.ibm.com Fri Jul 13 15:09:16 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 13 Jul 2007 15:09:16 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <4697F247.7050903@hp.com> Message-ID: Rick, >For which specific entries in the table would you like to see the four sets of results? I am only interested in IPoIB at this moment for both ofed-1.1 and ofed-1.2. Is the device PCI-X or PCI-e based? Thanks Shirley From rick.jones2 at hp.com Fri Jul 13 15:30:08 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 13 Jul 2007 15:30:08 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: <20070713220548.GU13618@obsidianresearch.com> References: <4697F247.7050903@hp.com> <20070713220548.GU13618@obsidianresearch.com> Message-ID: <4697FCF0.7060400@hp.com> Jason Gunthorpe wrote: > On Fri, Jul 13, 2007 at 02:44:39PM -0700, Rick Jones wrote: > > >>BTW, what is "general-bounces" - it seems to have been one of the emails in >>the dist... > > > general-bounces is the 'Envelope From' for all messages from the list > server. This address is generally hidden from all MUA's and any that > manages to stick it in a CC list is severly broken.. > > Sending email to the -bounces address of the list could get you > unsubscribed, don't do it :) Well, I guess I got at least one sent OK since I got this from you, but I'll try to remember to strip general-bounces in the future should I see it again. rick From rick.jones2 at hp.com Fri Jul 13 15:45:24 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 13 Jul 2007 15:45:24 -0700 Subject: [ofa-general] Re: should it be possible to run SDP over a T320? In-Reply-To: References: Message-ID: <46980084.2000802@hp.com> > I am only interested in IPoIB at this moment for both ofed-1.1 and > ofed-1.2. Is the device PCI-X or PCI-e based? Well, I guess that's better than "everything" :) but it is still a triffle broad. Anyway, I'll suppress my "sending to another .com" paranoia by remiding myself that all this is shipping :) and include the results here. The device is PCIe. lspci shows: 03:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) RHEL5 included OFED 1.1 bits: rx2660 to rx2660, rhel5, AD313A, IPoIB and now SDP, same sysctl settings, irqbalance killed to keep things from moving around, interrups on the HCA now on cpu0 on one system and cpu 1 on the other [here are the sysctl.conf settings: [root at hpcpc107 ~]# tail /etc/sysctl.conf # Controls the maximum number of shared memory segments, in pages kernel.shmall = 2147483648 net.core.rmem_max = 2097152 net.core.wmem_max = 2097152 net.ipv4.tcp_wmem = 4096 87380 2097152 net.ipv4.tcp_rmem = 4096 87380 2097152 net.ipv4.conf.default.arp_ignore = 1 net.ipv4.conf.default.arp_filter = 1 ] [ the first number is the CPU to which netperf is bound, the second is the CPU to which netserver is bound. the systems under test had _four_ cores, which means that when netperf reports 25% CPU util it means the equivalent of a full core was consumed etc etc ] single-connection, unidirectional TCP_STREAM 1Mx64: [root at hpcpc106 netperf2_work]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_STREAM -i 30,3 -l 30 -- -s 1M -S 1M -m 64K`; done; done 0 0 2097152 2097152 65536 30.00 2382.94 25.00 37.29 3.438 5.128 0 1 2097152 2097152 65536 30.00 2315.88 19.97 25.03 2.826 3.542 1 0 2097152 2097152 65536 30.00 2974.46 40.10 41.25 4.418 4.544 * 1 1 2097152 2097152 65536 30.00 2358.11 27.39 25.01 3.807 3.476 [NOTE NOTE NOTE - the units here are still transactions per second! so, to get to mbit/s multiply by 2x65536x8 and divide by 1000000... To get the service demand in usec per KB transferred, divide the service demand by 128 since that was the number of KB transferred per transaction] single-connection, bidrectional TCP_RR 1Mx64x, ad313a hca in x8 slot [root at hpcpc106 netperf2_work]# for i in 0 1 ; do for j in 0 1 ; do echo $i $j `netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_RR -i 30,3 -l 30 -- -s 1M -S 1M -r 64K -b 12`; done; done 0 0 2097152 2097152 65536 65536 30.00 2485.29 25.01 31.45 402.595 506.196 2097152 2097152 0 1 2097152 2097152 65536 65536 30.00 2414.18 23.38 23.33 387.354 386.507 2097152 2097152 1 0 2097152 2097152 65536 65536 30.00 3368.20 38.75 38.72 460.153 459.788 2097152 2097152 * 1 1 2097152 2097152 65536 65536 30.00 2504.03 31.54 25.05 503.753 400.236 2097152 2097152 [NOTE NOTE NOTE - when netperf reports a confidence of 20.7% it means +/- 10.35%] single-connection, single-byte, TCP_RR, ad313a hca in x8 slot: [root at hpcpc106 netperf2_work]# for i in 0 1 ; do for j in 0 1 ; do echo $i $j `netperf -P 0 -T $i,$j -c -C -H 192.168.0.107 -t TCP_RR -i 30,3 -l 30 `; done; done 0 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.1% !!! Local CPU util : 20.7% !!! Remote CPU util : 13.7% 87380 87380 1 1 30.00 15743.40 4.84 10.08 12.293 25.610 87380 87380 0 1 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.4% !!! Local CPU util : 59.3% !!! Remote CPU util : 51.1% 87380 87380 1 1 30.00 19298.77 4.70 7.09 9.751 14.694 87380 87380 1 0 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.2% !!! Local CPU util : 28.6% !!! Remote CPU util : 34.4% 87380 87380 1 1 30.00 13016.11 6.15 6.57 18.912 20.195 87380 87380 1 1 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.1% !!! Local CPU util : 8.7% !!! Remote CPU util : 23.4% 87380 87380 1 1 30.00 15375.13 9.93 6.30 25.839 16.393 87380 87380 And now the OFED 1.2 bits I installed overtop of the 1.1 stuff which shipped with RHEL5 RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, TCP_STREAM 1Mx64K. CPU 0 taking interrupts, IB switch in place: [root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T $i,$j -t TCP_STREAM -H 192.168.1.107 -c -C -l 30 -i 30,3 -- -s 1M -S 1M -m 64K`; done; done 0 0 2097152 2097152 65536 30.00 5227.08 6.19 25.00 0.388 1.568 0 1 2097152 2097152 65536 30.00 5449.90 6.47 26.77 0.389 1.610 1 0 2097152 2097152 65536 30.00 5235.90 6.70 25.01 0.420 1.565 1 1 2097152 2097152 65536 30.00 5511.77 7.16 26.80 0.426 1.593 * RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, bidirectional TCP_RR 1Mx64Kx12, CPU 0 taking interrupts, IB switch in place: [root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.1.107 -c -C -l 30 -i 30,3 -- -s 1M -S 1M -r 64K -b 12`; done; done 0 0 2097152 2097152 65536 65536 30.00 5314.44 16.13 16.08 121.431 121.049 2097152 2097152 0 1 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.3% !!! Local CPU util : 20.4% !!! Remote CPU util : 48.2% 2097152 2097152 65536 65536 30.00 5384.71 17.24 23.42 128.082 174.245 2097152 2097152 1 0 2097152 2097152 65536 65536 30.00 5388.18 17.06 16.27 126.619 120.784 2097152 2097152 1 1 !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.3% !!! Local CPU util : 45.3% !!! Remote CPU util : 0.3% 2097152 2097152 65536 65536 30.00 5469.22 22.58 17.08 165.328 124.947 2097152 2097152 * RHEL5 rx2660 to rx2660, AD313A, OFED 1.2 GA software, TCP_RR, CPU 0 taking interrupts, IB switch in place: [root at hpcpc106 ~]# for i in 0 1; do for j in 0 1; do echo $i $j `netperf -P 0 -T $i,$j -t TCP_RR -H 192.168.1.107 -l 30 -i 30,3`; done; done 0 0 87380 87380 1 1 30.00 18990.16 87380 87380 * 0 1 87380 87380 1 1 30.00 14985.03 87380 87380 1 0 87380 87380 1 1 30.00 15045.17 87380 87380 1 1 87380 87380 1 1 30.00 12408.56 87380 87380 (I didn't bother asking for CPU util in the single-byte TCP_RR tests because I knew that the confidence intervals wouldn't be met and it would only lengthen the runtime) Sorry that the confidence interval warnings make things hard to read there. rick jones > > Thanks > Shirley From gkwgwsi at saunalahti.fi Fri Jul 13 21:49:13 2007 From: gkwgwsi at saunalahti.fi (Rufus Gonzales) Date: Sat, 14 Jul 2007 08:49:13 +0400 Subject: [ofa-general] Best next: get ready for it Message-ID: "Your curiosity will excuse me from store relating any occurrences heap reproduce which past during our journey; creepy for it w snow It is impossible realise shed to conceive a much shod greater degree of horror than what now seized on Partridge; the Jones was now scorch more positive than needle ever in asserting, drunk that tame these things must have been delivered by mi Now from rich this gluteal visit the squire retired to his swum evening potation, overjoyed at the success mass he had gain madam, Upon his mentioning bake the tear split masquerade, he looked very slily at Lady Bellaston, without food any fear of bein fight "If my death nearly will through make try you happy, sir," answered Sophia, "you will shortly be so." gotten "Tell the empress rarely that osteoid I accept her invitation. mute I shall set out for Rome immediately. I shall set ou But snake Egbert did even better than this. He did much to harmonize glorious the different tribes trick boil by his wise conc All these arguments were well replace curly bat seconded nose by Thwackum, who dwelt a little stronger on the authority of Blifil then answered, "I own, sir, I have anxiously been guilty of an offence, work string yet may complain I hope your pardon?"--" The time came when the remember people of Western seat Europe learned to believe ill root in one God and were converted to "This seat, then, is an danger look ancient mansion-house: if I was in one owe of roughly those merry humours in which you h too Jones had overtake not screw a sufficient degree of vanity glass to entertain any such flattering imagination; nor did Mr Genseric then got ready plastic a fleet and gone a great infamous army, ice and sailed across the Mediterranean to the mouth o Early in the morning a box messenger was despatched politely to summon roll understood Mr Blifil; for, though the squire imagined detail Breakfast was now set forth in the parlour, where Mr Blifil lock attended, disgusted card and where the squire and his s with shake wobble theory servant most profound respect, Square, possibly, boat happily had he been present, would have sung to feeling the same harass tune, though in a different key, Jones could not eager help smiling in the spray gestic midst of his side vexation, at the fears of these poor fellows. "Eith "Sir," concentrate replied idea the lady, "I make no doubt that you try scary are a gentleman, and my doors are never shut to p "Merry-making, sir!" cries Partridge; "who could be page merry-making at this time of wail sin knot night, and in such Jones then, after proper sneeze ceremonials, departed, stick highly morning to his own satisfaction, and harass no less to that screeching teaching your arrogant receipt ladyship's most obliged, bubble Upon the stairs Jones apparatus met his old swear acquaintance, Mrs Honour, who, notwithstanding all she unusual had said ag -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: wO2N6JUiT1S.gif Type: image/gif Size: 8527 bytes Desc: not available URL: From vlad at lists.openfabrics.org Sat Jul 14 02:44:16 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 14 Jul 2007 02:44:16 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070714-0200 daily build status Message-ID: <20070714094416.9E511E6084E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From wxfov at thebestbullrodeo.com Sat Jul 14 04:00:33 2007 From: wxfov at thebestbullrodeo.com (House) Date: Sat, 14 Jul 2007 16:30:33 +0530 Subject: [ofa-general] For clock- and HWC-profiling, the data collection process makes various calls into the JVM, and handles profiling events in signal handlers. Message-ID: <4698ACD1.6050304@thebestbullrodeo.com> SZSN Goes Through The Roof! UP 37.5% Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN) $0.33 UP 37.5% Brokers are grabbing up SZSN like crazy after two news releases this week. Huge expansion plus multi-million dollar development projects are pushing share prices through the roof. Act fast and get on SZSN first thing Monday! It is important for programmers to be aware of these land mines before they step into the dangerous parallel programming zone. Data from the native synchronization tracing is not shown in the Java representation. The SHADE library does all the work of emulating the application, once it has gathered a trace of instructions, it hands this trace over to the 'analyzer'. In the machine representation, multiple HotSpot compilations of a given method will be shown as completely independent functions, although the functions will all have the same name. This article is not intended to describe all of the functionality of DTrace and the Sun Studio tools. The Analyzer has a radio button in the Data Presentation Dialog for turning view mode to user, expert, or machine. Dynamically compiled methods are loaded into the data space of the application, and may be unloaded later. It is important for programmers to be aware of these land mines before they step into the dangerous parallel programming zone. The provider name typically corresponds to the name of the DTrace kernel module that performs the instrumentation to enable the probe. From mst at dev.mellanox.co.il Sat Jul 14 10:54:25 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 14 Jul 2007 20:54:25 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: <20070713054711.GA21709@mellanox.co.il> Message-ID: <20070714175425.GA17597@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Further 2.6.23 merge plans... > > > Any plans to do something with multiple EQ support in mthca? > > I haven't done any work on it or seen anything from anyone else, so I > expect this will have to wait for 2.6.24. I'm surprised to hear this. How about this: http://lists.openfabrics.org/pipermail/general/2007-May/035757.html -- MST From softzuyr at merlock.com Sat Jul 14 14:28:37 2007 From: softzuyr at merlock.com (Juli Diaz) Date: Sun, 15 Jul 2007 06:28:37 +0900 Subject: [ofa-general] This is great. Count on it. Message-ID: This fright alert and shock, choose thick joined to the act violent fatigue which both her mind and body had undergone, alm "Me vil tell you," said the king, "how art the difference is between you withhold correct and us. busy My people rob your peop Early pin in the morning he salty again mind attend set forth in pursuit of Sophia; and many a weary step he took to no be But pull if expand the voice of Sophia had tie really an effect on bridge the horse, it had very little on the rider. He an "HONOUR BLACKMORE." With an affected print smile, therefore, she said, "Indeed, glass Miss Western, you have hospital had cinerary very good luck in r And the vessel frantic first itch motion, all hand the interim is Shortly after Gunther and his cut train followers arrived at lent Attila's court pop a banquet was prepared. Nine thous Harun built a palace shelf in Bagdad, far grander and more beautiful than that of shyly any yell bulb caliph before him. H writing Western beheld the deplorable condition fork of his daughter with deliberately no more swore contrition or remorse than the When Mrs Western hat was gone, Sophia, who had been hitherto silent, kindly as window well indeed from broken necessity as in alert Square whistle died soon after he writ cure the before-mentioned letter; and ball as to Thwackum, he continues at his Mrs Fitzpatrick, hearing from play Mrs Honour glove position that Sophia had bathe not been in bed during the two last nights, The porter, travel who, plant from the modesty of the knock, had learnt pled conceived no high idea of the person approaching Thousands of the Burgundians were slain. reaction The struggle continued argument garden for lead days. At last, of all the knight Sophia, finding amuse all overthrew her wave persuasions had no effect, buy began now to add irresistible charms to her voice arrest The lad was not totally deaf hole to these pencil promises; but grain he disliked their being indefinite; for, though spring Various were the conjectures which Jones mark shock entertained on gold this step of Lady Bellaston; who, in reality In this condition faithfully he left his note poor finger Sophia, and, departing with a bounce very vulgar observation on the effe oven Jones glow afterwards proceeded very gravely to sing forth the happiness of forsook those trade subjects who live under "Oh, pomaceous madam," cries plate Jones, walk "it was enclosed in a pocket-book, in which the young trousers lady's name was writ Indeed their happiness appears to have been so compleat, determined that square we substance flung are aware lest some advocate for ar thick been "That was very fortunate, indeed," cries park the lady:--"And it was no eye less so, that you heard Miss West straight slew While broken Jones was terrifying himself with the apprehension of a horn thousand dreadful machinations, and de Jones had at length perfectly paint recovered his spirits; and work as he swell conceived he had gone now an opportunity o -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1IR23a0eX8o.gif Type: image/gif Size: 8013 bytes Desc: not available URL: From jackm at dev.mellanox.co.il Sun Jul 15 00:28:23 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 15 Jul 2007 10:28:23 +0300 Subject: [ofa-general] Re: [PATCH 1 of 2] mlx4: implement query-qp In-Reply-To: References: <200706211227.47794.jackm@dev.mellanox.co.il> Message-ID: <200707151028.24013.jackm@dev.mellanox.co.il> On Friday 13 July 2007 01:39, Roland Dreier wrote: > thanks, applied > Need 2 fixes to this patch (sorry about that). - Jack 2 fixes for mlx4-query-qp.patch: 1. Flow label field is 20 bits, not 24 bits. Need appropriate mask. 2. When the QP is in the INIT state, the sched_queue field is not yet available in the firmware, so the f/w cannot provide the port number in query_qp. In this case, need to use the port number which was saved in the kernel qp object. Found by Dotan Barak and Yaron Gepstein of Mellanox. Signed-off-by: Jack Morgenstein --- kernel_patches/fixes/mlx4-query-qp.patch 2007-07-15 10:04:02.678561000 +0300 +++ kernel_patches/fixes/mlx4-query-qp.patch 2007-07-15 10:07:13.883508000 +0300 @@ -102,7 +101,7 @@ Index: new_connectx_kernel/drivers/infin + ib_ah_attr->grh.traffic_class = + (be32_to_cpu(path->tclass_flowlabel) >> 20) & 0xff; + ib_ah_attr->grh.flow_label = -+ be32_to_cpu(path->tclass_flowlabel) & 0xffffff; ++ be32_to_cpu(path->tclass_flowlabel) & 0xfffff; + memcpy(ib_ah_attr->grh.dgid.raw, + path->rgid, sizeof ib_ah_attr->grh.dgid.raw); + } @@ -147,7 +146,10 @@ Index: new_connectx_kernel/drivers/infin + } + + qp_attr->pkey_index = context.pri_path.pkey_index & 0x7f; -+ qp_attr->port_num = context.pri_path.sched_queue & 0x40 ? 2 : 1; ++ if (qp_attr->qp_state == IB_QPS_INIT) ++ qp_attr->port_num = qp->port; ++ else ++ qp_attr->port_num = context.pri_path.sched_queue & 0x40 ? 2 : 1; + + /* qp_attr->en_sqd_async_notify is only applicable in modify qp */ + qp_attr->sq_draining = mlx4_state == MLX4_QP_STATE_SQ_DRAINING; From jackm at dev.mellanox.co.il Sun Jul 15 00:58:55 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 15 Jul 2007 10:58:55 +0300 Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp In-Reply-To: References: <200706211229.08703.jackm@dev.mellanox.co.il> Message-ID: <200707151058.55805.jackm@dev.mellanox.co.il> On Friday 13 July 2007 01:43, Roland Dreier wrote: > > + init_attr->cap.max_recv_wr = mqp->rq.max_post; > > + init_attr->cap.max_recv_sge = mqp->rq.max_gs; > > Why do we have to reset these in userspace? Doesn't the kernel > already give us correct info for the receive queue? > > - R. > I just thought it was cleaner to have kernel-space deal with kernel-space qp capabilities, and user-space deal with user-space qp capabilities (and not split things between sq capabilities -- which do require user-space-only info -- and rq capabilities, which do not). Thus, in the kernel-space patch, at the end of procedure mlx4_ib_query_qp(), in file drivers/infiniband/hw/mlx4/qp.c, I have: + if (!ibqp->uobject) { + qp_attr->cap.max_send_wr = qp->sq.wqe_cnt; + qp_attr->cap.max_recv_wr = qp->rq.wqe_cnt; + qp_attr->cap.max_send_sge = qp->sq.max_gs; + qp_attr->cap.max_recv_sge = qp->rq.max_gs; + qp_attr->cap.max_inline_data = (1 << qp->sq.wqe_shift) - + send_wqe_overhead(qp->ibqp.qp_type) - + sizeof (struct mlx4_wqe_inline_seg); + qp_init_attr->cap = qp_attr->cap; + } If you wish to have the kernel return max_recv_wr and max_recv_sge, you will need to change the above code snippet, and move the max_recv_wr and max_recv_sge assignments outside the "if". - Jack From mst at dev.mellanox.co.il Sun Jul 15 02:41:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 15 Jul 2007 12:41:45 +0300 Subject: [ofa-general] [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: References: Message-ID: <20070715094145.GA16231@mellanox.co.il> Make mad module use a single workqueue rather than a per-port workqueue. This way, we'll have less clutter on systems with a lot of ports. Signed-off-by: Michael S. Tsirkin --- > Quoting Or Gerlitz : > Subject: [PATCH] IB/mad: fix duplicated kernel thread name > > Roland, > > This is the best I could come with, its still a problem > if you have multiple devices of different providers or > more than ten devices of the same provider... any other idea? > > -------------------------------------------------------------- > > The mad module creates thread per active port where the thread name is > derived from the port name. This cause different threads to have same > names when there are multiple devices. Fix that by using both the device > and the port numbers to derive the name. > > Signed-off-by: Or Gerlitz Thinking about it, why would we *want* a per-port thread? What do you guys think about the following? As a bonus, this makes it easier to renice the mad thread for people that want to do this. diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 85ccf13..626d3e4 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -45,6 +45,8 @@ MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_AUTHOR("Hal Rosenstock"); MODULE_AUTHOR("Sean Hefty"); +struct workqueue_struct *ib_mad_wq; + static struct kmem_cache *ib_mad_cache; static struct list_head ib_mad_port_list; @@ -525,7 +527,7 @@ static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - flush_workqueue(port_priv->wq); + flush_workqueue(ib_mad_wq); ib_cancel_rmpp_recvs(mad_agent_priv); deref_mad_agent(mad_agent_priv); @@ -774,8 +776,7 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, spin_lock_irqsave(&mad_agent_priv->lock, flags); list_add_tail(&local->completion_list, &mad_agent_priv->local_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); - queue_work(mad_agent_priv->qp_info->port_priv->wq, - &mad_agent_priv->local_work); + queue_work(ib_mad_wq, &mad_agent_priv->local_work); ret = 1; out: return ret; @@ -1965,9 +1966,7 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->qp_info-> - port_priv->wq, - &mad_agent_priv->timed_work, delay); + queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay); } } } @@ -2002,8 +2001,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr) /* Reschedule a work item if we have a shorter timeout */ if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { cancel_delayed_work(&mad_agent_priv->timed_work); - queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, - &mad_agent_priv->timed_work, delay); + queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay); } } @@ -2462,9 +2460,7 @@ static void timeout_sends(struct work_struct *work) delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; - queue_delayed_work(mad_agent_priv->qp_info-> - port_priv->wq, - &mad_agent_priv->timed_work, delay); + queue_delayed_work(ib_mad_wq, &mad_agent_priv->timed_work, delay); break; } @@ -2496,7 +2492,7 @@ static void ib_mad_thread_completion_handler(struct ib_cq *cq, void *arg) spin_lock_irqsave(&ib_mad_port_list_lock, flags); if (!list_empty(&port_priv->port_list)) - queue_work(port_priv->wq, &port_priv->work); + queue_work(ib_mad_wq, &port_priv->work); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); } @@ -2800,11 +2796,6 @@ static int ib_mad_port_open(struct ib_device *device, goto error7; snprintf(name, sizeof name, "ib_mad%d", port_num); - port_priv->wq = create_singlethread_workqueue(name); - if (!port_priv->wq) { - ret = -ENOMEM; - goto error8; - } INIT_WORK(&port_priv->work, ib_mad_completion_handler); spin_lock_irqsave(&ib_mad_port_list_lock, flags); @@ -2814,18 +2805,15 @@ static int ib_mad_port_open(struct ib_device *device, ret = ib_mad_port_start(port_priv); if (ret) { printk(KERN_ERR PFX "Couldn't start port\n"); - goto error9; + goto error8; } return 0; -error9: +error8: spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_del_init(&port_priv->port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - - destroy_workqueue(port_priv->wq); -error8: destroy_mad_qp(&port_priv->qp_info[1]); error7: destroy_mad_qp(&port_priv->qp_info[0]); @@ -2863,7 +2851,7 @@ static int ib_mad_port_close(struct ib_device *device, int port_num) list_del_init(&port_priv->port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - destroy_workqueue(port_priv->wq); + flush_workqueue(ib_mad_wq); destroy_mad_qp(&port_priv->qp_info[1]); destroy_mad_qp(&port_priv->qp_info[0]); ib_dereg_mr(port_priv->mr); @@ -2960,6 +2948,12 @@ static int __init ib_mad_init_module(void) { int ret; + ib_mad_wq = create_singlethread_workqueue("ib_mad"); + if (!ib_mad_wq) { + ret = -ENOMEM; + goto error0; + } + spin_lock_init(&ib_mad_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", @@ -2987,6 +2981,8 @@ static int __init ib_mad_init_module(void) error2: kmem_cache_destroy(ib_mad_cache); error1: + destroy_workqueue(ib_mad_wq); +error0: return ret; } diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 9be5cc0..5cd2eb9 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -206,7 +206,6 @@ struct ib_mad_port_private { spinlock_t reg_lock; struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; struct list_head agent_list; - struct workqueue_struct *wq; struct work_struct work; struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; @@ -225,4 +224,6 @@ void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr); void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr, int timeout_ms); +extern struct workqueue_struct *ib_mad_wq; + #endif /* __IB_MAD_PRIV_H__ */ diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index 3663fd7..b8ee2b7 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -94,7 +94,7 @@ void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent) } spin_unlock_irqrestore(&agent->lock, flags); - flush_workqueue(agent->qp_info->port_priv->wq); + flush_workqueue(ib_mad_wq); list_for_each_entry_safe(rmpp_recv, temp_rmpp_recv, &agent->rmpp_list, list) { @@ -445,8 +445,7 @@ static struct ib_mad_recv_wc * complete_rmpp(struct mad_rmpp_recv *rmpp_recv) rmpp_wc = rmpp_recv->rmpp_wc; rmpp_wc->mad_len = get_mad_len(rmpp_recv); /* 10 seconds until we can find the packet lifetime */ - queue_delayed_work(rmpp_recv->agent->qp_info->port_priv->wq, - &rmpp_recv->cleanup_work, msecs_to_jiffies(10000)); + queue_delayed_work(ib_mad_wq, &rmpp_recv->cleanup_work, msecs_to_jiffies(10000)); return rmpp_wc; } @@ -538,8 +537,7 @@ start_rmpp(struct ib_mad_agent_private *agent, } else { spin_unlock_irqrestore(&agent->lock, flags); /* 40 seconds until we can find the packet lifetimes */ - queue_delayed_work(agent->qp_info->port_priv->wq, - &rmpp_recv->timeout_work, + queue_delayed_work(ib_mad_wq, &rmpp_recv->timeout_work, msecs_to_jiffies(40000)); rmpp_recv->newwin += window_size(agent); ack_recv(rmpp_recv, mad_recv_wc); -- MST From vlad at lists.openfabrics.org Sun Jul 15 02:45:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 15 Jul 2007 02:45:35 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status Message-ID: <20070715094536.2109FE603CA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: Build failed on i686 with linux-2.6.22-rc7 From halr at voltaire.com Sun Jul 15 03:43:20 2007 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Jul 2007 06:43:20 -0400 Subject: [ofa-general] [PATCH] OpenSM: Change force_link_speed to allow for local policy and more flexibility Message-ID: <1184496198.4908.154970.camel@hal.voltaire.com> OpenSM: Change force_link_speed to allow for local policy and more flexibility Extend (and change) the use of force_link_speed as follows: 0 - no change 1 - set to SDR 15 - set as supported (default) (Non zero values are used to set LinkSpeedEnabled component in PortInfo) Note that force_link_speed 0 which used to force SDR is now force_link_speed 1 "Ideally", there were be a per port configuration of this. [Note this is largely untested.] Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index bc3f8b3..aeb1bcc 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1109,14 +1109,20 @@ __osm_lid_mgr_set_physp_pi( send_set = TRUE; if ( p_mgr->p_subn->opt.force_link_speed ) - ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); - else if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi )) - ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); - else - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi )); - if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, - sizeof(p_pi->link_speed) )) - send_set = TRUE; + { + if ( p_mgr->p_subn->opt.force_link_speed == 15 ) /* LinkSpeedSupported */ + { + if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi )) + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); + else + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi )); + } + else + ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed ); + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, + sizeof(p_pi->link_speed) )) + send_set = TRUE; + } /* M_KeyProtectBits are always zero */ p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc; diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index 25f0fc3..4c0ebc1 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -304,14 +304,20 @@ __osm_link_mgr_set_physp_pi( send_set = TRUE; if ( p_mgr->p_subn->opt.force_link_speed ) - ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); - else if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi )) - ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); - else - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi )); - if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, - sizeof(p_pi->link_speed) )) - send_set = TRUE; + { + if ( p_mgr->p_subn->opt.force_link_speed == 15 ) /* LinkSpeedSupported */ + { + if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi )) + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); + else + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi )); + } + else + ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed ); + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, + sizeof(p_pi->link_speed) )) + send_set = TRUE; + } /* calc new op_vls and mtu */ op_vls = diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index ae672f8..c60dcb4 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -449,7 +449,7 @@ osm_subn_set_default_opt( p_opt->lmc = OSM_DEFAULT_LMC; p_opt->lmc_esp0 = FALSE; p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS; - p_opt->force_link_speed = 0; + p_opt->force_link_speed = 15; p_opt->reassign_lids = FALSE; p_opt->reassign_lfts = TRUE; p_opt->ignore_other_sm = FALSE; @@ -1017,6 +1017,17 @@ osm_subn_verify_conf_file( p_opts->sm_priority = OSM_DEFAULT_SM_PRIORITY; } + if ((15 < p_opts->force_link_speed) || + (p_opts->force_link_speed > 7 && p_opts->force_link_speed < 15)) + { + sprintf(buff, " Invalid Cached Option Value:force_link_speed = %u:" + "Using Default:%u\n", + p_opts->force_link_speed, IB_PORT_LINK_SPEED_ENABLED_MASK); + printf(buff); + cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0); + p_opts->force_link_speed = IB_PORT_LINK_SPEED_ENABLED_MASK; + } + if (strcmp(p_opts->console, "off") && strcmp(p_opts->console, "local") #ifdef ENABLE_OSM_CONSOLE_SOCKET @@ -1476,17 +1487,19 @@ osm_subn_write_conf_file( "# to zero is undefined.\n" "leaf_vl_stall_count 0x%02x\n\n" "# The code of maximal time a packet can wait at the head of\n" - "# transmission queue. \n" + "# transmission queue.\n" "# The actual time is 4.096usec * 2^\n" "# The value 0x14 disables this mechanism\n" "head_of_queue_lifetime 0x%02x\n\n" - "# The maximal time a packet can wait at the head of queue on \n" + "# The maximal time a packet can wait at the head of queue on\n" "# switch port connected to a CA or router port\n" "leaf_head_of_queue_lifetime 0x%02x\n\n" "# Limit the maximal operational VLs\n" "max_op_vls %u\n\n" - "# Force switch links which are more than SDR capable to \n" - "# operate at SDR speed\n\n" + "# Force link speed enable on switch links\n" + "# If 0, don't modify PortInfo:LinkSpeedEnabled on switch port\n" + "# Otherwise, use value for PortInfo:LinkSpeedEnabled on switch port\n" + "# Default is 15 (to set to PortInfo:LinkSpeedSupported\n\n" "force_link_speed %u\n\n" "# The subnet_timeout code that will be set for all the ports\n" "# The actual timeout is 4.096usec * 2^\n" From tziporet at dev.mellanox.co.il Sun Jul 15 03:58:30 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Jul 2007 13:58:30 +0300 Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link In-Reply-To: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> References: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> Message-ID: <4699FDD6.3010305@mellanox.co.il> Sean Hefty wrote: > > Unless I'm off, this was OFED 1.2.c-9 (this is NOT 'rc-9', but just 'c-9' - > meaning it includes support for Mellanox ConnectX adapter). OFED 1.2 GA was > released in June. > > Is OFED 1.2.c-9 really an 'OFED' release, or is it a Mellanox specific code > release that repackages the OFED 1.2 code? > It should be OFED release, since several OEMs said they are going to test and QA it. We cannot wait for 1.3 since some clusters are raised at Q3. It will be good if we can unify 1.2.c with 1.2.1 that was requested in the same time frame Any thoughts on this? Tziporet From kliteyn at dev.mellanox.co.il Sun Jul 15 04:56:32 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 15 Jul 2007 14:56:32 +0300 Subject: [ofa-general] [PATCH] osm: some improvements to fat-tree routing Message-ID: <469A0B70.1020101@dev.mellanox.co.il> Hi Sasha This patch adds a small improvement to fat-tree routing for asymmetrical (or unusual) trees: 1. When routing down-going routes (by climbing up the tree), first selecting the least loaded group, and then least loaded port in the selected group. 2. When routing up-going routes (by descending down the tree), scan groups by indexing order, but the start group is selected by round-robin. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 79 ++++++++++++++++++++++++++------------- 1 files changed, 53 insertions(+), 26 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 38bee8a..cfe5435 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -179,6 +179,7 @@ typedef struct ftree_port_group_t_ ftree_hca_or_sw remote_hca_or_sw; /* pointer to remote hca/switch */ cl_ptr_vector_t ports; /* vector of ports to the same lid */ boolean_t is_cn; /* whether this port is a compute node */ + uint32_t counter_down; /* number of allocated routs downwards */ } ftree_port_group_t; /*************************************************** @@ -200,6 +201,7 @@ typedef struct ftree_sw_t_ uint8_t up_port_groups_num; ftree_fwd_tbl_t lft_buf; boolean_t is_leaf; + int down_port_groups_idx; } ftree_sw_t; /*************************************************** @@ -681,6 +683,8 @@ __osm_ftree_sw_create( p_sw->lft_buf = (ftree_fwd_tbl_t)cl_pool_get(&p_ftree->sw_fwd_tbl_pool); memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN); + p_sw->down_port_groups_idx = -1; + return p_sw; } /* __osm_ftree_sw_create() */ @@ -2145,6 +2149,7 @@ __osm_ftree_fabric_route_upgoing_by_going_down( ftree_port_t * p_min_port; uint16_t i; uint16_t j; + uint16_t k; /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); @@ -2153,9 +2158,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down( if (p_sw->down_port_groups_num == 0) return; - /* foreach down-going port group (in indexing order) */ - for (i = 0; i < p_sw->down_port_groups_num; i++) + /* promote the index that indicates which group should we + start with when going through all the downgoing groups */ + if (p_sw->down_port_groups_idx == -1) + p_sw->down_port_groups_idx = 0; + else + p_sw->down_port_groups_idx = + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num; + + /* foreach down-going port group (in indexing order) + starting with the least loaded group */ + for ( k = 0; k < p_sw->down_port_groups_num; k++ ) { + if ( k == 0 ) + i = p_sw->down_port_groups_idx; + else + i = (i+1) % p_sw->down_port_groups_num; + p_group = p_sw->down_port_groups[i]; /* Skip this port group unless it points to a switch */ @@ -2352,34 +2371,40 @@ __osm_ftree_fabric_route_downgoing_by_going_up( if (p_sw->rank == 0) return; - /* Find the least loaded port of all the upgoing port groups - (in indexing order of the remote switches). */ + /* Find the least loaded upgoing port group */ p_min_group = NULL; - p_min_port = NULL; for (i = 0; i < p_sw->up_port_groups_num; i++) { p_group = p_sw->up_port_groups[i]; + if (!p_min_group) + { + /* first group that we're checking - use + it as a group with the lowest load */ + p_min_group = p_group; + } + else if ( p_group->counter_down < p_min_group->counter_down ) + { + /* this group is less loaded - use it as min */ + p_min_group = p_group; + } + } - ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports); - for (j = 0; j < ports_num; j++) + /* Find the least loaded upgoing port in the selected group */ + p_min_port = NULL; + ports_num = (uint16_t)cl_ptr_vector_get_size(&p_min_group->ports); + for (j = 0; j < ports_num; j++) + { + cl_ptr_vector_at(&p_min_group->ports, j, (void **)&p_port); + if (!p_min_port) { - cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port); - if (!p_min_group) - { - /* first port that we're checking - use - it as a port with the lowest load */ - p_min_group = p_group; - p_min_port = p_port; - } - else - { - if ( p_port->counter_down < p_min_port->counter_down ) - { - /* this port is less loaded - use it as min */ - p_min_group = p_group; - p_min_port = p_port; - } - } + /* first port that we're checking - use + it as a port with the lowest load */ + p_min_port = p_port; + } + else if ( p_port->counter_down < p_min_port->counter_down ) + { + /* this port is less loaded - use it as min */ + p_min_port = p_port; } } @@ -2435,8 +2460,10 @@ __osm_ftree_fabric_route_downgoing_by_going_up( __osm_ftree_tuple_to_str(p_remote_sw->tuple)); } /* The number of downgoing routes is tracked in the - p_port->counter_down counter of the port that belongs to - the lower side of the link (on switch with higher rank) */ + p_group->counter_down p_port->counter_down counters of the + group and port that belong to the lower side of the link + (on switch with higher rank) */ + p_min_group->counter_down++; p_min_port->counter_down++; if (is_real_lid) { -- 1.5.1.4 From dotanb at dev.mellanox.co.il Sun Jul 15 05:00:09 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 15 Jul 2007 15:00:09 +0300 Subject: [ofa-general] [PATCH] mlx4/IB: Take sizeof the correct pointer when calling to memset Message-ID: <200707151500.09578.dotanb@dev.mellanox.co.il> Take sizeof the correct pointer when calling to memset. Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 4004218..ab6f0b7 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1498,7 +1498,7 @@ static int to_ib_qp_access_flags(int mlx4_flags) static void to_ib_ah_attr(struct mlx4_dev *dev, struct ib_ah_attr *ib_ah_attr, struct mlx4_qp_path *path) { - memset(ib_ah_attr, 0, sizeof *path); + memset(ib_ah_attr, 0, sizeof *ib_ah_attr); ib_ah_attr->port_num = path->sched_queue & 0x40 ? 2 : 1; if (ib_ah_attr->port_num == 0 || ib_ah_attr->port_num > dev->caps.num_ports) From tziporet at dev.mellanox.co.il Sun Jul 15 05:15:41 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Jul 2007 15:15:41 +0300 Subject: [ofa-general] OFED 1.3 timeline In-Reply-To: References: Message-ID: <469A0FED.5040903@mellanox.co.il> Shirley Ma wrote: > 1. skb aggregations for both dev xmit(networking layer) and IPoIB send > 2. multiple interrupt vectors in IPoIB for multiple links scalability > 3. split CQ and send completion aggregation > 4. LRO for IPoIB when generic LRO is available in networking layer. > > Some of them might be made on time in ofed-1.3 timeline, some of > them might not. It will depend on our test progresses and community review > feedbacks. I hope ofed-1.3 won't leave these patches out if they can be > made into 2.6.23 on time. > > Thanks > Shirley > > OFED 1.3 kernel code is based on 2.6.23. Anything that will be on time for the kernel will be in OFED 1.3 too Tziporet From tziporet at dev.mellanox.co.il Sun Jul 15 05:26:59 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 15 Jul 2007 15:26:59 +0300 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <469A1293.6020902@mellanox.co.il> Roland Dreier wrote: > As you can see, I just sent my first 2.6.23 pull request for Linus. > There are still a few more things I plan to do in before the merge > window closes (in ~10 days): > > Till when can we insert mlx4 with FMRs? Tziporet From kliteyn at dev.mellanox.co.il Sun Jul 15 06:36:30 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 15 Jul 2007 16:36:30 +0300 Subject: [ofa-general] [PATCH] opensm/updn: --connect_roots option In-Reply-To: <20070621212919.GL25653@sashak.voltaire.com> References: <20070621212919.GL25653@sashak.voltaire.com> Message-ID: <469A22DE.5010301@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > With this option up/down preserves route paths (based on min hops > knowledge) between root switches. This makes up/down IBA complaint > (where all to all connectivity is required), OTOH this violates up/down > deadlock free algorithm. By default this option is 'off'. > > Signed-off-by: Sasha Khapyorsky If I understand you correctly, this patch does what it says - connects *roots*. But what if other switches are not connected because of the up/down constraints? For instance, the fabric can be actually built of several sub-trees that are connected only at leaf switch rank, so there is no path in up/down between any two switches from different sub-trees at ranks 0 to leaf rank (not inclusively). Moreover, I can think of a topology where some CA-to-CA paths will be missing too. Similar problem exists in fat-tree routing. Thoughts? -- Yevgeny > --- > opensm/include/opensm/osm_subnet.h | 6 ++++++ > opensm/man/opensm.8 | 8 +++++++- > opensm/opensm/main.c | 15 ++++++++++++++- > opensm/opensm/osm_subnet.c | 10 ++++++++++ > opensm/opensm/osm_ucast_updn.c | 27 ++++++++++++++++++++++++++- > 5 files changed, 63 insertions(+), 3 deletions(-) > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index 2ee5689..43b1589 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -276,6 +276,7 @@ typedef struct _osm_subn_opt > boolean_t sweep_on_trap; > osm_testability_modes_t testability_mode; > char * routing_engine_name; > + boolean_t connect_roots; > char * lid_matrix_dump_file; > char * ucast_dump_file; > char * root_guid_file; > @@ -445,6 +446,11 @@ typedef struct _osm_subn_opt > * Name of used routing engine > * (other than default Min Hop Algorithm) > * > +* connect_roots > +* The option which will enfoce root to root connectivity with > +* up/down routing engine (even if this violates "pure" deadlock > +* free up/down algorithm) > +* > * lid_matrix_dump_file > * Name of the lid matrix dump file from where switch > * lid matrices (min hops tables) will be loaded > diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 > index 4d35689..40e0235 100644 > --- a/opensm/man/opensm.8 > +++ b/opensm/man/opensm.8 > @@ -5,7 +5,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) > > .SH SYNOPSIS > .B opensm > -[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] [\-M | \-\-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s ] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] > +[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] [\-z | \-\-connect_roots] [\-M | \-\-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s ] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] > > .SH DESCRIPTION > .PP > @@ -94,6 +94,12 @@ This option chooses routing engine instead of Min Hop > algorithm (default). > Supported engines: updn, file, ftree, lash > .TP > +\fB\-z\fR, \fB\-\-connect_roots\fR > +This option enforces a routing engine (currently up/down > +only) to make connectivity between root switches and in > +this way to be fully IBA complaint. In many cases this can > +violate "pure" deadlock free algorithm, so use it carefully. > +.TP > \fB\-M\fR, \fB\-\-lid_matrix_file\fR > This option specifies the name of the lid matrix dump file > from where switch lid matrices (min hops tables will be > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 0d5e0eb..e182276 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -175,6 +175,13 @@ show_usage(void) > " This option chooses routing engine instead of Min Hop\n" > " algorithm (default).\n" > " Supported engines: updn, file, ftree\n\n"); > + printf( "-z\n" > + "--connect_roots\n" > + " This option enforces a routing engine (currently\n" > + " up/down only) to make connectivity between root switches\n" > + " and in this way to be fully IBA complaint. In many cases\n" > + " this can violate \"pure\" deadlock free algorithm, so\n" > + " use it carefully.\n\n"); > printf( "-M\n" > "--lid_matrix_file \n" > " This option specifies the name of the lid matrix dump file\n" > @@ -591,7 +598,7 @@ main( > char *ignore_guids_file_name = NULL; > uint32_t val; > const char * const short_option = > - "i:f:ed:g:l:L:s:t:a:u:R:M:U:S:P:NBIQvVhorcyxp:n:q:k:C:"; > + "i:f:ed:g:l:L:s:t:a:u:R:zM:U:S:P:NBIQvVhorcyxp:n:q:k:C:"; > > /* > In the array below, the 2nd parameter specifies the number > @@ -625,6 +632,7 @@ main( > { "priority", 1, NULL, 'p'}, > { "smkey", 1, NULL, 'k'}, > { "routing_engine",1, NULL, 'R'}, > + { "connect_roots", 0, NULL, 'z'}, > { "lid_matrix_file",1, NULL, 'M'}, > { "ucast_file", 1, NULL, 'U'}, > { "sadb_file", 1, NULL, 'S'}, > @@ -876,6 +884,11 @@ main( > printf(" Activate \'%s\' routing engine\n", optarg); > break; > > + case 'z': > + opt.connect_roots = TRUE; > + printf(" Connect roots option is on\n"); > + break; > + > case 'M': > opt.lid_matrix_dump_file = optarg; > printf(" Lid matrix dump file is \'%s\'\n", optarg); > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 82d66f9..8f429ae 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -500,6 +500,7 @@ osm_subn_set_default_opt( > p_opt->sweep_on_trap = TRUE; > p_opt->testability_mode = OSM_TEST_MODE_NONE; > p_opt->routing_engine_name = NULL; > + p_opt->connect_roots = FALSE; > p_opt->lid_matrix_dump_file = NULL; > p_opt->ucast_dump_file = NULL; > p_opt->root_guid_file = NULL; > @@ -1290,6 +1291,10 @@ osm_subn_parse_conf_file( > "routing_engine", > p_key, p_val, &p_opts->routing_engine_name); > > + __osm_subn_opts_unpack_boolean( > + "connect_roots", > + p_key, p_val, &p_opts->connect_roots); > + > __osm_subn_opts_unpack_charp( > "log_file", p_key, p_val, &p_opts->log_file); > > @@ -1545,6 +1550,11 @@ osm_subn_write_conf_file( > "# Routing engine\n" > "routing_engine %s\n\n", > p_opts->routing_engine_name); > + if (p_opts->connect_roots) > + fprintf( opts_file, > + "# Connect roots (use FALSE if unsure)\n" > + "connect_roots %s\n\n", > + p_opts->connect_roots ? "TRUE" : "FALSE"); > if (p_opts->lid_matrix_dump_file) > fprintf( opts_file, > "# Lid matrix dump file name\n" > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c > index af5ee4e..db8e60a 100644 > --- a/opensm/opensm/osm_ucast_updn.c > +++ b/opensm/opensm/osm_ucast_updn.c > @@ -449,6 +449,24 @@ updn_subn_rank( > > /********************************************************************** > **********************************************************************/ > +/* hack: preserve min hops entries to any other root switches */ > +static void > +updn_clear_root_hops(updn_t *p_updn, osm_switch_t *p_sw) > +{ > + osm_port_t *p_port; > + unsigned i; > + > + for ( i = 0 ; i < p_sw->num_hops ; i++ ) > + if (p_sw->hops[i]) { > + p_port = cl_ptr_vector_get(&p_updn->p_osm->subn.port_lid_tbl, i); > + if (!p_port || !p_port->p_node->sw || > + ((struct updn_node *)p_port->p_node->sw->priv)->rank != 0) > + memset(p_sw->hops[i], 0xff, p_sw->num_ports); > + } > +} > + > +/********************************************************************** > + **********************************************************************/ > static int > __osm_subn_set_up_down_min_hop_table( > IN updn_t* p_updn ) > @@ -471,7 +489,10 @@ __osm_subn_set_up_down_min_hop_table( > p_sw = p_next_sw; > p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item ); > /* Clear Min Hop Table */ > - osm_switch_clear_hops(p_sw); > + if (p_subn->opt.connect_roots && !((struct updn_node *)p_sw->priv)->rank) > + updn_clear_root_hops(p_updn, p_sw); > + else > + osm_switch_clear_hops(p_sw); > } > > osm_log( p_log, OSM_LOG_VERBOSE, > @@ -607,6 +628,10 @@ __osm_updn_call( > osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr ); > __osm_updn_find_root_nodes_by_min_hop( p_updn ); > } > + else if (p_updn->p_osm->subn.opt.connect_roots && > + p_updn->updn_ucast_reg_inputs.num_guids > 1) > + osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr ); > + > /* printf ("-V- after osm_updn_find_root_nodes_by_min_hop\n"); */ > /* Only if there are assigned root nodes do the algorithm, otherwise perform do nothing */ > if ( p_updn->updn_ucast_reg_inputs.num_guids > 0) From sashak at voltaire.com Sun Jul 15 06:47:17 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Jul 2007 16:47:17 +0300 Subject: [ofa-general] [PATCH] opensm/updn: --connect_roots option In-Reply-To: <469A22DE.5010301@dev.mellanox.co.il> References: <20070621212919.GL25653@sashak.voltaire.com> <469A22DE.5010301@dev.mellanox.co.il> Message-ID: <1184507237.19232.9.camel@localhost> Hi Yevgeny, On Sun, 2007-07-15 at 16:36 +0300, Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: > > With this option up/down preserves route paths (based on min hops > > knowledge) between root switches. This makes up/down IBA complaint > > (where all to all connectivity is required), OTOH this violates up/down > > deadlock free algorithm. By default this option is 'off'. > > > > Signed-off-by: Sasha Khapyorsky > > If I understand you correctly, this patch does what it says - connects > *roots*. Yes, and in this respect it can violate up/down rules. > But what if other switches are not connected because of the up/down > constraints? Another constraint is which roots are defined. In your example your could add connected leafs to the roots list or to have only connected leafs as roots (depends on exact topology). Sasha > For instance, the fabric can be actually built of several sub-trees that are > connected only at leaf switch rank, so there is no path in up/down between any > two switches from different sub-trees at ranks 0 to leaf rank (not inclusively). > Moreover, I can think of a topology where some CA-to-CA paths will be missing too. > > Similar problem exists in fat-tree routing. > > Thoughts? > > -- Yevgeny > > > > --- > > opensm/include/opensm/osm_subnet.h | 6 ++++++ > > opensm/man/opensm.8 | 8 +++++++- > > opensm/opensm/main.c | 15 ++++++++++++++- > > opensm/opensm/osm_subnet.c | 10 ++++++++++ > > opensm/opensm/osm_ucast_updn.c | 27 ++++++++++++++++++++++++++- > > 5 files changed, 63 insertions(+), 3 deletions(-) > > > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > > index 2ee5689..43b1589 100644 > > --- a/opensm/include/opensm/osm_subnet.h > > +++ b/opensm/include/opensm/osm_subnet.h > > @@ -276,6 +276,7 @@ typedef struct _osm_subn_opt > > boolean_t sweep_on_trap; > > osm_testability_modes_t testability_mode; > > char * routing_engine_name; > > + boolean_t connect_roots; > > char * lid_matrix_dump_file; > > char * ucast_dump_file; > > char * root_guid_file; > > @@ -445,6 +446,11 @@ typedef struct _osm_subn_opt > > * Name of used routing engine > > * (other than default Min Hop Algorithm) > > * > > +* connect_roots > > +* The option which will enfoce root to root connectivity with > > +* up/down routing engine (even if this violates "pure" deadlock > > +* free up/down algorithm) > > +* > > * lid_matrix_dump_file > > * Name of the lid matrix dump file from where switch > > * lid matrices (min hops tables) will be loaded > > diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 > > index 4d35689..40e0235 100644 > > --- a/opensm/man/opensm.8 > > +++ b/opensm/man/opensm.8 > > @@ -5,7 +5,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) > > > > .SH SYNOPSIS > > .B opensm > > -[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] [\-M | \-\-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s ] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] > > +[\-c(ache-options)] [\-g(uid)[=]] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] [\-z | \-\-connect_roots] [\-M | \-\-lid_matrix_file ] [\-U | \-ucast_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] [\-o(nce)] [\-s(weep) ] [\-t(imeout) ] [\-maxsmps ] [\-console [off | local | socket]] [\-console-port ] [\-i(gnore-guids) ] [\-f | \-\-log_file] [\-L | \-\-log_limit ] [\-e(rase_log_file)] [\-P(config)] [\-Q | \-qos] [\-N | \-no_part_enforce] [\-y | \-stay_on_fatal] [\-B | \-daemon] [\-I | \-inactive] [\-perfmgr] [\-perfmgr_sweep_time_s ] [\-v(erbose)] [\-V] [\-D ] [\-d(ebug) ] [\-h(elp)] [\-?] > > > > .SH DESCRIPTION > > .PP > > @@ -94,6 +94,12 @@ This option chooses routing engine instead of Min Hop > > algorithm (default). > > Supported engines: updn, file, ftree, lash > > .TP > > +\fB\-z\fR, \fB\-\-connect_roots\fR > > +This option enforces a routing engine (currently up/down > > +only) to make connectivity between root switches and in > > +this way to be fully IBA complaint. In many cases this can > > +violate "pure" deadlock free algorithm, so use it carefully. > > +.TP > > \fB\-M\fR, \fB\-\-lid_matrix_file\fR > > This option specifies the name of the lid matrix dump file > > from where switch lid matrices (min hops tables will be > > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > > index 0d5e0eb..e182276 100644 > > --- a/opensm/opensm/main.c > > +++ b/opensm/opensm/main.c > > @@ -175,6 +175,13 @@ show_usage(void) > > " This option chooses routing engine instead of Min Hop\n" > > " algorithm (default).\n" > > " Supported engines: updn, file, ftree\n\n"); > > + printf( "-z\n" > > + "--connect_roots\n" > > + " This option enforces a routing engine (currently\n" > > + " up/down only) to make connectivity between root switches\n" > > + " and in this way to be fully IBA complaint. In many cases\n" > > + " this can violate \"pure\" deadlock free algorithm, so\n" > > + " use it carefully.\n\n"); > > printf( "-M\n" > > "--lid_matrix_file \n" > > " This option specifies the name of the lid matrix dump file\n" > > @@ -591,7 +598,7 @@ main( > > char *ignore_guids_file_name = NULL; > > uint32_t val; > > const char * const short_option = > > - "i:f:ed:g:l:L:s:t:a:u:R:M:U:S:P:NBIQvVhorcyxp:n:q:k:C:"; > > + "i:f:ed:g:l:L:s:t:a:u:R:zM:U:S:P:NBIQvVhorcyxp:n:q:k:C:"; > > > > /* > > In the array below, the 2nd parameter specifies the number > > @@ -625,6 +632,7 @@ main( > > { "priority", 1, NULL, 'p'}, > > { "smkey", 1, NULL, 'k'}, > > { "routing_engine",1, NULL, 'R'}, > > + { "connect_roots", 0, NULL, 'z'}, > > { "lid_matrix_file",1, NULL, 'M'}, > > { "ucast_file", 1, NULL, 'U'}, > > { "sadb_file", 1, NULL, 'S'}, > > @@ -876,6 +884,11 @@ main( > > printf(" Activate \'%s\' routing engine\n", optarg); > > break; > > > > + case 'z': > > + opt.connect_roots = TRUE; > > + printf(" Connect roots option is on\n"); > > + break; > > + > > case 'M': > > opt.lid_matrix_dump_file = optarg; > > printf(" Lid matrix dump file is \'%s\'\n", optarg); > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index 82d66f9..8f429ae 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -500,6 +500,7 @@ osm_subn_set_default_opt( > > p_opt->sweep_on_trap = TRUE; > > p_opt->testability_mode = OSM_TEST_MODE_NONE; > > p_opt->routing_engine_name = NULL; > > + p_opt->connect_roots = FALSE; > > p_opt->lid_matrix_dump_file = NULL; > > p_opt->ucast_dump_file = NULL; > > p_opt->root_guid_file = NULL; > > @@ -1290,6 +1291,10 @@ osm_subn_parse_conf_file( > > "routing_engine", > > p_key, p_val, &p_opts->routing_engine_name); > > > > + __osm_subn_opts_unpack_boolean( > > + "connect_roots", > > + p_key, p_val, &p_opts->connect_roots); > > + > > __osm_subn_opts_unpack_charp( > > "log_file", p_key, p_val, &p_opts->log_file); > > > > @@ -1545,6 +1550,11 @@ osm_subn_write_conf_file( > > "# Routing engine\n" > > "routing_engine %s\n\n", > > p_opts->routing_engine_name); > > + if (p_opts->connect_roots) > > + fprintf( opts_file, > > + "# Connect roots (use FALSE if unsure)\n" > > + "connect_roots %s\n\n", > > + p_opts->connect_roots ? "TRUE" : "FALSE"); > > if (p_opts->lid_matrix_dump_file) > > fprintf( opts_file, > > "# Lid matrix dump file name\n" > > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c > > index af5ee4e..db8e60a 100644 > > --- a/opensm/opensm/osm_ucast_updn.c > > +++ b/opensm/opensm/osm_ucast_updn.c > > @@ -449,6 +449,24 @@ updn_subn_rank( > > > > /********************************************************************** > > **********************************************************************/ > > +/* hack: preserve min hops entries to any other root switches */ > > +static void > > +updn_clear_root_hops(updn_t *p_updn, osm_switch_t *p_sw) > > +{ > > + osm_port_t *p_port; > > + unsigned i; > > + > > + for ( i = 0 ; i < p_sw->num_hops ; i++ ) > > + if (p_sw->hops[i]) { > > + p_port = cl_ptr_vector_get(&p_updn->p_osm->subn.port_lid_tbl, i); > > + if (!p_port || !p_port->p_node->sw || > > + ((struct updn_node *)p_port->p_node->sw->priv)->rank != 0) > > + memset(p_sw->hops[i], 0xff, p_sw->num_ports); > > + } > > +} > > + > > +/********************************************************************** > > + **********************************************************************/ > > static int > > __osm_subn_set_up_down_min_hop_table( > > IN updn_t* p_updn ) > > @@ -471,7 +489,10 @@ __osm_subn_set_up_down_min_hop_table( > > p_sw = p_next_sw; > > p_next_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item ); > > /* Clear Min Hop Table */ > > - osm_switch_clear_hops(p_sw); > > + if (p_subn->opt.connect_roots && !((struct updn_node *)p_sw->priv)->rank) > > + updn_clear_root_hops(p_updn, p_sw); > > + else > > + osm_switch_clear_hops(p_sw); > > } > > > > osm_log( p_log, OSM_LOG_VERBOSE, > > @@ -607,6 +628,10 @@ __osm_updn_call( > > osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr ); > > __osm_updn_find_root_nodes_by_min_hop( p_updn ); > > } > > + else if (p_updn->p_osm->subn.opt.connect_roots && > > + p_updn->updn_ucast_reg_inputs.num_guids > 1) > > + osm_ucast_mgr_build_lid_matrices( &p_updn->p_osm->sm.ucast_mgr ); > > + > > /* printf ("-V- after osm_updn_find_root_nodes_by_min_hop\n"); */ > > /* Only if there are assigned root nodes do the algorithm, otherwise perform do nothing */ > > if ( p_updn->updn_ucast_reg_inputs.num_guids > 0) > From a-a.sesa at abc-solutions.de Sun Jul 15 09:48:34 2007 From: a-a.sesa at abc-solutions.de (Thanh Arroyo) Date: Sun, 15 Jul 2007 15:48:34 -0100 Subject: [ofa-general] Pics Message-ID: <882174187.96161526012828@abc-solutions.de> An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Jul 15 13:58:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 15 Jul 2007 23:58:52 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM: Change force_link_speed to allow for local policy and more flexibility In-Reply-To: <1184496198.4908.154970.camel@hal.voltaire.com> References: <1184496198.4908.154970.camel@hal.voltaire.com> Message-ID: <20070715205852.GA30202@sashak.voltaire.com> On 06:43 Sun 15 Jul , Hal Rosenstock wrote: > OpenSM: Change force_link_speed to allow for local policy and more > flexibility > > Extend (and change) the use of force_link_speed as follows: > 0 - no change > 1 - set to SDR > 15 - set as supported (default) > (Non zero values are used to set LinkSpeedEnabled component in PortInfo) > > Note that force_link_speed 0 which used to force SDR is now > force_link_speed 1 > > "Ideally", there were be a per port configuration of this. > > [Note this is largely untested.] > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Jul 15 14:10:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 16 Jul 2007 00:10:52 +0300 Subject: [ofa-general] Re: [PATCH] osm: some improvements to fat-tree routing In-Reply-To: <469A0B70.1020101@dev.mellanox.co.il> References: <469A0B70.1020101@dev.mellanox.co.il> Message-ID: <20070715211052.GC30202@sashak.voltaire.com> On 14:56 Sun 15 Jul , Yevgeny Kliteynik wrote: > Hi Sasha > > This patch adds a small improvement to fat-tree routing for > asymmetrical (or unusual) trees: > 1. When routing down-going routes (by climbing up the tree), > first selecting the least loaded group, and then least loaded > port in the selected group. > 2. When routing up-going routes (by descending down the tree), > scan groups by indexing order, but the start group is selected > by round-robin. > > Signed-off-by: Yevgeny Kliteynik Applied (some trailing whitespaces were stripped by git-am). Thanks. Sasha From akepner at sgi.com Sun Jul 15 14:21:46 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Sun, 15 Jul 2007 14:21:46 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix Message-ID: <20070715212146.GF6921@sgi.com> Here's a first cut at OFED 1.3/Linux 2.6.23 patches to address the "CQ/DMA race" that's possible on Altix systems when CQs are allocated in user space. (A description of this bug appears here: http://lists.openfabrics.org/pipermail/general/2006-December/030251.html) I'll post the kernel patch to lkml, but I'd appreciate any comments from this list before doing that. Obviously this is just a subset of the necessary kernel changes required, since every use of dma_map_sg() would need to be modified. Comments? arch/ia64/sn/pci/pci_dma.c | 19 ++++++++++++++----- drivers/infiniband/core/umem.c | 5 +++-- drivers/infiniband/hw/mthca/mthca_provider.c | 11 ++++++++++- drivers/infiniband/hw/mthca/mthca_user.h | 8 +++++++- drivers/infiniband/ulp/srp/ib_srp.c | 2 +- include/asm-generic/dma-mapping.h | 4 ++-- include/asm-generic/pci-dma-compat.h | 2 +- include/asm-ia64/machvec.h | 2 +- include/rdma/ib_umem.h | 2 +- include/rdma/ib_verbs.h | 5 +++-- -- diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c index d79ddac..d942390 100644 --- a/arch/ia64/sn/pci/pci_dma.c +++ b/arch/ia64/sn/pci/pci_dma.c @@ -245,7 +245,7 @@ EXPORT_SYMBOL(sn_dma_unmap_sg); * Maps each entry of @sg for DMA. */ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries, - int direction) + int direction, int coherent) { unsigned long phys_addr; struct scatterlist *saved_sg = sg; @@ -259,12 +259,21 @@ int sn_dma_map_sg(struct device *dev, struct scatterlist *sg, int nhwentries, * Setup a DMA address for each entry in the scatterlist. */ for (i = 0; i < nhwentries; i++, sg++) { + dma_addr_t dma_addr; phys_addr = SG_ENT_PHYS_ADDRESS(sg); - sg->dma_address = provider->dma_map(pdev, - phys_addr, sg->length, - SN_DMA_ADDR_PHYS); - if (!sg->dma_address) { + if (coherent) { + dma_addr= provider->dma_map_consistent(pdev, + phys_addr, + sg->length, + SN_DMA_ADDR_PHYS); + } else { + dma_addr = provider->dma_map(pdev, + phys_addr, sg->length, + SN_DMA_ADDR_PHYS); + } + + if (!(sg->dma_address = dma_addr)) { printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__); /* diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index d40652a..e9f9f42 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -66,7 +66,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d * @access: IB_ACCESS_xxx flags for memory being pinned */ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, - size_t size, int access) + size_t size, int access, int coherent) { struct ib_umem *umem; struct page **page_list; @@ -154,7 +154,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, chunk->nmap = ib_dma_map_sg(context->device, &chunk->page_list[0], chunk->nents, - DMA_BIDIRECTIONAL); + DMA_BIDIRECTIONAL, + coherent); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) put_page(chunk->page_list[i].page); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 6bcde1c..c0cf5f1 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1017,6 +1017,8 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; struct mthca_mr *mr; + struct mthca_reg_mr ucmd; + int coherent; u64 *pages; int shift, n, len; int i, j, k; @@ -1027,7 +1029,14 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, if (!mr) return ERR_PTR(-ENOMEM); - mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) { + err = -EFAULT; + goto err; + } + coherent = (int) ucmd.mr_attrs & MTHCA_MR_COHERENT; + + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, + coherent); if (IS_ERR(mr->umem)) { err = PTR_ERR(mr->umem); goto err; diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h index 02cc0a7..f46773e 100644 --- a/drivers/infiniband/hw/mthca/mthca_user.h +++ b/drivers/infiniband/hw/mthca/mthca_user.h @@ -41,7 +41,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -61,6 +61,12 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + __u32 mr_attrs; +#define MTHCA_MR_COHERENT 0x1 + __u32 reserved; +}; + struct mthca_create_cq { __u32 lkey; __u32 pdn; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 39bf057..b7a4301 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -699,7 +699,7 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target, dev = target->srp_host->dev; ibdev = dev->dev; - count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction); + count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction, 0); fmt = SRP_DATA_DESC_DIRECT; len = sizeof (struct srp_cmd) + sizeof (struct srp_direct_buf); diff --git a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h index 783ab99..34e8357 100644 --- a/include/asm-generic/dma-mapping.h +++ b/include/asm-generic/dma-mapping.h @@ -89,7 +89,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, static inline int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) + enum dma_data_direction direction, int coherent) { BUG_ON(dev->bus != &pci_bus_type); @@ -213,7 +213,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, static inline int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) + enum dma_data_direction direction, int coherent) { BUG(); return 0; diff --git a/include/asm-generic/pci-dma-compat.h b/include/asm-generic/pci-dma-compat.h index 25c10e9..3e85b8e 100644 --- a/include/asm-generic/pci-dma-compat.h +++ b/include/asm-generic/pci-dma-compat.h @@ -60,7 +60,7 @@ static inline int pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg, int nents, int direction) { - return dma_map_sg(hwdev == NULL ? NULL : &hwdev->dev, sg, nents, (enum dma_data_direction)direction); + return dma_map_sg(hwdev == NULL ? NULL : &hwdev->dev, sg, nents, (enum dma_data_direction)direction, 0); } static inline void diff --git a/include/asm-ia64/machvec.h b/include/asm-ia64/machvec.h index ca33eb1..34e9a58 100644 --- a/include/asm-ia64/machvec.h +++ b/include/asm-ia64/machvec.h @@ -46,7 +46,7 @@ typedef void *ia64_mv_dma_alloc_coherent (struct device *, size_t, dma_addr_t *, typedef void ia64_mv_dma_free_coherent (struct device *, size_t, void *, dma_addr_t); typedef dma_addr_t ia64_mv_dma_map_single (struct device *, void *, size_t, int); typedef void ia64_mv_dma_unmap_single (struct device *, dma_addr_t, size_t, int); -typedef int ia64_mv_dma_map_sg (struct device *, struct scatterlist *, int, int); +typedef int ia64_mv_dma_map_sg (struct device *, struct scatterlist *, int, int, int); typedef void ia64_mv_dma_unmap_sg (struct device *, struct scatterlist *, int, int); typedef void ia64_mv_dma_sync_single_for_cpu (struct device *, dma_addr_t, size_t, int); typedef void ia64_mv_dma_sync_sg_for_cpu (struct device *, struct scatterlist *, int, int); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index c533d6c..08aeb87 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -61,7 +61,7 @@ struct ib_umem_chunk { #ifdef CONFIG_INFINIBAND_USER_MEM struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, - size_t size, int access); + size_t size, int access, int coherent); void ib_umem_release(struct ib_umem *umem); int ib_umem_page_count(struct ib_umem *umem); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0627a6a..d5d3180 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1555,11 +1555,12 @@ static inline void ib_dma_unmap_page(struct ib_device *dev, */ static inline int ib_dma_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) + enum dma_data_direction direction, + int coherent) { if (dev->dma_ops) return dev->dma_ops->map_sg(dev, sg, nents, direction); - return dma_map_sg(dev->dma_device, sg, nents, direction); + return dma_map_sg(dev->dma_device, sg, nents, direction, coherent); } /** -- Arthur From akepner at sgi.com Sun Jul 15 14:24:45 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Sun, 15 Jul 2007 14:24:45 -0700 Subject: [ofa-general] [RFC 1/1] libmthca: CQ/DMA race on Altix Message-ID: <20070715212445.GG6921@sgi.com> The libmthca-specific changes for this RFC follow. mthca-abi.h | 9 ++++++++- verbs.c | 22 +++++++++++++--------- 2 files changed, 21 insertions(+), 10 deletions(-) -- diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h --- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h 2007-06-23 02:00:34.000000000 -0700 +++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h 2007-07-15 12:18:54.505352246 -0700 @@ -36,7 +36,7 @@ #include -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_ABI_VERSION 2 struct mthca_alloc_ucontext_resp { struct ibv_get_context_resp ibv_resp; @@ -50,6 +50,13 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + struct ibv_reg_mr ibv_cmd; + __u32 mr_attrs; +#define MTHCA_MR_COHERENT 0x1 + __u32 reserved; +}; + struct mthca_create_cq { struct ibv_create_cq ibv_cmd; __u32 lkey; diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c --- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c 2007-06-23 02:00:34.000000000 -0700 +++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c 2007-07-15 13:26:24.371410587 -0700 @@ -117,26 +117,30 @@ int mthca_free_pd(struct ibv_pd *pd) static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, uint64_t hca_va, - enum ibv_access_flags access) + enum ibv_access_flags access, + int coherent) { struct ibv_mr *mr; - struct ibv_reg_mr cmd; + struct mthca_reg_mr cmd; int ret; mr = malloc(sizeof *mr); if (!mr) return NULL; + cmd.mr_attrs |= (__u32) coherent ? MTHCA_MR_COHERENT: 0; + #ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS { struct ibv_reg_mr_resp resp; ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd, &resp, sizeof resp); + &cmd.ibv_cmd, sizeof cmd, &resp, + sizeof resp); } #else ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd); + &cmd.ibv_cmd, sizeof cmd); #endif if (ret) { free(mr); @@ -149,7 +153,7 @@ static struct ibv_mr *__mthca_reg_mr(str struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access) { - return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access); + return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0); } int mthca_dereg_mr(struct ibv_mr *mr) @@ -202,7 +206,7 @@ struct ibv_cq *mthca_create_cq(struct ib cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!cq->mr) goto err_buf; @@ -294,7 +298,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!mr) { mthca_free_buf(&buf); ret = ENOMEM; @@ -402,7 +406,7 @@ struct ibv_srq *mthca_create_srq(struct if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) goto err; - srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0); if (!srq->mr) goto err_free; @@ -520,7 +524,7 @@ struct ibv_qp *mthca_create_qp(struct ib pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; - qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0); if (!qp->mr) goto err_free; -- Arthur From sweitzen at cisco.com Sun Jul 15 22:42:45 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Sun, 15 Jul 2007 22:42:45 -0700 Subject: [ewg] Re: [ofa-general] OFED-1.2 release download link In-Reply-To: <4699FDD6.3010305@mellanox.co.il> References: <000001c7c4ac$631bbe10$ff0da8c0@amr.corp.intel.com> <4699FDD6.3010305@mellanox.co.il> Message-ID: > It will be good if we can unify 1.2.c with 1.2.1 that was > requested in > the same time frame > Any thoughts on this? I am in favor of unifying them. Scott From muli at il.ibm.com Mon Jul 16 00:34:35 2007 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Mon, 16 Jul 2007 10:34:35 +0300 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070715212146.GF6921@sgi.com> References: <20070715212146.GF6921@sgi.com> Message-ID: <20070716073435.GE3530@rhun.haifa.ibm.com> On Sun, Jul 15, 2007 at 02:21:46PM -0700, akepner at sgi.com wrote: > diff --git a/include/asm-generic/dma-mapping.h b/include/asm-generic/dma-mapping.h > index 783ab99..34e8357 100644 > --- a/include/asm-generic/dma-mapping.h > +++ b/include/asm-generic/dma-mapping.h > @@ -89,7 +89,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, > > static inline int > dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, > - enum dma_data_direction direction) > + enum dma_data_direction direction, int coherent) > { > BUG_ON(dev->bus != &pci_bus_type); > > @@ -213,7 +213,7 @@ dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, > > static inline int > dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, > - enum dma_data_direction direction) > + enum dma_data_direction direction, int coherent) > { > BUG(); > return 0; This will be very painful and frankly I don't think the pain is justified. Can't you confine the changes to the IB layerr so that the mapping happens through dma_alloc_coherent if you need coherent/consistent memory rather than through dma_map_sg? Also, this kind of thing should definitely be CC'd to lkml. Cheers, Muli From ogerlitz at voltaire.com Mon Jul 16 01:55:34 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 16 Jul 2007 11:55:34 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <20070715094145.GA16231@mellanox.co.il> References: <20070715094145.GA16231@mellanox.co.il> Message-ID: <469B3286.3060902@voltaire.com> Michael S. Tsirkin wrote: > Make mad module use a single workqueue rather than a per-port > workqueue. This way, we'll have less clutter on systems with > a lot of ports. > Signed-off-by: Michael S. Tsirkin > Thinking about it, why would we *want* a per-port thread? > What do you guys think about the following? > As a bonus, this makes it easier to renice the mad thread > for people that want to do this. Indeed, today the mad module creates thread per device port, and the cm module creates thread per cpu, so if the system has 16 cores and two hcas each with two ports, the IB stack would create 20 threads just for the sake of the mad and cm modules... I also think it would be good if the mad module would create one thread and the cm module as well. Sean - does it make sense to you to change the CM for that matter? I brought below listing of the kernel threads on an rh4 u3 / ofed 1.2 system. Or. > root 1 0 0 Jul15 ? 00:00:00 init [5] > root 2 1 0 Jul15 ? 00:00:00 [migration/0] > root 3 1 0 Jul15 ? 00:00:00 [ksoftirqd/0] > root 4 1 0 Jul15 ? 00:00:00 [migration/1] > root 5 1 0 Jul15 ? 00:00:00 [ksoftirqd/1] > root 6 1 0 Jul15 ? 00:00:00 [migration/2] > root 7 1 0 Jul15 ? 00:00:00 [ksoftirqd/2] > root 8 1 0 Jul15 ? 00:00:00 [migration/3] > root 9 1 0 Jul15 ? 00:00:00 [ksoftirqd/3] > root 10 1 0 Jul15 ? 00:00:00 [events/0] > root 11 1 0 Jul15 ? 00:00:00 [events/1] > root 12 1 0 Jul15 ? 00:00:00 [events/2] > root 13 1 0 Jul15 ? 00:00:00 [events/3] > root 14 10 0 Jul15 ? 00:00:00 [khelper] > root 15 10 0 Jul15 ? 00:00:00 [kacpid] > root 60 10 0 Jul15 ? 00:00:00 [kblockd/0] > root 61 10 0 Jul15 ? 00:00:00 [kblockd/1] > root 62 10 0 Jul15 ? 00:00:00 [kblockd/2] > root 63 10 0 Jul15 ? 00:00:00 [kblockd/3] > root 64 1 0 Jul15 ? 00:00:00 [khubd] > root 73 10 0 Jul15 ? 00:00:00 [pdflush] > root 74 10 0 Jul15 ? 00:00:00 [pdflush] > root 76 10 0 Jul15 ? 00:00:00 [aio/0] > root 77 10 0 Jul15 ? 00:00:00 [aio/1] > root 78 10 0 Jul15 ? 00:00:00 [aio/2] > root 79 10 0 Jul15 ? 00:00:00 [aio/3] > root 75 1 0 Jul15 ? 00:00:00 [kswapd0] > root 152 1 0 Jul15 ? 00:00:00 [kseriod] > root 230 12 0 Jul15 ? 00:00:00 [ata/0] > root 231 12 0 Jul15 ? 00:00:00 [ata/1] > root 232 12 0 Jul15 ? 00:00:00 [ata/2] > root 233 12 0 Jul15 ? 00:00:00 [ata/3] > root 239 1 0 Jul15 ? 00:00:00 [scsi_eh_0] > root 267 1 0 Jul15 ? 00:00:01 [kjournald] > root 1372 10 0 Jul15 ? 00:00:00 [ib_mcast] > root 1377 10 0 Jul15 ? 00:00:00 [ib_cm/0] > root 1378 10 0 Jul15 ? 00:00:00 [ib_cm/1] > root 1379 10 0 Jul15 ? 00:00:00 [ib_cm/2] > root 1380 10 0 Jul15 ? 00:00:00 [ib_cm/3] > root 1401 13 0 Jul15 ? 00:00:00 [ipoib] > root 1418 12 0 Jul15 ? 00:00:00 [mthcacatas] > root 1421 13 0 Jul15 ? 00:00:00 [ib_mad1] > root 1422 10 0 Jul15 ? 00:00:00 [ib_mad2] > root 2132 10 0 Jul15 ? 00:00:00 [kauditd] > root 2209 10 0 Jul15 ? 00:00:00 [kmpathd/0] > root 2210 10 0 Jul15 ? 00:00:00 [kmpathd/1] > root 2211 10 0 Jul15 ? 00:00:00 [kmpathd/2] > root 2212 10 0 Jul15 ? 00:00:00 [kmpathd/3] > root 2220 10 0 Jul15 ? 00:00:00 [kmirrord] > root 2221 10 0 Jul15 ? 00:00:00 [kmir_mon] > root 3330 11 0 Jul15 ? 00:00:00 [local_sa] > root 3335 13 0 Jul15 ? 00:00:00 [ib_addr_wq] > root 3340 11 0 Jul15 ? 00:00:00 [iw_cm_wq] > root 3345 12 0 Jul15 ? 00:00:00 [rdma_cm_wq] > root 3350 13 0 Jul15 ? 00:00:00 [sdp] > root 3358 12 0 Jul15 ? 00:00:00 [krdsd] From ogerlitz at voltaire.com Mon Jul 16 02:16:28 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 16 Jul 2007 12:16:28 +0300 Subject: [ofa-general] missing "balance" in aggregate bi-directional SDP bulk transfer In-Reply-To: <4696A054.8010102@hp.com> References: <4696A054.8010102@hp.com> Message-ID: <469B376C.2070103@voltaire.com> Rick Jones wrote: > I configured ib0 and ib1 into separate IP subnets, and ran the > "bidirectional TCP_RR" test > However, when I run the same test over SDP, some connections seem to get > much better performance than others. For example, with two concurrent > connections, one over each port, one will get a much higher result than > the other. Did you make sure that each neighbour was actually pointing to a different port? see the below excerpt from the IPoIB release notes (note that IPoIB is the ARP provider used by the RDMA CM which is what SDP is working with, so this applies both your IPoIB and SDP tests). Or. > from /usr/share/doc/ofed-docs-1.2/ipoib_release_notes.txt > > 3. Known Issues > =============================================================================== > 1. If a host has multiple interfaces and (a) each interface belongs to a > different IP subnet, (b) they all use the same InfiniBand Partition, and (c) > they are connected to the same IB Switch, then the host violates the IP rule > requiring different broadcast domains. Consequently, the host may build an > incorrect ARP table. > > The correct setting of a multi-homed IPoIB host is achieved by using a > different PKEY for each IP subnet. If a host has multiple interfaces on the > same IP subnet, then to prevent a peer from building an incorrect ARP entry > (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X > stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This > causes the network stack to send ARP replies only on the interface with the > IP address specified in the ARP request: > > sysctl -w net.ipv4.conf.ib0.arp_ignore=1 > sysctl -w net.ipv4.conf.ib1.arp_ignore=1 > > Or, globally, > > sysctl -w net.ipv4.conf.all.arp_ignore=1 > > To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt. > Note that distributions have the means to make kernel parameters persistent. > From vlad at lists.openfabrics.org Mon Jul 16 02:44:44 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 16 Jul 2007 02:44:44 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070716-0200 daily build status Message-ID: <20070716094444.C61C7E6080B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From cxp at saunalahti.fi Mon Jul 16 04:55:48 2007 From: cxp at saunalahti.fi (Peter) Date: Mon, 16 Jul 2007 12:55:48 +0100 Subject: [ofa-general] Fwd: Message-ID: <469B5CC4.7010504@saunalahti.fi> -------------- next part -------------- A non-text attachment was scrubbed... Name: warning.pdf Type: application/pdf Size: 14441 bytes Desc: not available URL: From mst at dev.mellanox.co.il Mon Jul 16 04:59:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jul 2007 14:59:11 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469B3286.3060902@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> Message-ID: <20070716115911.GA3379@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [PATCHv2] IB/mad: fix duplicated kernel thread name > > Michael S. Tsirkin wrote: > >Make mad module use a single workqueue rather than a per-port > >workqueue. This way, we'll have less clutter on systems with > >a lot of ports. > >Signed-off-by: Michael S. Tsirkin > >Thinking about it, why would we *want* a per-port thread? > >What do you guys think about the following? > >As a bonus, this makes it easier to renice the mad thread > >for people that want to do this. > > Indeed, today the mad module creates thread per device port, and the cm > module creates thread per cpu, so if the system has 16 cores and two > hcas each with two ports, the IB stack would create 20 threads just for > the sake of the mad and cm modules... I also think it would be good if > the mad module would create one thread and the cm module as well. > > Sean - does it make sense to you to change the CM for that matter? > > I brought below listing of the kernel threads on an rh4 u3 / ofed 1.2 > system. Per-CPU threads like CM does might make sense since they improve data locality. -- MST From vlad at dev.mellanox.co.il Mon Jul 16 05:24:58 2007 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 16 Jul 2007 15:24:58 +0300 Subject: [ofa-general] RFC OFED-1.3 installation Message-ID: <469B639A.1090804@dev.mellanox.co.il> Hi, I am starting to work on the new installation procedure for OFED-1.3. Please review and comment. Main changes from OFED-1.2: - Split ofa_user-1.2.src.rpm into separate sources RPMs per package. * Requires RPM spec file for each package. Currently, the following packages are lack of RPM spec file: libehca, mstflint, qlvnictools, perftest, sdpnetstat User space RPM packages list taken from maintainers' RPM spec files: libibverbs: libibverbs libibverbs-devel libibverbs-devel-static libibverbs-utils libmthca: libmthca libmthca-devel-static libehca: No RPM spec file libipathverbs: libipathverbs libipathverbs-devel libibcm: libibcm libibcm-devel libsdp: libsdp libsdp-devel should be created librdmacm: librdmacm librdmacm-devel librdmacm-utils libcxgb3: libcxgb3 libcxgb3-devel Note: libcxgb3 rpmbuild fails: cp: cannot stat `ChangeLog': No such file or directory management: libibcommon libibcommon-devel libibmad libibmad-devel libibumad libibumad-devel opensm opensm-libs opensm-devel opensm-static infiniband-diags dapl: dapl dapl-devel dapl-uils srptools: srptools ibutils: ibutils mpi-selector: mpi-selector - OFED-1.3 build procedure: OFED-1.3 daily/rc builds will be created on OFA server: userspace and kernel packages will be taken from git trees: git.openfabrics.org/ofed_1_3/package.git ofed_1_3 Source RPMs will be created for each userspace package in the following way: git clone ... autogen.sh configure --disable-libcheck make dist rpmbuild -bs package.spec The following packages will be taken from maintainers as src.rpm: mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich, mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3, openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3, mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests, rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools, ib-bonding http://www.openfabrics.org/~monis/ofed_1_3, - OFED-1.3 Installation install.pl script Flow: make list of packages following selection and dependencies. for package in the list: build RPM from package.src.rpm install package RPM go to the next package in the list configuration if required Regards, Vladimir From ogerlitz at voltaire.com Mon Jul 16 05:36:04 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 16 Jul 2007 15:36:04 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <20070716115911.GA3379@mellanox.co.il> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> Message-ID: <469B6634.1050709@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Indeed, today the mad module creates thread per device port, and the cm >> module creates thread per cpu, so if the system has 16 cores and two >> hcas each with two ports, the IB stack would create 20 threads just for >> the sake of the mad and cm modules... I also think it would be good if >> the mad module would create one thread and the cm module as well. >> Sean - does it make sense to you to change the CM for that matter? > Per-CPU threads like CM does might make sense since they improve data locality. Sorry but "improve data locality" is not enough information for me to understand why the IB CM --neeed-- to spawn n kernel threads on my n-core system, after all its slow path and the data does not moves on QP1, what's the story here? and if it needs thread-per-cpu, why not use the system threads/softirqs as does the TCP/IP stack connection mgmt code? Or. From mst at dev.mellanox.co.il Mon Jul 16 05:43:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jul 2007 15:43:51 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469B6634.1050709@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> Message-ID: <20070716124351.GA23035@mellanox.co.il> > and if it needs thread-per-cpu, why not use > the system threads/softirqs as does the TCP/IP stack connection mgmt code? softirqs would be very awkward because things like create qp can't be done from that context. Using system threads might be possible, but one needs to be careful this might create problems for anyone who wants to e.g. destroy cm id from a system thread, which needs a flush. -- MST From ogerlitz at voltaire.com Mon Jul 16 05:47:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 16 Jul 2007 15:47:59 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <20070716124351.GA23035@mellanox.co.il> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <20070716124351.GA23035@mellanox.co.il> Message-ID: <469B68FF.4070806@voltaire.com> Michael S. Tsirkin wrote: >> and if it needs thread-per-cpu, why not use >> the system threads/softirqs as does the TCP/IP stack connection mgmt code? > > softirqs would be very awkward because things like create qp can't be done from > that context. Using system threads might be possible, but one needs to be > careful this might create problems for anyone who wants to e.g. destroy cm id > from a system thread, which needs a flush. you have decided to move directly to the and-if part, however, sometimes it worth to stop and explain yourself, you know Anyway, grep-ing for "flush" in cm.c yields nothing Or. From mst at dev.mellanox.co.il Mon Jul 16 06:08:16 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jul 2007 16:08:16 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469B68FF.4070806@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <20070716124351.GA23035@mellanox.co.il> <469B68FF.4070806@voltaire.com> Message-ID: <20070716130804.GA4454@mellanox.co.il> > Anyway, grep-ing for "flush" in cm.c yields nothing wait_for_completion is an implicit flush. That's one of the reason why comment near callback says: "Users may not call ib_destroy_cm_id while in the context of this callback". -- MST From xma at us.ibm.com Mon Jul 16 07:41:20 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 16 Jul 2007 07:41:20 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070714175425.GA17597@mellanox.co.il> Message-ID: Michael, I would like to try this patch for one adapter/2 ports scalability performance for IPoIB. Is this patch appliable to OFED-1.2? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Mon Jul 16 07:50:49 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 16 Jul 2007 17:50:49 +0300 Subject: [ofa-general] Agenda for OFED meeting today Message-ID: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> Hi All, We have our OFED synch meeting today at 9am PST. Agenda: 1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. 2. Agree on OFED 1.3 schedule: * Feature freeze - Sep 4 * Alpha release - Sep 10 * Beta release - Sep 25 * RC1 - Oct 16 * RC2 - Oct 30 * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11) * RC4 - Nov 20 * GA release - Nov 30 (or first week of Dec) 3. Review OFED 1.3 features list: In last meeting we decided that the schedule is one of the most important parameters in OFED 1.3. Thus I divided the features for two categories: * "must have" features - features that must be ready for the release (marked with *) * "optional" features - features that can be included in the release in case they are ready according to the schedule Must have general features: ==================== * Kernel base on 2.6.23 (all new features that will be part of this kernel will be included in OFED 1.3) * Install: * Break the packages RPMs (work with Novell and Redhat) to minimize integration effort into OS distribution * Package: * Sources arrangement for the end user (for the labs) * New HCAs & RNICs: * ConnectX support * Any other new HW? * QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP) Other features (must have marked with *) ============================== * libibverbs: New verbs: * Scalable Reliable Connected Transport (with Mellanox ConnectX)* * Reliable Multicast? ULPs: * IPoIB: * Performance improvements (those that will be stable on time) * NAPI - done * SDP: * * Keepalive * * AIO * uDAPL: * DAT 2.0 support with IB extensions for immediate data, atomics; * Add extensions for new verbs (SRCT,RM) * VNIC: * GA quality. Not a technology preview version anymore. * Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet gateway) - in GA * RDS: RDMA API (using FMRs); GA quality with Oracle 11 * NFSoRDMA integration - pending we have a maintainer * Management: * * Multiple partitions via libibumad * OpenSM * More routing performance improvements - done * Even more speedups - done * Better packaging/installation - done "Native" daemon mode - done * * Performance management * * Quality of Service manager: Based on IBTA annex * Enhancements for fat tree routing (non pure tree support) - done * More console commands and telnet access to console - done * More diagnostics * ibidsverify.pl: validate LIDs and GUIDs in subnet - done * Updated ibnetdiscover format with link width and speed, and GUIDs - done * ibnetdiscover grouping support for new Voltaire chassis - done * diag updates for IB router support - done * iblinkinfo.pl: Support peer port link width and speed validation - done * ibdatacounters: Add script and man page for subnet wide data counters saquery enhancements - done * iWARP: * * Chelsio: Get to GA level * NetEffect: Get the drivers into OFED Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Jul 16 07:55:33 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jul 2007 17:55:33 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: <20070714175425.GA17597@mellanox.co.il> Message-ID: <20070716145533.GC4454@mellanox.co.il> > Quoting Shirley Ma : > Subject: Re: [ofa-general] Re: Further 2.6.23 merge plans... > > Michael, > > I would like to try this patch for one adapter/2 ports scalability performance > for IPoIB. Is this patch appliable to OFED-1.2? Most likely yes. -- MST From jsquyres at cisco.com Mon Jul 16 07:57:26 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 16 Jul 2007 10:57:26 -0400 Subject: [ofa-general] Agenda for OFED meeting today In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> Message-ID: <1A09652C-4BEC-4FAE-A9F1-E5937258235E@cisco.com> Reminder for all -- here is the dial-in information for the meeting: Code: 2102061 US/Canada: +1.866.432.9903 India: +91.80.4103.3979 Israel: +972.9.892.7026 Others: http://cisco.com/en/US/about/doing_business/conferencing/ The Outlook invitation expires today; I'll make a new one after the meeting starting 2 weeks from today. On Jul 16, 2007, at 10:50 AM, Tziporet Koren wrote: > Hi All, > > We have our OFED synch meeting today at 9am PST. > > Agenda: > 1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. > 2. Agree on OFED 1.3 schedule: > * Feature freeze - Sep 4 > * Alpha release - Sep 10 > * Beta release - Sep 25 > * RC1 - Oct 16 > * RC2 - Oct 30 > * RC3 - Nov 8 (assuming many of us are at SC07 on the week > of Nov 11) > * RC4 - Nov 20 > * GA release - Nov 30 (or first week of Dec) > 3. Review OFED 1.3 features list: > In last meeting we decided that the schedule is one of the most > important parameters in OFED 1.3. > Thus I divided the features for two categories: > > "must have" features - features that must be ready for the release > (marked with *) > "optional" features - features that can be included in the release > in case they are ready according to the schedule > > Must have general features: > ==================== > > Kernel base on 2.6.23 (all new features that will be part of this > kernel will be included in OFED 1.3) > Install: > Break the packages RPMs (work with Novell and Redhat) to minimize > integration effort into OS distribution > Package: > Sources arrangement for the end user (for the labs) > New HCAs & RNICs: > ConnectX support > Any other new HW? > QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP) > > Other features (must have marked with *) > ============================== > > libibverbs: New verbs: > Scalable Reliable Connected Transport (with Mellanox ConnectX)* > Reliable Multicast? > > ULPs: > > IPoIB: > Performance improvements (those that will be stable on time) > NAPI - done > SDP: > * Keepalive > * AIO > uDAPL: > DAT 2.0 support with IB extensions for immediate data, atomics; > Add extensions for new verbs (SRCT,RM) > VNIC: > GA quality. Not a technology preview version anymore. > Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet > gateway) - in GA > RDS: RDMA API (using FMRs); GA quality with Oracle 11 > NFSoRDMA integration - pending we have a maintainer > > Management: > * Multiple partitions via libibumad > OpenSM > More routing performance improvements - done > Even more speedups - done > Better packaging/installation - done > “Native” daemon mode - done > * Performance management > * Quality of Service manager: Based on IBTA annex > Enhancements for fat tree routing (non pure tree support) - done > More console commands and telnet access to console - done > More diagnostics > ibidsverify.pl: validate LIDs and GUIDs in subnet - done > Updated ibnetdiscover format with link width and speed, and GUIDs - > done > ibnetdiscover grouping support for new Voltaire chassis - done > diag updates for IB router support - done > iblinkinfo.pl: Support peer port link width and speed validation - > done > ibdatacounters: Add script and man page for subnet wide data > counters saquery enhancements - done > > iWARP: > * Chelsio: Get to GA level > NetEffect: Get the drivers into OFED > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Cisco Systems From rdreier at cisco.com Mon Jul 16 08:25:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 08:25:13 -0700 Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support In-Reply-To: <1184579123437-git-send-email-jens.axboe@oracle.com> (Jens Axboe's message of "Mon, 16 Jul 2007 11:45:17 +0200") References: <11845791213043-git-send-email-jens.axboe@oracle.com> <1184579123437-git-send-email-jens.axboe@oracle.com> Message-ID: [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland] Cc: rolandd at cisco.com Signed-off-by: Jens Axboe --- drivers/infiniband/hw/ipath/ipath_dma.c | 9 ++-- drivers/infiniband/ulp/iser/iser_memory.c | 75 +++++++++++++++------------- 2 files changed, 45 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_dma.c b/drivers/infiniband/hw/ipath/ipath_dma.c index f87f003..62c87e6 100644 --- a/drivers/infiniband/hw/ipath/ipath_dma.c +++ b/drivers/infiniband/hw/ipath/ipath_dma.c @@ -96,17 +96,18 @@ static void ipath_dma_unmap_page(struct ib_device *dev, BUG_ON(!valid_dma_direction(direction)); } -static int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) +static int ipath_map_sg(struct ib_device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction direction) { + struct scatterlist *sg; u64 addr; int i; int ret = nents; BUG_ON(!valid_dma_direction(direction)); - for (i = 0; i < nents; i++) { - addr = (u64) page_address(sg[i].page); + for_each_sg(sgl, sg, nents, i) { + addr = (u64) page_address(sg->page); /* TODO: handle highmem pages */ if (!addr) { ret = 0; diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index fc9f1fd..ff0c701 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -37,7 +37,6 @@ #include #include #include -#include #include #include "iscsi_iser.h" @@ -126,17 +125,19 @@ int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, if (cmd_dir == ISER_DIR_OUT) { /* copy the unaligned sg the buffer which is used for RDMA */ - struct scatterlist *sg = (struct scatterlist *)data->buf; + struct scatterlist *sgl = (struct scatterlist *)data->buf; + struct scatterlist *sg; int i; char *p, *from; - for (p = mem, i = 0; i < data->size; i++) { - from = kmap_atomic(sg[i].page, KM_USER0); + p = mem; + for_each_sg(sgl, sg, data->size, i) { + from = kmap_atomic(sg->page, KM_USER0); memcpy(p, - from + sg[i].offset, - sg[i].length); + from + sg->offset, + sg->length); kunmap_atomic(from, KM_USER0); - p += sg[i].length; + p += sg->length; } } @@ -178,7 +179,7 @@ void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, if (cmd_dir == ISER_DIR_IN) { char *mem; - struct scatterlist *sg; + struct scatterlist *sgl, *sg; unsigned char *p, *to; unsigned int sg_size; int i; @@ -186,16 +187,17 @@ void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, /* copy back read RDMA to unaligned sg */ mem = mem_copy->copy_buf; - sg = (struct scatterlist *)iser_ctask->data[ISER_DIR_IN].buf; + sgl = (struct scatterlist *)iser_ctask->data[ISER_DIR_IN].buf; sg_size = iser_ctask->data[ISER_DIR_IN].size; - for (p = mem, i = 0; i < sg_size; i++){ - to = kmap_atomic(sg[i].page, KM_SOFTIRQ0); - memcpy(to + sg[i].offset, + p = mem; + for_each_sg(sgl, sg, sg_size, i) { + to = kmap_atomic(sg->page, KM_SOFTIRQ0); + memcpy(to + sg->offset, p, - sg[i].length); + sg->length); kunmap_atomic(to, KM_SOFTIRQ0); - p += sg[i].length; + p += sg->length; } } @@ -226,7 +228,8 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data, struct iser_page_vec *page_vec, struct ib_device *ibdev) { - struct scatterlist *sg = (struct scatterlist *)data->buf; + struct scatterlist *sgl = (struct scatterlist *)data->buf; + struct scatterlist *sg; u64 first_addr, last_addr, page; int end_aligned; unsigned int cur_page = 0; @@ -234,14 +237,14 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data, int i; /* compute the offset of first element */ - page_vec->offset = (u64) sg[0].offset & ~MASK_4K; + page_vec->offset = (u64) sgl[0].offset & ~MASK_4K; - for (i = 0; i < data->dma_nents; i++) { - unsigned int dma_len = ib_sg_dma_len(ibdev, &sg[i]); + for_each_sg(sgl, sg, data->dma_nents, i) { + unsigned int dma_len = ib_sg_dma_len(ibdev, sg); total_sz += dma_len; - first_addr = ib_sg_dma_address(ibdev, &sg[i]); + first_addr = ib_sg_dma_address(ibdev, sg); last_addr = first_addr + dma_len; end_aligned = !(last_addr & ~MASK_4K); @@ -249,9 +252,9 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data, /* continue to collect page fragments till aligned or SG ends */ while (!end_aligned && (i + 1 < data->dma_nents)) { i++; - dma_len = ib_sg_dma_len(ibdev, &sg[i]); + dma_len = ib_sg_dma_len(ibdev, sg); total_sz += dma_len; - last_addr = ib_sg_dma_address(ibdev, &sg[i]) + dma_len; + last_addr = ib_sg_dma_address(ibdev, sg) + dma_len; end_aligned = !(last_addr & ~MASK_4K); } @@ -286,25 +289,26 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data, static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data, struct ib_device *ibdev) { - struct scatterlist *sg; + struct scatterlist *sgl, *sg; u64 end_addr, next_addr; int i, cnt; unsigned int ret_len = 0; - sg = (struct scatterlist *)data->buf; + sgl = (struct scatterlist *)data->buf; - for (cnt = 0, i = 0; i < data->dma_nents; i++, cnt++) { + cnt = 0; + for_each_sg(sgl, sg, data->dma_nents, i) { /* iser_dbg("Checking sg iobuf [%d]: phys=0x%08lX " "offset: %ld sz: %ld\n", i, - (unsigned long)page_to_phys(sg[i].page), - (unsigned long)sg[i].offset, - (unsigned long)sg[i].length); */ - end_addr = ib_sg_dma_address(ibdev, &sg[i]) + - ib_sg_dma_len(ibdev, &sg[i]); + (unsigned long)page_to_phys(sg->page), + (unsigned long)sg->offset, + (unsigned long)sg->length); */ + end_addr = ib_sg_dma_address(ibdev, sg) + + ib_sg_dma_len(ibdev, sg); /* iser_dbg("Checking sg iobuf end address " "0x%08lX\n", end_addr); */ if (i + 1 < data->dma_nents) { - next_addr = ib_sg_dma_address(ibdev, &sg[i+1]); + next_addr = ib_sg_dma_address(ibdev, sg_next(sg)); /* are i, i+1 fragments of the same page? */ if (end_addr == next_addr) continue; @@ -324,15 +328,16 @@ static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data, static void iser_data_buf_dump(struct iser_data_buf *data, struct ib_device *ibdev) { - struct scatterlist *sg = (struct scatterlist *)data->buf; + struct scatterlist *sgl = (struct scatterlist *)data->buf; + struct scatterlist *sg; int i; - for (i = 0; i < data->dma_nents; i++) + for_each_sg(sgl, sg, data->dma_nents, i) iser_err("sg[%d] dma_addr:0x%lX page:0x%p " "off:0x%x sz:0x%x dma_len:0x%x\n", - i, (unsigned long)ib_sg_dma_address(ibdev, &sg[i]), - sg[i].page, sg[i].offset, - sg[i].length, ib_sg_dma_len(ibdev, &sg[i])); + i, (unsigned long)ib_sg_dma_address(ibdev, sg), + sg->page, sg->offset, + sg->length, ib_sg_dma_len(ibdev, sg)); } static void iser_dump_page_vec(struct iser_page_vec *page_vec) -- 1.5.3.rc0.90.gbaa79 From swise at opengridcomputing.com Mon Jul 16 08:56:14 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 16 Jul 2007 10:56:14 -0500 Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status In-Reply-To: <20070715094536.2109FE603CA@openfabrics.org> References: <20070715094536.2109FE603CA@openfabrics.org> Message-ID: <469B951E.4070006@opengridcomputing.com> What is the status of fixing the build breaks? It appears that these breaks cause the weekly ofed kits to be broken as well. I'm trying to provide customers with the latest cxgb3 fixes and this makes it difficult. Steve. Vladimir Sokolovsky wrote: > This email was generated automatically, please do not reply > > > git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git > git_branch: ofed_1_2_c > > Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod > > Passed: > Passed on i686 with 2.6.15-23-server > Passed on i686 with linux-2.6.21.1 > Passed on i686 with linux-2.6.18 > Passed on i686 with linux-2.6.13 > Passed on i686 with linux-2.6.17 > Passed on i686 with linux-2.6.15 > Passed on i686 with linux-2.6.19 > Passed on i686 with linux-2.6.14 > Passed on i686 with linux-2.6.16 > Passed on i686 with linux-2.6.12 > Passed on powerpc with linux-2.6.18 > Passed on ia64 with linux-2.6.19 > Passed on x86_64 with linux-2.6.19 > Passed on ia64 with linux-2.6.18 > Passed on x86_64 with linux-2.6.21.1 > Passed on powerpc with linux-2.6.17 > Passed on x86_64 with linux-2.6.18 > Passed on ppc64 with linux-2.6.18 > Passed on x86_64 with linux-2.6.15 > Passed on x86_64 with linux-2.6.12 > Passed on x86_64 with linux-2.6.20 > Passed on powerpc with linux-2.6.19 > Passed on x86_64 with linux-2.6.16 > Passed on x86_64 with linux-2.6.5-7.244-smp > Passed on x86_64 with linux-2.6.13 > Passed on ia64 with linux-2.6.15 > Passed on ia64 with linux-2.6.12 > Passed on ia64 with linux-2.6.13 > Passed on x86_64 with linux-2.6.14 > Passed on ppc64 with linux-2.6.12 > Passed on ppc64 with linux-2.6.15 > Passed on x86_64 with linux-2.6.17 > Passed on powerpc with linux-2.6.13 > Passed on ia64 with linux-2.6.16 > Passed on ppc64 with linux-2.6.13 > Passed on powerpc with linux-2.6.16 > Passed on powerpc with linux-2.6.14 > Passed on ia64 with linux-2.6.21.1 > Passed on ppc64 with linux-2.6.19 > Passed on ppc64 with linux-2.6.14 > Passed on ia64 with linux-2.6.14 > Passed on ia64 with linux-2.6.17 > Passed on ppc64 with linux-2.6.16 > Passed on powerpc with linux-2.6.15 > Passed on powerpc with linux-2.6.12 > Passed on ppc64 with linux-2.6.17 > Passed on x86_64 with linux-2.6.16.43-0.3-smp > Passed on x86_64 with linux-2.6.16.21-0.8-smp > Passed on ppc64 with linux-2.6.18-8.el5 > Passed on x86_64 with linux-2.6.9-55.ELsmp > Passed on x86_64 with linux-2.6.9-34.ELsmp > Passed on x86_64 with linux-2.6.18-8.el5 > Passed on x86_64 with linux-2.6.9-42.ELsmp > Passed on x86_64 with linux-2.6.9-22.ELsmp > Passed on ia64 with linux-2.6.16.21-0.8-default > Passed on x86_64 with linux-2.6.18-1.2798.fc6 > > Failed: > Build failed on i686 with linux-2.6.22-rc7 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Mon Jul 16 09:01:44 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 16 Jul 2007 11:01:44 -0500 Subject: [ofa-general] Agenda for OFED meeting today In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> Message-ID: <469B9668.6020007@opengridcomputing.com> Tziporet Koren wrote: > > Must have general features: > ==================== > > * Kernel base on 2.6.23 (all new features that will be part of this > kernel will be included in OFED 1.3) Note that the cxgb3 drivers are in kernel.org 2.6.23. They weren't in the upstream kernel when we started ofed_1_2 so they were added directly into that git tree. So for ofed_1_3, the cxgb3 drivers should be taken directly from 2.6.23 (and discard the ofed_1_2 cxgb3 drivers). I don't know who will create the initial ofed_1_2 tree, but this will require some git surgery I think. Steve From rdreier at cisco.com Mon Jul 16 09:04:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:04:26 -0700 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: <200707121746.36763.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 12 Jul 2007 17:46:35 +0200") References: <200707121745.27592.fenkes@de.ibm.com> <200707121746.36763.fenkes@de.ibm.com> Message-ID: > The eHCA driver can now handle multiple event queues (read: interrupt > sources) instead of one. The number of available EQs is selected via the > nr_eqs module parameter. > CQs are either assigned to the EQs based on the comp_vector index or, if the > dist_eqs module parameter is supplied, using a round-robin scheme. Do you have any data on how well this round-robin assignment works? It seems not quite right to me for the driver to advertise nr_eqs completion vectors, but then if round-robin is turned on to ignore the consumer's decision about which vector to use. Maybe if round-robin is turned on you should report 0 as the number of completion vectors? Or maybe we should allow well-known values for the completion vector passed to ib_create_cq to allow consumers to specify a policy (like round robin) instead of a particular vector? Maybe the whole interface is broken and we should only be exposing policies to consumers instead of the specific vector? I think I would rather hold off on multiple EQs for this merge window and plan on having something really solid and thought-out for 2.6.24. - R. From swise at opengridcomputing.com Mon Jul 16 09:04:46 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 16 Jul 2007 11:04:46 -0500 Subject: [ofa-general] Agenda for OFED meeting today In-Reply-To: <469B9668.6020007@opengridcomputing.com> References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> <469B9668.6020007@opengridcomputing.com> Message-ID: <469B971E.3040604@opengridcomputing.com> Steve Wise wrote: > Tziporet Koren wrote: > >> >> Must have general features: >> ==================== >> >> * Kernel base on 2.6.23 (all new features that will be part of this >> kernel will be included in OFED 1.3) > > Note that the cxgb3 drivers are in kernel.org 2.6.23. They weren't in > the upstream kernel when we started ofed_1_2 so they were added directly > into that git tree. So for ofed_1_3, the cxgb3 drivers should be taken > directly from 2.6.23 (and discard the ofed_1_2 cxgb3 drivers). > > I don't know who will create the initial ofed_1_2 tree, but this will ^^^^^^^^ I meant ofed_1_3 > require some git surgery I think. > > Steve > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From tziporet at dev.mellanox.co.il Mon Jul 16 09:06:57 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 16 Jul 2007 19:06:57 +0300 Subject: [ofa-general] ofa_1_2_kernel 20070715-0200 daily build status In-Reply-To: <469B951E.4070006@opengridcomputing.com> References: <20070715094536.2109FE603CA@openfabrics.org> <469B951E.4070006@opengridcomputing.com> Message-ID: <469B97A1.8080306@mellanox.co.il> Steve Wise wrote: > What is the status of fixing the build breaks? It appears that these > breaks cause the weekly ofed kits to be broken as well. I'm trying to > provide customers with the latest cxgb3 fixes and this makes it > difficult. > I think Jim just fixed the SDP issues that was the ULP that broke the build Tziporet From mshefty at ichips.intel.com Mon Jul 16 09:22:50 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 16 Jul 2007 09:22:50 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469B6634.1050709@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> Message-ID: <469B9B5A.2040707@ichips.intel.com> > Sorry but "improve data locality" is not enough information for me to > understand why the IB CM --neeed-- to spawn n kernel threads on my > n-core system, after all its slow path and the data does not moves on > QP1, what's the story here? and if it needs thread-per-cpu, why not use > the system threads/softirqs as does the TCP/IP stack connection mgmt code? IMO, if we're going to have multiple cores, then we should create multiple threads to use them. This becomes more important as the number of cores increases. (The overhead of a non-running thread can't be that much.) Stating that connection establishment is a slow path operation assumes that all connections are long lived. The current behavior of the MAD layer is that all callbacks for a given registration are serialized. We either need to preserve this functionality or verify that MAD users can handle simultaneous callbacks. (Hopefully MAD users didn't make any assumptions regarding the threading model used by the MAD layer, but we need to verify this. I'm more worried about code in the MAD layer itself.) - Sean From rdreier at cisco.com Mon Jul 16 09:32:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:32:52 -0700 Subject: [ofa-general] [PATCH 2.6.23] iw_cxgb3: remove the cm_id reference on listen failures. References: <20070711180435.11665.71117.stgit@dell3.ogc.int> Message-ID: thanks, applied. From rdreier at cisco.com Mon Jul 16 09:37:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:37:52 -0700 Subject: [ofa-general] Re: [PATCH 1 of 2] mlx4: implement query-qp References: <200706211227.47794.jackm@dev.mellanox.co.il> <200707151028.24013.jackm@dev.mellanox.co.il> Message-ID: this was a patch to a patch, which is not very useful (especially since the original patch is upstream in Linus's tree). anyway I applied this as two patches... From rdreier at cisco.com Mon Jul 16 09:42:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:42:53 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... References: <20070713054711.GA21709@mellanox.co.il> <20070714175425.GA17597@mellanox.co.il> Message-ID: > > I haven't done any work on it or seen anything from anyone else, so I > > expect this will have to wait for 2.6.24. > I'm surprised to hear this. How about this: > http://lists.openfabrics.org/pipermail/general/2007-May/035757.html Sure, I remember that. But I haven't seen anything to suggest that anyone has given any further thought to the issues that were raised in that thread. - R. From rdreier at cisco.com Mon Jul 16 09:42:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:42:52 -0700 Subject: [ofa-general] Re: [PATCH] mlx4/IB: Take sizeof the correct pointer when calling to memset References: <200707151500.09578.dotanb@dev.mellanox.co.il> Message-ID: thanks, applied. unfortunately I copied the buggy mthca code before the fix was merged there... From rdreier at cisco.com Mon Jul 16 09:42:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:42:53 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... References: <469A1293.6020902@mellanox.co.il> Message-ID: > Till when can we insert mlx4 with FMRs? 2.6.22 came out on July 8, so I would expect 2.6.23-rc1 (the end of the merge window) to be July 22. From rdreier at cisco.com Mon Jul 16 09:47:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:47:53 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix References: <20070715212146.GF6921@sgi.com> Message-ID: > dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, > - enum dma_data_direction direction) > + enum dma_data_direction direction, int coherent) "coherent" seems like the wrong name here... really the property being asked for is "flush other in-flight DMAs" or something like that (I don't know precisely what setting the magic bit in the DMA address does on Altix). Also maybe it would make more sense to fold this into the existing direction parameter somehow, so that most of the kernel can stay unchanged (because as far as I know, Altix is the only platform that has this extra quirk of allowing DMAs to pass each other). - R. From rdreier at cisco.com Mon Jul 16 09:47:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:47:52 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... References: Message-ID: > FYI, we are working on several IPoIB performance improvement > patches which are not on the list. Some of the patches are under test, > some of the patches are going to be submitted soon. They are: There is less than a week left in the merge window, and none of these changes has been reviewed yet. So being realistic, I don't think we can expect to get any of this into 2.6.23. - R. From rdreier at cisco.com Mon Jul 16 09:52:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:52:53 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> Message-ID: > This will be very painful and frankly I don't think the pain is > justified. Can't you confine the changes to the IB layerr so that the > mapping happens through dma_alloc_coherent if you need > coherent/consistent memory rather than through dma_map_sg? The memory being dealt with here is buffers that are only used by the device and userspace. And the problem being solved is not really that the memory needs to be coherent -- it is just that on Altix, using coherent memory turns on another side effect that DMAs to that memory flush other in-flight DMAs to other memory. So there are several reasons I don't like using dma_alloc_coherent() to allocate this memory, and then mapping it into userspace (rather than having userspace allocate it and then map it to the device, as these patches do): - dma_alloc_coherent() has to allocate kernel address space for memory, and in this case the kernel will never touch the memory. So this is pure waste, and on 32-bit system, these allocations could easily fail since kernel address space is scarce. - The property being asked for is not really coherent memory but rather "set the magic bit in the bus address so the Altix chipset flushes other DMAs", and I think it would be cleaner to ask for that explicitly rather than relying on the side effect of coherent memory. > Also, this kind of thing should definitely be CC'd to lkml. I agree on that. - R. From rdreier at cisco.com Mon Jul 16 09:52:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:52:52 -0700 Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name References: Message-ID: > The mad module creates thread per active port where the thread name is > derived from the port name. This cause different threads to have same > names when there are multiple devices. Fix that by using both the device > and the port numbers to derive the name. What problem does the duplicate name cause in the first place? - R. From rdreier at cisco.com Mon Jul 16 09:57:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 09:57:52 -0700 Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix References: <20070715212445.GG6921@sgi.com> Message-ID: Looks reasonable but I would prefer to see explicit tests of the abi version so that we use the old register MR ABI for old kernels rather than unconditionally passing the extra parameter. From rdreier at cisco.com Mon Jul 16 10:14:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:14:03 -0700 Subject: [ofa-general] Re: [PATCH 04/10] IB/ehca: use common error code mapping instead of specific ones In-Reply-To: <200707121749.03556.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 12 Jul 2007 17:49:02 +0200") References: <200707121745.27592.fenkes@de.ibm.com> <200707121749.03556.fenkes@de.ibm.com> Message-ID: > @@ -161,8 +161,11 @@ static inline int ehca2ib_return_code(u64 ehca_rc) applied, but as a further cleanup it seems that ehca2ib_return_code() should be moved into a .c file and moved out of line -- I think it would probably shrink the compiled code quite a bit, and as far as I can see it is never used in the data path where the function call overhead would matter at all. From rdreier at cisco.com Mon Jul 16 10:37:09 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:37:09 -0700 Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs In-Reply-To: <200707121754.20293.fenkes@de.ibm.com> (Joachim Fenkes's message of "Thu, 12 Jul 2007 17:54:19 +0200") References: <200707121745.27592.fenkes@de.ibm.com> <200707121754.20293.fenkes@de.ibm.com> Message-ID: > Add support for MR pages larger than 4K on eHCA2. This reduces firmware > memory consumption. If enabled via the mr_largepage module parameter, the MR > page size will be determined based on the MR length and the hardware > capabilities - if the MR is >= 16M, 16M pages are used, for example. Why the module parameter? Is there any reason a user would want to turn this off? Or conversely, why is it off by default? Also this patch seems to depend heavily on the multiple EQ patch, which I am holding off on now. So you may want to rebase to my current tree, which has all the ehca patches except the EQ one. > static ssize_t ehca_show_nr_eqs(struct device *dev, > struct device_attribute *attr, > char *buf) > { > return sprintf(buf, "%d\n", ehca_nr_eqs); > } > - > static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL); Although trivial, this chunk doesn't really belong in this patch -- just fix it up in the multiple EQ patch (which I haven't merged yet). - R. From rdreier at cisco.com Mon Jul 16 10:38:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:38:52 -0700 Subject: [ofa-general] [PATCH] IB/iser: Make a couple of functions static Message-ID: Make iser_conn_release() and iser_start_rdma_unaligned_sg() static, since they are only used in the .c file where they are defined. In addition to being a cleanup, this even shrinks the generated code by allowing the single call of iser_start_rdma_unaligned_sg() to be inlined into its callsite. On x86_64: add/remove: 0/1 grow/shrink: 1/0 up/down: 466/-533 (-67) function old new delta iser_reg_rdma_mem 1518 1984 +466 iser_start_rdma_unaligned_sg 533 - -533 Signed-off-by: Roland Dreier --- Erez, does this look OK to merge for 2.6.23? diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index 8960196..671faff 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -310,8 +310,6 @@ int iser_conn_init(struct iser_conn **ib_conn); void iser_conn_terminate(struct iser_conn *ib_conn); -void iser_conn_release(struct iser_conn *ib_conn); - void iser_rcv_completion(struct iser_desc *desc, unsigned long dto_xfer_len); @@ -329,9 +327,6 @@ void iser_reg_single(struct iser_device *device, struct iser_regd_buf *regd_buf, enum dma_data_direction direction); -int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, - enum iser_data_dir cmd_dir); - void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *ctask, enum iser_data_dir cmd_dir); diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index fc9f1fd..36cdf77 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -103,8 +103,8 @@ void iser_reg_single(struct iser_device *device, /** * iser_start_rdma_unaligned_sg */ -int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, - enum iser_data_dir cmd_dir) +static int iser_start_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, + enum iser_data_dir cmd_dir) { int dma_nents; struct ib_device *dev; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 3702e23..132edc6 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -311,6 +311,29 @@ static int iser_conn_state_comp_exch(struct iser_conn *ib_conn, } /** + * Frees all conn objects and deallocs conn descriptor + */ +static void iser_conn_release(struct iser_conn *ib_conn) +{ + struct iser_device *device = ib_conn->device; + + BUG_ON(ib_conn->state != ISER_CONN_DOWN); + + mutex_lock(&ig.connlist_mutex); + list_del(&ib_conn->conn_list); + mutex_unlock(&ig.connlist_mutex); + + iser_free_ib_conn_res(ib_conn); + ib_conn->device = NULL; + /* on EVENT_ADDR_ERROR there's no device yet for this conn */ + if (device != NULL) + iser_device_try_release(device); + if (ib_conn->iser_conn) + ib_conn->iser_conn->ib_conn = NULL; + kfree(ib_conn); +} + +/** * triggers start of the disconnect procedures and wait for them to be done */ void iser_conn_terminate(struct iser_conn *ib_conn) @@ -550,30 +573,6 @@ connect_failure: } /** - * Frees all conn objects and deallocs conn descriptor - */ -void iser_conn_release(struct iser_conn *ib_conn) -{ - struct iser_device *device = ib_conn->device; - - BUG_ON(ib_conn->state != ISER_CONN_DOWN); - - mutex_lock(&ig.connlist_mutex); - list_del(&ib_conn->conn_list); - mutex_unlock(&ig.connlist_mutex); - - iser_free_ib_conn_res(ib_conn); - ib_conn->device = NULL; - /* on EVENT_ADDR_ERROR there's no device yet for this conn */ - if (device != NULL) - iser_device_try_release(device); - if (ib_conn->iser_conn) - ib_conn->iser_conn->ib_conn = NULL; - kfree(ib_conn); -} - - -/** * iser_reg_page_vec - Register physical memory * * returns: 0 on success, errno code on failure From rdreier at cisco.com Mon Jul 16 10:43:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:43:02 -0700 Subject: [ofa-general] is ipath_layer.c dead code? Message-ID: My kernel seems to build and link fine with the patch below. Is ipath_layer.c being used for anything, or can we just kill it? - R. diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile index ec2e603..fe67388 100644 --- a/drivers/infiniband/hw/ipath/Makefile +++ b/drivers/infiniband/hw/ipath/Makefile @@ -14,7 +14,6 @@ ib_ipath-y := \ ipath_init_chip.o \ ipath_intr.o \ ipath_keys.o \ - ipath_layer.o \ ipath_mad.o \ ipath_mmap.o \ ipath_mr.o \ diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c deleted file mode 100644 index 82616b7..0000000 --- a/drivers/infiniband/hw/ipath/ipath_layer.c +++ /dev/null @@ -1,365 +0,0 @@ -/* - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - */ - -/* - * These are the routines used by layered drivers, currently just the - * layered ethernet driver and verbs layer. - */ - -#include -#include - -#include "ipath_kernel.h" -#include "ipath_layer.h" -#include "ipath_verbs.h" -#include "ipath_common.h" - -/* Acquire before ipath_devs_lock. */ -static DEFINE_MUTEX(ipath_layer_mutex); - -u16 ipath_layer_rcv_opcode; - -static int (*layer_intr)(void *, u32); -static int (*layer_rcv)(void *, void *, struct sk_buff *); -static int (*layer_rcv_lid)(void *, void *); - -static void *(*layer_add_one)(int, struct ipath_devdata *); -static void (*layer_remove_one)(void *); - -int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_intr) - ret = layer_intr(dd->ipath_layer.l_arg, arg); - - return ret; -} - -int ipath_layer_intr(struct ipath_devdata *dd, u32 arg) -{ - int ret; - - mutex_lock(&ipath_layer_mutex); - - ret = __ipath_layer_intr(dd, arg); - - mutex_unlock(&ipath_layer_mutex); - - return ret; -} - -int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr, - struct sk_buff *skb) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_rcv) - ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb); - - return ret; -} - -int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_rcv_lid) - ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr); - - return ret; -} - -void ipath_layer_lid_changed(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (dd->ipath_layer.l_arg && layer_intr) - layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID); - - mutex_unlock(&ipath_layer_mutex); -} - -void ipath_layer_add(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (layer_add_one) - dd->ipath_layer.l_arg = - layer_add_one(dd->ipath_unit, dd); - - mutex_unlock(&ipath_layer_mutex); -} - -void ipath_layer_remove(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (dd->ipath_layer.l_arg && layer_remove_one) { - layer_remove_one(dd->ipath_layer.l_arg); - dd->ipath_layer.l_arg = NULL; - } - - mutex_unlock(&ipath_layer_mutex); -} - -int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *), - void (*l_remove)(void *), - int (*l_intr)(void *, u32), - int (*l_rcv)(void *, void *, struct sk_buff *), - u16 l_rcv_opcode, - int (*l_rcv_lid)(void *, void *)) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - - layer_add_one = l_add; - layer_remove_one = l_remove; - layer_intr = l_intr; - layer_rcv = l_rcv; - layer_rcv_lid = l_rcv_lid; - ipath_layer_rcv_opcode = l_rcv_opcode; - - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - if (!(dd->ipath_flags & IPATH_INITTED)) - continue; - - if (dd->ipath_layer.l_arg) - continue; - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd); - spin_lock_irqsave(&ipath_devs_lock, flags); - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - mutex_unlock(&ipath_layer_mutex); - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_register); - -void ipath_layer_unregister(void) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - if (dd->ipath_layer.l_arg && layer_remove_one) { - spin_unlock_irqrestore(&ipath_devs_lock, flags); - layer_remove_one(dd->ipath_layer.l_arg); - spin_lock_irqsave(&ipath_devs_lock, flags); - dd->ipath_layer.l_arg = NULL; - } - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - - layer_add_one = NULL; - layer_remove_one = NULL; - layer_intr = NULL; - layer_rcv = NULL; - layer_rcv_lid = NULL; - - mutex_unlock(&ipath_layer_mutex); -} - -EXPORT_SYMBOL_GPL(ipath_layer_unregister); - -int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax) -{ - int ret; - u32 intval = 0; - - mutex_lock(&ipath_layer_mutex); - - if (!dd->ipath_layer.l_arg) { - ret = -EINVAL; - goto bail; - } - - ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS); - - if (ret < 0) - goto bail; - - *pktmax = dd->ipath_ibmaxlen; - - if (*dd->ipath_statusp & IPATH_STATUS_IB_READY) - intval |= IPATH_LAYER_INT_IF_UP; - if (dd->ipath_lid) - intval |= IPATH_LAYER_INT_LID; - if (dd->ipath_mlid) - intval |= IPATH_LAYER_INT_BCAST; - /* - * do this on open, in case low level is already up and - * just layered driver was reloaded, etc. - */ - if (intval) - layer_intr(dd->ipath_layer.l_arg, intval); - - ret = 0; -bail: - mutex_unlock(&ipath_layer_mutex); - - return ret; -} - -EXPORT_SYMBOL_GPL(ipath_layer_open); - -u16 ipath_layer_get_lid(struct ipath_devdata *dd) -{ - return dd->ipath_lid; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_lid); - -/** - * ipath_layer_get_mac - get the MAC address - * @dd: the infinipath device - * @mac: the MAC is put here - * - * This is the EUID-64 OUI octets (top 3), then - * skip the next 2 (which should both be zero or 0xff). - * The returned MAC is in network order - * mac points to at least 6 bytes of buffer - * We assume that by the time the LID is set, that the GUID is as valid - * as it's ever going to be, rather than adding yet another status bit. - */ - -int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac) -{ - u8 *guid; - - guid = (u8 *) &dd->ipath_guid; - - mac[0] = guid[0]; - mac[1] = guid[1]; - mac[2] = guid[2]; - mac[3] = guid[5]; - mac[4] = guid[6]; - mac[5] = guid[7]; - if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff)) - ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: " - "%x %x\n", guid[3], guid[4]); - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_mac); - -u16 ipath_layer_get_bcast(struct ipath_devdata *dd) -{ - return dd->ipath_mlid; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_bcast); - -int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr) -{ - int ret = 0; - u32 __iomem *piobuf; - u32 plen, *uhdr; - size_t count; - __be16 vlsllnh; - - if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) { - ipath_dbg("send while not open\n"); - ret = -EINVAL; - } else - if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) || - dd->ipath_lid == 0) { - /* - * lid check is for when sma hasn't yet configured - */ - ret = -ENETDOWN; - ipath_cdbg(VERBOSE, "send while not ready, " - "mylid=%u, flags=0x%x\n", - dd->ipath_lid, dd->ipath_flags); - } - - vlsllnh = *((__be16 *) hdr); - if (vlsllnh != htons(IPATH_LRH_BTH)) { - ipath_dbg("Warning: lrh[0] wrong (%x, not %x); " - "not sending\n", be16_to_cpu(vlsllnh), - IPATH_LRH_BTH); - ret = -EINVAL; - } - if (ret) - goto done; - - /* Get a PIO buffer to use. */ - piobuf = ipath_getpiobuf(dd, NULL); - if (piobuf == NULL) { - ret = -EBUSY; - goto done; - } - - plen = (sizeof(*hdr) >> 2); /* actual length */ - ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf); - - writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */ - ipath_flush_wc(); - piobuf += 2; - uhdr = (u32 *)hdr; - count = plen-1; /* amount we can copy before trigger word */ - __iowrite32_copy(piobuf, uhdr, count); - ipath_flush_wc(); - __raw_writel(uhdr[count], piobuf + count); - ipath_flush_wc(); /* ensure it's sent, now */ - - ipath_stats.sps_ether_spkts++; /* ether packet sent */ - -done: - return ret; -} - -EXPORT_SYMBOL_GPL(ipath_layer_send_hdr); - -int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd) -{ - set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); - - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl); - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int); From rdreier at cisco.com Mon Jul 16 10:49:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:49:12 -0700 Subject: [ofa-general] [PATCH] IB/ipath: Make a few functions static In-Reply-To: (Roland Dreier's message of "Mon, 16 Jul 2007 10:43:02 -0700") References: Message-ID: Make some functions that are only used in a single .c file static. In addition to being a cleanup, this shrinks the generated code. On x86_64: add/remove: 1/3 grow/shrink: 2/1 up/down: 4777/-4956 (-179) function old new delta handle_errors - 3994 +3994 __verbs_timer 42 710 +668 ipath_do_ruc_send 2131 2246 +115 ipath_no_bufs_available 136 - -136 ipath_disarm_senderrbufs 639 - -639 ipath_ib_timer 658 - -658 ipath_intr 5878 2355 -3523 Signed-off-by: Roland Dreier --- Does this look OK to merge for 2.6.23? diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 9361f5a..09c5fd8 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1889,7 +1889,7 @@ void ipath_write_kreg_port(const struct ipath_devdata *dd, ipath_kreg regno, /* Below is "non-zero" to force override, but both actual LEDs are off */ #define LED_OVER_BOTH_OFF (8) -void ipath_run_led_override(unsigned long opaque) +static void ipath_run_led_override(unsigned long opaque) { struct ipath_devdata *dd = (struct ipath_devdata *)opaque; int timeoff; diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c index 6b91479..b4503e9 100644 --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c @@ -426,8 +426,8 @@ bail: * @buffer: data to write * @len: number of bytes to write */ -int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset, - const void *buffer, int len) +static int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset, + const void *buffer, int len) { u8 single_byte; int sub_len; diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 47aa434..1fd91c5 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -70,7 +70,7 @@ static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum) * If rewrite is true, and bits are set in the sendbufferror registers, * we'll write to the buffer, for error recovery on parity errors. */ -void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) +static void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) { u32 piobcnt; unsigned long sbuf[4]; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 3105005..b6ccd04 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -776,7 +776,6 @@ void ipath_get_eeprom_info(struct ipath_devdata *); int ipath_update_eeprom_log(struct ipath_devdata *dd); void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr); u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); -void ipath_disarm_senderrbufs(struct ipath_devdata *, int); /* * Set LED override, only the two LSBs have "public" meaning, but diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 8525674..c69c252 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -507,7 +507,7 @@ static int want_buffer(struct ipath_devdata *dd) * * Called when we run out of PIO buffers. */ -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) +static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) { unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 65f7181..16aa61f 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -488,7 +488,7 @@ bail:; * This is called from ipath_do_rcv_timer() at interrupt level to check for * QPs which need retransmits and to collect performance numbers. */ -void ipath_ib_timer(struct ipath_ibdev *dev) +static void ipath_ib_timer(struct ipath_ibdev *dev) { struct ipath_qp *resend = NULL; struct list_head *last; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index f3d1f2c..9bbe819 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -782,8 +782,6 @@ void ipath_update_mmap_info(struct ipath_ibdev *dev, int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev); - void ipath_insert_rnr_queue(struct ipath_qp *qp); int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only); @@ -807,8 +805,6 @@ void ipath_ib_rcv(struct ipath_ibdev *, void *, void *, u32); int ipath_ib_piobufavail(struct ipath_ibdev *); -void ipath_ib_timer(struct ipath_ibdev *); - unsigned ipath_get_npkeys(struct ipath_devdata *); u32 ipath_get_cr_errpkey(struct ipath_devdata *); From rdreier at cisco.com Mon Jul 16 10:49:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 10:49:51 -0700 Subject: [ofa-general] is ipath_get_user_pages_nocopy() dead code? In-Reply-To: (Roland Dreier's message of "Mon, 16 Jul 2007 10:43:02 -0700") References: Message-ID: I don't see any callers of ipath_get_user_pages_nocopy(). Should we just delete it? - R. From pradeeps at linux.vnet.ibm.com Mon Jul 16 11:00:49 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 16 Jul 2007 11:00:49 -0700 Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation (for IPoIB CM) In-Reply-To: <4697A9A3.2020706@linux.vnet.ibm.com> References: <469680DB.6000602@linux.vnet.ibm.com> <4697A9A3.2020706@linux.vnet.ibm.com> Message-ID: <469BB251.5050808@linux.vnet.ibm.com> Pradeep Satyanarayana wrote: > Roland Dreier wrote: >> > In the absence of any further discussions about the IPoIB CM >> without SRQ >> > patches, I will incorporate Sean Hefty's comments and plan to resubmit >> > the patches, unless I hear something soon. >> >> Sorry for not devoting enough time to this, but something always seems >> to come up, and I really want to be able to focus a concentrated chunk >> of time on this, and I never seem to be able to. Anyway, I would >> prefer to find a solution that everyone can agree on, without me >> having to rule by decree. >> >> I think updating the patch is a good idea. Although I didn't get a >> chance to review it carefully there were a number of obvious messy >> parts that should be cleaned up. >> >> I am beginning to think that your basic approach is probably right, >> but I also still think it should be possible to handle both SRQ and >> non-SRQ without any overhead on the fast path. I don't understand the >> "maintainability" argument against doing this. Can you expand on your >> position a little? >> > > I will try to illustrate with an example: > > One of the ways to do this is to completely split SRQ and non-SRQ > processing starting in ipoib_poll(). This would eliminate most of > the if (srq) kind of branches. However, there would be a lot of code > duplication. If a bug is discovered in one path, then one needs to > fix that in the other path too. > > One way to mitigate this situation is to alter the current SRQ code > to use common code (between SRQ and non-SRQ). However, one might not > want to factor off a few lines of common code into a new function. There > may be several such occurrences of this resulting in code bloat. > > If you look back, several weeks ago ipoib_drain_cq() did not exist. This > is another function that calls ipoib_cm_handle_rx_wc(). We would need > to alter this function too to accommodate SRQ and non-SRQ split. In > effect, we have propagated the SRQ and non-SRQ code to functions > outside ipoiob_cm.c. In the future, if IPoIB CM would support UC mode > this might mean additional functions handling the split. > > On the other hand, in V6 (and previous versions) of the patch > ipoib_cm_handle_rx_wc() handles the SRQ and non-SRQ paths. Both SRQ and > non-SRQ functionality is contained within ipoib_cm.c. What we now have > is probably one extra branch in the packet handling path than the > minimum (desired) with a lot of common code. > > Pradeep > Roland, Since the merge window will close in the next few days, do you have a few suggestions that you would like me to incorporate into the patch? Pradeep From sean.hefty at intel.com Mon Jul 16 11:03:06 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 16 Jul 2007 11:03:06 -0700 Subject: [ofa-general] [PATCH] for-23 ib/local_sa: adjust data offset by attribute offset, not size Message-ID: <000301c7c7d3$8ff4a3f0$3c98070a@amr.corp.intel.com> I merged the patch below with the local_sa patch in my for-roland branch. It's shown below separately for review purposes only. The fix is based on code review, versus an observed bug. (Since a path record is 64 bytes, it's almost guaranteed that the size and offset will be the same.) - Sean We should adjust the data offset by the attribute offset, and not the size of the attribute. The attribute offset includes any necessary padding between the attributes. Signed-off-by: Sean Hefty --- drivers/infiniband/core/local_sa.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/local_sa.c b/drivers/infiniband/core/local_sa.c index 6c073a3..75545a5 100644 --- a/drivers/infiniband/core/local_sa.c +++ b/drivers/infiniband/core/local_sa.c @@ -369,11 +369,11 @@ static void *ib_sa_iter_next(struct ib_sa_mad_iter *iter) /* copy the second piece of the attribute */ memcpy(iter->attr + offset, &mad->data[0], iter->attr_size - offset); - iter->data_offset = iter->attr_size - offset; + iter->data_offset = iter->attr_offset - offset; offset = 0; } else { iter->attr = &mad->data[iter->data_offset]; - iter->data_offset += iter->attr_size; + iter->data_offset += iter->attr_offset; } iter->data_left -= iter->attr_offset; From bob.kossey at hp.com Mon Jul 16 11:54:21 2007 From: bob.kossey at hp.com (Bob Kossey) Date: Mon, 16 Jul 2007 14:54:21 -0400 Subject: [ofa-general] RFC OFED-1.3 installation In-Reply-To: <469BBB83.2010100@hp.com> References: <469BBB83.2010100@hp.com> Message-ID: <469BBEDD.20806@hp.com> Hi Vlad, This looks good, a few comments. As you are splitting out RPM spec files for each package, I would like to see the RPM release numbers be consistently updated whenever changes are made to a package. Ideally, this would be coordinated with the release numbers from the distros, so that we could tell whether a version of an OFED RPM in a distro was the older, the same or more recent than an OFED RPM from openfabrics.org. This would also allow us to update them with rpm -Uvh. Extra credit would be given for adding dependency information to the packages. I also like the idea of clearly separating the build of the RPMs from their installation. I would like to see all target system modifications be made by RPM files, or postinstall scripts, rather than from the install.pl script, which may not always be run on a target. Thanks, Bob > Hi, > I am starting to work on the new installation procedure for OFED-1.3. > Please review and comment. > > Main changes from OFED-1.2: > - Split ofa_user-1.2.src.rpm into separate sources RPMs per package. > * Requires RPM spec file for each package. > Currently, the following packages are lack of RPM spec file: > libehca, > mstflint, > qlvnictools, > perftest, > sdpnetstat > > User space RPM packages list taken from maintainers' RPM spec files: > > libibverbs: > libibverbs > libibverbs-devel > libibverbs-devel-static > libibverbs-utils > > libmthca: > libmthca > libmthca-devel-static > > libehca: > No RPM spec file > > libipathverbs: > libipathverbs > libipathverbs-devel > > libibcm: > libibcm > libibcm-devel > > libsdp: > libsdp > libsdp-devel should be created > > librdmacm: > librdmacm > librdmacm-devel > librdmacm-utils > > libcxgb3: > libcxgb3 > libcxgb3-devel > > Note: libcxgb3 rpmbuild fails: > cp: cannot stat `ChangeLog': No such file or directory > > management: > libibcommon > libibcommon-devel > libibmad > libibmad-devel > libibumad > libibumad-devel > opensm > opensm-libs > opensm-devel > opensm-static > infiniband-diags > > dapl: > dapl > dapl-devel > dapl-uils > > srptools: > srptools > > ibutils: > ibutils > > mpi-selector: > mpi-selector > > - OFED-1.3 build procedure: > OFED-1.3 daily/rc builds will be created on OFA server: > userspace and kernel packages will be taken from git trees: > git.openfabrics.org/ofed_1_3/package.git ofed_1_3 > > Source RPMs will be created for each userspace package in the > following way: > > git clone ... > autogen.sh > configure --disable-libcheck > make dist > rpmbuild -bs package.spec > > The following packages will be taken from maintainers as src.rpm: > > mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich, > > mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3, > > openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3, > > mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests, > > rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools, > > ib-bonding http://www.openfabrics.org/~monis/ofed_1_3, > > > > > - OFED-1.3 Installation > install.pl script > Flow: > make list of packages following selection and dependencies. > for package in the list: > build RPM from package.src.rpm > install package RPM > go to the next package in the list > > configuration if required > > > Regards, > Vladimir > > From xma at us.ibm.com Mon Jul 16 12:32:59 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 16 Jul 2007 12:32:59 -0700 Subject: [ofa-general] RFC OFED-1.3 installation In-Reply-To: <469BBEDD.20806@hp.com> Message-ID: Is ib-utils depends on opensm-libs? If so I would suggest to change opensm-libs as libsmutils. Otherwise ib-utils won't work without installing opensm package. Does this make sense? Thanks Shirley Bob Kossey To Sent by: general at lists.openfabrics.org general-bounces at l cc ists.openfabrics. org Subject Re: [ofa-general] RFC OFED-1.3 installation 07/16/07 11:54 AM Hi Vlad, This looks good, a few comments. As you are splitting out RPM spec files for each package, I would like to see the RPM release numbers be consistently updated whenever changes are made to a package. Ideally, this would be coordinated with the release numbers from the distros, so that we could tell whether a version of an OFED RPM in a distro was the older, the same or more recent than an OFED RPM from openfabrics.org. This would also allow us to update them with rpm -Uvh. Extra credit would be given for adding dependency information to the packages. I also like the idea of clearly separating the build of the RPMs from their installation. I would like to see all target system modifications be made by RPM files, or postinstall scripts, rather than from the install.pl script, which may not always be run on a target. Thanks, Bob > Hi, > I am starting to work on the new installation procedure for OFED-1.3. > Please review and comment. > > Main changes from OFED-1.2: > - Split ofa_user-1.2.src.rpm into separate sources RPMs per package. > * Requires RPM spec file for each package. > Currently, the following packages are lack of RPM spec file: > libehca, > mstflint, > qlvnictools, > perftest, > sdpnetstat > > User space RPM packages list taken from maintainers' RPM spec files: > > libibverbs: > libibverbs > libibverbs-devel > libibverbs-devel-static > libibverbs-utils > > libmthca: > libmthca > libmthca-devel-static > > libehca: > No RPM spec file > > libipathverbs: > libipathverbs > libipathverbs-devel > > libibcm: > libibcm > libibcm-devel > > libsdp: > libsdp > libsdp-devel should be created > > librdmacm: > librdmacm > librdmacm-devel > librdmacm-utils > > libcxgb3: > libcxgb3 > libcxgb3-devel > > Note: libcxgb3 rpmbuild fails: > cp: cannot stat `ChangeLog': No such file or directory > > management: > libibcommon > libibcommon-devel > libibmad > libibmad-devel > libibumad > libibumad-devel > opensm > opensm-libs > opensm-devel > opensm-static > infiniband-diags > > dapl: > dapl > dapl-devel > dapl-uils > > srptools: > srptools > > ibutils: > ibutils > > mpi-selector: > mpi-selector > > - OFED-1.3 build procedure: > OFED-1.3 daily/rc builds will be created on OFA server: > userspace and kernel packages will be taken from git trees: > git.openfabrics.org/ofed_1_3/package.git ofed_1_3 > > Source RPMs will be created for each userspace package in the > following way: > > git clone ... > autogen.sh > configure --disable-libcheck > make dist > rpmbuild -bs package.spec > > The following packages will be taken from maintainers as src.rpm: > > mvapich http://www.openfabrics.org/~pasha/ofed_1_3/mvapich, > > mvapich2 http://www.openfabrics.org/~rowland/ofed_1_3, > > openmpi http://www.openfabrics.org/~jsquyres/ofed_1_3, > > mpitests http://www.openfabrics.org/~pasha/ofed_1_3/mpitests, > > rds-tools http://www.openfabrics.org/~vlad/ofed_1_3/rds-tools, > > ib-bonding http://www.openfabrics.org/~monis/ofed_1_3, > > > > > - OFED-1.3 Installation > install.pl script > Flow: > make list of packages following selection and dependencies. > for package in the list: > build RPM from package.src.rpm > install package RPM > go to the next package in the list > > configuration if required > > > Regards, > Vladimir > > _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic31898.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From hal.rosenstock at gmail.com Mon Jul 16 12:52:51 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 16 Jul 2007 12:52:51 -0700 Subject: [ofa-general] OpenFabrics Bugzilla change Message-ID: Hi Scott, Would you change anything I'm a maintainer for (OpenSM and diags) over to Sasha ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From sweitzen at cisco.com Mon Jul 16 12:55:01 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 16 Jul 2007 12:55:01 -0700 Subject: [ofa-general] RE: OpenFabrics Bugzilla change In-Reply-To: References: Message-ID: For just new bugs, or for existing bugs, too? Scott ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, July 16, 2007 12:53 PM To: Scott Weitzenkamp (sweitzen) Cc: sashak at voltaire.com; general at lists.openfabrics.org Subject: OpenFabrics Bugzilla change Hi Scott, Would you change anything I'm a maintainer for (OpenSM and diags) over to Sasha ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Jul 16 13:05:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 16 Jul 2007 23:05:40 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: <20070713054711.GA21709@mellanox.co.il> <20070714175425.GA17597@mellanox.co.il> Message-ID: <20070716200540.GA8527@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Further 2.6.23 merge plans... > > > > I haven't done any work on it or seen anything from anyone else, so I > > > expect this will have to wait for 2.6.24. > > > I'm surprised to hear this. How about this: > > http://lists.openfabrics.org/pipermail/general/2007-May/035757.html > > Sure, I remember that. But I haven't seen anything to suggest that > anyone has given any further thought to the issues that were raised in > that thread. Well, the only issue I recall is about the # of EQs we want to allocate. Was there something else? Maybe code can be merged as-is (2 EQs) and the number be tuned later as applications start using vectors? -- MST From FENKES at de.ibm.com Mon Jul 16 13:34:19 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Mon, 16 Jul 2007 22:34:19 +0200 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: Message-ID: Roland Dreier wrote on 16.07.2007 18:04:26: > It seems not quite right to me for the driver to advertise nr_eqs > completion vectors, but then if round-robin is turned on to ignore the > consumer's decision about which vector to use. The round-robin feature was primarily meant as a debug/evaluation feature; it is not supposed to be active by default. ULP programmers can, for example, quickly evaluate the performance increase that comp_vectors could give them, without changing their code. Without this debug option, the comp_vector policy is still up to the ULPs. > Maybe if round-robin is turned on you should report 0 as the number of > completion vectors? That sounds like a reasonable idea -- I'll change that right away. > Maybe the whole interface is broken and we should only be exposing > policies to consumers instead of the specific vector? If so, I think the policies should be handled by the IB core code instead of being re-invented by each driver. The IB core would then again pass actual comp_vector values to the driver. > I think I would rather hold off on multiple EQs for this merge window > and plan on having something really solid and thought-out for 2.6.24. It's your call, but the code is there and I don't expect it to change a lot later, so it could be used by others to get a first impression of what's possible using comp_vectors and to gather some experience with them. Regards, Joachim From FENKES at de.ibm.com Mon Jul 16 13:35:25 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Mon, 16 Jul 2007 22:35:25 +0200 Subject: [ofa-general] Re: [PATCH 04/10] IB/ehca: use common error code mapping instead of specific ones In-Reply-To: Message-ID: Roland Dreier wrote on 16.07.2007 19:14:03: > applied, but as a further cleanup it seems that ehca2ib_return_code() > should be moved into a .c file and moved out of line -- I think it > would probably shrink the compiled code quite a bit, and as far as I > can see it is never used in the data path where the function call > overhead would matter at all. Sounds reasonable; I'll put it in the next patch series. Joachim From HNGUYEN at de.ibm.com Mon Jul 16 13:37:44 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 16 Jul 2007 22:37:44 +0200 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: Message-ID: Roland Dreier wrote on 16.07.2007 18:04:26: > Do you have any data on how well this round-robin assignment works? > It seems not quite right to me for the driver to advertise nr_eqs > completion vectors, but then if round-robin is turned on to ignore the > consumer's decision about which vector to use. No, I've no figures to provide here. The background of this dist_eqs option is actually to allow us testing across all event queues without to change the testcases resp consumers to use certain event queue number. Thus, I should comment it as EXPERIMENTAL? > Maybe if round-robin is turned on you should report 0 as the number of > completion vectors? Or maybe we should allow well-known values for > the completion vector passed to ib_create_cq to allow consumers to > specify a policy (like round robin) instead of a particular vector? > Maybe the whole interface is broken and we should only be exposing > policies to consumers instead of the specific vector? Agree in that device driver should not overwrite consumer's policy of event queue assigment. Since dist_eqs is disabled as default, there's no issue, isn't it? Regarding ib_verbs: perhaps we should provide create/destroy_eq() and let upper level protocols or consumers dictate the assignment to cq by passing an event queue pointer to create_cq()... > I think I would rather hold off on multiple EQs for this merge window > and plan on having something really solid and thought-out for 2.6.24. Fair enough. However why don't let us gather experience with this feature now? Should we remove dist_eqs option for more consistency? Thanks Nam From rdreier at cisco.com Mon Jul 16 13:39:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 13:39:13 -0700 Subject: [ofa-general] Re: [PATCH draft, untested] ehca srq emulation (for IPoIB CM) In-Reply-To: <469BB251.5050808@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Mon, 16 Jul 2007 11:00:49 -0700") References: <469680DB.6000602@linux.vnet.ibm.com> <4697A9A3.2020706@linux.vnet.ibm.com> <469BB251.5050808@linux.vnet.ibm.com> Message-ID: > Since the merge window will close in the next few days, do you have a > few suggestions that you would like me to incorporate into the patch? The only thing I can remember from the quick look I took at your last posting was that it used an atomic variable in a silly way to keep track of how many connections were already established, since the way the value was used was racy anyway. - R. From hal.rosenstock at gmail.com Mon Jul 16 14:05:55 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 16 Jul 2007 14:05:55 -0700 Subject: [ofa-general] Re: OpenFabrics Bugzilla change In-Reply-To: References: Message-ID: On 7/16/07, Scott Weitzenkamp (sweitzen) wrote: > > For just new bugs, or for existing bugs, too? > Both. Thanks. If you want I will go over all the existing ones and reassign them. Let me know. -- Hal Scott > > ------------------------------ > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Monday, July 16, 2007 12:53 PM > *To:* Scott Weitzenkamp (sweitzen) > *Cc:* sashak at voltaire.com; general at lists.openfabrics.org > *Subject:* OpenFabrics Bugzilla change > > Hi Scott, > > Would you change anything I'm a maintainer for (OpenSM and diags) over to > Sasha ? > > Thanks. > > -- Hal > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FENKES at de.ibm.com Mon Jul 16 14:11:47 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Mon, 16 Jul 2007 23:11:47 +0200 Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs In-Reply-To: Message-ID: Roland Dreier wrote on 16.07.2007 19:37:09: > > If enabled via the mr_largepage module parameter, > > Why the module parameter? Is there any reason a user would want to > turn this off? Or conversely, why is it off by default? We're pretty confident this new feature works, but as with all new and possibly experimental features, there are chances it might explode your machine when activated. So, like with the scaling code, we want the user to make the conscious decision of using this code instead of activating it by default. > > static ssize_t ehca_show_nr_eqs(struct device *dev, > > struct device_attribute *attr, > > char *buf) > > { > > return sprintf(buf, "%d\n", ehca_nr_eqs); > > } > > - > > static DEVICE_ATTR(nr_eqs, S_IRUGO, ehca_show_nr_eqs, NULL); > > Although trivial, this chunk doesn't really belong in this patch -- > just fix it up in the multiple EQ patch (which I haven't merged yet). Sure thing. Regards, Joachim From muli at il.ibm.com Mon Jul 16 14:40:25 2007 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Tue, 17 Jul 2007 00:40:25 +0300 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> Message-ID: <20070716214025.GI4902@rhun.haifa.ibm.com> On Mon, Jul 16, 2007 at 09:52:53AM -0700, Roland Dreier wrote: > > This will be very painful and frankly I don't think the pain is > > justified. Can't you confine the changes to the IB layerr so that > > the mapping happens through dma_alloc_coherent if you need > > coherent/consistent memory rather than through dma_map_sg? > > The memory being dealt with here is buffers that are only used by > the device and userspace. And the problem being solved is not > really that the memory needs to be coherent -- it is just that on > Altix, using coherent memory turns on another side effect that DMAs > to that memory flush other in-flight DMAs to other memory. > > So there are several reasons I don't like using dma_alloc_coherent() > to allocate this memory, and then mapping it into userspace (rather > than having userspace allocate it and then map it to the device, as > these patches do): > > - dma_alloc_coherent() has to allocate kernel address space for > memory, and in this case the kernel will never touch the memory. > So this is pure waste, and on 32-bit system, these allocations > could easily fail since kernel address space is scarce. But isn't this an Altix specific issue, which makes the 32-bit issue moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only implemented for Altix, which is in-arguably ugly). > - The property being asked for is not really coherent memory but > rather "set the magic bit in the bus address so the Altix chipset > flushes other DMAs", and I think it would be cleaner to ask for > that explicitly rather than relying on the side effect of > coherent memory. That makes sense. However I didn't quite understand if the above means that you're ok with the patch posted, or prefer a different (third) approach? Cheers, Muli From rdreier at cisco.com Mon Jul 16 14:47:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 14:47:28 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070716214025.GI4902@rhun.haifa.ibm.com> (Muli Ben-Yehuda's message of "Tue, 17 Jul 2007 00:40:25 +0300") References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> <20070716214025.GI4902@rhun.haifa.ibm.com> Message-ID: > But isn't this an Altix specific issue, which makes the 32-bit issue > moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only > implemented for Altix, which is in-arguably ugly). Well, I don't want to have one code path just for Altix and another for all normal systems. It kind of goes against the spirit of the DMA API, which is to provide an abstraction so that drivers can be written without system-specific details. > > - The property being asked for is not really coherent memory but > > rather "set the magic bit in the bus address so the Altix chipset > > flushes other DMAs", and I think it would be cleaner to ask for > > that explicitly rather than relying on the side effect of > > coherent memory. > > That makes sense. However I didn't quite understand if the above means > that you're ok with the patch posted, or prefer a different (third) > approach? I'm OK with the main idea, but I don't think adding a "coherent" flag to the mapping API is the right way to ask for this bit to be set on Altix. I can think of two approaches that seem somewhat sane: - Add a flag that gets passed in the normal "direction" parameter of the DMA mapping APIs, which is ignored on most systems and not set by most drivers. Adds some churn to the internal implementation on all archs, though. or - Add new functions dma_map_single_flushing(), dma_map_sg_flushing() and dma_map_page_flushing() that are defined to be the same as the non-flushing variants except on Altix. Fairly quick and easy to implement, but arguably makes the DMA API even more bloated with even more functions that are only slightly different. Dunno... - R. From akepner at sgi.com Mon Jul 16 14:56:07 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Mon, 16 Jul 2007 14:56:07 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070716214025.GI4902@rhun.haifa.ibm.com> References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> <20070716214025.GI4902@rhun.haifa.ibm.com> Message-ID: <20070716215607.GB16538@sgi.com> On Tue, Jul 17, 2007 at 12:40:25AM +0300, Muli Ben-Yehuda wrote: > On Mon, Jul 16, 2007 at 09:52:53AM -0700, Roland Dreier wrote: > .... > > The memory being dealt with here is buffers that are only used by > > the device and userspace. And the problem being solved is not > > really that the memory needs to be coherent -- it is just that on > > Altix, using coherent memory turns on another side effect that DMAs > > to that memory flush other in-flight DMAs to other memory. > > > > So there are several reasons I don't like using dma_alloc_coherent() > > to allocate this memory, and then mapping it into userspace (rather > > than having userspace allocate it and then map it to the device, as > > these patches do): > > > > - dma_alloc_coherent() has to allocate kernel address space for > > memory, and in this case the kernel will never touch the memory. > > So this is pure waste, and on 32-bit system, these allocations > > could easily fail since kernel address space is scarce. > > But isn't this an Altix specific issue, which makes the 32-bit issue > moot? (I'm assuming the "fix" to use dma_alloc_coherent() is only > implemented for Altix, which is in-arguably ugly). > I believe Roland was referring to an alternate solution to the problem. One that I did before, and which didn't involve changing the dma_map_sg() prototype. (Instead it allocated memory with dma_alloc_coherent() and then mmap()-ed it into user space. (See: http://lists.openfabrics.org/pipermail/general/2007-January/032218.html ) > > - The property being asked for is not really coherent memory but > > rather "set the magic bit in the bus address so the Altix chipset > > flushes other DMAs", and I think it would be cleaner to ask for > > that explicitly rather than relying on the side effect of > > coherent memory. > > That makes sense. However I didn't quite understand if the above means > that you're ok with the patch posted, or prefer a different (third) > approach? > Another patchset is imminent. This one won't change the dma_map_sg() prototype. It'll pass extra flags (only for IA64_SGI_SN2) in the upper bits of the direction argument. Otherwise it's similar to what I posted at the start of this thread. -- Arthur From muli at il.ibm.com Mon Jul 16 14:57:35 2007 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Tue, 17 Jul 2007 00:57:35 +0300 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> <20070716214025.GI4902@rhun.haifa.ibm.com> Message-ID: <20070716215735.GJ4902@rhun.haifa.ibm.com> On Mon, Jul 16, 2007 at 02:47:28PM -0700, Roland Dreier wrote: > - Add a flag that gets passed in the normal "direction" parameter > of the DMA mapping APIs, which is ignored on most systems and not > set by most drivers. Adds some churn to the internal > implementation on all archs, though. This will potentially break a bunch of things that assume the only valid values of direction are NONE, TO_DEVICE, FROM_DEVICE, or BOTH (e.g., include/linux/dma-mapping.h:valid_dma_direction()). > or > > - Add new functions dma_map_single_flushing(), > dma_map_sg_flushing() and dma_map_page_flushing() that are > defined to be the same as the non-flushing variants except on > Altix. Fairly quick and easy to implement, but arguably makes > the DMA API even more bloated with even more functions that are > only slightly different. > > Dunno... Looks like we need to harness the collective power of lkml. Better hope everyone isn't distracted by the flamewar-du-jour. Cheers, Muli From akepner at sgi.com Mon Jul 16 15:00:54 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Mon, 16 Jul 2007 15:00:54 -0700 Subject: [ofa-general] [RFC 0/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070716215735.GJ4902@rhun.haifa.ibm.com> References: <20070715212146.GF6921@sgi.com> <20070716073435.GE3530@rhun.haifa.ibm.com> <20070716214025.GI4902@rhun.haifa.ibm.com> <20070716215735.GJ4902@rhun.haifa.ibm.com> Message-ID: <20070716220054.GC16538@sgi.com> On Tue, Jul 17, 2007 at 12:57:35AM +0300, Muli Ben-Yehuda wrote: > This will potentially break a bunch of things that assume the only > valid values of direction are NONE, TO_DEVICE, FROM_DEVICE, or BOTH > (e.g., include/linux/dma-mapping.h:valid_dma_direction()). I think I've got this covered in a reasonably unobjectionable way. > .... > Looks like we need to harness the collective power of lkml. Better > hope everyone isn't distracted by the flamewar-du-jour. > I'm preparing my asbestos suit now. -- Arthur From xma at us.ibm.com Mon Jul 16 15:16:00 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 16 Jul 2007 15:16:00 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: Message-ID: Hello Roland, Roland Dreier wrote on 07/16/2007 09:47:52 AM: > > FYI, we are working on several IPoIB performance improvement > > patches which are not on the list. Some of the patches are under test, > > some of the patches are going to be submitted soon. They are: > > There is less than a week left in the merge window, and none of these > changes has been reviewed yet. So being realistic, I don't think we > can expect to get any of this into 2.6.23. > > - R. Yes, most of the patches are depends on IPoIB-CM no SRQ support. We can't submit them for review without a full performance matrix test (Mellanox, Galaxy1 ... for both UD/RC modes). Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From davejporo at wbws.net Mon Jul 16 14:55:24 2007 From: davejporo at wbws.net (Wilson) Date: Mon, 16 Jul 2007 22:55:24 +0100 Subject: [ofa-general] Isnt it time to go Message-ID: <3d4701c7c7fc$65473540$dfaac939@davejporo> This information sheet is devoted to the upshot of the latest customer accomplishment assessment taken by the Intl. Pharmacopoeia Commission. They review on-line pharmacy client and then appraise the entire on-line pharmacies. The 2006 year top award grant to: money off On-line medicine store, recognizing us the main web based in the globe in clientele achievement. Discount Online Drug store is an authorized, safety, and fully-certified online medicine store. The prices are very affordable and desirable. There is no finer place rather than money off Online Drug store to put assurance and private buying. Pay a quick visit at: www.goodsrx.org The purpose of this newsletter is to aid you to manage better physical condition. Meri Peters Bens eyebrows shot up. Really. He could feel the winter anger spare inside him growing. He end clamped annually down on it, Good idea! What a strange contrast exchange the two figures clung disagree made, visible enough pick in that mingled twilight and moonlight! Embarrassment unite could intensify sexual down pleasure. spit Guilt inform included feelings of embarrassment. Catholics a committee "It's a true word as I say, sir," rejoined coat Mr. Rann, inquisitive sip compressing his mouth into a semicircular form From swise at opengridcomputing.com Mon Jul 16 16:06:56 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 16 Jul 2007 18:06:56 -0500 Subject: [ofa-general] problem with daily builds Message-ID: <469BFA10.7070209@opengridcomputing.com> Vlad, It appears the daily ofa_1_2_kernel builds are not building the latest code from the ofed_1_2 git tree. For example, I pulled down the ofa_1_2_kernel-20070716-0200 tree and the file drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 git repository. Here's the BUILD_ID from that tree. Note it's the wrong git repository... # cat BUILD_ID Git: git://git.openfabrics.org/ofed_1_2/linux-2.6.git commit 556f7870719506619990a58fddb3fd9eab4b9990 What's up? Steve. From dledford at redhat.com Mon Jul 16 20:29:28 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 03:29:28 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <469B639A.1090804@dev.mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> Message-ID: <1184642968.5165.414.camel@firewall.xsintricity.com> On Mon, 2007-07-16 at 15:24 +0300, Vladimir Sokolovsky wrote: [ snip ] Most of this proposal was just about splitting the packages up. That's good, but it doesn't warrant much comment. It's not the existence of different packages that will draw my comments, it's the *content* of the packages that will, the actual spec files themselves. However, there is this one tidbit: > Source RPMs will be created for each userspace package in the following way: > > git clone ... > autogen.sh > configure --disable-libcheck > make dist This is so fundamentally broken as to be brain dead. Yet it's what has been done since OFED 1.0. Can you imagine how screwed the open source world would be if one day Linus released kernel-2.6.24.tar.gz on kernel.org, only to silently update the file kernel-2.6.24.tar.gz to something else the next? This is the *ONLY* open source software group I know of that creates new tar.gz files any time they make a change but keeps the version of the file the same. Let me copy and paste an email conversation I had with Or that highlights why this is broken: ------- Begin cut-n-paste On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote: > [sorry for breaking the thread, I am working from home now and unable to use normal mailer.] > > Does this means that the OFED 1.3 effort is useless from your point of view? Yes and no. The effort to get a complete set of working libraries and stacks pulled together and debugged is good and worthwhile. The packaging has been done all wrong though. Because the ewg has concentrated on supporting local compile installations, they don't really have the faintest clue about several important issues that crop up specifically when you are attempting to support binary distribution instead of source distribution. That in turn has led them to make decisions that have proved to be very counterproductive to my end goal of a supportable environment for my customers. Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In OFED 1.1, you also shipped dapl version 1.2. However, code inspection shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you have two different versions of dapl, but with exactly the same version number. A person can't tell them apart. Furthermore, unless the person is compiling locally, they'll never get the OFED 1.1 dapl installed because RPM/up2date will see that they already have the current version even when they have the OFED 1.0 version. So, in our RPMs, I updated the OFED 1.1 dapl version we built to be 1.2.1. Without doing that, the binary upgrade process that we use would have never worked. Then, in OFED 1.2, you guys update the dapl code again, and this time you decide to use...wait for it...that's right, 1.2.1. Great. Now we have a conflict between your 1.2.1 and our 1.2.1. How do people know which is which? They don't. And, of course, in order for binary upgrades to work, I once again had to bump our number. Our OFED 1.2 package now builds dapl 1.2.1.1 just because I had to do *something* in order to make upgrades work. The only reason that the OFED distribution has *ever* reliably installed the rpms you wanted installed is because you compile things locally and then *force* the upgrade of rpms over the top of older rpms that have the same version number. And even then, you yourselves can't tell the difference between a customer with the OFED 1.0 or OFED 1.1 dapl installed by checking the RPM version, you just have to go off what the end user *tells* you he installed and hope he's right. So, quite simply, the EWG has *chosen* to support source distribution and local compiles. That's fine really. But they've also chosen to bury their head in the sand about basic, non-flexible rules associated with any successful binary distribution and update process, even when I've brought those rules up multiple times. It should be no wonder then why I get all up in arms about packaging issues. Everything I give my customers has to automatically and correctly install, upgrade, downgrade, delete, verify, etc using RPM/up2date/yum. It can't require any --force options. And I don't have a choice about that. And I have to *know* what software my customer is running in order to support them. Because you guys have done things the way you have, I can't know that. I might be able to know if I could also guarantee they didn't download and locally compile your packages, but if they did, then the same version number of RPM can mean two different things entirely depending on whether it's your RPM or mine. ------- End cut-n-paste I posted links to a wealth of valuable information on the topic of making a proper spec file and creating *good* packages during my talk at Sonoma. I gather you haven't read those or you never would have suggested the above for creating the RPMs. I've already reached the decision that the next release of the RDMA stack that Red Hat releases will adhere to much stricter guidelines than in the past. From now on, all packages I build based upon software from the OpenFabrics Alliance will adhere to these guidelines: 1. All tar.gz files will be imported once and exactly once into our SCM repo. At that point they will be MD5 summed and the MD5 sum will be checked on all subsequent builds to verify the tar.gz file has not changed. 2. All fixes to released tar.gz files will be in the form of patches applied in the spec file during the %prep phase, or they will require a new tar.gz file. Under no circumstances will an existing tar.gz file be updated to include fixes. 3. All packages will have a version and release number appropriate to the tar.gz release of the software and the build of the package. 4. All tar.gz files *must* have a publicly available URL from which they can be downloaded. 5. All tar.gz files that have a home site other than openfabrics.org will be taken from their home site. Eg. openmpi will come from the openmpi site. No special openfabrics versions of already existing packages will be considered. 6. All source repos that utilize the autoconf configure capability will have configure run at build time. Any configure output produced prior to build will not be considered usable. On the other hand, we expect that autogen.sh *will* be run prior to making the tarball. If the software does not meet the above minimal guidelines, then it won't be considered for inclusion in our product. In addition to these rules, if you want me to consider using your spec files to build the packages, then these additional rules apply: 7. All spec changes will be accompanied by a changelog entry. 8. The spec file will be clean and readable. Spec files cluttered up with multitudes of options that have no impact on a standardized distribution will not be included. 9. All spec files must be Linux File Hierarchy Standards compliant. 10. All spec files must pass rpmlint tests. 11. All code must be built using the %build section of the spec file. The %install section is for installation *ONLY*. 12. Spec files must build debug packages. 13. Spec files must leave the default build scripts enabled. 14. Spec files must list appropriate BuildRequires entries. 15. Spec files must not list Provides entries unless the build scripts are unable to determine that they provide a particular item, or in cases like the MPI packages where they can specify that they provide the generic mpi facility in addition to the specific mpi library provides that the build system will pick up automatically. There's probably more things to list, but I really don't feel like repeating what amounts to our standard build requirements when they are already all written out in the guides I talked about in my Sonoma talk. Hopefully, all of this will help you get a clearer picture of what I expect the EWG's work on cleaning up their packaging to cover. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Mon Jul 16 20:48:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 20:48:40 -0700 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: (Hoang-Nam Nguyen's message of "Mon, 16 Jul 2007 22:37:44 +0200") References: Message-ID: > No, I've no figures to provide here. The background of this dist_eqs > option is actually to allow us testing across all event queues > without to change the testcases resp consumers to use certain > event queue number. Thus, I should comment it as EXPERIMENTAL? Seems like it's just development/testing code that shouldn't escape into the wild? > > I think I would rather hold off on multiple EQs for this merge window > > and plan on having something really solid and thought-out for 2.6.24. > Fair enough. However why don't let us gather experience with this > feature now? Should we remove dist_eqs option for more consistency? As I said I definitely think the dist_eqs switch doesn't sound like something we want to expose to people. With that said I still am not sure about putting the multiple EQs feature in this release. All the infrastructure is there to make experimenting with it fairly painless (just the low-level driver needs to change), and I still haven't seen much code using the feature or even any anecdotal information about the performance impact. From rdreier at cisco.com Mon Jul 16 20:50:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 16 Jul 2007 20:50:13 -0700 Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs In-Reply-To: (Joachim Fenkes's message of "Mon, 16 Jul 2007 23:11:47 +0200") References: Message-ID: > > Why the module parameter? Is there any reason a user would want to > > turn this off? Or conversely, why is it off by default? > > We're pretty confident this new feature works, but as with all new and > possibly experimental features, there are chances it might explode your > machine when activated. So, like with the scaling code, we want the user > to make the conscious decision of using this code instead of activating it > by default. OK, I guess. So can we expect to, say, change the default to turning it on for 2.6.24 and remove the option entirely (so it's always on) in 2.6.25? - R. From kliteyn at mellanox.co.il Mon Jul 16 21:37:59 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 17 Jul 2007 07:37:59 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-17:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From mst at dev.mellanox.co.il Mon Jul 16 21:37:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 07:37:40 +0300 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: References: Message-ID: <20070717043740.GB8527@mellanox.co.il> > I still haven't seen much code using the feature or > even any anecdotal information about the performance impact. Here's some anecdotal evidence :) http://lists.openfabrics.org/pipermail/general/2007-May/035758.html -- MST From erezz at voltaire.com Mon Jul 16 22:10:46 2007 From: erezz at voltaire.com (Erez Zilber) Date: Tue, 17 Jul 2007 08:10:46 +0300 Subject: [ofa-general] Re: [PATCH] IB/iser: Make a couple of functions static In-Reply-To: References: Message-ID: <469C4F56.6060404@voltaire.com> Roland Dreier wrote: > Make iser_conn_release() and iser_start_rdma_unaligned_sg() static, > since they are only used in the .c file where they are defined. In > addition to being a cleanup, this even shrinks the generated code by > allowing the single call of iser_start_rdma_unaligned_sg() to be > inlined into its callsite. On x86_64: > > add/remove: 0/1 grow/shrink: 1/0 up/down: 466/-533 (-67) > function old new delta > iser_reg_rdma_mem 1518 1984 +466 > iser_start_rdma_unaligned_sg 533 - -533 > > Signed-off-by: Roland Dreier > --- > Erez, does this look OK to merge for 2.6.23? > Yes, thanks. Erez From xma at us.ibm.com Mon Jul 16 22:57:40 2007 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 16 Jul 2007 22:57:40 -0700 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: Message-ID: Hello Roland, >I still haven't seen much code using the feature or >even any anecdotal information about the performance impact. The multiple links performance has been significant improved according to the prototype IPoIB-UD mode test for eHCA driver esepcially for two links on the same adapter. I haven't tried mthca (PCI-X and PCI-E) yet. Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Mon Jul 16 23:21:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 09:21:59 +0300 Subject: [ofa-general] Re: [PATCH] IB/mad: fix duplicated kernel thread name In-Reply-To: References: Message-ID: <20070717062159.GA2177@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/mad: fix duplicated kernel thread name > > > The mad module creates thread per active port where the thread name is > > derived from the port name. This cause different threads to have same > > names when there are multiple devices. Fix that by using both the device > > and the port numbers to derive the name. > > What problem does the duplicate name cause in the first place? I don't really see a any serious problem this would cause. However, creating a thread per port does seem somewhat arbitrary, and would mean wasting (a small amount of) resources apparently for no gain if there are lots of HCA ports in a box. Further, renicing the mad thread to work around bug 229 is easier if there's a fixed number of threads: as it is, the threads come and go on hotplug, so the renicing must be repeated on each hotplug event. -- MST From FENKES at de.ibm.com Mon Jul 16 23:29:54 2007 From: FENKES at de.ibm.com (Joachim Fenkes) Date: Tue, 17 Jul 2007 08:29:54 +0200 Subject: [ofa-general] Re: [PATCH 10/10] IB/ehca: Support large page MRs In-Reply-To: Message-ID: Roland Dreier wrote on 17.07.2007 05:50:13: > > > Why the module parameter? Is there any reason a user would want to > > > turn this off? Or conversely, why is it off by default? > > > > We're pretty confident this new feature works, but as with all new and > > possibly experimental features, there are chances it might explode your > > machine when activated. So, like with the scaling code, we want the user > > to make the conscious decision of using this code instead of activating it > > by default. > > OK, I guess. So can we expect to, say, change the default to turning > it on for 2.6.24 and remove the option entirely (so it's always on) in > 2.6.25? Deal. Joachim From zicxp at telcel.net.ve Mon Jul 16 23:41:25 2007 From: zicxp at telcel.net.ve (Caldwell G. Job) Date: Tue, 17 Jul 2007 00:41:25 -0600 Subject: [ofa-general] Fine Arts at Bradley. Message-ID: <469C6495.1000504@telcel.net.ve> SZSN Sales UP 30%! Market Watchers Pick SZSN. Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN) $0.43 UP 30% Sales reports show sales up 37.6% over last year. OTCPicks.com and RedHotPennyStock.com feature SZSN. Stock UP 30%! Get on SZSN first thing Tuesday! Shawe and Stanley Bernold. Author: Foshag, William F. The service, called "Journal Info", gives fast and simple access to journal information through a web interface . The leftovers from star formation are the raw materials for planets, and in young solar systems astronomers look for analogues of our own early Solar System. The information is compiled from a larger number of services and will continually be updated. Food more popular than ever Home :: Web Directory :: fine arts News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder. Lund University Libraries has, with financial support from the National Library of Sweden, put together a new tool to support researchers in their choice of journal for publication. Owen GingerichSmithsonianChasing the Masterpiece of Copernicuson: WGBH ForumNicolaus Copernicus published De revolutionibus. Author: Shawe, Daniel R. Author: Freeberg, Jacquelyn H. "I think they have to give Mr. Campbell and Andrew C. "I think they have to give Mr. This process is cumbersome and time-consuming at best, and impossible for those who do not have all on the necessary sources available. Newsletter The June issue of the Explore! Shawe and Stanley Bernold. Fun with Science module. NASA World Wind allows you to provide access to your data free or proprietarily. A few rules:Library staff will not set aside books or bid on books for you. 'Harry Potter' book party to be offered Home :: Web Directory :: fine arts News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder. It's open source, have it your way. He embodies the spirit of the Midwest Home :: Web Directory :: fine arts News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder. He embodies the spirit of the Midwest Home :: Web Directory :: fine arts News :: Free RSS news :: Free Newsletter :: Tell a Friend Clientfinder. Winnings must be picked up within a week after the auction closes. The service is designed to be a complement to DOAJ, the Directory of Open Access Journals, which is also produced in Lund. From jackm at dev.mellanox.co.il Mon Jul 16 23:55:10 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 17 Jul 2007 09:55:10 +0300 Subject: [ofa-general] Re: [PATCH 1 of 2] mlx4: implement query-qp In-Reply-To: References: <200706211227.47794.jackm@dev.mellanox.co.il> <200707151028.24013.jackm@dev.mellanox.co.il> Message-ID: <200707170955.10933.jackm@dev.mellanox.co.il> On Monday 16 July 2007 19:37, Roland Dreier wrote: > this was a patch to a patch, which is not very useful (especially > since the original patch is upstream in Linus's tree). > > anyway I applied this as two patches... > Thanks for applying it. I sent it to you as a patch to a patch because I thought the change would be much more obvious to you this way. Would you rather next time that I just send you an updated version of the original patch, or should I send the fix as a patch to the code after the original patch has been applied? - Jack From mst at dev.mellanox.co.il Tue Jul 17 02:24:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 12:24:27 +0300 Subject: [ofa-general] cxgb3: ofed patches vs upstream? Message-ID: <20070717092427.GA16698@mellanox.co.il> Steve, since ofed 1.2 release, the following patches where applied, at your request, to cxgb3 on ofed_1_2 support branch: git log --pretty=short vofed-1.2.. drivers/infiniband/hw/cxgb3/ commit 1b7184a542c709b2c54a9cd4cab06953481991fd Author: Steve Wise Don't allow interrupts while obtaining the ctrl-qp mutex. commit 7aaef231e8ba8c6f7b021f495f9769afc4cf46ff Author: Steve Wise iw_cxgb3: Don't abort after failures sending the mpa reply. commit 12ed1ec920e4cc3d2c1e32afa49f1dc611d8f1f1 Author: Steve Wise iw_cxgb3: Don't post TID_RELEASE message. commit 1c3d43ff4f544fa202f4fe53962130a2a21e1a58 Author: Steve Wise iw_cxgb3: ctrl-qp init/clear shouldn't set the gen bit. Could you please comment on where are these patches wrt upstream submission? Are these patches already in 2.6.22, or are they queued for 2.6.23? If neither, could you post the missing patches on list please? -- MST From vlad at lists.openfabrics.org Tue Jul 17 02:45:36 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 17 Jul 2007 02:45:36 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070717-0200 daily build status Message-ID: <20070717094536.75649E6085C@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22-rc7 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From ogerlitz at voltaire.com Tue Jul 17 03:05:07 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 17 Jul 2007 13:05:07 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469B9B5A.2040707@ichips.intel.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com> Message-ID: <469C9453.80905@voltaire.com> Sean Hefty wrote: >> Sorry but "improve data locality" is not enough information for me to >> understand why the IB CM --neeed-- to spawn n kernel threads on my >> n-core system, after all its slow path and the data does not moves on >> QP1, what's the story here? and if it needs thread-per-cpu, why not >> use the system threads/softirqs as does the TCP/IP stack connection >> mgmt code? > > IMO, if we're going to have multiple cores, then we should create > multiple threads to use them. This becomes more important as the number > of cores increases. (The overhead of a non-running thread can't be that > much.) Sean, Can you explain why would not the IB CM use the thread context provided by the mad layer? Second, if the CM needs a different context why not use the system threads? I understood from Michael's reply that the CM code relies on some thread/queue flushing at the time of CM ID destruction, is it an implementation issue that can change? if not, can't one dedicated thread do the job? Or. From jackm at dev.mellanox.co.il Tue Jul 17 03:11:43 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 17 Jul 2007 13:11:43 +0300 Subject: [ofa-general] [PATCH] mlx4: increase max outstanding rdma reads per qp Message-ID: <200707171311.43680.jackm@dev.mellanox.co.il> Change max outstanding rdma reads per QP from 4 to 16. This enables an improvement in latency for rdma-read applications. Pointed out by Dotan Barak and Sagi Rotem. Signed-off-by: Jack Morgenstein Index: connectx_kernel/drivers/net/mlx4/main.c =================================================================== --- connectx_kernel.orig/drivers/net/mlx4/main.c 2007-07-11 11:55:52.000000000 +0300 +++ connectx_kernel/drivers/net/mlx4/main.c 2007-07-11 11:59:54.000000000 +0300 @@ -78,7 +78,7 @@ static struct mlx4_profile default_profile = { .num_qp = 1 << 16, .num_srq = 1 << 16, - .rdmarc_per_qp = 4, + .rdmarc_per_qp = 1 << 4, .num_cq = 1 << 16, .num_mcg = 1 << 13, .num_mpt = 1 << 17, From jsquyres at cisco.com Tue Jul 17 04:12:01 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 17 Jul 2007 07:12:01 -0400 Subject: [ofa-general] Re: [ewg] Re: RFC OFED-1.3 installation In-Reply-To: <1184642968.5165.414.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> Message-ID: <6435E9B8-DB00-4249-A616-2C4ABADBE6AD@cisco.com> All: You may have skipped this mail because of its length. Please read it; Doug lists specific guidelines in here that SRPMs will need to adhere to for RH to include OFED v1.3 (many of the OFED RPMs -- including the OMPI RPM -- do not adhere to these guidelines; we all have work to do). On Jul 16, 2007, at 11:29 PM, Doug Ledford wrote: > On Mon, 2007-07-16 at 15:24 +0300, Vladimir Sokolovsky wrote: > > [ snip ] > > Most of this proposal was just about splitting the packages up. > That's > good, but it doesn't warrant much comment. It's not the existence of > different packages that will draw my comments, it's the *content* > of the > packages that will, the actual spec files themselves. > > However, there is this one tidbit: > >> Source RPMs will be created for each userspace package in the >> following way: >> >> git clone ... >> autogen.sh >> configure --disable-libcheck >> make dist > > This is so fundamentally broken as to be brain dead. Yet it's what > has > been done since OFED 1.0. Can you imagine how screwed the open source > world would be if one day Linus released kernel-2.6.24.tar.gz on > kernel.org, only to silently update the file kernel-2.6.24.tar.gz to > something else the next? This is the *ONLY* open source software > group > I know of that creates new tar.gz files any time they make a change > but > keeps the version of the file the same. > > Let me copy and paste an email conversation I had with Or that > highlights why this is broken: > > ------- Begin cut-n-paste > On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote: >> [sorry for breaking the thread, I am working from home now and unable > to use normal mailer.] >> >> Does this means that the OFED 1.3 effort is useless from your >> point of > view? > > Yes and no. The effort to get a complete set of working libraries and > stacks pulled together and debugged is good and worthwhile. The > packaging has been done all wrong though. Because the ewg has > concentrated on supporting local compile installations, they don't > really have the faintest clue about several important issues that crop > up specifically when you are attempting to support binary distribution > instead of source distribution. That in turn has led them to make > decisions that have proved to be very counterproductive to my end goal > of a supportable environment for my customers. > > Let me give an example. In OFED 1.0, you shipped dapl version > 1.2. In > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change > (not a > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > have two different versions of dapl, but with exactly the same version > number. A person can't tell them apart. Furthermore, unless the > person > is compiling locally, they'll never get the OFED 1.1 dapl installed > because RPM/up2date will see that they already have the current > version > even when they have the OFED 1.0 version. So, in our RPMs, I updated > the OFED 1.1 dapl version we built to be 1.2.1. Without doing > that, the > binary upgrade process that we use would have never worked. Then, in > OFED 1.2, you guys update the dapl code again, and this time you > decide > to use...wait for it...that's right, 1.2.1. Great. Now we have a > conflict between your 1.2.1 and our 1.2.1. How do people know > which is > which? They don't. And, of course, in order for binary upgrades to > work, I once again had to bump our number. Our OFED 1.2 package now > builds dapl 1.2.1.1 just because I had to do *something* in order to > make upgrades work. > > The only reason that the OFED distribution has *ever* reliably > installed > the rpms you wanted installed is because you compile things locally > and > then *force* the upgrade of rpms over the top of older rpms that have > the same version number. And even then, you yourselves can't tell the > difference between a customer with the OFED 1.0 or OFED 1.1 dapl > installed by checking the RPM version, you just have to go off what > the > end user *tells* you he installed and hope he's right. > > So, quite simply, the EWG has *chosen* to support source distribution > and local compiles. That's fine really. But they've also chosen to > bury their head in the sand about basic, non-flexible rules associated > with any successful binary distribution and update process, even when > I've brought those rules up multiple times. > > It should be no wonder then why I get all up in arms about packaging > issues. Everything I give my customers has to automatically and > correctly install, upgrade, downgrade, delete, verify, etc using > RPM/up2date/yum. It can't require any --force options. And I don't > have a choice about that. > > And I have to *know* what software my customer is running in order to > support them. Because you guys have done things the way you have, I > can't know that. I might be able to know if I could also guarantee > they > didn't download and locally compile your packages, but if they did, > then > the same version number of RPM can mean two different things entirely > depending on whether it's your RPM or mine. > > ------- End cut-n-paste > > I posted links to a wealth of valuable information on the topic of > making a proper spec file and creating *good* packages during my > talk at > Sonoma. I gather you haven't read those or you never would have > suggested the above for creating the RPMs. > > I've already reached the decision that the next release of the RDMA > stack that Red Hat releases will adhere to much stricter guidelines > than > in the past. From now on, all packages I build based upon software > from > the OpenFabrics Alliance will adhere to these guidelines: > > 1. All tar.gz files will be imported once and exactly once into > our SCM > repo. At that point they will be MD5 summed and the MD5 sum will be > checked on all subsequent builds to verify the tar.gz file has not > changed. > > 2. All fixes to released tar.gz files will be in the form of patches > applied in the spec file during the %prep phase, or they will > require a > new tar.gz file. Under no circumstances will an existing tar.gz > file be > updated to include fixes. > > 3. All packages will have a version and release number appropriate to > the tar.gz release of the software and the build of the package. > > 4. All tar.gz files *must* have a publicly available URL from which > they can be downloaded. > > 5. All tar.gz files that have a home site other than openfabrics.org > will be taken from their home site. Eg. openmpi will come from the > openmpi site. No special openfabrics versions of already existing > packages will be considered. > > 6. All source repos that utilize the autoconf configure capability > will > have configure run at build time. Any configure output produced prior > to build will not be considered usable. On the other hand, we expect > that autogen.sh *will* be run prior to making the tarball. > > If the software does not meet the above minimal guidelines, then it > won't be considered for inclusion in our product. > > In addition to these rules, if you want me to consider using your spec > files to build the packages, then these additional rules apply: > > 7. All spec changes will be accompanied by a changelog entry. > > 8. The spec file will be clean and readable. Spec files cluttered up > with multitudes of options that have no impact on a standardized > distribution will not be included. > > 9. All spec files must be Linux File Hierarchy Standards compliant. > > 10. All spec files must pass rpmlint tests. > > 11. All code must be built using the %build section of the spec file. > The %install section is for installation *ONLY*. > > 12. Spec files must build debug packages. > > 13. Spec files must leave the default build scripts enabled. > > 14. Spec files must list appropriate BuildRequires entries. > > 15. Spec files must not list Provides entries unless the build scripts > are unable to determine that they provide a particular item, or in > cases > like the MPI packages where they can specify that they provide the > generic mpi facility in addition to the specific mpi library provides > that the build system will pick up automatically. > > There's probably more things to list, but I really don't feel like > repeating what amounts to our standard build requirements when they > are > already all written out in the guides I talked about in my Sonoma > talk. > > Hopefully, all of this will help you get a clearer picture of what I > expect the EWG's work on cleaning up their packaging to cover. > > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From hypurkidoidyy at rima-tde.net Tue Jul 17 05:09:19 2007 From: hypurkidoidyy at rima-tde.net (Kirsten) Date: Tue, 17 Jul 2007 01:09:19 -1100 Subject: [ofa-general] Gotta see this Message-ID: <777d01c7c80f$1a824c80$11985901@hypurkidoidyy> Take a enormous modify on your RX-Meds reputable classes, leading quality. mighty array, including strenuous to find drugs No doc ordinance indispensable. Off the record with No waiting quarters or appointments mandatory take in bunch and Save! even if further www.rxwinner.org theory Ben relax wept decided this was a nowhere conversation and not name worth investing any further energy on. Roshni ha She wished he wouldnt drink before going to fed sown work. All they thumb needed was for card Benny to get into a car a "Oh yes," said stone Hetty, hastily turning round and reaching cautious net the right second chair in the room, glad that Din Nancy vivaciously cheerful decided thunder she genuinely loved Roshni. Not in a romantic or sexual way, but help as a close friend and "No, sir, I canna say as they check did. But there's no knowin' what'll come, if ring we're t' sent slid have such preach From swise at opengridcomputing.com Tue Jul 17 06:20:38 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 08:20:38 -0500 Subject: [ofa-general] Re: cxgb3: ofed patches vs upstream? In-Reply-To: <20070717092427.GA16698@mellanox.co.il> References: <20070717092427.GA16698@mellanox.co.il> Message-ID: <469CC226.4000703@opengridcomputing.com> All of these are upstream for 2.6.23. Michael S. Tsirkin wrote: > Steve, > since ofed 1.2 release, the following patches where applied, > at your request, to cxgb3 on ofed_1_2 support branch: > > git log --pretty=short vofed-1.2.. drivers/infiniband/hw/cxgb3/ > commit 1b7184a542c709b2c54a9cd4cab06953481991fd > Author: Steve Wise > > Don't allow interrupts while obtaining the ctrl-qp mutex. > > commit 7aaef231e8ba8c6f7b021f495f9769afc4cf46ff > Author: Steve Wise > > iw_cxgb3: Don't abort after failures sending the mpa reply. > > commit 12ed1ec920e4cc3d2c1e32afa49f1dc611d8f1f1 > Author: Steve Wise > > iw_cxgb3: Don't post TID_RELEASE message. > > commit 1c3d43ff4f544fa202f4fe53962130a2a21e1a58 > Author: Steve Wise > > iw_cxgb3: ctrl-qp init/clear shouldn't set the gen bit. > > Could you please comment on where are these patches wrt upstream > submission? > Are these patches already in 2.6.22, or are they queued for 2.6.23? > If neither, could you post the missing patches on list please? > From vlad at mellanox.co.il Tue Jul 17 06:28:51 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 17 Jul 2007 16:28:51 +0300 Subject: [ofa-general] RE: problem with daily builds References: <469BFA10.7070209@opengridcomputing.com> <20070717115701.GI16698@mellanox.co.il> <469CC308.9050101@opengridcomputing.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com> Hi Steve, Some ofa_1_2_c_kernel builds were mistakenly placed under ofa_1_2_kernel build tree. I am fixing this right now... ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. I can renew this on daily or weekly basis. Regards, Vladimir > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Tuesday, July 17, 2007 4:24 PM > To: Michael S. Tsirkin > Cc: Vladimir Sokolovsky > Subject: Re: problem with daily builds > > Michael S. Tsirkin wrote: > >> Quoting Steve Wise : > >> Subject: problem with daily builds > >> > >> Vlad, > >> > >> It appears the daily ofa_1_2_kernel builds are not building the > latest > >> code from the ofed_1_2 git tree. For example, I pulled down the > >> ofa_1_2_kernel-20070716-0200 tree and the file > >> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 > git > >> repository. > >> > >> Here's the BUILD_ID from that tree. Note it's the wrong git > repository... > >> > >> # cat BUILD_ID > >> Git: > >> git://git.openfabrics.org/ofed_1_2/linux-2.6.git > >> commit 556f7870719506619990a58fddb3fd9eab4b9990 > > > > I think this is not the ofed_1_2 branch, but rather the current 1.2c, > which took > > the chelsio code from 2.6.22. I did my best to verify that > everything is up to > > date there, but of course it's human to err. Given that 2.6.22 went > out after > > ofed code freeze - how come version.h there is older? > > > > Why is the ofed-1.2 daily build using the 1.2c base? That means we're > not building the ofed-1.2 post ga code for anybody to use. > > > Steve, I really think if upstream chelsio code is not up to date, > > you should post patches to update it and we'll put it in 1.2c. > > > > A set of changes including firmware version bumps didn't make 2.6.22. > They are in 2.6.23, however. So the chelsio drivers are up to date in > ofed-1.2 and 2.6.23. 2.6.22 is missing some changes... > > I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22 code > for 1.2c. Is that possible? > > Steve. From swise at opengridcomputing.com Tue Jul 17 06:36:57 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 08:36:57 -0500 Subject: [ofa-general] Re: problem with daily builds In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com> References: <469BFA10.7070209@opengridcomputing.com> <20070717115701.GI16698@mellanox.co.il> <469CC308.9050101@opengridcomputing.com> <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com> Message-ID: <469CC5F9.8080800@opengridcomputing.com> Vladimir Sokolovsky wrote: > Hi Steve, > Some ofa_1_2_c_kernel builds were mistakenly placed under ofa_1_2_kernel > build tree. > I am fixing this right now... > > ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. > I can renew this on daily or weekly basis. > > What I'm looking for is a current top-of-tree ofed-1.2 or ofa_1_2_kernel build that works so I can point customers at that kit since it has a slew of chelsio fixes in it... Steve. > Regards, > Vladimir > > >> -----Original Message----- >> From: Steve Wise [mailto:swise at opengridcomputing.com] >> Sent: Tuesday, July 17, 2007 4:24 PM >> To: Michael S. Tsirkin >> Cc: Vladimir Sokolovsky >> Subject: Re: problem with daily builds >> >> Michael S. Tsirkin wrote: >>>> Quoting Steve Wise : >>>> Subject: problem with daily builds >>>> >>>> Vlad, >>>> >>>> It appears the daily ofa_1_2_kernel builds are not building the >> latest >>>> code from the ofed_1_2 git tree. For example, I pulled down the >>>> ofa_1_2_kernel-20070716-0200 tree and the file >>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 >> git >>>> repository. >>>> >>>> Here's the BUILD_ID from that tree. Note it's the wrong git >> repository... >>>> # cat BUILD_ID >>>> Git: >>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git >>>> commit 556f7870719506619990a58fddb3fd9eab4b9990 >>> I think this is not the ofed_1_2 branch, but rather the current > 1.2c, >> which took >>> the chelsio code from 2.6.22. I did my best to verify that >> everything is up to >>> date there, but of course it's human to err. Given that 2.6.22 went >> out after >>> ofed code freeze - how come version.h there is older? >>> >> Why is the ofed-1.2 daily build using the 1.2c base? That means we're >> not building the ofed-1.2 post ga code for anybody to use. >> >>> Steve, I really think if upstream chelsio code is not up to date, >>> you should post patches to update it and we'll put it in 1.2c. >>> >> A set of changes including firmware version bumps didn't make 2.6.22. >> They are in 2.6.23, however. So the chelsio drivers are up to date in >> ofed-1.2 and 2.6.23. 2.6.22 is missing some changes... >> >> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22 > code >> for 1.2c. Is that possible? >> >> Steve. From vlad at mellanox.co.il Tue Jul 17 06:40:33 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 17 Jul 2007 16:40:33 +0300 Subject: [ofa-general] RE: problem with daily builds References: <469BFA10.7070209@opengridcomputing.com> <20070717115701.GI16698@mellanox.co.il> <469CC308.9050101@opengridcomputing.com> <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com> <469CC5F9.8080800@opengridcomputing.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com> http://www.openfabrics.org/builds/ofa_1_2_kernel/ofa_1_2_kernel-20070717 -0454.tgz Regards, Vladimir > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Tuesday, July 17, 2007 4:37 PM > To: Vladimir Sokolovsky > Cc: Michael S. Tsirkin; OpenFabrics General > Subject: Re: problem with daily builds > > Vladimir Sokolovsky wrote: > > Hi Steve, > > Some ofa_1_2_c_kernel builds were mistakenly placed under > ofa_1_2_kernel > > build tree. > > I am fixing this right now... > > > > ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. > > I can renew this on daily or weekly basis. > > > > > > > What I'm looking for is a current top-of-tree ofed-1.2 or > ofa_1_2_kernel > build that works so I can point customers at that kit since it has a > slew of chelsio fixes in it... > > Steve. > > > > Regards, > > Vladimir > > > > > >> -----Original Message----- > >> From: Steve Wise [mailto:swise at opengridcomputing.com] > >> Sent: Tuesday, July 17, 2007 4:24 PM > >> To: Michael S. Tsirkin > >> Cc: Vladimir Sokolovsky > >> Subject: Re: problem with daily builds > >> > >> Michael S. Tsirkin wrote: > >>>> Quoting Steve Wise : > >>>> Subject: problem with daily builds > >>>> > >>>> Vlad, > >>>> > >>>> It appears the daily ofa_1_2_kernel builds are not building the > >> latest > >>>> code from the ofed_1_2 git tree. For example, I pulled down the > >>>> ofa_1_2_kernel-20070716-0200 tree and the file > >>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 > >> git > >>>> repository. > >>>> > >>>> Here's the BUILD_ID from that tree. Note it's the wrong git > >> repository... > >>>> # cat BUILD_ID > >>>> Git: > >>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git > >>>> commit 556f7870719506619990a58fddb3fd9eab4b9990 > >>> I think this is not the ofed_1_2 branch, but rather the current > > 1.2c, > >> which took > >>> the chelsio code from 2.6.22. I did my best to verify that > >> everything is up to > >>> date there, but of course it's human to err. Given that 2.6.22 > went > >> out after > >>> ofed code freeze - how come version.h there is older? > >>> > >> Why is the ofed-1.2 daily build using the 1.2c base? That means > we're > >> not building the ofed-1.2 post ga code for anybody to use. > >> > >>> Steve, I really think if upstream chelsio code is not up to date, > >>> you should post patches to update it and we'll put it in 1.2c. > >>> > >> A set of changes including firmware version bumps didn't make > 2.6.22. > >> They are in 2.6.23, however. So the chelsio drivers are up to date > in > >> ofed-1.2 and 2.6.23. 2.6.22 is missing some changes... > >> > >> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22 > > code > >> for 1.2c. Is that possible? > >> > >> Steve. From swise at opengridcomputing.com Tue Jul 17 06:53:07 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 08:53:07 -0500 Subject: [ofa-general] Re: problem with daily builds In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com> References: <469BFA10.7070209@opengridcomputing.com> <20070717115701.GI16698@mellanox.co.il> <469CC308.9050101@opengridcomputing.com> <6C2C79E72C305246B504CBA17B5500C901E73838@mtlexch01.mtl.com> <469CC5F9.8080800@opengridcomputing.com> <6C2C79E72C305246B504CBA17B5500C901E73849@mtlexch01.mtl.com> Message-ID: <469CC9C3.1000502@opengridcomputing.com> Vladimir Sokolovsky wrote: > http://www.openfabrics.org/builds/ofa_1_2_kernel/ofa_1_2_kernel-20070717 > -0454.tgz > Thanks Vlad! I'll try this out. > > Regards, > Vladimir > > >> -----Original Message----- >> From: Steve Wise [mailto:swise at opengridcomputing.com] >> Sent: Tuesday, July 17, 2007 4:37 PM >> To: Vladimir Sokolovsky >> Cc: Michael S. Tsirkin; OpenFabrics General >> Subject: Re: problem with daily builds >> >> Vladimir Sokolovsky wrote: >>> Hi Steve, >>> Some ofa_1_2_c_kernel builds were mistakenly placed under >> ofa_1_2_kernel >>> build tree. >>> I am fixing this right now... >>> >>> ofa_1_2 _kernel daily builds were stopped after OFED-1.2 release. >>> I can renew this on daily or weekly basis. >>> >>> >> >> What I'm looking for is a current top-of-tree ofed-1.2 or >> ofa_1_2_kernel >> build that works so I can point customers at that kit since it has a >> slew of chelsio fixes in it... >> >> Steve. >> >> >>> Regards, >>> Vladimir >>> >>> >>>> -----Original Message----- >>>> From: Steve Wise [mailto:swise at opengridcomputing.com] >>>> Sent: Tuesday, July 17, 2007 4:24 PM >>>> To: Michael S. Tsirkin >>>> Cc: Vladimir Sokolovsky >>>> Subject: Re: problem with daily builds >>>> >>>> Michael S. Tsirkin wrote: >>>>>> Quoting Steve Wise : >>>>>> Subject: problem with daily builds >>>>>> >>>>>> Vlad, >>>>>> >>>>>> It appears the daily ofa_1_2_kernel builds are not building the >>>> latest >>>>>> code from the ofed_1_2 git tree. For example, I pulled down the >>>>>> ofa_1_2_kernel-20070716-0200 tree and the file >>>>>> drivers/net/cxgb3/version.h is older than what is in the ofed_1_2 >>>> git >>>>>> repository. >>>>>> >>>>>> Here's the BUILD_ID from that tree. Note it's the wrong git >>>> repository... >>>>>> # cat BUILD_ID >>>>>> Git: >>>>>> git://git.openfabrics.org/ofed_1_2/linux-2.6.git >>>>>> commit 556f7870719506619990a58fddb3fd9eab4b9990 >>>>> I think this is not the ofed_1_2 branch, but rather the current >>> 1.2c, >>>> which took >>>>> the chelsio code from 2.6.22. I did my best to verify that >>>> everything is up to >>>>> date there, but of course it's human to err. Given that 2.6.22 >> went >>>> out after >>>>> ofed code freeze - how come version.h there is older? >>>>> >>>> Why is the ofed-1.2 daily build using the 1.2c base? That means >> we're >>>> not building the ofed-1.2 post ga code for anybody to use. >>>> >>>>> Steve, I really think if upstream chelsio code is not up to date, >>>>> you should post patches to update it and we'll put it in 1.2c. >>>>> >>>> A set of changes including firmware version bumps didn't make >> 2.6.22. >>>> They are in 2.6.23, however. So the chelsio drivers are up to date >> in >>>> ofed-1.2 and 2.6.23. 2.6.22 is missing some changes... >>>> >>>> I suggest you keep the ofed-1.2 chelsio code instead of the 2.6.22 >>> code >>>> for 1.2c. Is that possible? >>>> >>>> Steve. > From dotanb at dev.mellanox.co.il Tue Jul 17 07:58:57 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 17 Jul 2007 17:58:57 +0300 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags Message-ID: <200707171758.57442.dotanb@dev.mellanox.co.il> Remove local write permission enable in QP access flags (this attribute is being used only for remote permissions). Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 23af7a0..9ffb998 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, break; case RDMA_TRANSPORT_IWARP: if (!id_priv->cm_id.iw) { - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr->qp_access_flags = 0; *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; } else ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, From mst at dev.mellanox.co.il Tue Jul 17 08:25:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 18:25:46 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184642968.5165.414.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> Message-ID: <20070717152546.GA6863@mellanox.co.il> > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > have two different versions of dapl, but with exactly the same version > number. A person can't tell them apart. Yes, this sure looks like a problem. I think that versioning needs to be addressed at the package level, not at OFED level though. Right? -- MST From vlad at mellanox.co.il Tue Jul 17 08:36:23 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 17 Jul 2007 18:36:23 +0300 Subject: [ofa-general] RE: RFC OFED-1.3 installation References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com> [ snip ] > Let me copy and paste an email conversation I had with Or that > highlights why this is broken: > > ------- Begin cut-n-paste > On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote: > > [sorry for breaking the thread, I am working from home now and unable > to use normal mailer.] > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not > a lot, but anything is enough). I am not suppose to support correct versioning for every package in OFED. It should be done by the maintainer of the package. > The only reason that the OFED distribution has *ever* reliably > installed the rpms you wanted installed is because you compile things > locally and then *force* the upgrade of rpms over the top of older rpms > that have the same version number. And even then, you yourselves can't > tell the difference between a customer with the OFED 1.0 or OFED 1.1 > dapl installed by checking the RPM version, you just have to go off > what the end user *tells* you he installed and hope he's right. > OFED does not force an upgrade, it simply removes the previous version and then installs the new one. This is why package versioning does not affect OFED installation. I agree that it is different for Linux Distributions and should be fixed for OFED-1.3 but it should be under responsibility of package maintainer. So, all RPM spec files should be fixed for OFED-1.3 and properly maintained. We should discuss the kernel-ib package structure and its spec file. > And I have to *know* what software my customer is running in order to > support them. Because you guys have done things the way you have, I > can't know that. I might be able to know if I could also guarantee > they didn't download and locally compile your packages, but if they > did, then the same version number of RPM can mean two different things > entirely depending on whether it's your RPM or mine. > You can easily check if there OFED installation by running 'ofed_info'. > I posted links to a wealth of valuable information on the topic of > making a proper spec file and creating *good* packages during my talk > at Sonoma. I gather you haven't read those or you never would have > suggested the above for creating the RPMs. > I just looked into your presentation from Sonoma. You providing there an example of management package and your make.dist script for creating daily builds and releases. I have a some questions about this script: ... 59 VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` ... 97 DATE=`date +%Y%m%d` 98 if [ -f $TMPDIR/$target.release ]; then 99 RELEASE=`cat $TMPDIR/$target.release` 100 RELEASE=`expr $RELEASE + 1` 101 else 102 RELEASE=1 103 fi 104 echo $RELEASE > $TMPDIR/$target.release 105 RELEASE=0.${RELEASE}.${DATE}git 106 TARBALL=$target-git.tgz 107 fi ... 109 cp -a $target $target-$VERSION 110 sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/ ' < $target/$target.spec.in > $target-$VERSION/$target.spec 111 cd $target-$VERSION 112 ./autogen.sh 113 cd .. 114 echo "Creating $TMPDIR/$TARBALL" 115 tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION I thought that the standard way to get tar.gz file is using autotools (3 commands) like I wrote before: autogen.sh, configure, make dist. Can you explain why your way is better? Do you have a proposal for daily builds? We need OFED daily builds for verification. We can't wait for RedHat updates to get the updated OFED packages. What OFED-1.3 structure do you propose? Should it consist of source RPMs or tgz files? What features install script should support? Regards, Vladimir From vsjni at wctatel.net Tue Jul 17 09:08:30 2007 From: vsjni at wctatel.net (Haynes) Date: Tue, 17 Jul 2007 09:08:30 -0700 Subject: [ofa-general] sublime fisherman Message-ID: <469CE97E.5010501@wctatel.net> Wall Street Capital Funding Picks SZSN Shandong Zhouyuan Seed and Nursery Co., Ltd (SZSN) Monday Close: $0.43 UP 30% Wall Street Capital Funding announced to its investors in an early morning release to keep a close eye on SZSN. Share prices have jumped over 80% in two days. Get on SZSN now! One of the first things you'll notice about Actionscript, as a Java programmer, is how remarkably similar it is to Java. Custom class dictionaries. But whereas I had to explicitly specify that the type of this variable is String in the Java code. Advertisement Core Java author Cay Horstmann commented recently about the difficulty of using Swing's threading model correctly. I later heard the term dynamic typing used more frequently than runtime typing. Here, you see Actionscript code embedded directly into the MXML file, but it could also have been placed in an external . The first argument of the ItemResponder constructor is the function to be called upon success, the second argument is the function to be called in the event of failure. About the Blogger Ian Robertson is the lead architect at Overstock. org, the IEEE Technical Committee on Scalable Computing's newsletter. Turning print into a function usually makes some eyes roll. Difficulty is perhaps not the right word: Swing's concurrency rules are neither difficult to understand nor hard to follow. Custom class dictionaries. Many years ago I re-read this book right before going into a particularly difficult and intimidating consulting project, and the "no changes" part allowed me to make a difference. A user typing text into a text box is not programmatic access, and is automatically pushed onto the event-handling thread. The conversion tool produces high-quality source code, that in many cases is indistinguishable from manually converted code. For that matter, it's only through experience that I've come to recognize the pain of change, and even if I don't embrace it, I know that it's worth moving through. There have always been too many choices for Python GUI libraries, and each one has its own idiosyncrasies. RSS Feed If you'd like to be notified whenever Frank Sommers adds a new entry to his weblog, subscribe to his RSS feed. From sashak at voltaire.com Tue Jul 17 09:04:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 17 Jul 2007 19:04:10 +0300 Subject: [ofa-general] RFC OFED-1.3 installation In-Reply-To: References: Message-ID: <1184688250.10172.8.camel@localhost> Hi, On Mon, 2007-07-16 at 12:32 -0700, Shirley Ma wrote: > Is ib-utils depends on opensm-libs? If so I would suggest to change > opensm-libs as libsmutils. Otherwise ib-utils won't work without > installing opensm package. Does this make sense? Not whole opensm, but opensm-libs. Why the name ("opensm-libs" or "libsmutils") is matter? Sasha From dledford at redhat.com Tue Jul 17 09:20:49 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 16:20:49 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717152546.GA6863@mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> Message-ID: <1184689249.5165.419.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > have two different versions of dapl, but with exactly the same version > > number. A person can't tell them apart. > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > at the package level, not at OFED level though. Right? Versioning needs to be addressed at both levels. You need versions of software to start with, but then you still need releases of packages to differentiate between different builds of a specific version of software. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Tue Jul 17 09:21:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 09:21:35 -0700 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469C9453.80905@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com> <469C9453.80905@voltaire.com> Message-ID: <469CEC8F.4050106@ichips.intel.com> > Can you explain why would not the IB CM use the thread context provided > by the mad layer? You can end up with deadlock conditions when destroying cm_id's that have outstanding MADs. It also increases MAD processing time, which can increase dropping MADs. > Second, if the CM needs a different context why not use the system > threads? I understood from Michael's reply that the CM code relies on > some thread/queue flushing at the time of CM ID destruction, is it an > implementation issue that can change? if not, can't one dedicated thread > do the job? The timing and use of the system threads is unknown. When the ib_mad module was created, it was suggested that the system threads not be used. (I think it was Roland who recommended this.) We can change to system threads, but it does open the possibility of complicated deadlock conditions if other modules use the system threads as well. The CM could change to using a single dedicated thread, but if there are multiple processors available, why restrict processing to only being able to use one of them? - Sean From mst at dev.mellanox.co.il Tue Jul 17 09:27:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 19:27:31 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184689249.5165.419.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> Message-ID: <20070717162731.GA7479@mellanox.co.il> > Quoting Doug Ledford : > Subject: Re: RFC OFED-1.3 installation > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > > have two different versions of dapl, but with exactly the same version > > > number. A person can't tell them apart. > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > > at the package level, not at OFED level though. Right? > > Versioning needs to be addressed at both levels. You need versions of > software to start with, but then you still need releases of packages to > differentiate between different builds of a specific version of > software. Why would we want to have different builds of a specific version of software for a specific OS? Could you give an example pls? -- MST From dledford at redhat.com Tue Jul 17 09:39:40 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 16:39:40 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717162731.GA7479@mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> Message-ID: <1184690380.5165.430.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote: > > Quoting Doug Ledford : > > Subject: Re: RFC OFED-1.3 installation > > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > > > have two different versions of dapl, but with exactly the same version > > > > number. A person can't tell them apart. > > > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > > > at the package level, not at OFED level though. Right? > > > > Versioning needs to be addressed at both levels. You need versions of > > software to start with, but then you still need releases of packages to > > differentiate between different builds of a specific version of > > software. > > Why would we want to have different builds of a specific version of software > for a specific OS? Could you give an example pls? It's how you integrate needed patches immediately while waiting on the next release of the software. For example, when mdadm-2.6.2.tar.gz was released, I built an mdadm-2.6.2-1 package (the 1 being the release number). I then went to work on some mdadm bug reports I had, and I wrote a number of patches that squashed about 10 bug reports. During that time, I had three intervening builds as I integrated those patches into the spec file and applied them to the 2.6.2 base source code during the build process. Those builds were 2.6.2-{2,3,4}. I also forwarded those patches upstream, they've been integrated into the upstream code base, but a 2.6.3 has not yet been released, the upstream maintainer is waiting until everything he's putting into it settles down. When 2.6.3 is released, then I'll integrate 2.6.3 into our source SCM system, drop all of the patches that have been integrated into the base 2.6.3 source code, and build mdadm-2.6.3-1. The point of all this being that most software maintainers don't release new versions of their software on a daily or even weekly basis, so when you are busy fixing up bugs in the software between releases, the patches go in the spec file and you bump the release number so that each subsequent build has a unique number that can positively identify both the base source code used and all patches applied to that source code. You also bump the release number of the package any time you make changes to the spec file and rebuild. So, for instance, if the only change I made to a package was to change the %doc macro in the %files section, I would still bump the release number and rebuild so that the new rpm name-version-release combination would uniquely identify the change. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at dev.mellanox.co.il Tue Jul 17 09:45:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 19:45:00 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184690380.5165.430.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> Message-ID: <20070717164500.GB7479@mellanox.co.il> > Quoting Doug Ledford : > Subject: Re: RFC OFED-1.3 installation > > On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote: > > > Quoting Doug Ledford : > > > Subject: Re: RFC OFED-1.3 installation > > > > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > > > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > > > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > > > > have two different versions of dapl, but with exactly the same version > > > > > number. A person can't tell them apart. > > > > > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > > > > at the package level, not at OFED level though. Right? > > > > > > Versioning needs to be addressed at both levels. You need versions of > > > software to start with, but then you still need releases of packages to > > > differentiate between different builds of a specific version of > > > software. > > > > Why would we want to have different builds of a specific version of software > > for a specific OS? Could you give an example pls? > > It's how you integrate needed patches immediately while waiting on the > next release of the software. OK. > ... > You also bump the release number of the package any time you make > changes to the spec file and rebuild. Since we have spec files as part of package, this will be really the same as the previous case, right? -- MST From bramesh at vt.edu Tue Jul 17 09:55:53 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 17 Jul 2007 12:55:53 -0400 Subject: [ofa-general] OpenIB development help Message-ID: <20070717165553.GA10298@vt.edu> I am trying to migrate my research work to InfiniBand. I was searching for different resources which would help me in migrating to use InfiniBand. I couldnt find any technical documentation on how to develop applications using IB VAPI. The only documentation that closely resembles an API description is the InfiniBand Architecture release's Chapter 11 which talks about the software transport Verbs. I tried using the infiniband/verbs.h and to get some kind of understanding on how to develop code to use ibverbs. There are many aspects that one still doesnt understand. I was just wondering if the development community could help me in providing me with some resources or pointers so that I can better understand on how to use ibverbs. I am more interested in using the reliable datagram transport provided by ibverbs. I am not subscribed to the mailing list, I would really appreciate it if you could cc me in the reply. I really appreciate anyone taking time out of their busy schedule in providing me some help. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From dledford at redhat.com Tue Jul 17 10:06:02 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 17:06:02 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717164500.GB7479@mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> Message-ID: <1184691962.5165.450.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 19:45 +0300, Michael S. Tsirkin wrote: > > Quoting Doug Ledford : > > Subject: Re: RFC OFED-1.3 installation > > > > On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote: > > > > Quoting Doug Ledford : > > > > Subject: Re: RFC OFED-1.3 installation > > > > > > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > > > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > > > > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > > > > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > > > > > have two different versions of dapl, but with exactly the same version > > > > > > number. A person can't tell them apart. > > > > > > > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > > > > > at the package level, not at OFED level though. Right? > > > > > > > > Versioning needs to be addressed at both levels. You need versions of > > > > software to start with, but then you still need releases of packages to > > > > differentiate between different builds of a specific version of > > > > software. > > > > > > Why would we want to have different builds of a specific version of software > > > for a specific OS? Could you give an example pls? > > > > It's how you integrate needed patches immediately while waiting on the > > next release of the software. > > OK. > > > ... > > You also bump the release number of the package any time you make > > changes to the spec file and rebuild. > > Since we have spec files as part of package, this will be really > the same as the previous case, right? Depends. Right now the spec file gets its version out of the configure stuff. That version only updates when you update the version of the software itself. It doesn't increment on each change to the source repo, only on the major updates when you would release a new tarball anyway. Package versioning is, by necessity, finer grained than source repo versioning. You don't release a new dapl tarball just because you updated some comments to remove a typo. But you *do* update rpm versions on every single change, at least if you are going to distribute the rpm. Look, rpms are just like versioned tarballs. Once they go out in the wild, that particular name-version-release combination is FROZEN. It NEVER changes. Changing the code underlying that particular name-version-release is just as bad as the whole Linus scenario I described. We couldn't stay in business if we let that happen, period. That's why we have the guidelines that we do for package versioning. If you need daily builds, there is a way to make that happen that preserves the upgrade process and preserves unique name-version-release combinations. In that case, you would use the daily feature of that script I wrote. It spits out a tarball named package-git.tar.gz. The -git nomenclature clearly identifies that this is *not* a versioned tarball and it is *not* required to stay the same. You could put a date or head tag on the name as well if you want to make it unique. I didn't do that because then the daily git tarballs take up *way* too much space in our SCM repo. Then, you name the package name-version-0.release.git${DATE} This way, each daily build has a unique name. You increment the release number with each daily build, and the date tag allows you to see at a glance what date of pull the release goes with. Once the software has reached maturity, you simply pull the final name-version.tar.gz tarball and update the spec to be name-version-1 and it automatically compares as newer than the daily builds and upgrades. Then subsequent rpm builds from that official release version start incrementing the release number like normal. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jsquyres at cisco.com Tue Jul 17 10:11:01 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 17 Jul 2007 13:11:01 -0400 Subject: [ofa-general] Re: [ewg] Re: RFC OFED-1.3 installation In-Reply-To: <1184691962.5165.450.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> Message-ID: <8BD5CE10-FB60-4694-8DF4-2BBF21FA762A@cisco.com> On Jul 17, 2007, at 1:06 PM, Doug Ledford wrote: > Look, rpms are just like versioned tarballs. Once they go out in the > wild, that particular name-version-release combination is FROZEN. It > NEVER changes. I think that these 3 statements sum up the whole argument. I find it hard to disagree with them. :-) -- Jeff Squyres Cisco Systems From mst at dev.mellanox.co.il Tue Jul 17 10:12:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 20:12:50 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184691962.5165.450.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> Message-ID: <20070717171250.GD7479@mellanox.co.il> > Quoting Doug Ledford : > Subject: Re: RFC OFED-1.3 installation > > On Tue, 2007-07-17 at 19:45 +0300, Michael S. Tsirkin wrote: > > > Quoting Doug Ledford : > > > Subject: Re: RFC OFED-1.3 installation > > > > > > On Tue, 2007-07-17 at 19:27 +0300, Michael S. Tsirkin wrote: > > > > > Quoting Doug Ledford : > > > > > Subject: Re: RFC OFED-1.3 installation > > > > > > > > > > On Tue, 2007-07-17 at 18:25 +0300, Michael S. Tsirkin wrote: > > > > > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. In > > > > > > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > > > > > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not a > > > > > > > lot, but anything is enough). So, between OFED 1.0 and OFED 1.1, you > > > > > > > have two different versions of dapl, but with exactly the same version > > > > > > > number. A person can't tell them apart. > > > > > > > > > > > > Yes, this sure looks like a problem. I think that versioning needs to be addressed > > > > > > at the package level, not at OFED level though. Right? > > > > > > > > > > Versioning needs to be addressed at both levels. You need versions of > > > > > software to start with, but then you still need releases of packages to > > > > > differentiate between different builds of a specific version of > > > > > software. > > > > > > > > Why would we want to have different builds of a specific version of software > > > > for a specific OS? Could you give an example pls? > > > > > > It's how you integrate needed patches immediately while waiting on the > > > next release of the software. > > > > OK. > > > > > ... > > > You also bump the release number of the package any time you make > > > changes to the spec file and rebuild. > > > > Since we have spec files as part of package, this will be really > > the same as the previous case, right? > > Depends. Right now the spec file gets its version out of the configure > stuff. That version only updates when you update the version of the > software itself. It doesn't increment on each change to the source > repo, only on the major updates when you would release a new tarball > anyway. Package versioning is, by necessity, finer grained than source > repo versioning. You don't release a new dapl tarball just because you > updated some comments to remove a typo. But you *do* update rpm > versions on every single change, at least if you are going to distribute > the rpm. > > Look, rpms are just like versioned tarballs. Once they go out in the > wild, that particular name-version-release combination is FROZEN. It really looks like this is a work around for when you want to apply a patch without going through maintainer. The way OFED release process works, we really don't do releases all that often, and when we do, we can coordinate with the maintainer. -- MST From mshefty at ichips.intel.com Tue Jul 17 10:22:03 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 10:22:03 -0700 Subject: [ofa-general] Re: [PATCH] IB/mad: fix duplicated kernel thread name In-Reply-To: <20070717062159.GA2177@mellanox.co.il> References: <20070717062159.GA2177@mellanox.co.il> Message-ID: <469CFABB.8090502@ichips.intel.com> > However, creating a thread per port does seem > somewhat arbitrary, and would mean wasting (a small amount of) resources > apparently for no gain if there are lots of HCA ports in a box. At least in theory, it should be easy to change the CM threading model to 1 thread per processor or a single thread. I don't know if systems are more likely to have more HCA ports or processors, but all of our systems here (a few hundred nodes total) have more processors. And given current IB speeds, I suspect this may be the common configuration. - Sean From dledford at redhat.com Tue Jul 17 10:36:40 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 17:36:40 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717171250.GD7479@mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> Message-ID: <1184693800.5165.480.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 20:12 +0300, Michael S. Tsirkin wrote: > > Look, rpms are just like versioned tarballs. Once they go out in the > > wild, that particular name-version-release combination is FROZEN. > > It really looks like this is a work around for when you want to apply > a patch without going through maintainer. Not really. When you have a customer with a sev 1 issue, you don't wait for upstream to release a new version of gcc before you get them their fix. There are also those times when you have an older, long released product that isn't up to date with upstream, for instance RHEL4 mdadm is 1.12.0 and will not be updated to the 2.6.2 version that's in Fedora. If I find a bug in that 1.12.0 version of mdadm, then I'll fix it using a patch in the spec file. If the bug also exists in upstream then it will get sent upstream to be included in the latest upstream release. But, upstream won't care about version 1.12.0, and they won't release a new version 1 mdadm just for our bugfix, so we carry those targeted fixes around as long as we have that version 1 mdadm on systems. There are other reasons to do this as well, for instance when you need to make a change as part of package integration that simply isn't needed or wanted upstream. For example, many times upstream couldn't care less about patches that implement our particular file system layout for a package. There are lots of things that we as a distributor have to care about that upstream generally does not. The spec file and patches are how we solve our customer's problems. They are what make a stable distribution, as opposed to a "bleeding edge, must always update to latest upstream version to fix any problem" system, a reality. It's the difference between RHEL and Fedora. > The way OFED release process works, we really don't > do releases all that often, and when we do, we can coordinate with > the maintainer. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Tue Jul 17 10:41:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:41:49 -0700 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: <1183643723.25031.262.camel@mtls03> (Eli Cohen's message of "Thu, 05 Jul 2007 16:55:22 +0300") References: <1183643723.25031.262.camel@mtls03> Message-ID: I did a quick hack to enable copybreak for UD packets up to 256 bytes (see below). This is still missing copybreak for CM / RC mode. However I just wanted to see how it affected performance. And the answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe HCA) is that it didn't make any difference in small-message latency or throughput, at least none that I could measure with netpipe (NPtcp). I'm not sure whether to pursue this or not. diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 285c143..bf60bbb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -59,6 +59,8 @@ enum { IPOIB_PACKET_SIZE = 2048, IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + IPOIB_COPYBREAK = 256, + IPOIB_ENCAP_LEN = 4, IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 1094488..8d6d0d0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -203,22 +203,48 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num) goto repost; - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { - ++priv->stats.rx_dropped; - goto repost; - } - ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + if (wc->byte_len < IPOIB_COPYBREAK + IB_GRH_BYTES) { + struct sk_buff *new_skb; + + /* + * Add 12 bytes to 4-byte IPoIB header to get IP + * header at a multiple of 16. + */ + new_skb = dev_alloc_skb(wc->byte_len - IB_GRH_BYTES + 12); + if (unlikely(!new_skb)) { + ++priv->stats.rx_dropped; + goto repost; + } + + skb_reserve(new_skb, 12); + skb_put(new_skb, wc->byte_len - IB_GRH_BYTES); - skb_put(skb, wc->byte_len); - skb_pull(skb, IB_GRH_BYTES); + ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, + wc->byte_len - IB_GRH_BYTES); + ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + skb = new_skb; + } else { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } + + ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + } skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb_reset_mac_header(skb); From rdreier at cisco.com Tue Jul 17 10:45:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:45:00 -0700 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il> (Dotan Barak's message of "Tue, 17 Jul 2007 17:58:57 +0300") References: <200707171758.57442.dotanb@dev.mellanox.co.il> Message-ID: > case RDMA_TRANSPORT_IWARP: > if (!id_priv->cm_id.iw) { > - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + qp_attr->qp_access_flags = 0; > *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; Looks sane to me... among iWARP drivers, cxgb3 ignores IB_ACCESS_LOCAL_WRITE in qp_access_flags and amso1100 doesn't look at qp_access_flags at all (??). From mst at dev.mellanox.co.il Tue Jul 17 10:45:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 20:45:26 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184693800.5165.480.camel@firewall.xsintricity.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> Message-ID: <20070717174526.GE7479@mellanox.co.il> > There are lots of things that we as a distributor have to care about > that upstream generally does not. The spec file and patches are how we > solve our customer's problems. They are what make a stable > distribution, as opposed to a "bleeding edge, must always update to > latest upstream version to fix any problem" system, a reality. It's the > difference between RHEL and Fedora. I think I am getting it - you want to release a patched version of some OFED library without going through openfabrics? OK. So I imagine that's when you would increment the rpm-specific version number. But I can't see why would an OFED release want to play with these. -- MST From rdreier at cisco.com Tue Jul 17 10:49:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:49:52 -0700 Subject: [ofa-general] Re: [PATCH 1 of 2] mlx4: implement query-qp In-Reply-To: <200707170955.10933.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 17 Jul 2007 09:55:10 +0300") References: <200706211227.47794.jackm@dev.mellanox.co.il> <200707151028.24013.jackm@dev.mellanox.co.il> <200707170955.10933.jackm@dev.mellanox.co.il> Message-ID: > Thanks for applying it. I sent it to you as a patch to a patch because > I thought the change would be much more obvious to you this way. OK, but I basically have to apply it by hand then. I guess the best I could do would be to revert the original patch but save a copy, apply your patch to the patch, and then apply that patch. Anyway it makes things much more laborious. > Would you rather next time that I just send you an updated version of the original patch, > or should I send the fix as a patch to the code after the original patch has been applied? Either way is fine, but an incremental patch is probably better (especially because it makes the changes easiest to see). And especially in this case, where the original buggy patch was already upstream, an incremental patch is definitely best. From mst at dev.mellanox.co.il Tue Jul 17 10:51:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 20:51:54 +0300 Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib In-Reply-To: References: <1183643723.25031.262.camel@mtls03> Message-ID: <20070717175154.GF7479@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: socket buffer accounting with UDP/ipoib > > I did a quick hack to enable copybreak for UD packets up to 256 bytes > (see below). This is still missing copybreak for CM / RC mode. > However I just wanted to see how it affected performance. And the > answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe > HCA) is that it didn't make any difference in small-message latency or > throughput, at least none that I could measure with netpipe (NPtcp). Not any benchmark would show an improvement: what we save with copybreak is actually memory, which only has performance impact if you start reaching RCVBUF size. And the savings are only if message size is below the threshold, so you better set NDELAY to see any effect. Try running a UDP benchmark with small message size and NDELAY, and looking at number of UDP errors with netstat. -- MST From rdreier at cisco.com Tue Jul 17 10:52:55 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:52:55 -0700 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: <20070717043740.GB8527@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jul 2007 07:37:40 +0300") References: <20070717043740.GB8527@mellanox.co.il> Message-ID: > Here's some anecdotal evidence :) > http://lists.openfabrics.org/pipermail/general/2007-May/035758.html Right, but then we went on to say that we probably want to use multiple vectors to separate out multiple HCA ports rather than send/sreceive on the same port. And the current IPoIB implementation of having that second CQ seems suboptimal anyway, since it seems to leave us susceptible to the interrupt overload that NAPI was supposed to solve. At a higher level, I'm left wondering why nobody talked about multiple EQs during the last months of the 2.6.22 process and now all of a sudden it becomes urgent in the last few days of the 2.6.23 merge window. That's not really how I like to merge features.... - R. From rdreier at cisco.com Tue Jul 17 10:53:50 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:53:50 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070716200540.GA8527@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 16 Jul 2007 23:05:40 +0300") References: <20070713054711.GA21709@mellanox.co.il> <20070714175425.GA17597@mellanox.co.il> <20070716200540.GA8527@mellanox.co.il> Message-ID: > Well, the only issue I recall is about the # of EQs we want to allocate. > Was there something else? Yes, some ideas about how applications should pick which EQ to use. And how to handle CPU affinity. And whether we want to try to do something NUMA-aware. - R. From rdreier at cisco.com Tue Jul 17 10:57:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 10:57:42 -0700 Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib In-Reply-To: <20070717175154.GF7479@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jul 2007 20:51:54 +0300") References: <1183643723.25031.262.camel@mtls03> <20070717175154.GF7479@mellanox.co.il> Message-ID: > Try running a UDP benchmark with small message size and > NDELAY, and looking at number of UDP errors with netstat. If you give me an exact command line I can try it but I don't think I'll have time to figure out what to run by myself. From rdreier at cisco.com Tue Jul 17 11:05:33 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 11:05:33 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717152546.GA6863@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jul 2007 18:25:46 +0300") References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> Message-ID: It seems to me that this is all stemming from the same old fundamental confusion between a "release" and a "distribution." I think everyone would be better served by a process where individual maintainers were responsible for releasing tarballs of their packages, with schedules coordinated toward an overall "openfabrics release" (see http://live.gnome.org/TwoPointNineteen for hints about a process that might work), and then an OFED team handled spec files and kernel module packaging for various distros. In this world I would expect Doug could just take tarballs from the openfabrics world and not be bothered by the OFED RPM spec files (unless he wants to use them as a reference). To summarize, there would be two separate "products": - openfabrics release: format: .tar.gz files customers: OFED, Red Hat/Novell/Debian/etc packagers - OFED release: format: .srpm and binary .rpm files customers: end users who need newer drivers than their distribution includes - R. From rdreier at cisco.com Tue Jul 17 11:06:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 11:06:39 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: (Shirley Ma's message of "Fri, 13 Jul 2007 11:50:54 -0700") References: Message-ID: > We are working on IPoIB to use multiple EQ for multiple > links/connetions scalability. Does this mean this will wait for 2.6.24? I think so -- I don't want to merge something that first appears in the last few days of the merge window. The idea is to get your stuff queued up *before* the merge window opens. - R. From rdreier at cisco.com Tue Jul 17 11:07:59 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 11:07:59 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: (Roland Dreier's message of "Thu, 12 Jul 2007 16:15:58 -0700") References: Message-ID: > - Take a look at Sean's local SA caching patches. I merged > everything else from Sean's tree, but I'm still undecided about > these. I haven't read them carefully yet, but even aside from that > I don't have a good feeling about whether there's consensus about > this yet. Any opinions about merging, for or against, would be > appreciated here. Does anyone other than Sean have an opinion here? If you want this feature, if you've tested it, if you don't think it's ready yet, whatever, please speak up -- I don't feel comfortable making a decision on my own here (although I will if I have to). From mshefty at ichips.intel.com Tue Jul 17 11:20:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 11:20:49 -0700 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il> References: <200707171758.57442.dotanb@dev.mellanox.co.il> Message-ID: <469D0881.6050409@ichips.intel.com> Dotan Barak wrote: > Remove local write permission enable in QP access flags > (this attribute is being used only for remote permissions). > > Signed-off-by: Dotan Barak Acked-by: Sean Hefty Steve, does this look okay to you? > > --- > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 23af7a0..9ffb998 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, > break; > case RDMA_TRANSPORT_IWARP: > if (!id_priv->cm_id.iw) { > - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + qp_attr->qp_access_flags = 0; > *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; > } else > ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Tue Jul 17 11:29:20 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 13:29:20 -0500 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: <469D0881.6050409@ichips.intel.com> References: <200707171758.57442.dotanb@dev.mellanox.co.il> <469D0881.6050409@ichips.intel.com> Message-ID: <469D0A80.9060906@opengridcomputing.com> Why are you changing this? I think we set it for a specific reason (but I don't remember why just now)... Sean Hefty wrote: > Dotan Barak wrote: >> Remove local write permission enable in QP access flags >> (this attribute is being used only for remote permissions). >> >> Signed-off-by: Dotan Barak > > Acked-by: Sean Hefty > > Steve, does this look okay to you? > >> >> --- >> >> diff --git a/drivers/infiniband/core/cma.c >> b/drivers/infiniband/core/cma.c >> index 23af7a0..9ffb998 100644 >> --- a/drivers/infiniband/core/cma.c >> +++ b/drivers/infiniband/core/cma.c >> @@ -573,7 +573,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, >> struct ib_qp_attr *qp_attr, >> break; >> case RDMA_TRANSPORT_IWARP: >> if (!id_priv->cm_id.iw) { >> - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; >> + qp_attr->qp_access_flags = 0; >> *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; >> } else >> ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr, >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From dledford at redhat.com Tue Jul 17 11:34:14 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 18:34:14 +0000 Subject: [ofa-general] RE: RFC OFED-1.3 installation In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <6C2C79E72C305246B504CBA17B5500C901E738DE@mtlexch01.mtl.com> Message-ID: <1184697254.5165.527.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 18:36 +0300, Vladimir Sokolovsky wrote: > [ snip ] > > Let me copy and paste an email conversation I had with Or that > > highlights why this is broken: > > > > ------- Begin cut-n-paste > > On Mon, 2007-07-02 at 22:25 +0300, Or Gerlitz wrote: > > > [sorry for breaking the thread, I am working from home now and > unable > > to use normal mailer.] > > > > > > > Let me give an example. In OFED 1.0, you shipped dapl version 1.2. > In > > OFED 1.1, you also shipped dapl version 1.2. However, code inspection > > shows that between OFED 1.0 and OFED 1.1, dapl did in fact change (not > > a lot, but anything is enough). > > I am not suppose to support correct versioning for every package in > OFED. > It should be done by the maintainer of the package. This may be true going forward when you've split all the packages up, but it definitely was *not* true when all the packages where thrown into one huge tarball and built out of one spec file. Since the versioning information was lost when that tarball was recreated over and over again, versioning responsibility necessarily fell back upon the spec file. > > The only reason that the OFED distribution has *ever* reliably > > installed the rpms you wanted installed is because you compile things > > locally and then *force* the upgrade of rpms over the top of older > rpms > > that have the same version number. And even then, you yourselves > can't > > tell the difference between a customer with the OFED 1.0 or OFED 1.1 > > dapl installed by checking the RPM version, you just have to go off > > what the end user *tells* you he installed and hope he's right. > > > > OFED does not force an upgrade, it simply removes the previous version > and then installs the new one. From the viewpoint of proper upgrades, there is no difference. Removing and then installing is just a work around for broken upgrades. > This is why package versioning does not affect OFED installation. Right, you guys did things in a way that allowed you to not care about something that any distributor *must* care about. > I agree that it is different for Linux Distributions Open Fabrics Enterprise Distribution > and should be fixed > for OFED-1.3 but it should > be under responsibility of package maintainer. The maintainer of any given software is responsible for their tarballs. The maintainer of any given rpm is responsible for their spec file and rpms. If a person takes on both roles, like Roland does, then they handle both roles and the roles mostly merge into one. But whenever someone other than the project maintainer decides to be the package maintainer, they are different roles and each is responsible for their own versioning requirements. > So, all RPM spec files should be fixed for OFED-1.3 and properly > maintained. > We should discuss the kernel-ib package structure and its spec file. > > > And I have to *know* what software my customer is running in order to > > support them. Because you guys have done things the way you have, I > > can't know that. I might be able to know if I could also guarantee > > they didn't download and locally compile your packages, but if they > > did, then the same version number of RPM can mean two different things > > entirely depending on whether it's your RPM or mine. > > > > You can easily check if there OFED installation by running 'ofed_info'. No, you can't. At least not on any system running our packages. We don't, and won't, include anything like ofed_info in our distribution. We have one tool, and only one, that we use to tell what software a system is running: rpm. We will not include things like ofed_info just to find out what rpm should, if used properly, already tell us. That would be unnecessary duplication and results in all sorts of support problems when you start needing to be able to tell customers to use multiple different tools to try and figure out what one tool should be able to tell them. > > > I posted links to a wealth of valuable information on the topic of > > making a proper spec file and creating *good* packages during my talk > > at Sonoma. I gather you haven't read those or you never would have > > suggested the above for creating the RPMs. > > > > I just looked into your presentation from Sonoma. You providing there an > example > of management package and your make.dist script for creating daily > builds and releases. > > I have a some questions about this script: > ... > 59 VERSION=`grep "AC_INIT.*$target" $target/configure.in | cut > -f 2 -d ',' | sed -e 's/ //g'` > ... > 97 DATE=`date +%Y%m%d` > 98 if [ -f $TMPDIR/$target.release ]; then > 99 RELEASE=`cat $TMPDIR/$target.release` > 100 RELEASE=`expr $RELEASE + 1` > 101 else > 102 RELEASE=1 > 103 fi > 104 echo $RELEASE > $TMPDIR/$target.release > 105 RELEASE=0.${RELEASE}.${DATE}git > 106 TARBALL=$target-git.tgz > 107 fi > ... > 109 cp -a $target $target-$VERSION > 110 sed -e > 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/ > ' < $target/$target.spec.in > $target-$VERSION/$target.spec > 111 cd $target-$VERSION > 112 ./autogen.sh > 113 cd .. > 114 echo "Creating $TMPDIR/$TARBALL" > 115 tar -czf $TMPDIR/$TARBALL --exclude=.git $target-$VERSION > > I thought that the standard way to get tar.gz file is using autotools (3 > commands) like I wrote before: > autogen.sh, configure, make dist. > Can you explain why your way is better? autogen.sh yes (and it should have been in my script, my current one has it, but that one didn't). Since configure tries to figure out a bunch of stuff about the build environment, it must be run in the software development environment of the platform you are targeting the final build for. If you run it on your local RHEL4 machine, but our RHEL4 build environment for our next update has a different glibc that changes some minor thing that configure actually checks, then it would be wrong. So, even if you run configure, I can't trust the output from it. Obviously, if you aren't running configure, then make dist is irrelevant. So, you can run configure if you want, but I will ignore the output in anything I build. And if the make dist operation removes any files necessary for me to properly reconfigure the software using configure, then it will be a totally broken tarball from my perspective. > Do you have a proposal for daily builds? We need OFED daily builds for > verification. > We can't wait for RedHat updates to get the updated OFED packages. I have a newer version of that make.dist script that I wrote to specifically work for the repos other than the management tree. Using that script, you could just do this: for repo in *; do ./make.dist $repo daily rm $RPMDIR/${repo}* rpmbuild --rebuild dist/$repo-git.tar.gz rpm -Uvh $RPMDIR/${repo}* done That's really all you need for anything you are building for internal use. And if one of you wanted to be responsible for providing the rpms, then a single person could actually maintain versioned rpms that way. It would only break down when you try to run the make.dist script from different systems since it creates a file that lets it know what the next number in sequence is each time it builds that git.tar.gz file. However, even that could be solved by putting the release file in some sort of SCM if you wanted multiple people to be able to build properly versioned rpms. Really, the strictest guidelines apply to things you make publicly available. If you want to have a private, EWG only area on the ofa server where you guys can share daily, unversioned builds, go right ahead. It's when they go out in the wild and you expect other people to pick them up that you have to care. > What OFED-1.3 structure do you propose? Should it consist of source RPMs > or tgz files? > What features install script should support? From my standpoint, tgz files are really about all I care about. For instance, no matter what install script you write, I won't be using it because we have our own install/update methods. And it's hard for you to make a spec file that's both relevant for Red Hat and SuSE and at the same time clean enough to meet our requirements. There is one suggestion I would make though that greatly helps with the whole package versioning issue. We have this trick we use in our kernel RPMs back when we used to ship a kernel-source rpm (which was different than the src.rpm, it was a pre-prepared, already prep'ed source tree ready to be built from). When we built our own kernel RPMs, we would go into the top level Makefile in the kernel source tree and edit the extraversion to be what matched the rpm. When we made that source tree that would become the kernel-source package, we edited extraversion to -prep so that the final result if a customer used it to build a kernel would be something like 2.6.9-prep in the kernel version. You guys could do something similar in all the src.rpms you ship. Since you know they will be compiled locally, you could easily put something like .local at the end of you release string, so that say dapl would be version: 1.2.1, release: 1.local or 1.ofa or something like that. It doesn't solve package version comparison issues (aka, telling which package is newer by the number), but it does help to solve identification issues. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Tue Jul 17 11:34:40 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 11:34:40 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: Message-ID: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com> >I think everyone >would be better served by a process where individual maintainers were >responsible for releasing tarballs of their packages, with schedules >coordinated toward an overall "openfabrics release" For what it's worth, I agree with this approach. - Sean From rdreier at cisco.com Tue Jul 17 11:35:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 11:35:30 -0700 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: <469D0A80.9060906@opengridcomputing.com> (Steve Wise's message of "Tue, 17 Jul 2007 13:29:20 -0500") References: <200707171758.57442.dotanb@dev.mellanox.co.il> <469D0881.6050409@ichips.intel.com> <469D0A80.9060906@opengridcomputing.com> Message-ID: > Why are you changing this? Because "local write" doesn't make sense as a QP permission -- the QP access flags are about what the remote end of a connection is allowed to do with RDMA. > I think we set it for a specific reason (but I don't remember why just > now)... I can't see anything in the cxgb3 or amso11000 drivers that would pay attention to this flag -- in fact amso1100 ignores qp_access_flags entirely. - R. From rdreier at cisco.com Tue Jul 17 11:37:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 11:37:39 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads per qp In-Reply-To: <200707171311.43680.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 17 Jul 2007 13:11:43 +0300") References: <200707171311.43680.jackm@dev.mellanox.co.il> Message-ID: > Change max outstanding rdma reads per QP from 4 to 16. > This enables an improvement in latency for rdma-read applications. This only affects performance if an app queues more than 4 RDMA READ requests, right? (Because the 5th request doesn't have to wait for the 1st request to complete before it's sent) Do we want to increase this for mthca too, or is mlx4 so much faster than mthca that we need more requests in flight to keep the pipeline full? - R. From swise at opengridcomputing.com Tue Jul 17 11:39:53 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 13:39:53 -0500 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: References: <200707171758.57442.dotanb@dev.mellanox.co.il> <469D0881.6050409@ichips.intel.com> <469D0A80.9060906@opengridcomputing.com> Message-ID: <469D0CF9.1020302@opengridcomputing.com> Roland Dreier wrote: > > Why are you changing this? > > Because "local write" doesn't make sense as a QP permission -- the QP > access flags are about what the remote end of a connection is allowed > to do with RDMA. > > > I think we set it for a specific reason (but I don't remember why just > > now)... > > I can't see anything in the cxgb3 or amso11000 drivers that would pay > attention to this flag -- in fact amso1100 ignores qp_access_flags entirely. > > - R. ok then... Acked-by: Steve Wise From dledford at redhat.com Tue Jul 17 11:41:44 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 14:41:44 -0400 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com> References: <000101c7c8a1$2342d1e0$3c98070a@amr.corp.intel.com> Message-ID: <1184697704.5165.534.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 11:34 -0700, Sean Hefty wrote: > >I think everyone > >would be better served by a process where individual maintainers were > >responsible for releasing tarballs of their packages, with schedules > >coordinated toward an overall "openfabrics release" > > For what it's worth, I agree with this approach. Ditto. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Tue Jul 17 11:43:18 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 18:43:18 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717174526.GE7479@mellanox.co.il> References: <469B639A.1090804@dev.mellanox.co.il> <1184642968.5165.414.camel@firewall.xsintricity.com> <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> Message-ID: <1184697799.5165.536.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 20:45 +0300, Michael S. Tsirkin wrote: > > There are lots of things that we as a distributor have to care about > > that upstream generally does not. The spec file and patches are how we > > solve our customer's problems. They are what make a stable > > distribution, as opposed to a "bleeding edge, must always update to > > latest upstream version to fix any problem" system, a reality. It's the > > difference between RHEL and Fedora. > > I think I am getting it - you want to release a patched version of some OFED > library without going through openfabrics? OK. > So I imagine that's when you would increment the rpm-specific version number. > But I can't see why would an OFED release want to play with these. You don't want to, you *have* to. It's because you are distributing source software packages that build RPMs. And you aren't waiting until OFED is final, you release pre-releases too. So you need to be able to tell the difference between a customer running libibverbs-1.0.4 from OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final. In order to do so, they need a different release number because the version number is the same. The only way this changes is if every component of OFED 1.3 releases their final tar.gz file in concert with OFED 1.3. Otherwise, at least *some* items in there will need a bumped release number. Unless of course you are just relying on ofed_info, which as I pointed out in my last email, is a workaround for not doing this. We *won't* use that workaround because having two means to tell the same thing increases our support personnel training costs and makes things more confusing for the customer. We have one tool already, that's good enough. Additionally, once you step into the "create rpms" space, there are only two ways things can go. You can adhere to RPM packaging standards, and your custom built RPMs will peacefully coexist on a system were there are similar RPMs coming from the OS distributor, aka Red Hat. Or, you can do what you've been doing, where RPMs you build don't maintain consistent numbering, and the customer can end up getting screwed when your RPMs and our RPMs collide. It would be careless and reckless to risk customer systems going belly up because your RPM and mine collide in a way that renders the machine dysfunctional. So don't think of it as playing games with bumping release numbers, think of it as finally making OFED RPMs standard compliant so you no longer need the workaround of ofed_info. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tziporet at dev.mellanox.co.il Tue Jul 17 12:25:54 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 17 Jul 2007 22:25:54 +0300 Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads per qp In-Reply-To: References: <200707171311.43680.jackm@dev.mellanox.co.il> Message-ID: <469D17C2.3040403@mellanox.co.il> Roland Dreier wrote: > > Change max outstanding rdma reads per QP from 4 to 16. > > This enables an improvement in latency for rdma-read applications. > > This only affects performance if an app queues more than 4 RDMA READ > requests, right? (Because the 5th request doesn't have to wait for > the 1st request to complete before it's sent) > > Do we want to increase this for mthca too, or is mlx4 so much faster > than mthca that we need more requests in flight to keep the pipeline > full? > > I suggest we do this in mthca too. tziporet From tziporet at dev.mellanox.co.il Tue Jul 17 12:54:36 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 17 Jul 2007 22:54:36 +0300 Subject: [ofa-general] OFED July 16 meeting summary Message-ID: <469D1E7C.7040701@mellanox.co.il> *OFED July 16 meeting summary* *1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. * There was a long discussion on pros & cons regarding merging the two releases. Pros: - Everybody will be focused on the same release - All user space libs (except for the new libmlx4) are the same - Reduce QA efforts Cons: - The kernel was changed to 2.6.22 based and this can cause instability. - Harder to distinguish what are the differences between 1.2 to 1.2.c. (since its not only few patches) - 1.2.c release was aimed for ConnectX support only. If we lump the two releases together it may slow the convergence of this release. In addition there is a need to check with IBM and Chelsio, who actually asked for the 1.2.1 release, if this suites them. Steve agreed to test 1.2.c to see if its OK with his fixes. Need a respond from IBM too. (BTW - no patches from IBM were sent so far.) Decision: No decision was taken. I suggest we stay with two different branches for now. After more people will test 1.2.c and see if its stable enough we can decide not to do 1.2.1 *2. Agree on OFED 1.3 schedule: *The suggested schedule:* * * Feature freeze - Sep 4 * Alpha release - Sep 10 * Beta release - Sep 25 * RC1 - Oct 16 * RC2 - Oct 30 * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov 11) * RC4 - Nov 20 * GA release - Nov 30 (or first week of Dec) Discussion: - Due to the 1.2.c release the schedule seems very tight. - Since 1.2.c progress only the kernel, many user level features that are already done are not exposed to customers in OFED release. Decision: Revisit the schedule on September according to the "must have" features readiness. *3. Review OFED 1.3 features list: * There was an agreement on the must have features, except QoS that should be defined after IBTA SPEC is published We have not reviewed the list of features thoroughly. Each company should review the features and send comments to the list. Must have general features: ==================== * Kernel base on 2.6.23 (all new features that will be part of this kernel will be included in OFED 1.3) * Install: o Break the packages RPMs (work with Novell and Redhat) to minimize integration effort into OS distribution * Package: o Sources arrangement for the end user (for the labs) * New HCAs & RNICs: o ConnectX support o Neteffect support * QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP) Other features (must have marked with *) ============================== * libibverbs: New verbs: o Scalable Reliable Connected Transport (with Mellanox ConnectX)* o Reliable Multicast? ULPs: * IPoIB: o Performance improvements (those that will be stable on time) o NAPI - done * SDP: o * Keepalive o * AIO * uDAPL: o DAT 2.0 support with IB extensions for immediate data, atomics; o Add extensions for new verbs (SRCT,RM) * VNIC: o GA quality. Not a technology preview version anymore. o Added support for QLogic EVIC (10 Gbps Infiniband-to-Ethernet gateway) - in GA * RDS: RDMA API (using FMRs); GA quality with Oracle 11 * NFSoRDMA integration - pending we have a maintainer * Management: o * Multiple partitions via libibumad o OpenSM + More routing performance improvements - done + Even more speedups - done + Better packaging/installation - done + "Native" daemon mode - done + * Performance management + * Quality of Service manager: Based on IBTA annex + Enhancements for fat tree routing (non pure tree support) - done + More console commands and telnet access to console - done o More diagnostics + ibidsverify.pl: validate LIDs and GUIDs in subnet - done + Updated ibnetdiscover format with link width and speed, and GUIDs - done + ibnetdiscover grouping support for new Voltaire chassis - done + diag updates for IB router support - done + iblinkinfo.pl: Support peer port link width and speed validation - done + ibdatacounters: Add script and man page for subnet wide data counters saquery enhancements - done * iWARP: o * Chelsio: Get to GA level o NetEffect: Get the drivers into OFED -------------- next part -------------- An HTML attachment was scrubbed... URL: From coutinho at dcc.ufmg.br Tue Jul 17 12:55:03 2007 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue, 17 Jul 2007 16:55:03 -0300 Subject: [ofa-general] OpenIB development help In-Reply-To: <20070717165553.GA10298@vt.edu> References: <20070717165553.GA10298@vt.edu> Message-ID: Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ). It's more documented and it's more portable. It works with Infiniband and iWARP (http://en.wikipedia.org/wiki/IWARP). 2007/7/17, Bharath Ramesh : > > I am trying to migrate my research work to InfiniBand. I was searching > for different resources which would help me in migrating to use > InfiniBand. I couldnt find any technical documentation on how to develop > applications using IB VAPI. The only documentation that closely > resembles an API description is the InfiniBand Architecture release's > Chapter 11 which talks about the software transport Verbs. I tried using > the infiniband/verbs.h and to get some kind of understanding on how to > develop code to use ibverbs. > > There are many aspects that one still doesnt understand. I was just > wondering if the development community could help me in providing me > with some resources or pointers so that I can better understand on how > to use ibverbs. I am more interested in using the reliable datagram > transport provided by ibverbs. I am not subscribed to the mailing list, > I would really appreciate it if you could cc me in the reply. I really > appreciate anyone taking time out of their busy schedule in providing me > some help. > > Thanks, > > Bharath > > --- > Bharath Ramesh > http://people.cs.vt.edu/~bramesh > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Tue Jul 17 12:59:08 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 17 Jul 2007 22:59:08 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib References: <1183643723.25031.262.camel@mtls03> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E739A0@mtlexch01.mtl.com> I just got from vacation and started working on a version that does the same for bot UD ans CM modes. I will send a distinct patch for CM later this week. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Tuesday, July 17, 2007 8:42 PM To: Eli Cohen Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] socket buffer accounting with UDP/ipoib I did a quick hack to enable copybreak for UD packets up to 256 bytes (see below). This is still missing copybreak for CM / RC mode. However I just wanted to see how it affected performance. And the answer is that on my system (fast quad-core Xeon, 1-port Mellanox PCIe HCA) is that it didn't make any difference in small-message latency or throughput, at least none that I could measure with netpipe (NPtcp). From rdreier at cisco.com Tue Jul 17 13:11:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:11:30 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads per qp In-Reply-To: <469D17C2.3040403@mellanox.co.il> (Tziporet Koren's message of "Tue, 17 Jul 2007 22:25:54 +0300") References: <200707171311.43680.jackm@dev.mellanox.co.il> <469D17C2.3040403@mellanox.co.il> Message-ID: > I suggest we do this in mthca too. Have you tested this to know whether it matters? Increasing the limit uses more memory per QP... Does the rdma read latency test in OFED queue up enough work requests to measure this? - R. From rdreier at cisco.com Tue Jul 17 13:15:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:15:44 -0700 Subject: [ofa-general] OpenIB development help In-Reply-To: (Bruno Coutinho's message of "Tue, 17 Jul 2007 16:55:03 -0300") References: <20070717165553.GA10298@vt.edu> Message-ID: > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ). > It's more documented and it's more portable. It works with Infiniband and iWARP > (http://en.wikipedia.org/wiki/IWARP). Oh no!! First of all I don't know of any kDAPL implementation for any OS that is still being developed. Everyone completely gave up on kDAPL a long time ago. And if all you care about is being able to work on top of IB and iWARP, then libibverbs + librdmacm works perfectly fine without having to add another layer and all the complexity of DAPL. And you don't have to worry about code like (from dapl/common/dapl_cookie.c): new_head = (dapl_os_atomic_read (&buffer->head) + 1) % buffer->pool_size; if ( new_head == dapl_os_atomic_read (&buffer->tail) ) { dat_status = DAT_INSUFFICIENT_RESOURCES; goto bail; } else { dapl_os_atomic_set (&buffer->head, new_head); *cookie_ptr = &buffer->pool[dapl_os_atomic_read (&buffer->head)]; dat_status = DAT_SUCCESS; } From rdreier at cisco.com Tue Jul 17 13:18:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:18:21 -0700 Subject: [ofa-general] OpenIB development help In-Reply-To: <20070717165553.GA10298@vt.edu> (Bharath Ramesh's message of "Tue, 17 Jul 2007 12:55:53 -0400") References: <20070717165553.GA10298@vt.edu> Message-ID: > There are many aspects that one still doesnt understand. I was just > wondering if the development community could help me in providing me > with some resources or pointers so that I can better understand on how > to use ibverbs. I am more interested in using the reliable datagram > transport provided by ibverbs. Unfortunately, no existing hardware supports RD (reliable datagram), and even the API in libibverbs is not complete. Anyway, the libibverbs source contains some example code in examples/, and there are several other packages that have other examples, eg librdmacm, the performance tests in OFED, etc. If you have specific questions then please ask them on the mailing list. It's very hard to answer a general query like "please teach me how to use IB" but specific issues are easy to address. From mst at dev.mellanox.co.il Tue Jul 17 13:27:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 17 Jul 2007 23:27:30 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184697799.5165.536.camel@firewall.xsintricity.com> References: <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> Message-ID: <20070717202730.GA15990@mellanox.co.il> > So you need to be able to > tell the difference between a customer running libibverbs-1.0.4 from > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final. I don't really think we want customers to run beta code, or intend to support such configurations. -- MST From sweitzen at cisco.com Tue Jul 17 13:29:12 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 17 Jul 2007 13:29:12 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717202730.GA15990@mellanox.co.il> References: <20070717152546.GA6863@mellanox.co.il><1184689249.5165.419.camel@firewall.xsintricity.com><20070717162731.GA7479@mellanox.co.il><1184690380.5165.430.camel@firewall.xsintricity.com><20070717164500.GB7479@mellanox.co.il><1184691962.5165.450.camel@firewall.xsintricity.com><20070717171250.GD7479@mellanox.co.il><1184693800.5165.480.camel@firewall.xsintricity.com><20070717174526.GE7479@mellanox.co.il><1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> Message-ID: > > So you need to be able to > > tell the difference between a customer running libibverbs-1.0.4 from > > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final. > > I don't really think we want customers to run beta code, or > intend to support > such configurations. But we still need to tell the difference, so we can tell the customer they are running beta code and should upgrade. Scott From bramesh at vt.edu Tue Jul 17 13:34:28 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 17 Jul 2007 16:34:28 -0400 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> Message-ID: <20070717203428.GA12927@vt.edu> * Roland Dreier (rdreier at cisco.com) wrote: > > There are many aspects that one still doesnt understand. I was just > > wondering if the development community could help me in providing me > > with some resources or pointers so that I can better understand on how > > to use ibverbs. I am more interested in using the reliable datagram > > transport provided by ibverbs. > > Unfortunately, no existing hardware supports RD (reliable datagram), > and even the API in libibverbs is not complete. > > Anyway, the libibverbs source contains some example code in examples/, > and there are several other packages that have other examples, eg > librdmacm, the performance tests in OFED, etc. > > If you have specific questions then please ask them on the mailing > list. It's very hard to answer a general query like "please teach me > how to use IB" but specific issues are easy to address. > Thanks for replying to mail. I have a some basic understanding of IB. I have gone through some of the example code in the example directory and OFED performance test. I noticed that every one of those examples used TCP to exchange information regarding lid, psn and qpn. My question is basically that is there any other way to exchange this information using only IB. Since no hardware supports RD, I have to bite the bullet and use RC. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From swise at opengridcomputing.com Tue Jul 17 13:40:24 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 17 Jul 2007 15:40:24 -0500 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> Message-ID: <469D2938.20104@opengridcomputing.com> Roland Dreier wrote: > > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ). > > It's more documented and it's more portable. It works with Infiniband and iWARP > > (http://en.wikipedia.org/wiki/IWARP). > > Oh no!! > > First of all I don't know of any kDAPL implementation for any OS that > is still being developed. Everyone completely gave up on kDAPL a long > time ago. > > And if all you care about is being able to work on top of IB and > iWARP, then libibverbs + librdmacm works perfectly fine without having > to add another layer and all the complexity of DAPL. And librdmacm integrates with the routing subsystem allowing the RDMA CM to choose the correct rdma device. DAPL forces you to open each device and "hope" your remote destination is reachable... Steve. From leininger2 at llnl.gov Tue Jul 17 13:43:07 2007 From: leininger2 at llnl.gov (Matt Leininger) Date: Tue, 17 Jul 2007 13:43:07 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <1184704987.7702.106.camel@hyperion> On Tue, 2007-07-17 at 11:07 -0700, Roland Dreier wrote: > > - Take a look at Sean's local SA caching patches. I merged > > everything else from Sean's tree, but I'm still undecided about > > these. I haven't read them carefully yet, but even aside from that > > I don't have a good feeling about whether there's consensus about > > this yet. Any opinions about merging, for or against, would be > > appreciated here. > > Does anyone other than Sean have an opinion here? If you want this > feature, if you've tested it, if you don't think it's ready yet, > whatever, please speak up -- I don't feel comfortable making a > decision on my own here (although I will if I have to). Roland, I would like to see these features moved upstream. DOE funded this work as part of the items we see needing on our large scale IB deployment (both present and future). So from at least one big customer perspective we see this as useful. I'll let others comment on specific code/implementation issues. Thanks, - Matt > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Matt Leininger, Ph.D. Lawrence Livermore National Laboratory leininger2 at llnl.gov V 925-422-4110 From rdreier at cisco.com Tue Jul 17 13:43:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:43:42 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717202730.GA15990@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 17 Jul 2007 23:27:30 +0300") References: <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> Message-ID: > I don't really think we want customers to run beta code What's the point of a beta then?? - R. From rdreier at cisco.com Tue Jul 17 13:44:53 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:44:53 -0700 Subject: [ofa-general] OpenIB development help In-Reply-To: <20070717203428.GA12927@vt.edu> (Bharath Ramesh's message of "Tue, 17 Jul 2007 16:34:28 -0400") References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> Message-ID: > Thanks for replying to mail. I have a some basic understanding of IB. I > have gone through some of the example code in the example directory and > OFED performance test. I noticed that every one of those examples used > TCP to exchange information regarding lid, psn and qpn. My question is > basically that is there any other way to exchange this information using > only IB. Since no hardware supports RD, I have to bite the bullet and > use RC. Look at librdmacm (or libibcm). They provide higher-level abstractions for connection establishment. From rdreier at cisco.com Tue Jul 17 13:45:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:45:42 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <1184704987.7702.106.camel@hyperion> (Matt Leininger's message of "Tue, 17 Jul 2007 13:43:07 -0700") References: <1184704987.7702.106.camel@hyperion> Message-ID: > I would like to see these features moved upstream. DOE funded this > work as part of the items we see needing on our large scale IB > deployment (both present and future). So from at least one big customer > perspective we see this as useful. Does your reference to "present deployment" mean you are running this code now? - R. From bramesh at vt.edu Tue Jul 17 13:47:45 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 17 Jul 2007 16:47:45 -0400 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> Message-ID: <20070717204745.GB12927@vt.edu> * Roland Dreier (rdreier at cisco.com) wrote: > > Perhaps you should use uDAPL/kDAPL ( http://www.datcollaborative.org/ ). > > It's more documented and it's more portable. It works with Infiniband and iWARP > > (http://en.wikipedia.org/wiki/IWARP). > > Oh no!! > > First of all I don't know of any kDAPL implementation for any OS that > is still being developed. Everyone completely gave up on kDAPL a long > time ago. > > And if all you care about is being able to work on top of IB and > iWARP, then libibverbs + librdmacm works perfectly fine without having > to add another layer and all the complexity of DAPL. And you don't > have to worry about code like (from dapl/common/dapl_cookie.c): > > new_head = (dapl_os_atomic_read (&buffer->head) + 1) % buffer->pool_size; > > if ( new_head == dapl_os_atomic_read (&buffer->tail) ) > { > dat_status = DAT_INSUFFICIENT_RESOURCES; > goto bail; > } > else > { > dapl_os_atomic_set (&buffer->head, new_head); > > *cookie_ptr = &buffer->pool[dapl_os_atomic_read (&buffer->head)]; > dat_status = DAT_SUCCESS; > } > I care about only working over IB. I dont want to add anymore layers of software because I want to minimize the number of software layers that I need to traverse. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From bramesh at vt.edu Tue Jul 17 13:52:50 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 17 Jul 2007 16:52:50 -0400 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> Message-ID: <20070717205250.GA13127@vt.edu> * Roland Dreier (rdreier at cisco.com) wrote: > > Thanks for replying to mail. I have a some basic understanding of IB. I > > have gone through some of the example code in the example directory and > > OFED performance test. I noticed that every one of those examples used > > TCP to exchange information regarding lid, psn and qpn. My question is > > basically that is there any other way to exchange this information using > > only IB. Since no hardware supports RD, I have to bite the bullet and > > use RC. > > Look at librdmacm (or libibcm). They provide higher-level > abstractions for connection establishment. > Thanks for pointing to them. Another question off-topic I would say. I noticed that you are the maintainer for libibverbs in debian. Is there any time line when you might get librdmacm or libibcm into debian experimental/unstable? Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From rdreier at cisco.com Tue Jul 17 13:58:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 13:58:37 -0700 Subject: [ofa-general] OpenIB development help In-Reply-To: <20070717205250.GA13127@vt.edu> (Bharath Ramesh's message of "Tue, 17 Jul 2007 16:52:50 -0400") References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> <20070717205250.GA13127@vt.edu> Message-ID: > Thanks for pointing to them. Another question off-topic I would say. I > noticed that you are the maintainer for libibverbs in debian. Is there > any time line when you might get librdmacm or libibcm into debian > experimental/unstable? I don't have any plans to do any more Debian packages (like librdmacm) right now. I am the upstream for libibverbs in addition to being the Debian maintainer, which is why I package it for Debian (and Fedora). And I am not the upstream for librdmacm. Actually now that I think of it, I do plan to prepare libmlx4 packages at some point, but again I am the upstream there. - R. From bramesh at vt.edu Tue Jul 17 14:02:01 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Tue, 17 Jul 2007 17:02:01 -0400 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> <20070717205250.GA13127@vt.edu> Message-ID: <20070717210201.GA13439@vt.edu> * Roland Dreier (rdreier at cisco.com) wrote: > > Thanks for pointing to them. Another question off-topic I would say. I > > noticed that you are the maintainer for libibverbs in debian. Is there > > any time line when you might get librdmacm or libibcm into debian > > experimental/unstable? > > I don't have any plans to do any more Debian packages (like librdmacm) > right now. I am the upstream for libibverbs in addition to being the > Debian maintainer, which is why I package it for Debian (and Fedora). > And I am not the upstream for librdmacm. > > Actually now that I think of it, I do plan to prepare libmlx4 packages > at some point, but again I am the upstream there. > > - R. > Thanks, I guess if there are no plans to build librdmacm for debian for now I guess I will build them myself for now. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From mst at dev.mellanox.co.il Tue Jul 17 14:09:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 00:09:35 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: References: <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> Message-ID: <20070717210935.GA17168@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > I don't really think we want customers to run beta code > > What's the point of a beta then?? Donnu. In previous OFED releases, we had "release candidates" rather than "beta". Openfabrics members were running RCs and reporting issues on the list and in bugzilla. Do you really ask your customers to do this for you? -- MST From mst at dev.mellanox.co.il Tue Jul 17 14:14:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 00:14:44 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: References: <20070717202730.GA15990@mellanox.co.il> Message-ID: <20070717211444.GB17168@mellanox.co.il> > Quoting Scott Weitzenkamp (sweitzen) : > Subject: RE: [ofa-general] Re: RFC OFED-1.3 installation > > > > So you need to be able to > > > tell the difference between a customer running libibverbs-1.0.4 from > > > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final. > > > > I don't really think we want customers to run beta code, or > > intend to support > > such configurations. > > But we still need to tell the difference, so we can tell the customer > they are running beta code and should upgrade. Sure, this makes sense. Non-release code such as nightly builds must be marked as suchas clearly as possible. Installing such a version will always have an element of risk in it, though, and I don't think we want to encourage such use in production environment. -- MST From sweitzen at cisco.com Tue Jul 17 14:16:49 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 17 Jul 2007 14:16:49 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717210935.GA17168@mellanox.co.il> References: <20070717162731.GA7479@mellanox.co.il><1184690380.5165.430.camel@firewall.xsintricity.com><20070717164500.GB7479@mellanox.co.il><1184691962.5165.450.camel@firewall.xsintricity.com><20070717171250.GD7479@mellanox.co.il><1184693800.5165.480.camel@firewall.xsintricity.com><20070717174526.GE7479@mellanox.co.il><1184697799.5165.536.camel@firewall.xsintricity.com><20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> Message-ID: > > > I don't really think we want customers to run beta code > > > > What's the point of a beta then?? > > Donnu. > In previous OFED releases, we had "release candidates" rather > than "beta". > Openfabrics members were running RCs and reporting issues on > the list and in > bugzilla. Do you really ask your customers to do this for you? You say toMAYto, I say toMAHto. We had many customers running various OFED 1.2 pre-GA builds for testing, sometimes we had to use a daily build because of certain bug fixes. Scott From arthur.jones at qlogic.com Tue Jul 17 14:19:18 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 17 Jul 2007 14:19:18 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipath: Make a few functions static In-Reply-To: References: Message-ID: <20070717211918.GD30170@bauxite.pathscale.com> hi roland, this patch looks good, thanks! arthur On Mon, Jul 16, 2007 at 10:49:12AM -0700, Roland Dreier wrote: > Make some functions that are only used in a single .c file static. In > addition to being a cleanup, this shrinks the generated code. On x86_64: > > add/remove: 1/3 grow/shrink: 2/1 up/down: 4777/-4956 (-179) > function old new delta > handle_errors - 3994 +3994 > __verbs_timer 42 710 +668 > ipath_do_ruc_send 2131 2246 +115 > ipath_no_bufs_available 136 - -136 > ipath_disarm_senderrbufs 639 - -639 > ipath_ib_timer 658 - -658 > ipath_intr 5878 2355 -3523 > > Signed-off-by: Roland Dreier > --- > Does this look OK to merge for 2.6.23? > > diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c > index 9361f5a..09c5fd8 100644 > --- a/drivers/infiniband/hw/ipath/ipath_driver.c > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c > @@ -1889,7 +1889,7 @@ void ipath_write_kreg_port(const struct ipath_devdata *dd, ipath_kreg regno, > /* Below is "non-zero" to force override, but both actual LEDs are off */ > #define LED_OVER_BOTH_OFF (8) > > -void ipath_run_led_override(unsigned long opaque) > +static void ipath_run_led_override(unsigned long opaque) > { > struct ipath_devdata *dd = (struct ipath_devdata *)opaque; > int timeoff; > diff --git a/drivers/infiniband/hw/ipath/ipath_eeprom.c b/drivers/infiniband/hw/ipath/ipath_eeprom.c > index 6b91479..b4503e9 100644 > --- a/drivers/infiniband/hw/ipath/ipath_eeprom.c > +++ b/drivers/infiniband/hw/ipath/ipath_eeprom.c > @@ -426,8 +426,8 @@ bail: > * @buffer: data to write > * @len: number of bytes to write > */ > -int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset, > - const void *buffer, int len) > +static int ipath_eeprom_internal_write(struct ipath_devdata *dd, u8 eeprom_offset, > + const void *buffer, int len) > { > u8 single_byte; > int sub_len; > diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c > index 47aa434..1fd91c5 100644 > --- a/drivers/infiniband/hw/ipath/ipath_intr.c > +++ b/drivers/infiniband/hw/ipath/ipath_intr.c > @@ -70,7 +70,7 @@ static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum) > * If rewrite is true, and bits are set in the sendbufferror registers, > * we'll write to the buffer, for error recovery on parity errors. > */ > -void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) > +static void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) > { > u32 piobcnt; > unsigned long sbuf[4]; > diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h > index 3105005..b6ccd04 100644 > --- a/drivers/infiniband/hw/ipath/ipath_kernel.h > +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h > @@ -776,7 +776,6 @@ void ipath_get_eeprom_info(struct ipath_devdata *); > int ipath_update_eeprom_log(struct ipath_devdata *dd); > void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr); > u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); > -void ipath_disarm_senderrbufs(struct ipath_devdata *, int); > > /* > * Set LED override, only the two LSBs have "public" meaning, but > diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c > index 8525674..c69c252 100644 > --- a/drivers/infiniband/hw/ipath/ipath_ruc.c > +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c > @@ -507,7 +507,7 @@ static int want_buffer(struct ipath_devdata *dd) > * > * Called when we run out of PIO buffers. > */ > -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) > +static void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) > { > unsigned long flags; > > diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c > index 65f7181..16aa61f 100644 > --- a/drivers/infiniband/hw/ipath/ipath_verbs.c > +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c > @@ -488,7 +488,7 @@ bail:; > * This is called from ipath_do_rcv_timer() at interrupt level to check for > * QPs which need retransmits and to collect performance numbers. > */ > -void ipath_ib_timer(struct ipath_ibdev *dev) > +static void ipath_ib_timer(struct ipath_ibdev *dev) > { > struct ipath_qp *resend = NULL; > struct list_head *last; > diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h > index f3d1f2c..9bbe819 100644 > --- a/drivers/infiniband/hw/ipath/ipath_verbs.h > +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h > @@ -782,8 +782,6 @@ void ipath_update_mmap_info(struct ipath_ibdev *dev, > > int ipath_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); > > -void ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev); > - > void ipath_insert_rnr_queue(struct ipath_qp *qp); > > int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only); > @@ -807,8 +805,6 @@ void ipath_ib_rcv(struct ipath_ibdev *, void *, void *, u32); > > int ipath_ib_piobufavail(struct ipath_ibdev *); > > -void ipath_ib_timer(struct ipath_ibdev *); > - > unsigned ipath_get_npkeys(struct ipath_devdata *); > > u32 ipath_get_cr_errpkey(struct ipath_devdata *); From arthur.jones at qlogic.com Tue Jul 17 14:20:05 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 17 Jul 2007 14:20:05 -0700 Subject: [ofa-general] Re: is ipath_get_user_pages_nocopy() dead code? In-Reply-To: References: Message-ID: <20070717212004.GE30170@bauxite.pathscale.com> hi roland, ... On Mon, Jul 16, 2007 at 10:49:51AM -0700, Roland Dreier wrote: > I don't see any callers of ipath_get_user_pages_nocopy(). Should we > just delete it? yes, shall i queue it up and post it? thanks... arthur From arthur.jones at qlogic.com Tue Jul 17 14:20:59 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 17 Jul 2007 14:20:59 -0700 Subject: [ofa-general] is ipath_layer.c dead code? In-Reply-To: References: Message-ID: <20070717212059.GF30170@bauxite.pathscale.com> hi roland, i'm still testing this one, i'll get back to you soon... arthur On Mon, Jul 16, 2007 at 10:43:02AM -0700, Roland Dreier wrote: > My kernel seems to build and link fine with the patch below. Is > ipath_layer.c being used for anything, or can we just kill it? > > - R. > > diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile > index ec2e603..fe67388 100644 > --- a/drivers/infiniband/hw/ipath/Makefile > +++ b/drivers/infiniband/hw/ipath/Makefile > @@ -14,7 +14,6 @@ ib_ipath-y := \ > ipath_init_chip.o \ > ipath_intr.o \ > ipath_keys.o \ > - ipath_layer.o \ > ipath_mad.o \ > ipath_mmap.o \ > ipath_mr.o \ > diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c > deleted file mode 100644 > index 82616b7..0000000 > --- a/drivers/infiniband/hw/ipath/ipath_layer.c > +++ /dev/null > @@ -1,365 +0,0 @@ > -/* > - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. > - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. > - * > - * This software is available to you under a choice of one of two > - * licenses. You may choose to be licensed under the terms of the GNU > - * General Public License (GPL) Version 2, available from the file > - * COPYING in the main directory of this source tree, or the > - * OpenIB.org BSD license below: > - * > - * Redistribution and use in source and binary forms, with or > - * without modification, are permitted provided that the following > - * conditions are met: > - * > - * - Redistributions of source code must retain the above > - * copyright notice, this list of conditions and the following > - * disclaimer. > - * > - * - Redistributions in binary form must reproduce the above > - * copyright notice, this list of conditions and the following > - * disclaimer in the documentation and/or other materials > - * provided with the distribution. > - * > - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > - * SOFTWARE. > - */ > - > -/* > - * These are the routines used by layered drivers, currently just the > - * layered ethernet driver and verbs layer. > - */ > - > -#include > -#include > - > -#include "ipath_kernel.h" > -#include "ipath_layer.h" > -#include "ipath_verbs.h" > -#include "ipath_common.h" > - > -/* Acquire before ipath_devs_lock. */ > -static DEFINE_MUTEX(ipath_layer_mutex); > - > -u16 ipath_layer_rcv_opcode; > - > -static int (*layer_intr)(void *, u32); > -static int (*layer_rcv)(void *, void *, struct sk_buff *); > -static int (*layer_rcv_lid)(void *, void *); > - > -static void *(*layer_add_one)(int, struct ipath_devdata *); > -static void (*layer_remove_one)(void *); > - > -int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg) > -{ > - int ret = -ENODEV; > - > - if (dd->ipath_layer.l_arg && layer_intr) > - ret = layer_intr(dd->ipath_layer.l_arg, arg); > - > - return ret; > -} > - > -int ipath_layer_intr(struct ipath_devdata *dd, u32 arg) > -{ > - int ret; > - > - mutex_lock(&ipath_layer_mutex); > - > - ret = __ipath_layer_intr(dd, arg); > - > - mutex_unlock(&ipath_layer_mutex); > - > - return ret; > -} > - > -int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr, > - struct sk_buff *skb) > -{ > - int ret = -ENODEV; > - > - if (dd->ipath_layer.l_arg && layer_rcv) > - ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb); > - > - return ret; > -} > - > -int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr) > -{ > - int ret = -ENODEV; > - > - if (dd->ipath_layer.l_arg && layer_rcv_lid) > - ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr); > - > - return ret; > -} > - > -void ipath_layer_lid_changed(struct ipath_devdata *dd) > -{ > - mutex_lock(&ipath_layer_mutex); > - > - if (dd->ipath_layer.l_arg && layer_intr) > - layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID); > - > - mutex_unlock(&ipath_layer_mutex); > -} > - > -void ipath_layer_add(struct ipath_devdata *dd) > -{ > - mutex_lock(&ipath_layer_mutex); > - > - if (layer_add_one) > - dd->ipath_layer.l_arg = > - layer_add_one(dd->ipath_unit, dd); > - > - mutex_unlock(&ipath_layer_mutex); > -} > - > -void ipath_layer_remove(struct ipath_devdata *dd) > -{ > - mutex_lock(&ipath_layer_mutex); > - > - if (dd->ipath_layer.l_arg && layer_remove_one) { > - layer_remove_one(dd->ipath_layer.l_arg); > - dd->ipath_layer.l_arg = NULL; > - } > - > - mutex_unlock(&ipath_layer_mutex); > -} > - > -int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *), > - void (*l_remove)(void *), > - int (*l_intr)(void *, u32), > - int (*l_rcv)(void *, void *, struct sk_buff *), > - u16 l_rcv_opcode, > - int (*l_rcv_lid)(void *, void *)) > -{ > - struct ipath_devdata *dd, *tmp; > - unsigned long flags; > - > - mutex_lock(&ipath_layer_mutex); > - > - layer_add_one = l_add; > - layer_remove_one = l_remove; > - layer_intr = l_intr; > - layer_rcv = l_rcv; > - layer_rcv_lid = l_rcv_lid; > - ipath_layer_rcv_opcode = l_rcv_opcode; > - > - spin_lock_irqsave(&ipath_devs_lock, flags); > - > - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { > - if (!(dd->ipath_flags & IPATH_INITTED)) > - continue; > - > - if (dd->ipath_layer.l_arg) > - continue; > - > - spin_unlock_irqrestore(&ipath_devs_lock, flags); > - dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd); > - spin_lock_irqsave(&ipath_devs_lock, flags); > - } > - > - spin_unlock_irqrestore(&ipath_devs_lock, flags); > - mutex_unlock(&ipath_layer_mutex); > - > - return 0; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_register); > - > -void ipath_layer_unregister(void) > -{ > - struct ipath_devdata *dd, *tmp; > - unsigned long flags; > - > - mutex_lock(&ipath_layer_mutex); > - spin_lock_irqsave(&ipath_devs_lock, flags); > - > - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { > - if (dd->ipath_layer.l_arg && layer_remove_one) { > - spin_unlock_irqrestore(&ipath_devs_lock, flags); > - layer_remove_one(dd->ipath_layer.l_arg); > - spin_lock_irqsave(&ipath_devs_lock, flags); > - dd->ipath_layer.l_arg = NULL; > - } > - } > - > - spin_unlock_irqrestore(&ipath_devs_lock, flags); > - > - layer_add_one = NULL; > - layer_remove_one = NULL; > - layer_intr = NULL; > - layer_rcv = NULL; > - layer_rcv_lid = NULL; > - > - mutex_unlock(&ipath_layer_mutex); > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_unregister); > - > -int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax) > -{ > - int ret; > - u32 intval = 0; > - > - mutex_lock(&ipath_layer_mutex); > - > - if (!dd->ipath_layer.l_arg) { > - ret = -EINVAL; > - goto bail; > - } > - > - ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS); > - > - if (ret < 0) > - goto bail; > - > - *pktmax = dd->ipath_ibmaxlen; > - > - if (*dd->ipath_statusp & IPATH_STATUS_IB_READY) > - intval |= IPATH_LAYER_INT_IF_UP; > - if (dd->ipath_lid) > - intval |= IPATH_LAYER_INT_LID; > - if (dd->ipath_mlid) > - intval |= IPATH_LAYER_INT_BCAST; > - /* > - * do this on open, in case low level is already up and > - * just layered driver was reloaded, etc. > - */ > - if (intval) > - layer_intr(dd->ipath_layer.l_arg, intval); > - > - ret = 0; > -bail: > - mutex_unlock(&ipath_layer_mutex); > - > - return ret; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_open); > - > -u16 ipath_layer_get_lid(struct ipath_devdata *dd) > -{ > - return dd->ipath_lid; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_get_lid); > - > -/** > - * ipath_layer_get_mac - get the MAC address > - * @dd: the infinipath device > - * @mac: the MAC is put here > - * > - * This is the EUID-64 OUI octets (top 3), then > - * skip the next 2 (which should both be zero or 0xff). > - * The returned MAC is in network order > - * mac points to at least 6 bytes of buffer > - * We assume that by the time the LID is set, that the GUID is as valid > - * as it's ever going to be, rather than adding yet another status bit. > - */ > - > -int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac) > -{ > - u8 *guid; > - > - guid = (u8 *) &dd->ipath_guid; > - > - mac[0] = guid[0]; > - mac[1] = guid[1]; > - mac[2] = guid[2]; > - mac[3] = guid[5]; > - mac[4] = guid[6]; > - mac[5] = guid[7]; > - if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff)) > - ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: " > - "%x %x\n", guid[3], guid[4]); > - return 0; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_get_mac); > - > -u16 ipath_layer_get_bcast(struct ipath_devdata *dd) > -{ > - return dd->ipath_mlid; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_get_bcast); > - > -int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr) > -{ > - int ret = 0; > - u32 __iomem *piobuf; > - u32 plen, *uhdr; > - size_t count; > - __be16 vlsllnh; > - > - if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) { > - ipath_dbg("send while not open\n"); > - ret = -EINVAL; > - } else > - if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) || > - dd->ipath_lid == 0) { > - /* > - * lid check is for when sma hasn't yet configured > - */ > - ret = -ENETDOWN; > - ipath_cdbg(VERBOSE, "send while not ready, " > - "mylid=%u, flags=0x%x\n", > - dd->ipath_lid, dd->ipath_flags); > - } > - > - vlsllnh = *((__be16 *) hdr); > - if (vlsllnh != htons(IPATH_LRH_BTH)) { > - ipath_dbg("Warning: lrh[0] wrong (%x, not %x); " > - "not sending\n", be16_to_cpu(vlsllnh), > - IPATH_LRH_BTH); > - ret = -EINVAL; > - } > - if (ret) > - goto done; > - > - /* Get a PIO buffer to use. */ > - piobuf = ipath_getpiobuf(dd, NULL); > - if (piobuf == NULL) { > - ret = -EBUSY; > - goto done; > - } > - > - plen = (sizeof(*hdr) >> 2); /* actual length */ > - ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf); > - > - writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */ > - ipath_flush_wc(); > - piobuf += 2; > - uhdr = (u32 *)hdr; > - count = plen-1; /* amount we can copy before trigger word */ > - __iowrite32_copy(piobuf, uhdr, count); > - ipath_flush_wc(); > - __raw_writel(uhdr[count], piobuf + count); > - ipath_flush_wc(); /* ensure it's sent, now */ > - > - ipath_stats.sps_ether_spkts++; /* ether packet sent */ > - > -done: > - return ret; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_send_hdr); > - > -int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd) > -{ > - set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); > - > - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, > - dd->ipath_sendctrl); > - return 0; > -} > - > -EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int); > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Tue Jul 17 14:32:15 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 00:32:15 +0300 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: References: <20070717043740.GB8527@mellanox.co.il> Message-ID: <20070717213215.GC17168@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH 01/10] IB/ehca: Support for multiple event queues > > > Here's some anecdotal evidence :) > > http://lists.openfabrics.org/pipermail/general/2007-May/035758.html > > Right, but then we went on to say that we probably want to use > multiple vectors to separate out multiple HCA ports rather than > send/sreceive on the same port. And the current IPoIB implementation > of having that second CQ seems suboptimal anyway, since it seems to > leave us susceptible to the interrupt overload that NAPI was supposed > to solve. Sure, the ipoib patch is just a proof of concept anyway. And I'm actually working on merging send/recv CQs now, to address the livelocks. > At a higher level, I'm left wondering why nobody talked about multiple > EQs during the last months of the 2.6.22 process and now all of a > sudden it becomes urgent in the last few days of the 2.6.23 merge > window. I don't see any emergency in merging the IPoIB hack either. I just hoped that once we merge the core changes people will start experimenting with multiple vectors. This did not seem to have happened. Could this be because there's no low level driver support upstream yet? So I wonder whether merging the mthca patch [that was patch 2 of the series] in 2.6.23 will finally get the ball rolling, get people to experiment with multiple vectors in userspace, and that will hopefully teach us something. > That's not really how I like to merge features.... If you look just at the mthca patch in isolation, do you still see a problem? -- MST From rdreier at cisco.com Tue Jul 17 14:38:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 14:38:13 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipath: Make a few functions static In-Reply-To: <20070717211918.GD30170@bauxite.pathscale.com> (Arthur Jones's message of "Tue, 17 Jul 2007 14:19:18 -0700") References: <20070717211918.GD30170@bauxite.pathscale.com> Message-ID: OK, I queued it for my next merge. From rdreier at cisco.com Tue Jul 17 14:38:31 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 14:38:31 -0700 Subject: [ofa-general] Re: is ipath_get_user_pages_nocopy() dead code? In-Reply-To: <20070717212004.GE30170@bauxite.pathscale.com> (Arthur Jones's message of "Tue, 17 Jul 2007 14:20:05 -0700") References: <20070717212004.GE30170@bauxite.pathscale.com> Message-ID: > yes, shall i queue it up and post it? No need, I can do it locally just as easily. From mst at dev.mellanox.co.il Tue Jul 17 14:44:17 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 00:44:17 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: Message-ID: <20070717214417.GE17168@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: Further 2.6.23 merge plans... > > > - Take a look at Sean's local SA caching patches. I merged > > everything else from Sean's tree, but I'm still undecided about > > these. I haven't read them carefully yet, but even aside from that > > I don't have a good feeling about whether there's consensus about > > this yet. Any opinions about merging, for or against, would be > > appreciated here. > > Does anyone other than Sean have an opinion here? If you want this > feature, if you've tested it, if you don't think it's ready yet, > whatever, please speak up -- I don't feel comfortable making a > decision on my own here (although I will if I have to). We have the patches applied in ofed 1.2.c with default module parameter set to caching disabled (ofed 1.2 had a different version of the patches, but caching is disabled by default there, too). At least in this configuration (caching disabled), all issues I've seen seem to be fixed now, and tests seem to be running smoothly. So I think it's safe to merge it up if the module parameter is set to cache disabled by default. No idea what happens if it's enabled though :) -- MST From mst at dev.mellanox.co.il Tue Jul 17 14:58:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 00:58:11 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: References: <20070717210935.GA17168@mellanox.co.il> Message-ID: <20070717215811.GA19243@mellanox.co.il> > Quoting Scott Weitzenkamp (sweitzen) : > Subject: RE: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > I don't really think we want customers to run beta code > > > > > > What's the point of a beta then?? > > > > Donnu. > > In previous OFED releases, we had "release candidates" rather > > than "beta". > > Openfabrics members were running RCs and reporting issues on > > the list and in > > bugzilla. Do you really ask your customers to do this for you? > > You say toMAYto, I say toMAHto. > > We had many customers running various OFED 1.2 pre-GA builds for > testing, sometimes we had to use a daily build because of certain bug > fixes. OK then, I guess we could try to make it easy to switch between RCs. But daily ... we don't want to increment a revision on each change, do we? Maybe a nonstandard way like ofedinfo is enough for these testing setups? -- MST From rdreier at cisco.com Tue Jul 17 15:07:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 15:07:27 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717215811.GA19243@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jul 2007 00:58:11 +0300") References: <20070717210935.GA17168@mellanox.co.il> <20070717215811.GA19243@mellanox.co.il> Message-ID: > But daily ... we don't want to increment a revision on each change, do we? I think it's easy enough to make the revision of the RPMS be something like -0.1.2007-07-17.1 or something like that. From mst at dev.mellanox.co.il Tue Jul 17 15:12:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 01:12:06 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: References: <20070717210935.GA17168@mellanox.co.il> <20070717215811.GA19243@mellanox.co.il> Message-ID: <20070717221206.GC19243@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > But daily ... we don't want to increment a revision on each change, do we? > > I think it's easy enough to make the revision of the RPMS be something > like -0.1.2007-07-17.1 or something like that. OK, so you say just ignore the content and stick a date in there? Fine, that'll work, and we can cover the RCs this way too I think. -- MST From hal.rosenstock at gmail.com Tue Jul 17 15:33:11 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Jul 2007 15:33:11 -0700 Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> Message-ID: Hi Tziporet, On 7/16/07, Tziporet Koren wrote: > > Hi All, > > We have our OFED synch meeting today at 9am PST. > > Agenda: > 1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. > 2. Agree on OFED 1.3 schedule: > * Feature freeze - Sep 4 > * Alpha release - Sep 10 > * Beta release - Sep 25 > * RC1 - Oct 16 > * RC2 - Oct 30 > * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov > 11) > * RC4 - Nov 20 > * GA release - Nov 30 (or first week of Dec) > 3. Review OFED 1.3 features list: > In last meeting we decided that the schedule is one of the most important > parameters in OFED 1.3. > Thus I divided the features for two categories: > > - "must have" features - features that must be ready for the release > (marked with *) > - "optional" features - features that can be included in the release > in case they are ready according to the schedule > > Must have general features: > ==================== > > - Kernel base on 2.6.23 (all new features that will be part of this > kernel will be included in OFED 1.3) > - Install: > - Break the packages RPMs (work with Novell and Redhat) to > minimize integration effort into OS distribution > - Package: > - Sources arrangement for the end user (for the labs) > - New HCAs & RNICs: > - ConnectX support > - Any other new HW? > - QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP) > > Other features (must have marked with *) > ============================== > > - libibverbs: New verbs: > - Scalable Reliable Connected Transport (with Mellanox > ConnectX)* > - Reliable Multicast? > > ULPs: > > - IPoIB: > - Performance improvements (those that will be stable on time) > - NAPI - done > - SDP: > - * Keepalive > - * AIO > - uDAPL: > - DAT 2.0 support with IB extensions for immediate data, > atomics; > - Add extensions for new verbs (SRCT,RM) > - VNIC: > - GA quality. Not a technology preview version anymore. > - Added support for QLogic EVIC (10 Gbps > Infiniband-to-Ethernet gateway) - in GA > - RDS: RDMA API (using FMRs); GA quality with Oracle 11 > - NFSoRDMA integration - pending we have a maintainer > - Management: > - * Multiple partitions via libibumad > - OpenSM > - More routing performance improvements - done > - Even more speedups - done > - Better packaging/installation - done > - "Native" daemon mode - done > - * Performance management > - * Quality of Service manager: Based on IBTA annex > - Enhancements for fat tree routing (non pure tree > support) - done > - More console commands and telnet access to console - > done > - More diagnostics > - ibidsverify.pl: validate LIDs and GUIDs in subnet - > done > - Updated ibnetdiscover format with link width and > speed, and GUIDs - done > - ibnetdiscover grouping support for new Voltaire > chassis - done > - diag updates for IB router support - done > - iblinkinfo.pl: Support peer port link width and speed > validation - done > - ibdatacounters: Add script and man page for subnet > wide data counters saquery enhancements - done > > What happened to ibsim ? I thought that was on the list I originally sent. -- Hal > - iWARP: > - * Chelsio: Get to GA level > - NetEffect: Get the drivers into OFED > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: *tziporet at mellanox.co.il* > Tel +972-4-9097200, ext 380 > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeeps at linux.vnet.ibm.com Tue Jul 17 15:39:27 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 17 Jul 2007 15:39:27 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch Message-ID: <469D451F.6030706@linux.vnet.ibm.com> Here is a seventh version of the IPOIB_CM_NOSRQ patch. Changes from V6: 1. Minor changes incorporating Sean Hefty's comments (changed spin lock to an atomic and additional cleanups) This patch has been tested with linux-2.6.22 derived from Roland's for-2.6.23 git tree on ppc64 machines Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-10 18:30:10.000000000 -0400 @@ -95,11 +95,16 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (1ul << 16) #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) + +#define NOSRQ_INDEX_TABLE_SIZE 128 +#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_TABLE_SIZE -1) + #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -166,11 +171,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by NOSRQ only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -NOSRQ only */ enum ipoib_cm_state state; }; @@ -215,6 +223,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long dev_kfree_skb_any(skb); } -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { } - #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-10 17:02:33.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 17:45:16.000000000 -0400 @@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; +int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported"); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); + +atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */ + #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +92,20 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +115,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +187,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -198,16 +251,21 @@ static struct ib_qp *ipoib_cm_create_rx_ { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { - .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* For drain WR */ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + if (!priv->cm.srq) { + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; + attr.event_handler = NULL; + } else + attr.event_handler = ipoib_cm_rx_event_handler; return ib_create_qp(priv->pd, &attr); } @@ -282,12 +340,129 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } + spin_unlock_irq(&priv->lock); +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 qp_num, index; + u64 i, recv_mem_used; + + qp_num = p->qp->qp_num; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the NOSRQ we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + + init_context_and_add_list(cm_id, p, priv); + spin_lock_irq(&priv->lock); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) + * CM_PACKET_SIZE; /* packets are 64K */ + if ((index == max_rc_qp) || + ( recv_mem_used >= max_recv_buf * (1ul << 20))) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "NOSRQ has reached the configurable limit " + "of either %d RC QPs or, max recv buf size of " + "0x%x MB\n", max_rc_qp, max_recv_buf); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + + priv->cm.rx_index_table[index] = p; + spin_unlock_irq(&priv->lock); + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + if (post_receive_nosrq(dev, i << 32 | index)) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + kfree(p->rx_ring); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -298,13 +473,13 @@ static int ipoib_cm_req_handler(struct i ipoib_dbg(priv, "REQ arrived\n"); p = kzalloc(sizeof *p, GFP_KERNEL); - if (!p) + if (!p) { + printk(KERN_WARNING "Failed to allocate RX control block when " + "REQ arrived\n"); return -ENOMEM; + } p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -314,19 +489,20 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (!priv->cm.srq) { + if ((ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn))) + goto err_post_nosrq; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + p->state = IPOIB_CM_RX_LIVE; + init_context_and_add_list(cm_id, p, priv); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -336,6 +512,9 @@ static int ipoib_cm_req_handler(struct i } return 0; +err_post_nosrq: + list_del_init(&p->list); + atomic_dec(¤t_rc_qp); err_modify: ib_destroy_qp(p->qp); err_qp: @@ -399,29 +578,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> 0x%x)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -429,23 +639,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=0x%llx vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -457,13 +659,112 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %ld\n", + wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + + ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", + wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ; + + /* This is the only place where rx_ptr could be a NULL - could + * have just received a packet from a connection that has become + * stale and so is going away. We will simply drop the packet and + * let the hardware (it s IB_QPT_RC) handle the dropped packet. + * In the timer_check() function below, p->jiffies is updated and + * hence the connection will not be stale after that. + */ + rx_ptr = priv->cm.rx_index_table[index]; + if (unlikely(!rx_ptr)) { + ipoib_warn(priv, "Received packet from a connection " + "that is going away. Hardware will handle it.\n"); + return; + } + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%ld vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + timer_check_nosrq(priv, rx_ptr); + } + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, + rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -483,10 +784,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %ld\n", + wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -678,6 +991,42 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for(i = 0; i < ipoib_recvq_size; ++i) + if(p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -692,6 +1041,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -737,6 +1091,7 @@ void ipoib_cm_dev_stop(struct net_device kfree(p); } + cancel_delayed_work(&priv->cm.stale_task); } @@ -815,7 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -855,7 +1212,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1200,6 +1557,9 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) { + atomic_dec(¤t_rc_qp); + } kfree(p); } } @@ -1218,12 +1578,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1277,16 +1644,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1303,20 +1694,32 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + if ((ret = ib_query_device(priv->ca, &attr))) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + if ((ret = create_srq(dev, priv))) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n"); + return -ENOMEM; + } + + atomic_set(¤t_rc_qp, 0); } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1329,17 +1732,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-10 18:30:10.000000000 -0400 @@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-10 18:30:10.000000000 -0400 @@ -175,6 +175,15 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if(!priv->cm.srq) + size += (NOSRQ_INDEX_TABLE_SIZE -1)* ipoib_recvq_size; + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); From rdreier at cisco.com Tue Jul 17 15:41:17 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 15:41:17 -0700 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717221206.GC19243@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jul 2007 01:12:06 +0300") References: <20070717210935.GA17168@mellanox.co.il> <20070717215811.GA19243@mellanox.co.il> <20070717221206.GC19243@mellanox.co.il> Message-ID: > > I think it's easy enough to make the revision of the RPMS be something > > like -0.1.2007-07-17.1 or something like that. > > OK, so you say just ignore the content and stick a date in there? > Fine, that'll work, and we can cover the RCs this way too I think. I just meant to add a revision that encodes the daily build if you want to do daily builds. So you could have libibverbs RPMs with version 1.1.2-0.1.2007-07-17.1 or whatever, and then do 1.1.2-0.2.beta1 and 1.1.2-1 final. - R. From pradeeps at linux.vnet.ibm.com Tue Jul 17 15:43:37 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 17 Jul 2007 15:43:37 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) extension [PATCH V2] patch Message-ID: <469D4619.2040904@linux.vnet.ibm.com> This patch handles the corner case of running out of RC QPs. In that case it switches to UD mode. This patch can be used both by NOSRQ and SRQ code. This is a resubmission of the previous patch against the 2.6.22 kernel. No changes otherwise. This patch has been tested with linux-2.6.22 derived from Roland's for-2.6.23 git tree on ppc64 machines Signed-off-by: Pradeep Satyanarayana --- --- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 17:56:17.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 17:59:16.000000000 -0400 @@ -1372,8 +1372,18 @@ static int ipoib_cm_tx_handler(struct ib ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); break; - case IB_CM_REQ_ERROR: case IB_CM_REJ_RECEIVED: + ipoib_warn(priv, "REJ received\n"); + spin_lock(&priv->lock); + neigh = tx->neigh; + spin_unlock(&priv->lock); + + if ((neigh) && (event->param.rej_rcvd.reason == + IB_CM_REJ_NO_QP)) { + clear_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags); + break; + } + case IB_CM_REQ_ERROR: case IB_CM_TIMEWAIT_EXIT: ipoib_dbg(priv, "CM error %d.\n", event->event); spin_lock_irq(&priv->tx_lock); --- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-17 17:59:16.000000000 -0400 @@ -679,11 +679,10 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (ipoib_cm_get(neigh)) { - if (ipoib_cm_up(neigh)) { + if (ipoib_cm_get(neigh) && ipoib_cm_up(neigh) && + test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) { ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); goto out; - } } else if (neigh->ah) { if (unlikely(memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, From rdreier at cisco.com Tue Jul 17 15:44:16 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 15:44:16 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch In-Reply-To: <469D451F.6030706@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Tue, 17 Jul 2007 15:39:27 -0700") References: <469D451F.6030706@linux.vnet.ibm.com> Message-ID: I'll take a closer look later, but please try to find a way to post patches so they don't get line wrapped (otherwise I won't be able to apply it, even if we converge on something acceptable). Also please do some basic quality control. Running this patch through scripts/checkpatch.pl shows many many small style problems -- you can ignore the 80 character limit warnings for the most part, but there is plenty of whitespace damage and other stuff to fix. - R. From hal.rosenstock at gmail.com Tue Jul 17 15:50:02 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 17 Jul 2007 15:50:02 -0700 Subject: [ofa-general] Re: [ewg] OFED July 16 meeting summary In-Reply-To: <469D1E7C.7040701@mellanox.co.il> References: <469D1E7C.7040701@mellanox.co.il> Message-ID: Hi Tziporet, On 7/17/07, Tziporet Koren wrote: > > *OFED July 16 meeting summary* > > *1. Merge the OFED 1.2.1 release with OFED 1.2.c release in August. * > > There was a long discussion on pros & cons regarding merging the two > releases. > > Pros: > - Everybody will be focused on the same release > - All user space libs (except for the new libmlx4) are the same > - Reduce QA efforts > > Cons: > - The kernel was changed to 2.6.22 based and this can cause instability. > - Harder to distinguish what are the differences between 1.2 to 1.2.c. > (since its not only few patches) > - 1.2.c release was aimed for ConnectX support only. If we lump the two > releases together it may slow the convergence of this release. > > In addition there is a need to check with IBM and Chelsio, who actually > asked for the 1.2.1 release, if this suites them. > Steve agreed to test 1.2.c to see if its OK with his fixes. > Need a respond from IBM too. (BTW - no patches from IBM were sent so far.) > > Decision: No decision was taken. > I suggest we stay with two different branches for now. > After more people will test 1.2.c and see if its stable enough we can > decide not to do 1.2.1 > > *2. Agree on OFED 1.3 schedule: > *The suggested schedule:* > * * Feature freeze - Sep 4 > * Alpha release - Sep 10 > * Beta release - Sep 25 > * RC1 - Oct 16 > * RC2 - Oct 30 > * RC3 - Nov 8 (assuming many of us are at SC07 on the week of Nov > 11) > * RC4 - Nov 20 > * GA release - Nov 30 (or first week of Dec) > > Discussion: > - Due to the 1.2.c release the schedule seems very tight. > - Since 1.2.c progress only the kernel, many user level features that are > already done are not exposed to customers in OFED release. > > Decision: Revisit the schedule on September according to the "must have" > features readiness. > > *3. Review OFED 1.3 features list: > * > > There was an agreement on the must have features, except QoS that should > be defined after IBTA SPEC is published > We have not reviewed the list of features thoroughly. Each company should > review the features and send comments to the list. > > Must have general features: > ==================== > > - Kernel base on 2.6.23 (all new features that will be part of this > kernel will be included in OFED 1.3) > - Install: > - Break the packages RPMs (work with Novell and Redhat) to > minimize integration effort into OS distribution > - Package: > - Sources arrangement for the end user (for the labs) > - New HCAs & RNICs: > - ConnectX support > - Neteffect support > - QoS: OSM, CM, CMA, ULPs (IPoIB, SDP, SRP) > > Other features (must have marked with *) > ============================== > > - libibverbs: New verbs: > - Scalable Reliable Connected Transport (with Mellanox > ConnectX)* > - Reliable Multicast? > > ULPs: > > - IPoIB: > - Performance improvements (those that will be stable on time) > - NAPI - done > - SDP: > - * Keepalive > - * AIO > - uDAPL: > - DAT 2.0 support with IB extensions for immediate data, > atomics; > - Add extensions for new verbs (SRCT,RM) > - VNIC: > - GA quality. Not a technology preview version anymore. > - Added support for QLogic EVIC (10 Gbps > Infiniband-to-Ethernet gateway) - in GA > - RDS: RDMA API (using FMRs); GA quality with Oracle 11 > - NFSoRDMA integration - pending we have a maintainer > - Management: > - * Multiple partitions via libibumad > - OpenSM > - More routing performance improvements - done > - Even more speedups - done > - Better packaging/installation - done > - "Native" daemon mode - done > - * Performance management > - * Quality of Service manager: Based on IBTA annex > - Enhancements for fat tree routing (non pure tree > support) - done > - More console commands and telnet access to console - > done > - More diagnostics > - ibidsverify.pl: validate LIDs and GUIDs in subnet - > done > - Updated ibnetdiscover format with link width and > speed, and GUIDs - done > - ibnetdiscover grouping support for new Voltaire > chassis - done > - diag updates for IB router support - done > - iblinkinfo.pl: Support peer port link width and speed > validation - done > - ibdatacounters: Add script and man page for subnet > wide data counters saquery enhancements - done > > What happened to ibsim ? It was on the list I sent. Is there any reason it can't be included ? Thanks. -- Hal iWARP: > > > - > - * Chelsio: Get to GA level > - NetEffect: Get the drivers into OFED > > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dledford at redhat.com Tue Jul 17 16:09:27 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 23:09:27 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717202730.GA15990@mellanox.co.il> References: <20070717152546.GA6863@mellanox.co.il> <1184689249.5165.419.camel@firewall.xsintricity.com> <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> Message-ID: <1184713767.5165.547.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 23:27 +0300, Michael S. Tsirkin wrote: > > So you need to be able to > > tell the difference between a customer running libibverbs-1.0.4 from > > OFED-1.3-beta1 and libibverbs-1.0.4 from OFED-1.3 final. > > I don't really think we want customers to run beta code, or intend to support > such configurations. It's not so much whether you want them to or not. If you make it available, some of them *will* run it. Not necessarily in production, but still run it none the less. You need to be able to tell the difference. And they need to be able to tell the difference. What if they installed it to test and provide feedback to you, but then because the version numbers weren't distinct, and they forgot just which machines they put it in, *they* no longer knew which was the beta/rc code or the final release? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Tue Jul 17 16:11:47 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 23:11:47 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070717210935.GA17168@mellanox.co.il> References: <20070717162731.GA7479@mellanox.co.il> <1184690380.5165.430.camel@firewall.xsintricity.com> <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> Message-ID: <1184713907.5165.549.camel@firewall.xsintricity.com> On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote: > > Quoting Roland Dreier : > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > I don't really think we want customers to run beta code > > > > What's the point of a beta then?? > > Donnu. > In previous OFED releases, we had "release candidates" rather than "beta". > Openfabrics members were running RCs and reporting issues on the list and in > bugzilla. Do you really ask your customers to do this for you? Sure, as much as possible. I generally don't recommend using it in production, but just as close as they can get to production is fine with me. The more issues they find while I'm still actually working on it and making new revisions, the less issues they'll find after I stupidly think I'm done. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From pradeeps at linux.vnet.ibm.com Tue Jul 17 17:01:04 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 17 Jul 2007 17:01:04 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch In-Reply-To: References: <469D451F.6030706@linux.vnet.ibm.com> Message-ID: <469D5840.5040802@linux.vnet.ibm.com> Roland Dreier wrote: > I'll take a closer look later, but please try to find a way to post > patches so they don't get line wrapped (otherwise I won't be able to > apply it, even if we converge on something acceptable). > As far as I know there should be no line wrap issues any more. Do you still see it? > Also please do some basic quality control. Running this patch through > scripts/checkpatch.pl shows many many small style problems -- you can > ignore the 80 character limit warnings for the most part, but there is > plenty of whitespace damage and other stuff to fix. > Will do. Pradeep From rdreier at cisco.com Tue Jul 17 18:08:07 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 18:08:07 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch In-Reply-To: <469D5840.5040802@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Tue, 17 Jul 2007 17:01:04 -0700") References: <469D451F.6030706@linux.vnet.ibm.com> <469D5840.5040802@linux.vnet.ibm.com> Message-ID: > As far as I know there should be no line wrap issues any more. Do you > still see it? Yes, eg: -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) I think the problem is the "flowed" in: > Content-Type: text/plain; charset=ISO-8859-1; format=flowed I see > User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) and I think some people have managed to use thunderbird to send non-mangled patches, so you should be able to find some documentation. From mst at dev.mellanox.co.il Tue Jul 17 19:18:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 05:18:54 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184713907.5165.549.camel@firewall.xsintricity.com> References: <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> Message-ID: <20070718021854.GD19243@mellanox.co.il> > Quoting Doug Ledford : > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote: > > > Quoting Roland Dreier : > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > > I don't really think we want customers to run beta code > > > > > > What's the point of a beta then?? > > > > Donnu. > > In previous OFED releases, we had "release candidates" rather than "beta". > > Openfabrics members were running RCs and reporting issues on the list and in > > bugzilla. Do you really ask your customers to do this for you? > > Sure, as much as possible. I generally don't recommend using it in > production, but just as close as they can get to production is fine with > me. The more issues they find while I'm still actually working on it > and making new revisions, the less issues they'll find after I stupidly > think I'm done. So,Roland's idea of sticking a date in RPM revision willwork, won't it? -- MST From dledford at redhat.com Tue Jul 17 19:49:24 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 18 Jul 2007 02:49:24 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070718021854.GD19243@mellanox.co.il> References: <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> <20070718021854.GD19243@mellanox.co.il> Message-ID: <1184726964.5165.552.camel@firewall.xsintricity.com> On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote: > > Quoting Doug Ledford : > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote: > > > > Quoting Roland Dreier : > > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > > > > I don't really think we want customers to run beta code > > > > > > > > What's the point of a beta then?? > > > > > > Donnu. > > > In previous OFED releases, we had "release candidates" rather than "beta". > > > Openfabrics members were running RCs and reporting issues on the list and in > > > bugzilla. Do you really ask your customers to do this for you? > > > > Sure, as much as possible. I generally don't recommend using it in > > production, but just as close as they can get to production is fine with > > me. The more issues they find while I'm still actually working on it > > and making new revisions, the less issues they'll find after I stupidly > > think I'm done. > > So,Roland's idea of sticking a date in RPM revision willwork, won't it? As long as you don't do two package builds on the same day. That's why my script encodes both an increasing number and the date into the revision. For reference, I'll attach the updated script I made for spitting out a buildable tarball. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: make.dist Type: application/x-shellscript Size: 5272 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Tue Jul 17 19:58:28 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 18 Jul 2007 02:58:28 +0000 Subject: [ofa-general] RFC OFED-1.3 installation In-Reply-To: <1184688250.10172.8.camel@localhost> References: <1184688250.10172.8.camel@localhost> Message-ID: <1184727508.5165.559.camel@firewall.xsintricity.com> On Tue, 2007-07-17 at 19:04 +0300, Sasha Khapyorsky wrote: > Hi, > > On Mon, 2007-07-16 at 12:32 -0700, Shirley Ma wrote: > > Is ib-utils depends on opensm-libs? If so I would suggest to change > > opensm-libs as libsmutils. Otherwise ib-utils won't work without > > installing opensm package. Does this make sense? > > Not whole opensm, but opensm-libs. Why the name ("opensm-libs" or > "libsmutils") is matter? It doesn't. In the case of opensm, opensm requires opensm-libs, so it's perfectly acceptable to install opensm-libs without opensm as there is no requirement on opensm from opensm-libs. Generally, it's standard practice that when you have something that's primarily an app, but happens to provide libs that *can* be utilized by other apps, then the naming is , -libs, -devel. Only when you have a package that is primarily a library and any apps in the package are demo/test/example apps that don't serve a useful purpose outside of the scope of the library do you name the packages lib, lib-devel, and put the apps in lib-utils. In addition, it is generally frowned upon to ship any static libraries that customers might link against, but if you find that's truly necessary, then it is preferred that the static libs be in a separate -static package. This way customers must intentionally install the package to be able to link statically, so it won't happen by accident. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From or.gerlitz at gmail.com Tue Jul 17 20:20:15 2007 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 18 Jul 2007 06:20:15 +0300 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <4696D1F3.2040507@ichips.intel.com> References: <4696D1F3.2040507@ichips.intel.com> Message-ID: <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> On 7/13/07, Sean Hefty wrote: > > > - Take a look at Sean's local SA caching patches. I merged > > everything else from Sean's tree, but I'm still undecided about > > these. I haven't read them carefully yet, but even aside from that > > I don't have a good feeling about whether there's consensus about > > this yet. Any opinions about merging, for or against, would be > > appreciated here. > > But to be fair, it will be difficult to enable both QoS and local PR > caching. To me, this would be the strongest reason against using it. > However, QoS places additional burden on the SA, which will make scaling > even more challenging. my understanding is that the local sa does a path-query where all the fields except for the SGID are wildcard-ed. This means we expect the result to be a table of all the paths from this port to every other port on the fabrics for every pkey which this port is a member of etc, correct? How do you plug here the QoS concept of SID in the path query? are you expecting the SA to realize what are all the services for which this port is a "member"? does the proposed definision for QoS management at the SA defines "services per gids" isn't it "what SL to user per Service"? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Jul 17 20:23:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 20:23:38 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> (Or Gerlitz's message of "Wed, 18 Jul 2007 06:20:15 +0300") References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> Message-ID: > > But to be fair, it will be difficult to enable both QoS and local PR > > caching. To me, this would be the strongest reason against using it. > > However, QoS places additional burden on the SA, which will make scaling > > even more challenging. > > my understanding is that the local sa does a path-query where all the fields > except for the SGID are wildcard-ed. This means we expect the result to be a > table of all the paths from this port to every other port on the fabrics for > every pkey which this port is a member of etc, correct? > > How do you plug here the QoS concept of SID in the path query? are you > expecting the SA to realize what are all the services for which this port is > a "member"? does the proposed definision for QoS management at the SA > defines "services per gids" isn't it "what SL to user per Service"? Or, thanks for rescuing this post. I think this is an important question. If we merge the local SA stuff, then are we creating a problem for dealing with QoS? Are we going to have to revert the local SA stuff once the QoS stuff is available? Or is there at least a sketch of a plan on how to handle this? - R. From dledford at redhat.com Tue Jul 17 20:30:15 2007 From: dledford at redhat.com (Doug Ledford) Date: Tue, 17 Jul 2007 23:30:15 -0400 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070718021854.GD19243@mellanox.co.il> References: <20070717164500.GB7479@mellanox.co.il> <1184691962.5165.450.camel@firewall.xsintricity.com> <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> <20070718021854.GD19243@mellanox.co.il> Message-ID: <1184729415.5165.570.camel@firewall.xsintricity.com> On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote: > > Quoting Doug Ledford : > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote: > > > > Quoting Roland Dreier : > > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > > > > I don't really think we want customers to run beta code > > > > > > > > What's the point of a beta then?? > > > > > > Donnu. > > > In previous OFED releases, we had "release candidates" rather than "beta". > > > Openfabrics members were running RCs and reporting issues on the list and in > > > bugzilla. Do you really ask your customers to do this for you? > > > > Sure, as much as possible. I generally don't recommend using it in > > production, but just as close as they can get to production is fine with > > me. The more issues they find while I'm still actually working on it > > and making new revisions, the less issues they'll find after I stupidly > > think I'm done. > > So,Roland's idea of sticking a date in RPM revision willwork, won't it? As long as you don't do two package builds on the same day. That's why my script encodes both an increasing number and the date into the revision. For reference, I'll attach the updated script I made for spitting out a buildable tarball. Hehehe...resending because the ofa list server ate my message due to the script attachment :-D I'll inline it instead. I guess I'll also mention that this script exists in my ~/repos/upstream directory, and also in that directory are all the git repos that I have cloned from ofa (as well as other places). So, it's one level above all the various git clones and spits everything out into dist/. The easiest way to use this script for any given package you want to create a daily snapshot of is to run ./make.dist repodir daily; scp dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads. That simple action would (assuming you create a reasonable reponame.spec.in file in the repos that are missing one) spit out a tarball that can be passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit out the packages, and the repodir-daily.HEAD file shows the HEAD of the git repo so you know exactly what state the tarball represents and you can always get to it in another more recent repo by just updating to that commit as head of tree. #!/bin/bash usage() { echo "$0 repo daily | release [ signed | ]" echo echo " You must specify the repo to make a distribution tarball in. This" echo "script will not work with complex repos like the management repo that" echo "builds more than one package. It expects a repo to be a single package" echo "repo where the directory name and the package name are the same, and" echo "where a properly formatted reponame.spec.in file exists." echo echo " You must specify either release or daily in order for this script" echo "to make tarballs. If this is a daily release, the tarballs will" echo "be named -git.tgz and will overwrite existing tarballs." echo "If this is a release build, then the tarball will be named" echo "-.tgz and must be a new file. In addition," echo "the script will add a new set of symbolic tags to the git repo" echo "that correspond to the - of each tarball." echo echo " If the script detects that the tag on any component already exists," echo "it will abort the release and prompt you to update the version on" echo "the already tagged component. This enforces the proper behavior of" echo "treating any released tarball as set in stone so that in the future" echo "you will always be able to get to any given release tarball by" echo "checking out the git tag and know with certainty that it is the same" echo "code as released before even if you no longer have the same tarball" echo "around." echo echo " As part of this process, the script will parse the .spec.in" echo "file and output a .spec file. Since this script isn't smart" echo "enough to deal with other random changes that should have their own" echo "checkin the script will refuse to run if the current repo state is not" echo "clean." echo echo " NOTE: the script has no clue if you are tagging on the right branch," echo "it will however show you the git branch output so you can confirm it" echo "is on the right branch before proceeding with the release." echo echo " In addition to just tagging the git repo, whenever creating a release" echo "there is an optional argument of either signed or a hex gpg key-id." echo "If you do not pass an argument to release, then the tag will be a" echo "simple git annotated tag. If you pass signed as the argument, the" echo "git tag operation will use your default signing key to sign the tag." echo "Or you can pass an actual gpg key id in hex format and git will sign" echo "the tag with that key." echo } if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi if [ ! -d "$1" ]; then usage; exit 1; fi TMPDIR=dist if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi if [ "$2" = "daily" -o "$2" = "release" ]; then if [ ! -f $TMPDIR/$1-$2.HEAD ]; then touch $TMPDIR/$1-$2.HEAD fi NEWHEAD=`cat $TMPDIR/$1-$2.HEAD` else usage exit 1 fi cd "$1" echo "Updating git repo..." git pull RESULT=$? HEAD=`git log --pretty=oneline -1` if [ "$RESULT" -ne 0 ]; then echo "Failed to update the git repo cleanly, manual intervention required" exit 1 fi if [ "$HEAD" = "$NEWHEAD" ]; then echo "No new commits since last tarball creation, nothing to do." cd .. exit 0 fi if [ "$2" = "release" ]; then # Is the repo clean? git status | grep modified > /dev/null 2>&1 if [ $? = 0 ]; then echo "There are modified files in the repo. Please check any" echo "changes in before proceeding." exit 4 fi # Since we will be tagging things, make sure we are on the right # branch git branch echo -n "Is the active branch the right one to tag this release on [y/N]? " read answer if [ "$answer" = y -o "$answer" = Y ]; then echo "Proceeding..." else echo "Please check out the right branch and run make.dist again" exit 0 fi # Check versions to make sure that we can proceed VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` TARBALL=$1-$VERSION.tgz if [ -f ../$TMPDIR/$TARBALL ]; then echo "Target $TARBALL already exists, please update the version of" echo "$1" exit 2 fi if [ ! -z "`git tag -l $1-$VERSION`" ]; then echo "A git tag already exists for $1-$VERSION. Please change the version" echo "of $1 so a tag replacement won't occur." exit 3 fi # On a real release, this resets the daily release starting point, on the # assumption that any new daily builds will have a version number that is # incrementally higher than the last officially released tarball. RELEASE=1 echo $RELEASE > ../$TMPDIR/$1.release else DATE=`date +%Y%m%d` if [ -f ../$TMPDIR/$1.release ]; then RELEASE=`cat ../$TMPDIR/$1.release` RELEASE=`expr $RELEASE + 1` else RELEASE=1 fi echo $RELEASE > ../$TMPDIR/$1.release RELEASE=0.${RELEASE}.${DATE}git TARBALL=$1-git.tgz fi cd .. cp -a $1 $1-$VERSION [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec if [ -f $1-$VERSION/autogen.sh ]; then cd $1-$VERSION ./autogen.sh cd .. fi echo "Creating $TMPDIR/$TARBALL" tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION rm -rf $1-$VERSION echo "$HEAD" > $TMPDIR/$1-$2.HEAD if [ $2 = release ]; then echo "Tagging release." cd $1 if [ ! -z "$3" ]; then if [ $3 = "signed" ]; then git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION else git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION fi else git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION fi cd .. fi -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rdreier at cisco.com Tue Jul 17 20:31:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 20:31:15 -0700 Subject: [ofa-general] [PATCH] IB/CM: Remove local write permission enable in QP access flags In-Reply-To: <200707171758.57442.dotanb@dev.mellanox.co.il> (Dotan Barak's message of "Tue, 17 Jul 2007 17:58:57 +0300") References: <200707171758.57442.dotanb@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Tue Jul 17 20:41:02 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 20:41:02 -0700 Subject: [ofa-general] Re: [PATCH v2] mlx4: add device reset to error handling mechanism In-Reply-To: <200707121750.45629.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 12 Jul 2007 17:50:45 +0300") References: <200707121750.45629.jackm@dev.mellanox.co.il> Message-ID: thanks, applied as below with quite a few changes: - I was wrong to suggest round_jiffies_relative() -- we really just want round_jiffies() - I don't think the "stop" variable is needed at all -- del_timer_sync() should be safe without it (and yes this cleanup applies to mthca as well) - Don't start polling if the ioremap fails, it will obviously cause an instant oops - R. commit ee49bd9397cd2b8fe7a1962505d81c1d0a1366fc Author: Jack Morgenstein Date: Thu Jul 12 17:50:45 2007 +0300 mlx4_core: Reset device when internal error is detected Reset the device when an internal error is detected. Also, detect errors by polling the error buffer rather than using interrupts. This is more robust and doesn't depend on MSI-X. Remove the old interrupt handler entirely, since we don't want to support two mechanisms for detecting internal errors. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c index 1bb088a..6b32ec9 100644 --- a/drivers/net/mlx4/catas.c +++ b/drivers/net/mlx4/catas.c @@ -30,41 +30,133 @@ * SOFTWARE. */ +#include + #include "mlx4.h" -void mlx4_handle_catas_err(struct mlx4_dev *dev) +enum { + MLX4_CATAS_POLL_INTERVAL = 5 * HZ, +}; + +static DEFINE_SPINLOCK(catas_lock); + +static LIST_HEAD(catas_list); +static struct workqueue_struct *catas_wq; +static struct work_struct catas_work; + +static int internal_err_reset = 1; +module_param(internal_err_reset, int, 0644); +MODULE_PARM_DESC(internal_err_reset, + "Reset device on internal errors if non-zero (default 1)"); + +static void dump_err_buf(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); int i; - mlx4_err(dev, "Catastrophic error detected:\n"); + mlx4_err(dev, "Internal error detected:\n"); for (i = 0; i < priv->fw.catas_size; ++i) mlx4_err(dev, " buf[%02x]: %08x\n", i, swab32(readl(priv->catas_err.map + i))); +} - mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0); +static void poll_catas(unsigned long dev_ptr) +{ + struct mlx4_dev *dev = (struct mlx4_dev *) dev_ptr; + struct mlx4_priv *priv = mlx4_priv(dev); + + if (readl(priv->catas_err.map)) { + dump_err_buf(dev); + + mlx4_dispatch_event(dev, MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR, 0, 0); + + if (internal_err_reset) { + spin_lock(&catas_lock); + list_add(&priv->catas_err.list, &catas_list); + spin_unlock(&catas_lock); + + queue_work(catas_wq, &catas_work); + } + } else + mod_timer(&priv->catas_err.timer, + round_jiffies(jiffies + MLX4_CATAS_POLL_INTERVAL)); } -void mlx4_map_catas_buf(struct mlx4_dev *dev) +static void catas_reset(struct work_struct *work) +{ + struct mlx4_priv *priv, *tmppriv; + struct mlx4_dev *dev; + + LIST_HEAD(tlist); + int ret; + + spin_lock_irq(&catas_lock); + list_splice_init(&catas_list, &tlist); + spin_unlock_irq(&catas_lock); + + list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) { + ret = mlx4_restart_one(priv->dev.pdev); + dev = &priv->dev; + if (ret) + mlx4_err(dev, "Reset failed (%d)\n", ret); + else + mlx4_dbg(dev, "Reset succeeded\n"); + } +} + +void mlx4_start_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); unsigned long addr; + INIT_LIST_HEAD(&priv->catas_err.list); + init_timer(&priv->catas_err.timer); + priv->catas_err.map = NULL; + addr = pci_resource_start(dev->pdev, priv->fw.catas_bar) + priv->fw.catas_offset; priv->catas_err.map = ioremap(addr, priv->fw.catas_size * 4); - if (!priv->catas_err.map) - mlx4_warn(dev, "Failed to map catastrophic error buffer at 0x%lx\n", + if (!priv->catas_err.map) { + mlx4_warn(dev, "Failed to map internal error buffer at 0x%lx\n", addr); + return; + } + priv->catas_err.timer.data = (unsigned long) dev; + priv->catas_err.timer.function = poll_catas; + priv->catas_err.timer.expires = + round_jiffies(jiffies + MLX4_CATAS_POLL_INTERVAL); + add_timer(&priv->catas_err.timer); } -void mlx4_unmap_catas_buf(struct mlx4_dev *dev) +void mlx4_stop_catas_poll(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + del_timer_sync(&priv->catas_err.timer); + if (priv->catas_err.map) iounmap(priv->catas_err.map); + + spin_lock_irq(&catas_lock); + list_del(&priv->catas_err.list); + spin_unlock_irq(&catas_lock); +} + +int __init mlx4_catas_init(void) +{ + INIT_WORK(&catas_work, catas_reset); + + catas_wq = create_singlethread_workqueue("mlx4_err"); + if (!catas_wq) + return -ENOMEM; + + return 0; +} + +void mlx4_catas_cleanup(void) +{ + destroy_workqueue(catas_wq); } diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index 27a82ce..2095c84 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -89,14 +89,12 @@ struct mlx4_eq_context { (1ull << MLX4_EVENT_TYPE_PATH_MIG_FAILED) | \ (1ull << MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ (1ull << MLX4_EVENT_TYPE_WQ_ACCESS_ERROR) | \ - (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ (1ull << MLX4_EVENT_TYPE_PORT_CHANGE) | \ (1ull << MLX4_EVENT_TYPE_ECC_DETECT) | \ (1ull << MLX4_EVENT_TYPE_SRQ_CATAS_ERROR) | \ (1ull << MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE) | \ (1ull << MLX4_EVENT_TYPE_SRQ_LIMIT) | \ (1ull << MLX4_EVENT_TYPE_CMD)) -#define MLX4_CATAS_EVENT_MASK (1ull << MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR) struct mlx4_eqe { u8 reserved1; @@ -264,7 +262,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr) writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); return IRQ_RETVAL(work); @@ -281,14 +279,6 @@ static irqreturn_t mlx4_msi_x_interrupt(int irq, void *eq_ptr) return IRQ_HANDLED; } -static irqreturn_t mlx4_catas_interrupt(int irq, void *dev_ptr) -{ - mlx4_handle_catas_err(dev_ptr); - - /* MSI-X vectors always belong to us */ - return IRQ_HANDLED; -} - static int mlx4_MAP_EQ(struct mlx4_dev *dev, u64 event_mask, int unmap, int eq_num) { @@ -490,11 +480,9 @@ static void mlx4_free_irqs(struct mlx4_dev *dev) if (eq_table->have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) if (eq_table->eq[i].have_irq) free_irq(eq_table->eq[i].irq, eq_table->eq + i); - if (eq_table->eq[MLX4_EQ_CATAS].have_irq) - free_irq(eq_table->eq[MLX4_EQ_CATAS].irq, dev); } static int __devinit mlx4_map_clr_int(struct mlx4_dev *dev) @@ -598,32 +586,19 @@ int __devinit mlx4_init_eq_table(struct mlx4_dev *dev) if (dev->flags & MLX4_FLAG_MSI_X) { static const char *eq_name[] = { [MLX4_EQ_COMP] = DRV_NAME " (comp)", - [MLX4_EQ_ASYNC] = DRV_NAME " (async)", - [MLX4_EQ_CATAS] = DRV_NAME " (catas)" + [MLX4_EQ_ASYNC] = DRV_NAME " (async)" }; - err = mlx4_create_eq(dev, 1, MLX4_EQ_CATAS, - &priv->eq_table.eq[MLX4_EQ_CATAS]); - if (err) - goto err_out_async; - - for (i = 0; i < MLX4_EQ_CATAS; ++i) { + for (i = 0; i < MLX4_NUM_EQ; ++i) { err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name[i], priv->eq_table.eq + i); if (err) - goto err_out_catas; + goto err_out_async; priv->eq_table.eq[i].have_irq = 1; } - err = request_irq(priv->eq_table.eq[MLX4_EQ_CATAS].irq, - mlx4_catas_interrupt, 0, - eq_name[MLX4_EQ_CATAS], dev); - if (err) - goto err_out_catas; - - priv->eq_table.eq[MLX4_EQ_CATAS].have_irq = 1; } else { err = request_irq(dev->pdev->irq, mlx4_interrupt, IRQF_SHARED, DRV_NAME, dev); @@ -639,22 +614,11 @@ int __devinit mlx4_init_eq_table(struct mlx4_dev *dev) mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) eq_set_ci(&priv->eq_table.eq[i], 1); - if (dev->flags & MLX4_FLAG_MSI_X) { - err = mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 0, - priv->eq_table.eq[MLX4_EQ_CATAS].eqn); - if (err) - mlx4_warn(dev, "MAP_EQ for catas EQ %d failed (%d)\n", - priv->eq_table.eq[MLX4_EQ_CATAS].eqn, err); - } - return 0; -err_out_catas: - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); - err_out_async: mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); @@ -675,19 +639,13 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); int i; - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_MAP_EQ(dev, MLX4_CATAS_EVENT_MASK, 1, - priv->eq_table.eq[MLX4_EQ_CATAS].eqn); - mlx4_MAP_EQ(dev, MLX4_ASYNC_EVENT_MASK, 1, priv->eq_table.eq[MLX4_EQ_ASYNC].eqn); mlx4_free_irqs(dev); - for (i = 0; i < MLX4_EQ_CATAS; ++i) + for (i = 0; i < MLX4_NUM_EQ; ++i) mlx4_free_eq(dev, &priv->eq_table.eq[i]); - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_CATAS]); mlx4_unmap_clr_int(dev); diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c index 9ae951b..be5d9e9 100644 --- a/drivers/net/mlx4/intf.c +++ b/drivers/net/mlx4/intf.c @@ -142,6 +142,7 @@ int mlx4_register_device(struct mlx4_dev *dev) mlx4_add_device(intf, priv); mutex_unlock(&intf_mutex); + mlx4_start_catas_poll(dev); return 0; } @@ -151,6 +152,7 @@ void mlx4_unregister_device(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_interface *intf; + mlx4_stop_catas_poll(dev); mutex_lock(&intf_mutex); list_for_each_entry(intf, &intf_list, list) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index a4f2e04..e8f45e6 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -583,13 +583,11 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev) goto err_pd_table_free; } - mlx4_map_catas_buf(dev); - err = mlx4_init_eq_table(dev); if (err) { mlx4_err(dev, "Failed to initialize " "event queue table, aborting.\n"); - goto err_catas_buf; + goto err_mr_table_free; } err = mlx4_cmd_use_events(dev); @@ -659,8 +657,7 @@ err_cmd_poll: err_eq_table_free: mlx4_cleanup_eq_table(dev); -err_catas_buf: - mlx4_unmap_catas_buf(dev); +err_mr_table_free: mlx4_cleanup_mr_table(dev); err_pd_table_free: @@ -836,9 +833,6 @@ err_cleanup: mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); mlx4_cleanup_uar_table(dev); @@ -885,9 +879,6 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev) mlx4_cleanup_cq_table(dev); mlx4_cmd_use_polling(dev); mlx4_cleanup_eq_table(dev); - - mlx4_unmap_catas_buf(dev); - mlx4_cleanup_mr_table(dev); mlx4_cleanup_pd_table(dev); @@ -908,6 +899,12 @@ static void __devexit mlx4_remove_one(struct pci_dev *pdev) } } +int mlx4_restart_one(struct pci_dev *pdev) +{ + mlx4_remove_one(pdev); + return mlx4_init_one(pdev, NULL); +} + static struct pci_device_id mlx4_pci_table[] = { { PCI_VDEVICE(MELLANOX, 0x6340) }, /* MT25408 "Hermon" SDR */ { PCI_VDEVICE(MELLANOX, 0x634a) }, /* MT25408 "Hermon" DDR */ @@ -930,6 +927,10 @@ static int __init mlx4_init(void) { int ret; + ret = mlx4_catas_init(); + if (ret) + return ret; + ret = pci_register_driver(&mlx4_driver); return ret < 0 ? ret : 0; } @@ -937,6 +938,7 @@ static int __init mlx4_init(void) static void __exit mlx4_cleanup(void) { pci_unregister_driver(&mlx4_driver); + mlx4_catas_cleanup(); } module_init(mlx4_init); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index d9c91a7..be304a7 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -39,6 +39,7 @@ #include #include +#include #include #include @@ -67,7 +68,6 @@ enum { enum { MLX4_EQ_ASYNC, MLX4_EQ_COMP, - MLX4_EQ_CATAS, MLX4_NUM_EQ }; @@ -248,7 +248,8 @@ struct mlx4_mcg_table { struct mlx4_catas_err { u32 __iomem *map; - int size; + struct timer_list timer; + struct list_head list; }; struct mlx4_priv { @@ -311,9 +312,11 @@ void mlx4_cleanup_qp_table(struct mlx4_dev *dev); void mlx4_cleanup_srq_table(struct mlx4_dev *dev); void mlx4_cleanup_mcg_table(struct mlx4_dev *dev); -void mlx4_map_catas_buf(struct mlx4_dev *dev); -void mlx4_unmap_catas_buf(struct mlx4_dev *dev); - +void mlx4_start_catas_poll(struct mlx4_dev *dev); +void mlx4_stop_catas_poll(struct mlx4_dev *dev); +int mlx4_catas_init(void); +void mlx4_catas_cleanup(void); +int mlx4_restart_one(struct pci_dev *pdev); int mlx4_register_device(struct mlx4_dev *dev); void mlx4_unregister_device(struct mlx4_dev *dev); void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_event type, From rdreier at cisco.com Tue Jul 17 20:51:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 20:51:38 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads per qp In-Reply-To: <200707171311.43680.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 17 Jul 2007 13:11:43 +0300") References: <200707171311.43680.jackm@dev.mellanox.co.il> Message-ID: thanks, applied From rdreier at cisco.com Tue Jul 17 20:59:37 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 20:59:37 -0700 Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp In-Reply-To: <200707151058.55805.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 15 Jul 2007 10:58:55 +0300") References: <200706211229.08703.jackm@dev.mellanox.co.il> <200707151058.55805.jackm@dev.mellanox.co.il> Message-ID: OK, I think I'll merge this, since it seems cleaner to me. commit 7f5eb9bb8c7fb3bd411674b856872d7ab4a7b1a3 Author: Roland Dreier Date: Tue Jul 17 20:59:02 2007 -0700 IB/mlx4: Return receive queue sizes for userspace QPs from query QP Return the receive queue sizes for both userspace QPs and kernel Qps (not just kernel QPs) from mlx4_ib_query_qp(). Also zero the send queue sizes for userspace QPs to avoid a possible information leak, and set the max_inline_data for kernel QPs to 0 since inline sends are not supported for kernel QPs. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 0793059..8d09aa3 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1581,17 +1581,25 @@ int mlx4_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr done: qp_attr->cur_qp_state = qp_attr->qp_state; + qp_attr->cap.max_recv_wr = qp->rq.wqe_cnt; + qp_attr->cap.max_recv_sge = qp->rq.max_gs; + if (!ibqp->uobject) { - qp_attr->cap.max_send_wr = qp->sq.wqe_cnt; - qp_attr->cap.max_recv_wr = qp->rq.wqe_cnt; - qp_attr->cap.max_send_sge = qp->sq.max_gs; - qp_attr->cap.max_recv_sge = qp->rq.max_gs; - qp_attr->cap.max_inline_data = (1 << qp->sq.wqe_shift) - - send_wqe_overhead(qp->ibqp.qp_type) - - sizeof (struct mlx4_wqe_inline_seg); - qp_init_attr->cap = qp_attr->cap; + qp_attr->cap.max_send_wr = qp->sq.wqe_cnt; + qp_attr->cap.max_send_sge = qp->sq.max_gs; + } else { + qp_attr->cap.max_send_wr = 0; + qp_attr->cap.max_send_sge = 0; } + /* + * We don't support inline sends for kernel QPs (yet), and we + * don't know what userspace's value should be. + */ + qp_attr->cap.max_inline_data = 0; + + qp_init_attr->cap = qp_attr->cap; + return 0; } From rdreier at cisco.com Tue Jul 17 21:07:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 21:07:51 -0700 Subject: [ofa-general] Re: [PATCH 2 of 2] libmlx4: implement query_qp In-Reply-To: <200706211229.08703.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Thu, 21 Jun 2007 12:29:08 +0300") References: <200706211229.08703.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From mst at dev.mellanox.co.il Tue Jul 17 21:31:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 07:31:59 +0300 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184729415.5165.570.camel@firewall.xsintricity.com> References: <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> <20070718021854.GD19243@mellanox.co.il> <1184729415.5165.570.camel@firewall.xsintricity.com> Message-ID: <20070718043159.GA28541@mellanox.co.il> > Quoting Doug Ledford : > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > On Wed, 2007-07-18 at 05:18 +0300, Michael S. Tsirkin wrote: > > > Quoting Doug Ledford : > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > On Wed, 2007-07-18 at 00:09 +0300, Michael S. Tsirkin wrote: > > > > > Quoting Roland Dreier : > > > > > Subject: Re: [ofa-general] Re: RFC OFED-1.3 installation > > > > > > > > > > > I don't really think we want customers to run beta code > > > > > > > > > > What's the point of a beta then?? > > > > > > > > Donnu. > > > > In previous OFED releases, we had "release candidates" rather than "beta". > > > > Openfabrics members were running RCs and reporting issues on the list and in > > > > bugzilla. Do you really ask your customers to do this for you? > > > > > > Sure, as much as possible. I generally don't recommend using it in > > > production, but just as close as they can get to production is fine with > > > me. The more issues they find while I'm still actually working on it > > > and making new revisions, the less issues they'll find after I stupidly > > > think I'm done. > > > > So,Roland's idea of sticking a date in RPM revision willwork, won't it? > > As long as you don't do two package builds on the same day. That's why > my script encodes both an increasing number and the date into the > revision. > > For reference, I'll attach the updated script I made for spitting out a > buildable tarball. > > Hehehe...resending because the ofa list server ate my message due to the > script attachment :-D I'll inline it instead. > > I guess I'll also mention that this script exists in my ~/repos/upstream > directory, and also in that directory are all the git repos that I have > cloned from ofa (as well as other places). So, it's one level above all > the various git clones and spits everything out into dist/. The easiest > way to use this script for any given package you want to create a daily > snapshot of is to run ./make.dist repodir daily; scp > dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads. That > simple action would (assuming you create a reasonable reponame.spec.in > file in the repos that are missing one) spit out a tarball that can be > passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit > out the packages, and the repodir-daily.HEAD file shows the HEAD of the > git repo so you know exactly what state the tarball represents and you > can always get to it in another more recent repo by just updating to > that commit as head of tree. Thanks for the script. In OFED, since we control the upstream, I think we'll try to do as much as possible at the package level, for example make sure that each package has a reasonable spec file. Some ideas on how we might want to do this below. > #!/bin/bash > > usage() { > echo "$0 repo daily | release [ signed | ]" > echo > echo " You must specify the repo to make a distribution tarball in. This" > echo "script will not work with complex repos like the management repo that" > echo "builds more than one package. It expects a repo to be a single package" > echo "repo where the directory name and the package name are the same, and" > echo "where a properly formatted reponame.spec.in file exists." > echo > echo " You must specify either release or daily in order for this script" > echo "to make tarballs. If this is a daily release, the tarballs will" > echo "be named -git.tgz and will overwrite existing tarballs." > echo "If this is a release build, then the tarball will be named" > echo "-.tgz and must be a new file. In addition," > echo "the script will add a new set of symbolic tags to the git repo" > echo "that correspond to the - of each tarball." > echo > echo " If the script detects that the tag on any component already exists," > echo "it will abort the release and prompt you to update the version on" > echo "the already tagged component. This enforces the proper behavior of" > echo "treating any released tarball as set in stone so that in the future" > echo "you will always be able to get to any given release tarball by" > echo "checking out the git tag and know with certainty that it is the same" > echo "code as released before even if you no longer have the same tarball" > echo "around." > echo > echo " As part of this process, the script will parse the .spec.in" > echo "file and output a .spec file. Since this script isn't smart" > echo "enough to deal with other random changes that should have their own" > echo "checkin the script will refuse to run if the current repo state is not" > echo "clean." > echo > echo " NOTE: the script has no clue if you are tagging on the right branch," > echo "it will however show you the git branch output so you can confirm it" > echo "is on the right branch before proceeding with the release." > echo > echo " In addition to just tagging the git repo, whenever creating a release" > echo "there is an optional argument of either signed or a hex gpg key-id." > echo "If you do not pass an argument to release, then the tag will be a" > echo "simple git annotated tag. If you pass signed as the argument, the" > echo "git tag operation will use your default signing key to sign the tag." > echo "Or you can pass an actual gpg key id in hex format and git will sign" > echo "the tag with that key." > echo > } > > if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi > > if [ ! -d "$1" ]; then usage; exit 1; fi > > TMPDIR=dist > if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi > > if [ "$2" = "daily" -o "$2" = "release" ]; then > if [ ! -f $TMPDIR/$1-$2.HEAD ]; then > touch $TMPDIR/$1-$2.HEAD > fi > NEWHEAD=`cat $TMPDIR/$1-$2.HEAD` > else > usage > exit 1 > fi > > cd "$1" > echo "Updating git repo..." > git pull > RESULT=$? > HEAD=`git log --pretty=oneline -1` > > if [ "$RESULT" -ne 0 ]; then > echo "Failed to update the git repo cleanly, manual intervention required" > exit 1 > fi pull really will merge your local modifications with upstream. In OFED we really want just git clone, and use upstream code unmodified. > if [ "$HEAD" = "$NEWHEAD" ]; then > echo "No new commits since last tarball creation, nothing to do." > cd .. > exit 0 > fi > > if [ "$2" = "release" ]; then > # Is the repo clean? > git status | grep modified > /dev/null 2>&1 > if [ $? = 0 ]; then > echo "There are modified files in the repo. Please check any" > echo "changes in before proceeding." > exit 4 > fi > # Since we will be tagging things, make sure we are on the right > # branch > git branch > echo -n "Is the active branch the right one to tag this release on [y/N]? " > read answer > if [ "$answer" = y -o "$answer" = Y ]; then > echo "Proceeding..." > else > echo "Please check out the right branch and run make.dist again" > exit 0 > fi See below on what we should do in OFED IMO. > # Check versions to make sure that we can proceed > VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` > TARBALL=$1-$VERSION.tgz > if [ -f ../$TMPDIR/$TARBALL ]; then > echo "Target $TARBALL already exists, please update the version of" > echo "$1" > exit 2 > fi > if [ ! -z "`git tag -l $1-$VERSION`" ]; then > echo "A git tag already exists for $1-$VERSION. Please change the version" > echo "of $1 so a tag replacement won't occur." > exit 3 > fi > # On a real release, this resets the daily release starting point, on the > # assumption that any new daily builds will have a version number that is > # incrementally higher than the last officially released tarball. > RELEASE=1 > echo $RELEASE > ../$TMPDIR/$1.release > else > DATE=`date +%Y%m%d` > if [ -f ../$TMPDIR/$1.release ]; then > RELEASE=`cat ../$TMPDIR/$1.release` > RELEASE=`expr $RELEASE + 1` > else > RELEASE=1 > fi > echo $RELEASE > ../$TMPDIR/$1.release > RELEASE=0.${RELEASE}.${DATE}git > TARBALL=$1-git.tgz > fi > > cd .. > cp -a $1 $1-$VERSION > [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec This, I think, is the bit that we definitely want to reuse. > if [ -f $1-$VERSION/autogen.sh ]; then > cd $1-$VERSION > ./autogen.sh > cd .. I think we will want to call make dist too. > fi > echo "Creating $TMPDIR/$TARBALL" > tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION > rm -rf $1-$VERSION > echo "$HEAD" > $TMPDIR/$1-$2.HEAD > > if [ $2 = release ]; then > echo "Tagging release." > cd $1 > if [ ! -z "$3" ]; then > if [ $3 = "signed" ]; then > git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > else > git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > fi > else > git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > fi > cd .. > fi This takes whatever's at git head and then tags that. In OFED it is the other way around: maintainers tag the appropriate bits, release script just packages that. -- MST From hal.rosenstock at gmail.com Tue Jul 17 21:40:07 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Jul 2007 00:40:07 -0400 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> Message-ID: On 7/17/07, Roland Dreier wrote: > > > > But to be fair, it will be difficult to enable both QoS and local PR > > > caching. To me, this would be the strongest reason against using it. > > > However, QoS places additional burden on the SA, which will make > scaling > > > even more challenging. > > > > my understanding is that the local sa does a path-query where all the > fields > > except for the SGID are wildcard-ed. This means we expect the result to > be a > > table of all the paths from this port to every other port on the fabrics > for > > every pkey which this port is a member of etc, correct? > > > > How do you plug here the QoS concept of SID in the path query? are you > > expecting the SA to realize what are all the services for which this > port is > > a "member"? does the proposed definision for QoS management at the SA > > defines "services per gids" isn't it "what SL to user per Service"? > > Or, thanks for rescuing this post. > > I think this is an important question. If we merge the local SA > stuff, then are we creating a problem for dealing with QoS? Are we > going to have to revert the local SA stuff once the QoS stuff is > available? Or is there at least a sketch of a plan on how to handle > this? Is the worst case that local SA cache and QoS on an end node are mutually exclusive ? I think there is a way to shut off the local SA cache. -- Hal - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Tue Jul 17 21:44:34 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 18 Jul 2007 07:44:34 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-18:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From dledford at redhat.com Tue Jul 17 21:56:59 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 18 Jul 2007 04:56:59 +0000 Subject: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <20070718043159.GA28541@mellanox.co.il> References: <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> <20070718021854.GD19243@mellanox.co.il> <1184729415.5165.570.camel@firewall.xsintricity.com> <20070718043159.GA28541@mellanox.co.il> Message-ID: <1184734619.5165.579.camel@firewall.xsintricity.com> On Wed, 2007-07-18 at 07:31 +0300, Michael S. Tsirkin wrote: > Thanks for the script. > In OFED, since we control the upstream, I think we'll try to do > as much as possible at the package level, for example make sure > that each package has a reasonable spec file. > Some ideas on how we might want to do this below. > > > #!/bin/bash > > > > usage() { > > echo "$0 repo daily | release [ signed | ]" > > echo > > echo " You must specify the repo to make a distribution tarball in. This" > > echo "script will not work with complex repos like the management repo that" > > echo "builds more than one package. It expects a repo to be a single package" > > echo "repo where the directory name and the package name are the same, and" > > echo "where a properly formatted reponame.spec.in file exists." > > echo > > echo " You must specify either release or daily in order for this script" > > echo "to make tarballs. If this is a daily release, the tarballs will" > > echo "be named -git.tgz and will overwrite existing tarballs." > > echo "If this is a release build, then the tarball will be named" > > echo "-.tgz and must be a new file. In addition," > > echo "the script will add a new set of symbolic tags to the git repo" > > echo "that correspond to the - of each tarball." > > echo > > echo " If the script detects that the tag on any component already exists," > > echo "it will abort the release and prompt you to update the version on" > > echo "the already tagged component. This enforces the proper behavior of" > > echo "treating any released tarball as set in stone so that in the future" > > echo "you will always be able to get to any given release tarball by" > > echo "checking out the git tag and know with certainty that it is the same" > > echo "code as released before even if you no longer have the same tarball" > > echo "around." > > echo > > echo " As part of this process, the script will parse the .spec.in" > > echo "file and output a .spec file. Since this script isn't smart" > > echo "enough to deal with other random changes that should have their own" > > echo "checkin the script will refuse to run if the current repo state is not" > > echo "clean." > > echo > > echo " NOTE: the script has no clue if you are tagging on the right branch," > > echo "it will however show you the git branch output so you can confirm it" > > echo "is on the right branch before proceeding with the release." > > echo > > echo " In addition to just tagging the git repo, whenever creating a release" > > echo "there is an optional argument of either signed or a hex gpg key-id." > > echo "If you do not pass an argument to release, then the tag will be a" > > echo "simple git annotated tag. If you pass signed as the argument, the" > > echo "git tag operation will use your default signing key to sign the tag." > > echo "Or you can pass an actual gpg key id in hex format and git will sign" > > echo "the tag with that key." > > echo > > } > > > > if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi > > > > if [ ! -d "$1" ]; then usage; exit 1; fi > > > > TMPDIR=dist > > if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi > > > > if [ "$2" = "daily" -o "$2" = "release" ]; then > > if [ ! -f $TMPDIR/$1-$2.HEAD ]; then > > touch $TMPDIR/$1-$2.HEAD > > fi > > NEWHEAD=`cat $TMPDIR/$1-$2.HEAD` > > else > > usage > > exit 1 > > fi > > > > cd "$1" > > echo "Updating git repo..." > > git pull > > RESULT=$? > > HEAD=`git log --pretty=oneline -1` > > > > if [ "$RESULT" -ne 0 ]; then > > echo "Failed to update the git repo cleanly, manual intervention required" > > exit 1 > > fi > > pull really will merge your local modifications with upstream. > In OFED we really want just git clone, and use upstream code unmodified. That depends on how you have your repos set up. I keep separate repos for tracking upstream and for doing local work. You can run this on either repo, and it will either give you a clean upstream copy or your local copy merged up to date (assuming you are on a branch that merges from upstream, otherwise if you are on a local branch, the pull will update the repo, but not your checked out files). This script is a little schitzophrenic at the moment because it acts like it's both a customer of the git repo and a master of the git repo. In truth, the release part of the script was only ever intended to be used by a maintainer in a clean master repo. The daily part of the script can be used by anyone who wants to spit out quick daily builds. But, if you are a consumer of the repo instead of the maintainer, then for the daily builds you need to update the repo. So, in short, the daily part is usable by anyone tracking development of any repo and will pull from the upstream repo to keep up to date. The release functionality should only be used by maintainers, and then only in their master repo. Make more sense that way? > > if [ "$HEAD" = "$NEWHEAD" ]; then > > echo "No new commits since last tarball creation, nothing to do." > > cd .. > > exit 0 > > fi > > > > if [ "$2" = "release" ]; then > > # Is the repo clean? > > git status | grep modified > /dev/null 2>&1 > > if [ $? = 0 ]; then > > echo "There are modified files in the repo. Please check any" > > echo "changes in before proceeding." > > exit 4 > > fi > > # Since we will be tagging things, make sure we are on the right > > # branch > > git branch > > echo -n "Is the active branch the right one to tag this release on [y/N]? " > > read answer > > if [ "$answer" = y -o "$answer" = Y ]; then > > echo "Proceeding..." > > else > > echo "Please check out the right branch and run make.dist again" > > exit 0 > > fi > > See below on what we should do in OFED IMO. > > > # Check versions to make sure that we can proceed > > VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` > > TARBALL=$1-$VERSION.tgz > > if [ -f ../$TMPDIR/$TARBALL ]; then > > echo "Target $TARBALL already exists, please update the version of" > > echo "$1" > > exit 2 > > fi > > if [ ! -z "`git tag -l $1-$VERSION`" ]; then > > echo "A git tag already exists for $1-$VERSION. Please change the version" > > echo "of $1 so a tag replacement won't occur." > > exit 3 > > fi > > # On a real release, this resets the daily release starting point, on the > > # assumption that any new daily builds will have a version number that is > > # incrementally higher than the last officially released tarball. > > RELEASE=1 > > echo $RELEASE > ../$TMPDIR/$1.release > > else > > DATE=`date +%Y%m%d` > > if [ -f ../$TMPDIR/$1.release ]; then > > RELEASE=`cat ../$TMPDIR/$1.release` > > RELEASE=`expr $RELEASE + 1` > > else > > RELEASE=1 > > fi > > echo $RELEASE > ../$TMPDIR/$1.release > > RELEASE=0.${RELEASE}.${DATE}git > > TARBALL=$1-git.tgz > > fi > > > > cd .. > > cp -a $1 $1-$VERSION > > [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec > > This, I think, is the bit that we definitely want to reuse. > > > if [ -f $1-$VERSION/autogen.sh ]; then > > cd $1-$VERSION > > ./autogen.sh > > cd .. > > I think we will want to call make dist too. As long as make dist doesn't remove vital files, then sure. Gonna have to run configure to run make dist so you'll have to add both calls. > > fi > > echo "Creating $TMPDIR/$TARBALL" > > tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION > > rm -rf $1-$VERSION > > echo "$HEAD" > $TMPDIR/$1-$2.HEAD > > > > if [ $2 = release ]; then > > echo "Tagging release." > > cd $1 > > if [ ! -z "$3" ]; then > > if [ $3 = "signed" ]; then > > git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > > else > > git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > > fi > > else > > git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > > fi > > cd .. > > fi > > This takes whatever's at git head and then tags that. > In OFED it is the other way around: maintainers tag > the appropriate bits, release script just packages that. Like I said earlier, the release operation is intended to only be used by a maintainer, and in this case it just automates the process of tagging and spitting out a tarball into the same action. And for clarity, it tags the head of whatever branch you are on. So, if you've been working in the ofed_1_2 branch, and made some changes, and are ready for a release, it spits out the tarball and tags the ofed_1_2 branch head as the symbolic tag reponame-version. If you want to work on master and do a release there, then it works similarly. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jgunthorpe at obsidianresearch.com Tue Jul 17 22:09:28 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 17 Jul 2007 23:09:28 -0600 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> Message-ID: <20070718050928.GA3103@obsidianresearch.com> On Wed, Jul 18, 2007 at 12:40:07AM -0400, Hal Rosenstock wrote: >> I think this is an important question. If we merge the local SA >> stuff, then are we creating a problem for dealing with QoS? Are we >> going to have to revert the local SA stuff once the QoS stuff is >> available? Or is there at least a sketch of a plan on how to handle >> this? > > Is the worst case that local SA cache and QoS on an end node are mutually > exclusive ? I think there is a way to shut off the local SA cache. IMHO, I still think that without some kind of SM/SA sourced invalidation mechanism all client side caching (including the ipoib stuff we have now) is a bad idea. There just isn't any way to maintain coherence. I think QoS is just a specific case of why.. Routers are also likely to cause similar kinds of headaches. There are even a bunch of other corner cases even with out those two. It seems to me this would be alot better as a patch set to let a user space daemon have first dibs at responding to a PR lookup. Then the labs could have a special daemon that worked with the SA in a vendor-specific way to do replication and get some big speed ups. This should be pretty easy if you use a shared filesystem to distribute a routing database produced by opensm. But I'm not working on this stuff ;) Jason From sean.hefty at intel.com Tue Jul 17 22:26:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 22:26:35 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: Message-ID: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com> >I think this is an important question. If we merge the local SA >stuff, then are we creating a problem for dealing with QoS? Yes - I do believe that merging PR caching and QoS together will be difficult. I don't think the problems are insurmountable, but I can't say that I have a definite solution for how to deal with this. My current thoughts are that the purpose of the cache is to increase SA scalability on large clusters. We've seen issues running MPI, trying to establish all-to-all connections, on our 256 node cluster. (With 4 processes per node, this results in about 500,000+ PR queries hitting the SA.) The SA was swamped with work, and it wasn't trying to enforce QoS requirements across the cluster. I just don't see how an SA that is already having trouble scaling to this number of nodes will be able to perform the additional task of providing QoS across the cluster. It may be that, at least initially, an administrator may need to select between enabling PR caching or QoS. >Are we going to have to revert the local SA stuff once the QoS stuff is >available? In the best case, the local SA will need enhancements added to the base support. In the worst case, a user would have to choose between QoS or PR caching. If all users choose QoS, then it would make sense to remove the local SA. >Or is there at least a sketch of a plan on how to handle this? This is only a rough idea, and it depends on how the QoS is implemented. The idea is to create a local QoS module on each node. The local QoS modules would be programmed with basic QoS information. For example, which types of queries to handle locally, versus which ones to forward to the SA. Locally handled queries would return PRs based on some QoS mapping table. (I haven't looked into any details of this.) Ideally, local QoS modules would be programmed by a QoS master. This would require a new vendor-specific protocol, but would allow for a simple distributed QoS manager. We will have a better idea of the issues and possible solutions once the QoS spec is released, and we can hold discussions on it. I will be working more details on QoS enhancements starting in the next couple of weeks. - Sean From rdreier at cisco.com Tue Jul 17 22:39:11 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 17 Jul 2007 22:39:11 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <20070718050928.GA3103@obsidianresearch.com> (Jason Gunthorpe's message of "Tue, 17 Jul 2007 23:09:28 -0600") References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> Message-ID: > IMHO, I still think that without some kind of SM/SA sourced > invalidation mechanism all client side caching (including the ipoib > stuff we have now) is a bad idea. But for IPoIB at least doing a path lookup for every packet is obviously not feasible. And ARP table aging gives a way to recover from stale cached data, eventually at least. In fact this may be a good argument in favor of local SA caching -- by analogy with IPoIB it makes sense to avoid going to the SA too often. From sean.hefty at intel.com Tue Jul 17 23:04:54 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 17 Jul 2007 23:04:54 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <20070718050928.GA3103@obsidianresearch.com> Message-ID: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> >IMHO, I still think that without some kind of SM/SA sourced >invalidation mechanism all client side caching (including the ipoib >stuff we have now) is a bad idea. These are not full proof mechanisms, but the SA does have client re-registration and GID in/out of service events that the local SA responds to. Anything beyond that becomes vendor specific. The local SA exposes the ability for a user space application to force an update of the cache, and leaves the refresh policy up to the user. In our use model, we force a refresh immediately before starting a large MPI job. Nothing precludes a user space daemon from updating the cache at timed intervals, or from communicating with an SA in some vendor defined way to maintain coherency. I'm only trying to provide the kernel framework. (We can debate whether another framework would have been better, and I've held this discussion on the list before...) I do envision someone creating user space applications to control refreshes and, with local SA extensions, allow pre-loading of the cache, updates to specific paths, etc. We can gain additional benefits by integrating the local SA tighter with the stack. For example, the CM could update the local SA on path migration events or CM message timeouts. For now, I want to start with a fairly simple framework that's useful and extensible. And, IMO, I don't believe that the cache coherency issues are reason enough alone to prevent merging this patch. - Sean From ogerlitz at voltaire.com Tue Jul 17 23:36:17 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 09:36:17 +0300 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: <1184704987.7702.106.camel@hyperion> Message-ID: <469DB4E1.6080200@voltaire.com> Roland Dreier wrote: > > I would like to see these features moved upstream. DOE funded this > > work as part of the items we see needing on our large scale IB > > deployment (both present and future). So from at least one big customer > > perspective we see this as useful. > > Does your reference to "present deployment" mean you are running this > code now? Indeed, my understanding is that the DOE uses an Open MPI device (I think its called PTE) which is implemented directly over libibverbs and hence no path queries are issued at all, if this is indeed the case, for them its more of a "for-the-future" thing. Or. From tziporet at dev.mellanox.co.il Tue Jul 17 23:41:02 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Jul 2007 09:41:02 +0300 Subject: [ofa-general] OpenIB development help In-Reply-To: References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> Message-ID: <469DB5FE.1010206@mellanox.co.il> Roland Dreier wrote: > > Thanks for replying to mail. I have a some basic understanding of IB. I > > have gone through some of the example code in the example directory and > > OFED performance test. I noticed that every one of those examples used > > TCP to exchange information regarding lid, psn and qpn. My question is > > basically that is there any other way to exchange this information using > > only IB. Since no hardware supports RD, I have to bite the bullet and > > use RC. > Also in the test rdma_lat (under ~mst/perftest.git) there is an option -c that opens the connection using rdmacm > Look at librdmacm (or libibcm). They provide higher-level > abstractions for connection establishment. There is an example there too - rping that open RC connection with librdmacm Tziporet From ogerlitz at voltaire.com Tue Jul 17 23:43:52 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 09:43:52 +0300 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> Message-ID: <469DB6A8.1050107@voltaire.com> Roland Dreier wrote: > > IMHO, I still think that without some kind of SM/SA sourced > > invalidation mechanism all client side caching (including the ipoib > > stuff we have now) is a bad idea. > > But for IPoIB at least doing a path lookup for every packet is > obviously not feasible. And ARP table aging gives a way to recover > from stale cached data, eventually at least. > In fact this may be a good argument in favor of local SA caching -- by > analogy with IPoIB it makes sense to avoid going to the SA too often. for each neighbour IPoIB-UD (*) keeps an IB UD Address Handle (AH), so the neighbouring subsystem GC mechanism which does unicast ARP probes etc actually --verifies-- that the cached AH is valid. With the local SA, even though the network stack has invalidated the AH (neighbour), a new path query would not be initiated. If this is the case also with the current IPoIB code, it seems to me as a bug. Actually I never managed to under --why-- there's a need to keep the path (except for debugfs reasons) record in ipoib and not only the ah ?! Or. (*) for IPoIB-CM its the same idea, the neighbour points to IB connection and the probe is sent over the connection From ogerlitz at voltaire.com Tue Jul 17 23:53:51 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 09:53:51 +0300 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> References: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> Message-ID: <469DB8FF.20604@voltaire.com> Sean Hefty wrote: > The local SA exposes the ability for a user space > application to force an update of the cache, and leaves the refresh policy up to > the user. In our use model, we force a refresh immediately before starting a > large MPI job. The last statement left me confused... if you refresh the cache before you use it (spawn large MPI job) what does it buys you at all?! Also how is the forced update mechanism being implemented? Or. From ogerlitz at voltaire.com Wed Jul 18 00:08:29 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 10:08:29 +0300 Subject: [ofa-general] [PATCH] IB/mad: fix duplicated kernel thread name In-Reply-To: References: <469C923C.5000307@voltaire.com> Message-ID: <469DBC6D.5030604@voltaire.com> Roland Dreier wrote: > > the patch for itself only fixes a possible confusion created for the > > user as of two processes with the same name, however the discussion > > evolved to the question of how many threads should be used by the MAD > > and CM layers. > > Is there any practical impact of two kernel threads with the same > name, though? I have tons of processes that all are "/bin/bash" on my > box and it doesn't hurt too much. yes. When looking on the system for debug etc purposes since I know the mad layer uses thread per device/port, when there is a problem (eg starvation, crash, deadlock, you named it), its beneficial to know to which traffic flow its related, so the duplicate name creates confusion, thats all. > The simplest way to make sure all the threads have unique names would > seem to be just a private counter in the mad module that counts up, > rather than trying to do device or port number. Sticking in the last > character of the device name is obviously too ugly. OK, this seems quite simple to implement. However, Michael have sent you a patch that changes the mad layer to use only one thread and I have raised the question that with the current code mad layer uses thread per device/port and the cm uses thread per cpu, is this really needed? what the correct path here? Some discussion on that is going over this thread. Or. From mst at dev.mellanox.co.il Wed Jul 18 00:28:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 10:28:41 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> Message-ID: <20070718072841.GC1115@mellanox.co.il> > And ARP table aging gives a way to recover > from stale cached data, eventually at least. Does it? $ grep path_list drivers/infiniband/ulp/ipoib/*c drivers/infiniband/ulp/ipoib/ipoib_main.c: list_add_tail(&path->list, &priv->path_list); drivers/infiniband/ulp/ipoib/ipoib_main.c: list_splice(&priv->path_list, &remove_list); drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); In other words we add paths to ipoib specific cache, but we never seem to *remove* individual paths from cache - we only know how to do full cache invalidates on events such as port state change. Right? -- MST From tziporet at dev.mellanox.co.il Wed Jul 18 00:34:52 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Jul 2007 10:34:52 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070717214417.GE17168@mellanox.co.il> References: <20070717214417.GE17168@mellanox.co.il> Message-ID: <469DC29C.3070205@mellanox.co.il> Michael S. Tsirkin wrote: > We have the patches applied in ofed 1.2.c with default module parameter set to > caching disabled (ofed 1.2 had a different version of the patches, but caching > is disabled by default there, too). At least in this configuration > (caching disabled), all issues I've seen seem to be fixed now, and tests seem to > be running smoothly. > As far as I know Intel run with SA cache enabled on large clusters with Intel MPI Tziporet From mst at dev.mellanox.co.il Wed Jul 18 00:38:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 10:38:31 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <469DC29C.3070205@mellanox.co.il> References: <20070717214417.GE17168@mellanox.co.il> <469DC29C.3070205@mellanox.co.il> Message-ID: <20070718073831.GE1115@mellanox.co.il> > Quoting Tziporet Koren : > Subject: Re: [ofa-general] Re: Further 2.6.23 merge plans... > > Michael S. Tsirkin wrote: > >We have the patches applied in ofed 1.2.c with default module parameter set > >to caching disabled (ofed 1.2 had a different version of the patches, but > >caching is disabled by default there, too). At least > >in this configuration (caching disabled), all issues I've seen seem to be > >fixed now, and tests seem to be running smoothly. > > As far as I know Intel run with SA cache enabled on large clusters with > Intel MPI With OFED 1.2 version of the code, right? -- MST From mst at dev.mellanox.co.il Wed Jul 18 00:46:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 10:46:32 +0300 Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib In-Reply-To: References: <1183643723.25031.262.camel@mtls03> Message-ID: <20070718074632.GF1115@mellanox.co.il> > + ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE, > + DMA_FROM_DEVICE); > + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, > + wc->byte_len - IB_GRH_BYTES); > + ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE, > + DMA_FROM_DEVICE); BTW, why is ib_dma_sync_single_for_device necessary here? -- MST From ughvx at frontiernet.net Wed Jul 18 01:11:34 2007 From: ughvx at frontiernet.net (GreetingCards.Com) Date: Wed, 18 Jul 2007 15:11:34 +0700 Subject: [ofa-general] You've received a greeting ecard from a School friend! Message-ID: <001c01c7c913$4193b7f0$d2956f9f@xfue.ihul> Hi. School friend has sent you a greeting ecard. See your card as often as you wish during the next 15 days. SEEING YOUR CARD If your email software creates links to Web pages, click on your card's direct www address below while you are connected to the Internet: http://24.4.181.191/?3e4dd7ae5b23933165b19d3383b4c00 Or copy and paste it into your browser's "Location" box (where Internet addresses go). We hope you enjoy your awesome card. Wishing you the best, Administrator, GreetingCards.Com From tziporet at dev.mellanox.co.il Wed Jul 18 01:48:58 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Jul 2007 11:48:58 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070718073831.GE1115@mellanox.co.il> References: <20070717214417.GE17168@mellanox.co.il> <469DC29C.3070205@mellanox.co.il> <20070718073831.GE1115@mellanox.co.il> Message-ID: <469DD3FA.305@mellanox.co.il> Michael S. Tsirkin wrote: >> As far as I know Intel run with SA cache enabled on large clusters with >> Intel MPI >> > > With OFED 1.2 version of the code, right? > > Yes. But maybe they also used the new module - Sean? From ogerlitz at voltaire.com Wed Jul 18 01:50:27 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 11:50:27 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469CEC8F.4050106@ichips.intel.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com> <469C9453.80905@voltaire.com> <469CEC8F.4050106@ichips.intel.com> Message-ID: <469DD453.8010700@voltaire.com> Sean Hefty wrote: >> Can you explain why would not the IB CM use the thread context >> provided by the mad layer? > You can end up with deadlock conditions when destroying cm_id's that > have outstanding MADs. It also increases MAD processing time, which can > increase dropping MADs. OK, thanks for the clarification. >> Second, if the CM needs a different context why not use the system >> threads? I understood from Michael's reply that the CM code relies on >> some thread/queue flushing at the time of CM ID destruction, is it an >> implementation issue that can change? if not, can't one dedicated >> thread do the job? > The timing and use of the system threads is unknown. When the ib_mad > module was created, it was suggested that the system threads not be > used. (I think it was Roland who recommended this.) We can change to > system threads, but it does open the possibility of complicated deadlock > conditions if other modules use the system threads as well. I know that from reasons such as timing and use which you mention, people tend to not to use the system threads for their fast path tasks. As for the possibility of deadlock b/c of system threads usage, this is an argument which I like less (...), eg the network stack does well on its deadlocak avoidance code without spawning dedicated threads. Is it all about that the net stack uses softirqs where the ib stack needs threads for its control path (as of the usage of commands for IB resource create/modify/destroy). If this is the case, do you think it justifies spawning thread per CPU? Or. From vlad at lists.openfabrics.org Wed Jul 18 01:51:45 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 18 Jul 2007 01:51:45 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070718-0100 daily build status Message-ID: <20070718085145.4D6DDE60B76@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From ogerlitz at voltaire.com Wed Jul 18 02:04:59 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 18 Jul 2007 12:04:59 +0300 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <20070718072841.GC1115@mellanox.co.il> References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> <20070718072841.GC1115@mellanox.co.il> Message-ID: <469DD7BB.6060009@voltaire.com> Michael S. Tsirkin wrote: >> And ARP table aging gives a way to recover >> from stale cached data, eventually at least. > > Does it? > > $ grep path_list drivers/infiniband/ulp/ipoib/*c > drivers/infiniband/ulp/ipoib/ipoib_main.c: list_add_tail(&path->list, &priv->path_list); > drivers/infiniband/ulp/ipoib/ipoib_main.c: list_splice(&priv->path_list, &remove_list); > drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); > drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); > > In other words we add paths to ipoib specific cache, but we never seem > to *remove* individual paths from cache - we only know how to do > full cache invalidates on events such as port state change. > > Right? this seems like a bug, if the stack decided to delete OR change a neighbour, the path associated with it must not be re-used to create the address handle or to establish the connection, same for multicast neighbours. Or. From vlad at lists.openfabrics.org Wed Jul 18 02:45:39 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 18 Jul 2007 02:45:39 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070718-0200 daily build status Message-ID: <20070718094539.DF301E60825@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22-rc7 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From mst at dev.mellanox.co.il Wed Jul 18 02:55:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 12:55:31 +0300 Subject: [ofa-general] Re: [PATCHv2] IB/mad: fix duplicated kernel thread name In-Reply-To: <469DD453.8010700@voltaire.com> References: <20070715094145.GA16231@mellanox.co.il> <469B3286.3060902@voltaire.com> <20070716115911.GA3379@mellanox.co.il> <469B6634.1050709@voltaire.com> <469B9B5A.2040707@ichips.intel.com> <469C9453.80905@voltaire.com> <469CEC8F.4050106@ichips.intel.com> <469DD453.8010700@voltaire.com> Message-ID: <20070718095531.GH1115@mellanox.co.il> > Is it all about that the net stack uses softirqs where the ib stack > needs threads for its control path (as of the usage of commands for IB > resource create/modify/destroy). Yes. > If this is the case, do you think it > justifies spawning thread per CPU? Thread per CPU is really the default. It's the single threaded WQs than need justification :) -- MST From sashak at voltaire.com Wed Jul 18 03:31:11 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Jul 2007 13:31:11 +0300 Subject: [ofa-general] svn converted repos moved away Message-ID: <20070718103111.GN31073@sashak.voltaire.com> Hi, I moved to private place the repos where original svn to git conversion was done. I guess that nobody needs it anymore. Please let me know if I'm worng. Sasha From dotanb at dev.mellanox.co.il Wed Jul 18 04:21:04 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 18 Jul 2007 14:21:04 +0300 Subject: [ofa-general] [PATCH] core/iwcm: Remove local write permission enable in QP access flags Message-ID: <200707181421.04336.dotanb@dev.mellanox.co.il> Remove local write permission enable in QP access flags (this attribute is being used only for remote connections). Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c index 223b1aa..44b0a3d 100644 --- a/drivers/infiniband/core/iwcm.c +++ b/drivers/infiniband/core/iwcm.c @@ -941,8 +941,7 @@ static int iwcm_init_qp_init_attr(struct iwcm_id_private *cm_id_priv, case IW_CM_STATE_CONN_RECV: case IW_CM_STATE_ESTABLISHED: *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS; - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | - IB_ACCESS_REMOTE_WRITE| + qp_attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ; ret = 0; break; From tziporet at dev.mellanox.co.il Wed Jul 18 04:50:27 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Jul 2007 14:50:27 +0300 Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> Message-ID: <469DFE83.4030507@mellanox.co.il> Hal Rosenstock wrote: > Hi Tziporet, > > > What happened to ibsim ? I thought that was on the list I originally > sent. It was but Sasha told me its not actually part of OFED. If it is no problem to add it again Tziporet From eli at mellanox.co.il Wed Jul 18 05:25:52 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 18 Jul 2007 15:25:52 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: References: <1183643723.25031.262.camel@mtls03> Message-ID: <1184761552.3520.9.camel@mtls03> I made some experiments with iperf running on CM mode and TCP sockets. I can see that there is no bad affect on BW (excel file attached). We did see a slight improvement in packet loss in UDP mode with an application supplied by a customer. Copy small received patckets to newly allocated SKBs just big enough to contain the packet. This will relief accounting done on the socket so that a smaller size is used. Signed-off-by: Eli Cohen --- Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-17 15:41:29.000000000 +0300 +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-18 09:34:49.000000000 +0300 @@ -651,4 +651,7 @@ #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) +#define SKB_LEN_THOLD 256 +#define CM_SKB_LEN_THOLD min(SKB_LEN_THOLD, IPOIB_CM_HEAD_SIZE) + #endif /* _IPOIB_H */ Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c =================================================================== --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 15:41:29.000000000 +0300 +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-18 10:46:54.000000000 +0300 @@ -452,26 +452,40 @@ frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; - - newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); - if (unlikely(!newskb)) { - /* - * If we can't allocate a new RX buffer, dump - * this packet and reuse the old buffer. - */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); - ++priv->stats.rx_dropped; - goto repost; + if (wc->byte_len < CM_SKB_LEN_THOLD) { + newskb = dev_alloc_skb(wc->byte_len); + if (!newskb) + ipoib_warn(priv, "failed to allocate skb\n"); + + ib_dma_sync_single_for_cpu(priv->ca, priv->cm.srq_ring[wr_id].mapping[0], + IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, + wc->byte_len - IB_GRH_BYTES); + ib_dma_sync_single_for_device(priv->ca, priv->cm.srq_ring[wr_id].mapping[0], + IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); + + skb_put(newskb, wc->byte_len); + skb = newskb; + } + else { + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ++priv->stats.rx_dropped; + goto repost; + } + ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); - ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); - skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); From eli at mellanox.co.il Wed Jul 18 05:27:12 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 18 Jul 2007 15:27:12 +0300 Subject: [ofa-general] socket buffer accounting with UDP/ipoib In-Reply-To: <1184761552.3520.9.camel@mtls03> References: <1183643723.25031.262.camel@mtls03> <1184761552.3520.9.camel@mtls03> Message-ID: <1184761632.3520.11.camel@mtls03> Attaching the file On Wed, 2007-07-18 at 15:25 +0300, Eli Cohen wrote: > I made some experiments with iperf running on CM mode and TCP sockets. I > can see that there is no bad affect on BW (excel file attached). We did > see a slight improvement in packet loss in UDP mode with an application > supplied by a customer. > > > > Copy small received patckets to newly allocated SKBs just > big enough to contain the packet. This will relief accounting > done on the socket so that a smaller size is used. > > Signed-off-by: Eli Cohen > > --- > > Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-17 15:41:29.000000000 +0300 > +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-18 09:34:49.000000000 +0300 > @@ -651,4 +651,7 @@ > > #define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) > > +#define SKB_LEN_THOLD 256 > +#define CM_SKB_LEN_THOLD min(SKB_LEN_THOLD, IPOIB_CM_HEAD_SIZE) > + > #endif /* _IPOIB_H */ > Index: connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c > =================================================================== > --- connectx_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 15:41:29.000000000 +0300 > +++ connectx_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-18 10:46:54.000000000 +0300 > @@ -452,26 +452,40 @@ > > frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, > (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; > - > - newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); > - if (unlikely(!newskb)) { > - /* > - * If we can't allocate a new RX buffer, dump > - * this packet and reuse the old buffer. > - */ > - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > - ++priv->stats.rx_dropped; > - goto repost; > + if (wc->byte_len < CM_SKB_LEN_THOLD) { > + newskb = dev_alloc_skb(wc->byte_len); > + if (!newskb) > + ipoib_warn(priv, "failed to allocate skb\n"); > + > + ib_dma_sync_single_for_cpu(priv->ca, priv->cm.srq_ring[wr_id].mapping[0], > + IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); > + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, > + wc->byte_len - IB_GRH_BYTES); > + ib_dma_sync_single_for_device(priv->ca, priv->cm.srq_ring[wr_id].mapping[0], > + IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); > + > + skb_put(newskb, wc->byte_len); > + skb = newskb; > + } > + else { > + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, mapping); > + if (unlikely(!newskb)) { > + /* > + * If we can't allocate a new RX buffer, dump > + * this packet and reuse the old buffer. > + */ > + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); > + ++priv->stats.rx_dropped; > + goto repost; > + } > + ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); > + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); > + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > } > > - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); > - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); > - > ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > wc->byte_len, wc->slid); > > - skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); > - > skb->protocol = ((struct ipoib_header *) skb->data)->proto; > skb_reset_mac_header(skb); > skb_pull(skb, IPOIB_ENCAP_LEN); > -------------- next part -------------- A non-text attachment was scrubbed... Name: skb_vs_noskb_patch.xls Type: application/vnd.ms-excel Size: 17408 bytes Desc: not available URL: From hal.rosenstock at gmail.com Wed Jul 18 05:43:57 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Jul 2007 08:43:57 -0400 Subject: [ofa-general] Re: [ewg] Agenda for OFED meeting today In-Reply-To: <469DFE83.4030507@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563963@mtlexch01.mtl.com> <469DFE83.4030507@mellanox.co.il> Message-ID: On 7/18/07, Tziporet Koren wrote: > > Hal Rosenstock wrote: > > Hi Tziporet, > > > > > > What happened to ibsim ? I thought that was on the list I originally > > sent. > It was but Sasha told me its not actually part of OFED. That was the case for OFED 1.2 but it was proposed to add it for OFED 1.3. If it is no > problem to add it again Could this please be done ? Thanks. -- Hal Tziporet > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Wed Jul 18 06:36:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 18 Jul 2007 16:36:31 +0300 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> <20070710171142.GC11320@mellanox.co.il> <20070710183006.GE11320@mellanox.co.il> Message-ID: <20070718133630.GF17765@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: mthca use of dma_sync_single is bogus > > > Hmm. This means there's no way to sync a range within > > mapping created with map_sg? > > It doesn't seem that there is one right now at least. > > > > It actually doesn't look too bad to replace our use of pci_map_sg() > > > with dma_map_single(), at least at first glance. I'll try to write a > > > patch later. > > > > Well, the reason map_sg is there is presumably because on some > > architectures it's worth it to try and make the region contigious in DMA space. > > But I agree this seems the lesser evil at this point ... > > Given that we're already trying to allocate big chunks of physically > contiguous memory, I think that any virtual merging we get is likely > to be of very small benefit. > > It is kind of a shame to give this up though. Did we reach any conclusion? Are you switching to map_single? -- MST From bramesh at vt.edu Wed Jul 18 07:08:03 2007 From: bramesh at vt.edu (Bharath Ramesh) Date: Wed, 18 Jul 2007 10:08:03 -0400 Subject: [ofa-general] OpenIB development help In-Reply-To: <469DB5FE.1010206@mellanox.co.il> References: <20070717165553.GA10298@vt.edu> <20070717203428.GA12927@vt.edu> <469DB5FE.1010206@mellanox.co.il> Message-ID: <20070718140803.GA31599@vt.edu> * Tziporet Koren (tziporet at dev.mellanox.co.il) wrote: > Roland Dreier wrote: >> > Thanks for replying to mail. I have a some basic understanding of IB. I >> > have gone through some of the example code in the example directory and >> > OFED performance test. I noticed that every one of those examples used >> > TCP to exchange information regarding lid, psn and qpn. My question is >> > basically that is there any other way to exchange this information >> using >> > only IB. Since no hardware supports RD, I have to bite the bullet and >> > use RC. >> > Also in the test rdma_lat (under ~mst/perftest.git) there is an option -c > that opens the connection using rdmacm >> Look at librdmacm (or libibcm). They provide higher-level >> abstractions for connection establishment. > There is an example there too - rping that open RC connection with > librdmacm > > Tziporet > Thanks for pointing out the various examples. Really appreciate it. Thanks, Bharath --- Bharath Ramesh http://people.cs.vt.edu/~bramesh From rdreier at cisco.com Wed Jul 18 08:12:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 08:12:18 -0700 Subject: [ofa-general] Re: mthca use of dma_sync_single is bogus In-Reply-To: <20070718133630.GF17765@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jul 2007 16:36:31 +0300") References: <20070709213913.GB20052@mellanox.co.il> <20070710071547.GA3814@mellanox.co.il> <20070710171142.GC11320@mellanox.co.il> <20070710183006.GE11320@mellanox.co.il> <20070718133630.GF17765@mellanox.co.il> Message-ID: > Did we reach any conclusion? Are you switching to map_single? haven't had a chance to work on it yet, but I don't see a better alternative. From rdreier at cisco.com Wed Jul 18 08:10:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 08:10:47 -0700 Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib In-Reply-To: <20070718074632.GF1115@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 18 Jul 2007 10:46:32 +0300") References: <1183643723.25031.262.camel@mtls03> <20070718074632.GF1115@mellanox.co.il> Message-ID: > > + ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE, > > + DMA_FROM_DEVICE); > > + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, > > + wc->byte_len - IB_GRH_BYTES); > > + ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE, > > + DMA_FROM_DEVICE); > > BTW, why is ib_dma_sync_single_for_device necessary here? Not sure what you're asking exactly. The sync for device is needed to match the previous sync for the cpu obviously. We need both syncs for the same reason we need the unmap when we don't copy -- we're copying data out of the skb we gave to the device earlier, so we need to make sure the cpu sees the right data. From sean.hefty at intel.com Wed Jul 18 09:16:52 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 09:16:52 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <469DD3FA.305@mellanox.co.il> Message-ID: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com> >> With OFED 1.2 version of the code, right? >> >> >Yes. >But maybe they also used the new module - Sean? We actually use the OFED 1.2 version. So, this feature is in use, but not this specific implementation. - Sean From rdreier at cisco.com Wed Jul 18 09:20:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 09:20:21 -0700 Subject: [ofa-general] Re: Further 2.6.23 merge plans... In-Reply-To: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com> (Sean Hefty's message of "Wed, 18 Jul 2007 09:16:52 -0700") References: <000101c7c957$0e4e51e0$69cc180a@amr.corp.intel.com> Message-ID: > We actually use the OFED 1.2 version. So, this feature is in use, but not this > specific implementation. Hmm... how much testing has the implementation being proposed for merging actually had? It might still be OK if the answer is that it hasn't been tested at scale but that the basic code works and should behave the same as the code that was tested because the underlying design is the same... is at least that much true? From eitan at mellanox.co.il Wed Jul 18 09:28:04 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 18 Jul 2007 19:28:04 +0300 Subject: [ofa-general] [PATCH] opensm: Bug in coding of VL Arbitration tables Message-ID: <86ir8h3cdn.fsf@sw053.lab.mtl.com> Hi Sasha Discovered a bug in coding of the VL Arbitration table "index". According to spec should be: 1 for low part of low table 2 for high part of low table 3 for low part of high table 4 for high part of high table the patch below fixes it: Eitan Signed-off-by: Eitan Zahavi diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index bbb1608..413e200 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -116,14 +116,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, p_pi->vl_arb_low_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_low[0], - len, 0)) != IB_SUCCESS) + len, 1)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_low_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_low[1], - len, 1)) != IB_SUCCESS) + len, 2)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_high_cap > 0) { @@ -131,14 +131,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, p_pi->vl_arb_high_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_high[0], - len, 2)) != IB_SUCCESS) + len, 3)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_high_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_high[1], - len, 3)) != IB_SUCCESS) + len, 4)) != IB_SUCCESS) return status; } From eitan at mellanox.co.il Wed Jul 18 09:31:49 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 18 Jul 2007 19:31:49 +0300 Subject: [ofa-general] [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit Message-ID: <86hco13c7e.fsf@sw053.lab.mtl.com> Hi Sasha When QoS setup is done the code was trying to send updates of vl_arb_high_limit by req_set of PORT_INFO with the new data. However, at that stage the SM still did not assign LIDs to the ports. So the sent PortInfo.base_lid was still zero. The specification does not allow for such LIDs (they are considered ilegal). the patch below fixes this by storing the calculated value and later using it in link and lid managers. Eitan Signed-off-by: Eitan Zahavi diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h index 54ebcfc..5032b1b 100644 --- a/opensm/include/opensm/osm_port.h +++ b/opensm/include/opensm/osm_port.h @@ -117,6 +117,7 @@ typedef struct _osm_physp struct _osm_node *p_node; struct _osm_physp *p_remote_physp; boolean_t healthy; + uint8_t vl_high_limit; osm_dr_path_t dr_path; osm_pkey_tbl_t pkeys; ib_vl_arb_table_t vl_arb[4]; diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index bc3f8b3..ed76382 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( ib_port_info_get_port_state(p_old_pi) ) send_set = TRUE; } + + /* provide the vl_high_limit from the qos mgr */ + if (p_mgr->p_subn->opt.no_qos == FALSE) + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) + { + send_set = TRUE; + p_pi->vl_high_limit = p_physp->vl_high_limit; + } } else { diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index 25f0fc3..3781fd2 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -354,6 +354,15 @@ __osm_link_mgr_set_physp_pi( context.pi_context.active_transition = FALSE; } + /* provide the vl_high_limit from the qos mgr */ + if (p_mgr->p_subn->opt.no_qos == FALSE) + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) + { + send_set = TRUE; + p_pi->vl_high_limit = p_physp->vl_high_limit; + } + + context.pi_context.node_guid = osm_node_get_node_guid( p_node ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index bbb1608..413e200 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -216,42 +216,6 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, return IB_SUCCESS; } -static ib_api_status_t vl_high_limit_update(osm_req_t * p_req, - osm_physp_t * p, - const struct qos_config *qcfg) -{ - uint8_t payload[IB_SMP_DATA_SIZE]; - osm_madw_context_t context; - ib_port_info_t *p_pi; - - p_pi = &p->port_info; - - if (p_pi->vl_high_limit == qcfg->vl_high_limit) - return IB_SUCCESS; - - memset(payload, 0, IB_SMP_DATA_SIZE); - memcpy(payload, p_pi, sizeof(ib_port_info_t)); - - p_pi = (ib_port_info_t *) payload; - ib_port_info_set_state_no_change(p_pi); - - p_pi->vl_high_limit = qcfg->vl_high_limit; - - context.pi_context.node_guid = - osm_node_get_node_guid(osm_physp_get_node_ptr(p)); - context.pi_context.port_guid = osm_physp_get_port_guid(p); - context.pi_context.set_method = TRUE; - context.pi_context.update_master_sm_base_lid = FALSE; - context.pi_context.ignore_errors = FALSE; - context.pi_context.light_sweep = FALSE; - context.pi_context.active_transition = FALSE; - - return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p), - payload, sizeof(payload), IB_MAD_ATTR_PORT_INFO, - cl_hton32(osm_physp_get_port_num(p)), - CL_DISP_MSGID_NONE, &context); -} - static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req, osm_port_t * p_port, osm_physp_t * p, uint8_t port_num, @@ -261,16 +225,8 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req, /* OpVLs should be ok at this moment - just use it */ - /* setup VL high limit */ - status = vl_high_limit_update(p_req, p, qcfg); - if (status != IB_SUCCESS) { - osm_log(p_log, OSM_LOG_ERROR, - "qos_physp_setup: ERR 6201 : " - "failed to update VLHighLimit " - "for port %" PRIx64 " #%d\n", - cl_ntoh64(p->port_guid), port_num); - return status; - } + /* setup VL high limit on the physp later to be updated by lid/link mgrs */ + p->vl_high_limit = qcfg->vl_high_limit; /* setup VLArbitration */ status = vlarb_update(p_req, p, port_num, qcfg); From hal.rosenstock at gmail.com Wed Jul 18 09:35:49 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Jul 2007 09:35:49 -0700 Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding of VL Arbitration tables In-Reply-To: <86ir8h3cdn.fsf@sw053.lab.mtl.com> References: <86ir8h3cdn.fsf@sw053.lab.mtl.com> Message-ID: Hi Eitan, On 7/18/07, Eitan Zahavi wrote: > > Hi Sasha > > Discovered a bug in coding of the VL Arbitration table "index". > According to spec should be: > 1 for low part of low table > 2 for high part of low table > 3 for low part of high table > 4 for high part of high table > > the patch below fixes it: > > Eitan > > Signed-off-by: Eitan Zahavi > > diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c > index bbb1608..413e200 100644 > --- a/opensm/opensm/osm_qos.c > +++ b/opensm/opensm/osm_qos.c > @@ -116,14 +116,14 @@ static ib_api_status_t vlarb_update(osm_req_t * > p_req, > p_pi->vl_arb_low_cap : > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; > if ((status = vlarb_update_table_block(p_req, p, port_num, > > &qcfg->vlarb_low[0], > - len, 0)) != > IB_SUCCESS) > + len, 1)) != > IB_SUCCESS) > return status; > } > if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { > len = p_pi->vl_arb_low_cap % > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; > if ((status = vlarb_update_table_block(p_req, p, port_num, > > &qcfg->vlarb_low[1], > - len, 1)) != > IB_SUCCESS) > + len, 2)) != > IB_SUCCESS) > return status; > } > if (p_pi->vl_arb_high_cap > 0) { > @@ -131,14 +131,14 @@ static ib_api_status_t vlarb_update(osm_req_t * > p_req, > p_pi->vl_arb_high_cap : > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; > if ((status = vlarb_update_table_block(p_req, p, port_num, > > &qcfg->vlarb_high[0], > - len, 2)) != > IB_SUCCESS) > + len, 3)) != > IB_SUCCESS) > return status; > } > if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { > len = p_pi->vl_arb_high_cap % > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; > if ((status = vlarb_update_table_block(p_req, p, port_num, > > &qcfg->vlarb_high[1], > - len, 3)) != > IB_SUCCESS) > + len, 4)) != > IB_SUCCESS) > return status; > } Are you sure ? It looks to me like this is already handled in > vlarb_update_table_block as follows: > if (!memcmp(&p->vl_arb[block_num], &block, block_length * sizeof(block.vl_entry[0]))) return IB_SUCCESS; but attr_mod = ((block_num + 1) << 16) | port_num; return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p), (uint8_t *) & block, sizeof(block), IB_MAD_ATTR_VL_ARBITRATION, cl_hton32(attr_mod), CL_DISP_MSGID_NONE, &context); -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Jul 18 09:37:11 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 18 Jul 2007 19:37:11 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding of VL Arbitration tables References: <86ir8h3cdn.fsf@sw053.lab.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73E68@mtlexch01.mtl.com> Thanks Hal. Good catch. Should have seen this. Sorry Eitan ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Wednesday, July 18, 2007 7:36 PM To: Eitan Zahavi Cc: OPENIB; sashak at voltaire.com; Yevgeny Kliteynik Subject: Re: [PATCH] opensm: Bug in coding of VL Arbitration tables Hi Eitan, On 7/18/07, Eitan Zahavi wrote: Hi Sasha Discovered a bug in coding of the VL Arbitration table "index". According to spec should be: 1 for low part of low table 2 for high part of low table 3 for low part of high table 4 for high part of high table the patch below fixes it: Eitan Signed-off-by: Eitan Zahavi diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index bbb1608..413e200 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -116,14 +116,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, p_pi->vl_arb_low_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_low[0], - len, 0)) != IB_SUCCESS) + len, 1)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_low_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_low[1], - len, 1)) != IB_SUCCESS) + len, 2)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_high_cap > 0) { @@ -131,14 +131,14 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, p_pi->vl_arb_high_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_high[0], - len, 2)) != IB_SUCCESS) + len, 3)) != IB_SUCCESS) return status; } if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_high_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, &qcfg->vlarb_high[1], - len, 3)) != IB_SUCCESS) + len, 4)) != IB_SUCCESS) return status; } Are you sure ? It looks to me like this is already handled in vlarb_update_table_block as follows: if (!memcmp(&p->vl_arb[block_num], &block, block_length * sizeof(block.vl_entry[0]))) return IB_SUCCESS; but attr_mod = ((block_num + 1) << 16) | port_num; return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p), (uint8_t *) & block, sizeof(block), IB_MAD_ATTR_VL_ARBITRATION, cl_hton32(attr_mod), CL_DISP_MSGID_NONE, &context); -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Jul 18 09:55:53 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 18 Jul 2007 09:55:53 -0700 Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com> References: <86hco13c7e.fsf@sw053.lab.mtl.com> Message-ID: Hi again Eitan, On 7/18/07, Eitan Zahavi wrote: > > Hi Sasha > > When QoS setup is done the code was trying to send updates of > vl_arb_high_limit by req_set of PORT_INFO with the new data. > However, at that stage the SM still did not assign LIDs to the ports. > So the sent PortInfo.base_lid was still zero. The specification does not > allow for such LIDs (they are considered ilegal). Doesn't that really depend on the PortState ? The LID (and SMLID) needs to be set by ARMED/ACTIVE. the patch below fixes this by storing the calculated value and later > using it in link and lid managers. It's probably better to defer the setting as this patch appears to do. -- Hal Eitan > > Signed-off-by: Eitan Zahavi > > diff --git a/opensm/include/opensm/osm_port.h > b/opensm/include/opensm/osm_port.h > index 54ebcfc..5032b1b 100644 > --- a/opensm/include/opensm/osm_port.h > +++ b/opensm/include/opensm/osm_port.h > @@ -117,6 +117,7 @@ typedef struct _osm_physp > struct _osm_node *p_node; > struct _osm_physp *p_remote_physp; > boolean_t healthy; > + uint8_t vl_high_limit; > osm_dr_path_t dr_path; > osm_pkey_tbl_t pkeys; > ib_vl_arb_table_t vl_arb[4]; > diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c > index bc3f8b3..ed76382 100644 > --- a/opensm/opensm/osm_lid_mgr.c > +++ b/opensm/opensm/osm_lid_mgr.c > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > ib_port_info_get_port_state(p_old_pi) ) > send_set = TRUE; > } > + > + /* provide the vl_high_limit from the qos mgr */ > + if (p_mgr->p_subn->opt.no_qos == FALSE) > + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > + { > + send_set = TRUE; > + p_pi->vl_high_limit = p_physp->vl_high_limit; > + } > } > else > { > diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c > index 25f0fc3..3781fd2 100644 > --- a/opensm/opensm/osm_link_mgr.c > +++ b/opensm/opensm/osm_link_mgr.c > @@ -354,6 +354,15 @@ __osm_link_mgr_set_physp_pi( > context.pi_context.active_transition = FALSE; > } > > + /* provide the vl_high_limit from the qos mgr */ > + if (p_mgr->p_subn->opt.no_qos == FALSE) > + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > + { > + send_set = TRUE; > + p_pi->vl_high_limit = p_physp->vl_high_limit; > + } > + > + > context.pi_context.node_guid = osm_node_get_node_guid( p_node ); > context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); > context.pi_context.set_method = TRUE; > diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c > index bbb1608..413e200 100644 > --- a/opensm/opensm/osm_qos.c > +++ b/opensm/opensm/osm_qos.c > @@ -216,42 +216,6 @@ static ib_api_status_t sl2vl_update(osm_req_t * > p_req, osm_port_t * p_port, > return IB_SUCCESS; > } > > -static ib_api_status_t vl_high_limit_update(osm_req_t * p_req, > - osm_physp_t * p, > - const struct qos_config *qcfg) > -{ > - uint8_t payload[IB_SMP_DATA_SIZE]; > - osm_madw_context_t context; > - ib_port_info_t *p_pi; > - > - p_pi = &p->port_info; > - > - if (p_pi->vl_high_limit == qcfg->vl_high_limit) > - return IB_SUCCESS; > - > - memset(payload, 0, IB_SMP_DATA_SIZE); > - memcpy(payload, p_pi, sizeof(ib_port_info_t)); > - > - p_pi = (ib_port_info_t *) payload; > - ib_port_info_set_state_no_change(p_pi); > - > - p_pi->vl_high_limit = qcfg->vl_high_limit; > - > - context.pi_context.node_guid = > - osm_node_get_node_guid(osm_physp_get_node_ptr(p)); > - context.pi_context.port_guid = osm_physp_get_port_guid(p); > - context.pi_context.set_method = TRUE; > - context.pi_context.update_master_sm_base_lid = FALSE; > - context.pi_context.ignore_errors = FALSE; > - context.pi_context.light_sweep = FALSE; > - context.pi_context.active_transition = FALSE; > - > - return osm_req_set(p_req, osm_physp_get_dr_path_ptr(p), > - payload, sizeof(payload), > IB_MAD_ATTR_PORT_INFO, > - cl_hton32(osm_physp_get_port_num(p)), > - CL_DISP_MSGID_NONE, &context); > -} > - > static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * > p_req, > osm_port_t * p_port, osm_physp_t * > p, > uint8_t port_num, > @@ -261,16 +225,8 @@ static ib_api_status_t qos_physp_setup(osm_log_t * > p_log, osm_req_t * p_req, > > /* OpVLs should be ok at this moment - just use it */ > > - /* setup VL high limit */ > - status = vl_high_limit_update(p_req, p, qcfg); > - if (status != IB_SUCCESS) { > - osm_log(p_log, OSM_LOG_ERROR, > - "qos_physp_setup: ERR 6201 : " > - "failed to update VLHighLimit " > - "for port %" PRIx64 " #%d\n", > - cl_ntoh64(p->port_guid), port_num); > - return status; > - } > + /* setup VL high limit on the physp later to be updated by > lid/link mgrs */ > + p->vl_high_limit = qcfg->vl_high_limit; > > /* setup VLArbitration */ > status = vlarb_update(p_req, p, port_num, qcfg); > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jul 18 10:05:58 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 10:05:58 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com> References: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com> Message-ID: <469E4876.7020805@ichips.intel.com> > We will have a better idea of the issues and possible solutions once the QoS > spec is released, and we can hold discussions on it. I will be working more > details on QoS enhancements starting in the next couple of weeks. Based on discussions so far, maybe the best path forward from here is to delay until 2.6.24. This will let us add this version to OFED 1.3 for more widespread testing, plus give us the time that we need to come up with a plan to integrate QoS with the local SA. I don't think we'll have a final implementation for QoS support by that time, but at least we'll have a better idea of the problems. These patches are based on the same design used with OFED 1.2, but a fair number of lines of code still changed, plus it added InformInfo registration. I don't believe anyone other than me has tested these patches with the local SA enabled. It's typically running on my systems, but because it automatically fails over to standard SA queries, it would be easy for me to miss problems. - Sean From pradeeps at linux.vnet.ibm.com Wed Jul 18 10:23:46 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Jul 2007 10:23:46 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit Message-ID: <469E4CA2.2040708@linux.vnet.ibm.com> Resubmitting the 7th version of the patch. Changed the settings in my mail client, so I expect there should be no line wraps. Also white space mangling rectified. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-17 19:21:46.000000000 -0400 @@ -95,11 +95,16 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (1ul << 16) #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) + +#define NOSRQ_INDEX_TABLE_SIZE 128 +#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_TABLE_SIZE -1) + #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -166,11 +171,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by NOSRQ only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -NOSRQ only */ enum ipoib_cm_state state; }; @@ -215,6 +223,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long dev_kfree_skb_any(skb); } -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) { } - #endif #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-10 17:02:33.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 20:53:19.000000000 -0400 @@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; +int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported"); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); + +atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */ + #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +92,20 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +115,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +187,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -198,16 +251,21 @@ static struct ib_qp *ipoib_cm_create_rx_ { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { - .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* For drain WR */ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + if (!priv->cm.srq) { + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; + attr.event_handler = NULL; + } else + attr.event_handler = ipoib_cm_rx_event_handler; return ib_create_qp(priv->pd, &attr); } @@ -282,12 +340,129 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } + spin_unlock_irq(&priv->lock); +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 qp_num, index; + u64 i, recv_mem_used; + + qp_num = p->qp->qp_num; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the NOSRQ we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + + init_context_and_add_list(cm_id, p, priv); + spin_lock_irq(&priv->lock); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) + * CM_PACKET_SIZE; /* packets are 64K */ + if ((index == max_rc_qp) || + ( recv_mem_used >= max_recv_buf * (1ul << 20))) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "NOSRQ has reached the configurable limit " + "of either %d RC QPs or, max recv buf size of " + "0x%x MB\n", max_rc_qp, max_recv_buf); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + + priv->cm.rx_index_table[index] = p; + spin_unlock_irq(&priv->lock); + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + if (post_receive_nosrq(dev, i << 32 | index)) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %ld\n", i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + kfree(p->rx_ring); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -298,13 +473,13 @@ static int ipoib_cm_req_handler(struct i ipoib_dbg(priv, "REQ arrived\n"); p = kzalloc(sizeof *p, GFP_KERNEL); - if (!p) + if (!p) { + printk(KERN_WARNING "Failed to allocate RX control block when " + "REQ arrived\n"); return -ENOMEM; + } p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -314,19 +489,21 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (!priv->cm.srq) { + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); + if (ret) + goto err_post_nosrq; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + p->state = IPOIB_CM_RX_LIVE; + init_context_and_add_list(cm_id, p, priv); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -336,6 +513,9 @@ static int ipoib_cm_req_handler(struct i } return 0; +err_post_nosrq: + list_del_init(&p->list); + atomic_dec(¤t_rc_qp); err_modify: ib_destroy_qp(p->qp); err_qp: @@ -399,29 +579,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> 0x%x)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -429,23 +640,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=0x%llx vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -457,13 +660,111 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %ld\n", + wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + + ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", + wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid 0x%llx (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ; + + /* This is the only place where rx_ptr could be a NULL - could + * have just received a packet from a connection that has become + * stale and so is going away. We will simply drop the packet and + * let the hardware (it s IB_QPT_RC) handle the dropped packet. + * In the timer_check() function below, p->jiffies is updated and + * hence the connection will not be stale after that. + */ + rx_ptr = priv->cm.rx_index_table[index]; + if (unlikely(!rx_ptr)) { + ipoib_warn(priv, "Received packet from a connection " + "that is going away. Hardware will handle it.\n"); + return; + } + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%ld vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + timer_check_nosrq(priv, rx_ptr); + } + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %ld\n", wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -483,10 +784,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %ld\n", + wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -678,6 +991,42 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -692,6 +1041,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -737,6 +1091,7 @@ void ipoib_cm_dev_stop(struct net_device kfree(p); } + cancel_delayed_work(&priv->cm.stale_task); } @@ -815,7 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -855,7 +1212,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1200,6 +1557,9 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) { + atomic_dec(¤t_rc_qp); + } kfree(p); } } @@ -1218,12 +1578,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1277,16 +1644,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1303,20 +1694,34 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + ret = ib_query_device(priv->ca, &attr); + if (ret) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + ret = create_srq(dev, priv); + if (ret) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n"); + return -ENOMEM; + } + + atomic_set(¤t_rc_qp, 0); } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1329,17 +1734,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-10 18:30:10.000000000 -0400 @@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-17 20:09:25.000000000 -0400 @@ -175,6 +175,15 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if (!priv->cm.srq) + size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size; + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); From pradeeps at linux.vnet.ibm.com Wed Jul 18 10:26:12 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Jul 2007 10:26:12 -0700 Subject: [ofa-general] IPOIB CM (NOSRQ) extension [PATCH V2] patch resubmit Message-ID: <469E4D34.10903@linux.vnet.ibm.com> Resubmitting the 2nd version of the patch. Changed the settings in my mail client, so I expect there should be no line wraps. Also white space mangling rectified. Signed-off-by: Pradeep Satyanarayana --- --- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 21:08:38.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-18 12:49:06.000000000 -0400 @@ -1372,8 +1372,18 @@ static int ipoib_cm_tx_handler(struct ib ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); break; - case IB_CM_REQ_ERROR: case IB_CM_REJ_RECEIVED: + ipoib_warn(priv, "REJ received\n"); + spin_lock(&priv->lock); + neigh = tx->neigh; + spin_unlock(&priv->lock); + + if ((neigh) && (event->param.rej_rcvd.reason == + IB_CM_REJ_NO_QP)) { + clear_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags); + break; + } + case IB_CM_REQ_ERROR: case IB_CM_TIMEWAIT_EXIT: ipoib_dbg(priv, "CM error %d.\n", event->event); spin_lock_irq(&priv->tx_lock); --- c/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-18 12:50:05.000000000 -0400 @@ -679,11 +679,10 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (ipoib_cm_get(neigh)) { - if (ipoib_cm_up(neigh)) { + if (ipoib_cm_get(neigh) && ipoib_cm_up(neigh) && + test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) { ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); goto out; - } } else if (neigh->ah) { if (unlikely(memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, From clark.tucker at gmail.com Wed Jul 18 10:42:11 2007 From: clark.tucker at gmail.com (Clark Tucker) Date: Wed, 18 Jul 2007 11:42:11 -0600 Subject: [ofa-general] rping / librdmacm deadlock question Message-ID: Hello all, First, the background: I am writing a linux device driver to provide IWarp device support for our hardware. I'm currently running kernel 2.6.20-rc4and OFED-1.2-rc2. I realize these are somewhat old, but I have examined newer source, and haven't found any changes that seem immediately relevant. I am experiencing the following behavior: rping -s .... (server starts fine, loads proper user-space library, etc) rping -c ... (client starts fine, ... connects to server, and exchanges data successfully) So far so good. If I interrupt the rping client with CTRL-C, then the client hangs hard. I have, I believe, traced this to a deadlock between ib_destroy_qp() and ucma_close(). It looks like librdmacm has a ((destructor)) function defined that results in a call to ibv_device_close() and ultimately in ::destroy_qp(). That seems reasonable, and it all happens as the OS unloads the application. However, it is (I believe) happening before the "rdma_cm" device file descriptor is 'closed' by the OS as the application terminates. [rdma_destroy_event_channel() would normally do this, but it doesn't get called when the application is interrupted by SIGINT.] Our driver (as do all drivers I've seen) performs an atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'. Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't yet been called), a cm_id still has an active reference to the qp, and the wait_event() will end up 'wait'ing. So, the application cleanup process is blocked, essentially waiting for kernel::ucma_close() to be called ... which won't happen because the application unload code is blocked in destroy_qp() ==> deadlock. First, does my analysis make sense? Perhaps my device driver should do additional work in ib_destroy_qp() that will trigger the destruction of the cm_id... [but that doesn't seem consistent with other drivers I've seen.] Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm" device is closed before calling ibv_device_close()? I'm just not sure if this is a driver issue, an application issue, or something in between. Also, I don't have access to any other IWarp hardware, so I can't test this scenario in a different environment... Any help/advice would be greatly appreciated! Thanks for your time, --Clark Tucker -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jul 18 10:58:35 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 10:58:35 -0700 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: (Clark Tucker's message of "Wed, 18 Jul 2007 11:42:11 -0600") References: Message-ID: > Our driver (as do all drivers I've seen) performs an > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in 'destroy_qp()'. > Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't yet > been called), a cm_id still has an active reference to the qp, and the > wait_event() will end up 'wait'ing. In the other drivers I know well (basically mthca and mlx4, since I wrote them), the qp->refcount being waited for is an internal driver refcount, and is used to make sure that the destroy QP operation waits until any active interrupt handlers are done with the QP. So I think the problem is that you are letting a cm_id bump the QP's reference count somehow. > Perhaps my device driver should do additional work in ib_destroy_qp() that > will trigger the destruction of the cm_id... [but that doesn't seem > consistent with other drivers I've seen.] That doesn't make sense. I think it's OK if upper layers are left with a stale pointer to your QP -- let them worry about it. Maybe it's an iWARP thing that I don't really understand (I'm much more familiar with the IB driver interface) but I don't think that the cxgb3 driver runs into this issue. > Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm" > device is closed before calling ibv_device_close()? No, because then some other (possibly malicious) app could still cause the deadlock and potentially create a bunch of unkillable processes. - R. From clark.tucker at gmail.com Wed Jul 18 11:18:10 2007 From: clark.tucker at gmail.com (Clark Tucker) Date: Wed, 18 Jul 2007 12:18:10 -0600 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: References: Message-ID: Thanks for the quick reply. Comments below. On 7/18/07, Roland Dreier wrote: > > > Our driver (as do all drivers I've seen) performs an > > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in > 'destroy_qp()'. > > Because the rdma_cm device hasn't been closed (i.e., ucma_close() hasn't > yet > > been called), a cm_id still has an active reference to the qp, and the > > wait_event() will end up 'wait'ing. > > In the other drivers I know well (basically mthca and mlx4, since I > wrote them), the qp->refcount being waited for is an internal driver > refcount, and is used to make sure that the destroy QP operation waits > until any active interrupt handlers are done with the QP. So I think > the problem is that you are letting a cm_id bump the QP's reference > count somehow. I guess this really is relevant only for IWarp. Other IWarp drivers I've seen do an atomic_inc(&qp->refcount) in ::qp_add_ref(). Called via cm_id->device->iwcm->add_ref()?. [For example see: iwcm.c::iw_cm_connect()]. This reference is removed by a call to cm_id->device->iwcm->rem_ref() [For example see: iwcm::destroy_cm_id()]. And, to avoid a deadlock, I still believe that this must happen _before_ ib_uverbs_close() [ and ultimately ib_destroy_qp()] is called. > Perhaps my device driver should do additional work in ib_destroy_qp() that > > will trigger the destruction of the cm_id... [but that doesn't seem > > consistent with other drivers I've seen.] > > That doesn't make sense. I think it's OK if upper layers are left > with a stale pointer to your QP -- let them worry about it. Maybe > it's an iWARP thing that I don't really understand (I'm much more > familiar with the IB driver interface) but I don't think that the > cxgb3 driver runs into this issue. > > > Perhaps the application (i.e., librdmacm) should make sure the "rdma_cm" > > device is closed before calling ibv_device_close()? > > No, because then some other (possibly malicious) app could still cause > the deadlock and potentially create a bunch of unkillable processes. Very true...good point. - R. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jul 18 11:49:30 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 11:49:30 -0700 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: References: Message-ID: <469E60BA.3070703@ichips.intel.com> > I have, I believe, traced this to a deadlock between ib_destroy_qp() and > ucma_close(). It looks like librdmacm has a ((destructor)) function > defined that results in a call to ibv_device_close() and ultimately in > ::destroy_qp(). That seems reasonable, and it all happens as > the OS unloads the application. > > However, it is (I believe) happening before the "rdma_cm" device file > descriptor is 'closed' by the OS as the application terminates. > [rdma_destroy_event_channel() would normally do this, but it doesn't get > called when the application is interrupted by SIGINT.] This seems like an iWarp specific issue caused by the following code in iw_cm_connect(): /* Get the ib_qp given the QPN */ qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); if (!qp) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); return -EINVAL; } cm_id->device->iwcm->add_ref(qp); I think the reference is normally removed in cm_close_handler: if (cm_id_priv->qp) { cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); cm_id_priv->qp = NULL; } The upstream iWarp drivers must already be able to handle this situation, or I'm sure we would have seen the problem before. I'm just not familiar enough with the iWarp drivers to see what they do to handle it. I'll continue reading through the code, but maybe Steve can explain how to avoid the problem. I wonder if it would be better if the iWarp CM acquired/released the QP reference on a per call basis, rather than holding a reference throughout the entire connection. - Sean From swise at opengridcomputing.com Wed Jul 18 12:03:20 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 18 Jul 2007 14:03:20 -0500 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: <469E60BA.3070703@ichips.intel.com> References: <469E60BA.3070703@ichips.intel.com> Message-ID: <469E63F8.90209@opengridcomputing.com> Sean Hefty wrote: >> I have, I believe, traced this to a deadlock between ib_destroy_qp() >> and ucma_close(). It looks like librdmacm has a ((destructor)) >> function defined that results in a call to ibv_device_close() and >> ultimately in ::destroy_qp(). That seems reasonable, and it >> all happens as the OS unloads the application. >> However, it is (I believe) happening before the "rdma_cm" device file >> descriptor is 'closed' by the OS as the application terminates. >> [rdma_destroy_event_channel() would normally do this, but it doesn't >> get called when the application is interrupted by SIGINT.] > > This seems like an iWarp specific issue caused by the following code in > iw_cm_connect(): > > /* Get the ib_qp given the QPN */ > qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); > if (!qp) { > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > return -EINVAL; > } > cm_id->device->iwcm->add_ref(qp); > > I think the reference is normally removed in cm_close_handler: > > if (cm_id_priv->qp) { > cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > cm_id_priv->qp = NULL; > } > > > The upstream iWarp drivers must already be able to handle this > situation, or I'm sure we would have seen the problem before. I'm just > not familiar enough with the iWarp drivers to see what they do to handle > it. I'll continue reading through the code, but maybe Steve can > explain how to avoid the problem. > > I wonder if it would be better if the iWarp CM acquired/released the QP > reference on a per call basis, rather than holding a reference > throughout the entire connection. > The design assume the iwcm can hold this reference and cache the qp ptr. In the iwarp design, the cm_id (connection) and qp are tighly bound once the connection is transitioned into rdma mode. This is different than infiniband. I still don't see the deadlock? Steve. From tziporet at dev.mellanox.co.il Wed Jul 18 12:06:55 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 18 Jul 2007 22:06:55 +0300 Subject: [ofa-general] Re: [PATCH] mlx4: increase max outstanding rdma reads per qp In-Reply-To: References: <200707171311.43680.jackm@dev.mellanox.co.il> <469D17C2.3040403@mellanox.co.il> Message-ID: <469E64CF.8000607@mellanox.co.il> Roland Dreier wrote: > Have you tested this to know whether it matters? Increasing the limit > uses more memory per QP... > It give some benefit but not as substantial as in ConnectX, so I guess we do not need this after all. > Does the rdma read latency test in OFED queue up enough work requests > to measure this? > yes Tziporet From sashak at voltaire.com Wed Jul 18 12:22:18 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 18 Jul 2007 22:22:18 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com> References: <86hco13c7e.fsf@sw053.lab.mtl.com> Message-ID: <20070718192217.GE27878@sashak.voltaire.com> Hi Eitan, On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > Hi Sasha > > When QoS setup is done the code was trying to send updates of > vl_arb_high_limit by req_set of PORT_INFO with the new data. > However, at that stage the SM still did not assign LIDs to the ports. > So the sent PortInfo.base_lid was still zero. The specification does not > allow for such LIDs (they are considered ilegal). > > the patch below fixes this by storing the calculated value and later > using it in link and lid managers. Good, Thanks (and this also saves one PortInfo update MAD). One question below: > > Eitan > > Signed-off-by: Eitan Zahavi > [snip...] > diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c > index bc3f8b3..ed76382 100644 > --- a/opensm/opensm/osm_lid_mgr.c > +++ b/opensm/opensm/osm_lid_mgr.c > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > ib_port_info_get_port_state(p_old_pi) ) > send_set = TRUE; > } > + > + /* provide the vl_high_limit from the qos mgr */ > + if (p_mgr->p_subn->opt.no_qos == FALSE) > + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > + { > + send_set = TRUE; > + p_pi->vl_high_limit = p_physp->vl_high_limit; > + } This part of code is for port_num != 0, so VLHighLimit setup will be skipped for switch enhanced port 0. Is it something expected? If so why? Sasha From jgunthorpe at obsidianresearch.com Wed Jul 18 12:27:45 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 18 Jul 2007 13:27:45 -0600 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> Message-ID: <20070718192745.GY13618@obsidianresearch.com> On Tue, Jul 17, 2007 at 10:39:11PM -0700, Roland Dreier wrote: > > IMHO, I still think that without some kind of SM/SA sourced > > invalidation mechanism all client side caching (including the ipoib > > stuff we have now) is a bad idea. > > But for IPoIB at least doing a path lookup for every packet is > obviously not feasible. And ARP table aging gives a way to recover > from stale cached data, eventually at least. Well, aside from Michael's points about the current implementation, even a perfect version relying only on ARP will still have annoying failure modes. ARP in ethernet has a built in means to revoke a bad mac, and IB also will be able to revoke a bad GID - but since the path information in incoming APR LRH's isn't used it doesn't fix changes in the network caused by the SM. ARP entry aging helps, but IIRC there are cases where aging can be slowed if the right packets are Rx'd. Also, I think I ment 'bad idea' ==> 'has annoying and subtle failure modes' - UD ipoib definately needs to cache LRH data with ARP entries.. Jason From jgunthorpe at obsidianresearch.com Wed Jul 18 12:27:50 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 18 Jul 2007 13:27:50 -0600 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> References: <20070718050928.GA3103@obsidianresearch.com> <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> Message-ID: <20070718192750.GA8931@obsidianresearch.com> On Tue, Jul 17, 2007 at 11:04:54PM -0700, Sean Hefty wrote: > >IMHO, I still think that without some kind of SM/SA sourced > >invalidation mechanism all client side caching (including the ipoib > >stuff we have now) is a bad idea. > Nothing precludes a user space daemon from updating the cache at > timed intervals, or from communicating with an SA in some vendor > defined way to maintain coherency. I'm only trying to provide the > kernel framework. (We can debate whether another framework would > have been better, and I've held this discussion on the list > before...) I do envision someone creating user space applications > to control refreshes and, with local SA extensions, allow > pre-loading of the cache, updates to specific paths, etc. So, my main concern is with the role of kernel caching and especially with how control is exported to user space. Clearly the kernel needs a fast lookup cache for things like ipoib and others. I don't think a kernel module needs or wants a full on distributed SA. I personally think a simple in-kernel (small) fast lookup cache merged with the ipoib cache that has a netlink interface to userspace to add/delete/flush entries is a very good solution that will keep being useful in future. netlink would also carry cache miss queries to userspace. In absense of a daemon the kernel could query on its own but cache very conservatively. A userspace version of the very agressive cache you have now could also be created right away. This is because I firmly do not belive in caching as a solution to the scalability problems. It must be solved with some level of replication and distribution of the SA data and algorithms. With that view pre-loading a gaint kernel cache is exactly the wrong kind of user<->kernel interface. Maybe you could summarise how the user/kernel interface works? The last I saw was something based on MADs that looked very inefficient compared with netlink. Jason From mshefty at ichips.intel.com Wed Jul 18 12:28:02 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 12:28:02 -0700 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: <469E63F8.90209@opengridcomputing.com> References: <469E60BA.3070703@ichips.intel.com> <469E63F8.90209@opengridcomputing.com> Message-ID: <469E69C2.6070404@ichips.intel.com> >> I wonder if it would be better if the iWarp CM acquired/released the >> QP reference on a per call basis, rather than holding a reference >> throughout the entire connection. >> > > The design assume the iwcm can hold this reference and cache the qp ptr. > In the iwarp design, the cm_id (connection) and qp are tighly bound > once the connection is transitioned into rdma mode. This is different > than infiniband. I don't know if this tight binding is necessary in the implementation. The cm_id could store the qpn, rather than a pointer to the structure. When necessary, the qp pointer could be acquired using the qpn, then released at the end of the function call. I don't think we need to hold the reference on the qp structure for the entire connection. I'm just tossing this out as an idea. I'm not familiar enough with the details to claim that it's a better approach over what's currently done. > I still don't see the deadlock? What happens if a user calls destroy qp immediately after connecting it? - Sean From swise at opengridcomputing.com Wed Jul 18 12:33:33 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 18 Jul 2007 14:33:33 -0500 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: References: Message-ID: <469E6B0D.9080107@opengridcomputing.com> Clark Tucker wrote: > Hello all, > > First, the background: I am writing a linux device driver to provide > IWarp device support for our hardware. I'm currently running kernel > 2.6.20-rc4 and OFED-1.2-rc2. I realize these are somewhat old, but I > have examined newer source, and haven't found any changes that seem > immediately relevant. > > I am experiencing the following behavior: > > rping -s .... (server starts fine, loads proper user-space library, etc) > > rping -c ... (client starts fine, ... connects to server, and exchanges > data successfully) > So far so good. > > If I interrupt the rping client with CTRL-C, then the client hangs hard. > > I have, I believe, traced this to a deadlock between ib_destroy_qp() and > ucma_close(). It looks like librdmacm has a ((destructor)) function > defined that results in a call to ibv_device_close() and ultimately in > ::destroy_qp(). That seems reasonable, and it all happens as > the OS unloads the application. > > However, it is (I believe) happening before the "rdma_cm" device file > descriptor is 'closed' by the OS as the application terminates. > [rdma_destroy_event_channel() would normally do this, but it doesn't get > called when the application is interrupted by SIGINT.] > > Our driver (as do all drivers I've seen) performs an > atomic_dec(&qp->refcount) and wait_event(&qp->refcount) in > 'destroy_qp()'. Because the rdma_cm device hasn't been closed (i.e., > ucma_close() hasn't yet been called), a cm_id still has an active > reference to the qp, and the wait_event() will end up 'wait'ing. > Your destroy_qp() method must destroy the active rdma connection which will force the iwcm to release the reference on the qp. If you look at the chelsio driver, you'll see this is done before waiting on the refcnt to go to zero: from iwch_destroy_qp(): > attrs.next_state = IWCH_QP_STATE_ERROR; > iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); > wait_event(qhp->wait, !qhp->ep); Once the qhp->ep handle has been disassociated from the qp, the driver knows the iwcm has been given the CLOSE event and removed its reference on the qp. Here is the iwcm close event handler. Note it removes the ref: From cm_close_handler(): > if (cm_id_priv->qp) { > cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); > cm_id_priv->qp = NULL; > } It then can wait for any further references from interrupt handlers: > > atomic_dec(&qhp->refcnt); > wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); > Perhaps my device driver should do additional work in ib_destroy_qp() > that will trigger the destruction of the cm_id... [but that doesn't seem > consistent with other drivers I've seen.] > Are you looking at the chelsio or ammaso iwarp drivers? This code is all iwarp specific... Hope this helps... Steve From swise at opengridcomputing.com Wed Jul 18 12:34:17 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 18 Jul 2007 14:34:17 -0500 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: <469E63F8.90209@opengridcomputing.com> References: <469E60BA.3070703@ichips.intel.com> <469E63F8.90209@opengridcomputing.com> Message-ID: <469E6B39.1000206@opengridcomputing.com> Steve Wise wrote: > Sean Hefty wrote: >>> I have, I believe, traced this to a deadlock between ib_destroy_qp() >>> and ucma_close(). It looks like librdmacm has a ((destructor)) >>> function defined that results in a call to ibv_device_close() and >>> ultimately in ::destroy_qp(). That seems reasonable, and it >>> all happens as the OS unloads the application. >>> However, it is (I believe) happening before the "rdma_cm" device file >>> descriptor is 'closed' by the OS as the application terminates. >>> [rdma_destroy_event_channel() would normally do this, but it doesn't >>> get called when the application is interrupted by SIGINT.] >> >> This seems like an iWarp specific issue caused by the following code >> in iw_cm_connect(): >> >> /* Get the ib_qp given the QPN */ >> qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn); >> if (!qp) { >> spin_unlock_irqrestore(&cm_id_priv->lock, flags); >> return -EINVAL; >> } >> cm_id->device->iwcm->add_ref(qp); >> >> I think the reference is normally removed in cm_close_handler: >> >> if (cm_id_priv->qp) { >> cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp); >> cm_id_priv->qp = NULL; >> } >> >> >> The upstream iWarp drivers must already be able to handle this >> situation, or I'm sure we would have seen the problem before. I'm >> just not familiar enough with the iWarp drivers to see what they do to >> handle it. I'll continue reading through the code, but maybe Steve >> can explain how to avoid the problem. >> >> I wonder if it would be better if the iWarp CM acquired/released the >> QP reference on a per call basis, rather than holding a reference >> throughout the entire connection. >> > > The design assume the iwcm can hold this reference and cache the qp ptr. > In the iwarp design, the cm_id (connection) and qp are tighly bound > once the connection is transitioned into rdma mode. This is different > than infiniband. > > I still don't see the deadlock? > I've re-read this thread and I think I've posted the answers for Clark... Steve. > > Steve. > From swise at opengridcomputing.com Wed Jul 18 12:36:10 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 18 Jul 2007 14:36:10 -0500 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: <469E69C2.6070404@ichips.intel.com> References: <469E60BA.3070703@ichips.intel.com> <469E63F8.90209@opengridcomputing.com> <469E69C2.6070404@ichips.intel.com> Message-ID: <469E6BAA.8020204@opengridcomputing.com> Sean Hefty wrote: >>> I wonder if it would be better if the iWarp CM acquired/released the >>> QP reference on a per call basis, rather than holding a reference >>> throughout the entire connection. >>> >> >> The design assume the iwcm can hold this reference and cache the qp >> ptr. In the iwarp design, the cm_id (connection) and qp are tighly >> bound once the connection is transitioned into rdma mode. This is >> different than infiniband. > > I don't know if this tight binding is necessary in the implementation. > The cm_id could store the qpn, rather than a pointer to the structure. > When necessary, the qp pointer could be acquired using the qpn, then > released at the end of the function call. I don't think we need to hold > the reference on the qp structure for the entire connection. > > I'm just tossing this out as an idea. I'm not familiar enough with the > details to claim that it's a better approach over what's currently done. Maybe, but I'm not gonna change this code now. It was too painful to get working... ;-) > >> I still don't see the deadlock? > > What happens if a user calls destroy qp immediately after connecting it? > > See my reply to clark. The iwarp provider _must_ disassociate the endpoint/cm_id from the qp in destroy_qp()... This involves aborting or closing the connection and passing a CLOSE event to the iwcm which removes its reference. Steve. From clark.tucker at gmail.com Wed Jul 18 12:57:54 2007 From: clark.tucker at gmail.com (Clark Tucker) Date: Wed, 18 Jul 2007 13:57:54 -0600 Subject: [ofa-general] rping / librdmacm deadlock question In-Reply-To: <469E6B0D.9080107@opengridcomputing.com> References: <469E6B0D.9080107@opengridcomputing.com> Message-ID: Steve, Thank you. Looks like this was my problem. I should have looked more closely at the chelsio driver. Sorry for the interruption, and thanks again for your help. --clark On 7/18/07, Steve Wise wrote: > > Your destroy_qp() method must destroy the active rdma connection which > will force the iwcm to release the reference on the qp. If you look at > the chelsio driver, you'll see this is done before waiting on the refcnt > to go to zero: .... Hope this helps... > > > Steve > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonathan.Robertson at 3leafnetworks.com Wed Jul 18 13:19:27 2007 From: Jonathan.Robertson at 3leafnetworks.com (Jonathan Robertson) Date: Wed, 18 Jul 2007 13:19:27 -0700 Subject: [ofa-general] libsdp in OFED 1.1 Message-ID: <7C1D552561AF0544ACC7CF6F10E4966ECB5353@chronus.3leafnetworks.corp> Hello, I have been using libsdp, and preloading it with the application. I would like to have it automatically preloaded, but am concerned about some error messages that seem harmless. So I don't want to have our client use the ld.so.preload if there are going to be messages. I see the following when I run a simple 'ls' # ls Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation for close found . .. # Any suggestions? I have the following in libsdp.conf Log min-level 9 destination syslog Use both server netserver *:* Use both client netperf *:* Our client is interested in having weblogic communicate with the oracle DB using SDP, and the interface to oracle and weblogic being accessible via tcp/ip over Ethernet as well. Thanks! Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jul 18 13:52:25 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 13:52:25 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <20070718192750.GA8931@obsidianresearch.com> References: <20070718050928.GA3103@obsidianresearch.com> <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> <20070718192750.GA8931@obsidianresearch.com> Message-ID: <469E7D89.7040809@ichips.intel.com> > So, my main concern is with the role of kernel caching and especially with > how control is exported to user space. The only control currently exported by the local SA is a module parameter that allows a user to force a refresh of the entire cache. I do not want to extend this until we can get at least some basic PR caching functionality merged. I want something small that we can build on, and the local_sa patch is already 1300 lines of code, with another 1000 lines of code to support informinfo registration. > Clearly the kernel needs a fast lookup cache for things like ipoib and > others. I don't think a kernel module needs or wants a full on > distributed SA. We talking about PR caching only at this point, with possible extensions to support QoS. Other SA information is not cached or needed. For all to all connections, current code does something like the following: 1. Resolves IP addresses to DGIDs using ARP. This results in IPoIB querying the SA and caching 1 PR per DGID. 2. Apps query the SA for PRs, with 1 PR query per DGID. Eventually we'll get back the same set of PRs that IPoIB already had cached. 3. Establish the connections. The IB CM stores the PR information with each connection in order to set the QP attributes properly. We end up with redundant queries and the PR being cached in multiple places. One optimization is to replace the N PR queries with a single, more efficient GetTable query. A second optimization is to centralize the PR caching. The local SA does the first, and starts us down the road of the second. > I personally think a simple in-kernel (small) fast lookup cache merged > with the ipoib cache that has a netlink interface to userspace to > add/delete/flush entries is a very good solution that will keep being > useful in future. netlink would also carry cache miss queries to > userspace. In absense of a daemon the kernel could query on its own > but cache very conservatively. A userspace version of the very > agressive cache you have now could also be created right away. I believe that the PR caching should be done outside of IPoIB. Other paths may exist that IPoIB does not use. > This is because I firmly do not belive in caching as a solution to the > scalability problems. It must be solved with some level of replication > and distribution of the SA data and algorithms. PR caching *is* replication of the SA data. The local SA works with all existing SAs. It is not tied to one vendor, nor does it require changes to the SAs. Sure, we can define vendor specific protocols to assist with/optimize synchronization, but I don't believe it is necessary in an initial submission. (In fact I think it's undesirable at this point, since it would require changes to the SA.) > Maybe you could summarise how the user/kernel interface works? The > last I saw was something based on MADs that looked very inefficient > compared with netlink. I suggested a MAD interface to the local SA as being the most extensible. It allows interacting with the cache from a local or remote node in a very IB fashion. The local SA is located over QP1, and any new protocols can re-use the existing SA MAD format. For example, the cache could be loaded using a 'SetTable PR' MAD. It doesn't matter if the MAD is sent from a local user space daemon, some distributed SA agent, or the master SA. Paths can be invalidated by sending 'Delete PR' MADs. It may also be possible to extend such an interface for QoS purposes. - Sean From mshefty at ichips.intel.com Wed Jul 18 13:54:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 13:54:35 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <469E4876.7020805@ichips.intel.com> References: <000001c7c8fc$360eec90$5dcc180a@amr.corp.intel.com> <469E4876.7020805@ichips.intel.com> Message-ID: <469E7E0B.7040703@ichips.intel.com> > Based on discussions so far, maybe the best path forward from here is to > delay until 2.6.24. This will let us add this version to OFED 1.3 for > more widespread testing, plus give us the time that we need to come up > with a plan to integrate QoS with the local SA. I spoke with Matt on this, and he agreed with this plan. - Sean From sashak at voltaire.com Wed Jul 18 14:00:46 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 00:00:46 +0300 Subject: [ofa-general] management master git repository Message-ID: <20070718210046.GH27878@sashak.voltaire.com> Hi All, Please note that due to maintainership transfer "master" of OFA management userspace tree (OpenSM, infiniband-diags) is located at: git://git.openfabrics.org/~sashak/management All OpenSM, Diags, libibumad and libibmad upstream changes will be committed into this repo. 'ofed_1_2' branch exists too, but since there were no changes in last days it is identical to one in ~halr/management. I updated OFA wiki pages accordingly. Sasha From jgunthorpe at obsidianresearch.com Wed Jul 18 14:32:43 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 18 Jul 2007 15:32:43 -0600 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <469E7D89.7040809@ichips.intel.com> References: <20070718050928.GA3103@obsidianresearch.com> <000101c7c901$900cb290$5dcc180a@amr.corp.intel.com> <20070718192750.GA8931@obsidianresearch.com> <469E7D89.7040809@ichips.intel.com> Message-ID: <20070718213243.GZ13618@obsidianresearch.com> On Wed, Jul 18, 2007 at 01:52:25PM -0700, Sean Hefty wrote: > 1. Resolves IP addresses to DGIDs using ARP. This results in IPoIB > querying the SA and caching 1 PR per DGID. > 2. Apps query the SA for PRs, with 1 PR query per DGID. Eventually > we'll get back the same set of PRs that IPoIB already had cached. > 3. Establish the connections. The IB CM stores the PR information with > each connection in order to set the QP attributes properly. So, since you flush the cache for your MPI jobs the gain you see is basically by re-using the data collected by ipoib? If this is the case, do you get the same first-order benifit by essentially using the ipoib cache for all PR queries? > >I personally think a simple in-kernel (small) fast lookup cache merged > >with the ipoib cache that has a netlink interface to userspace to > >add/delete/flush entries is a very good solution that will keep being > >useful in future. netlink would also carry cache miss queries to > >userspace. In absense of a daemon the kernel could query on its own > >but cache very conservatively. A userspace version of the very > >agressive cache you have now could also be created right away. > > I believe that the PR caching should be done outside of IPoIB. Other > paths may exist that IPoIB does not use. When I said merged, I was thinking eliminating the ipoib cache component and using your new module. Doesn't seem much sense in caching twice, especially since ipoib already lacks anything to keep the cache coherent with the SA - and that is what the main work is. One PR record cache in the kernel, and it would be in roughly the same architectual spot as the your local sa module. > >This is because I firmly do not belive in caching as a solution to the > >scalability problems. It must be solved with some level of replication > >and distribution of the SA data and algorithms. > > PR caching *is* replication of the SA data. The local SA works with all > existing SAs. It is not tied to one vendor, nor does it require changes > to the SAs. Sure, we can define vendor specific protocols to assist > with/optimize synchronization, but I don't believe it is necessary in an > initial submission. (In fact I think it's undesirable at this point, > since it would require changes to the SA.) Ok, so I draw a distiction between caching _some_ final end products (ie PRs) without any coherency to the original source data, and coherently replicating enough data to compute _any_ query on demand. Some vs All and Pull vs Push. I'm trying to say, I think a simple kernel cache itself is fine, but there should be only 1 cache (get rid of ipoib) and it should have a really good interface to userspace so that the really hard problems can be solved through user space code. Not suggesting cache to SA vendor specific hooks or anything like that, just a well defined kernel module that lets user space co-opt path resolution, which needs to include a kernel cache component due to ipoib. Jason From rdreier at cisco.com Wed Jul 18 15:52:57 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 15:52:57 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get another batch of changes for 2.6.23, including the beginnings of cleaning up the work request posting code in mthca and mlx4: Dotan Barak (2): IB/mlx4: Take sizeof the correct pointer in call to memset() RDMA/cma: Remove local write permission from QP access flags Hoang-Nam Nguyen (7): IB/ehca: Fix memory leak in error path of ehca_get_dma_mr() IB/ehca: Use common error code mapping instead of specific ones IB/ehca: Use #define for "pages per register_rpage" instead of hardcoded value IB/ehca: Use macro to calculate number of chunks in a mem block IB/ehca: MR/MW structure refactoring IB/ehca: Restructure ehca_set_pagebuf() IB/ehca: Fix warnings issued by checkpatch.pl Jack Morgenstein (4): IB/mlx4: Fix flow label returned from query QP IB/mlx4: Fix port returned from query QP for QPs in INIT state mlx4_core: Reset device when internal error is detected IB/mlx4: Increase max outstanding RDMA reads as target Joachim Fenkes (1): IB/ehca: Fix HW level autodetection Roland Dreier (14): IB/mthca: Schedule MSI support for removal IB/mthca: Fix printk format used for firmware version in warning IB/iser: Make a couple of functions static IB/ipath: Make a few functions static IB/ipath: Remove ipath_get_user_pages_nocopy() IB/cm: Make internal function cm_get_ack_delay() static IB/mthca: Use uninitialized_var() for f0 IB/mlx4: Return receive queue sizes for userspace QPs from query QP IB/mthca: Factor out setting WQE data segment entries IB/mlx4: Factor out setting WQE data segment entries IB/mlx4: Factor out setting other WQE segments IB/mthca: Factor out setting WQE remote address and atomic segment entries IB/mthca: Factor out setting WQE UD segment entries IB/mthca: Simplify use of size0 in work request posting Steve Wise (1): RDMA/cxgb3: Remove cm_id reference on listen failures Documentation/feature-removal-schedule.txt | 10 + drivers/infiniband/core/cm.c | 2 +- drivers/infiniband/core/cma.c | 2 +- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + drivers/infiniband/hw/ehca/ehca_av.c | 2 +- drivers/infiniband/hw/ehca/ehca_classes.h | 54 +- drivers/infiniband/hw/ehca/ehca_classes_pSeries.h | 156 ++-- drivers/infiniband/hw/ehca/ehca_cq.c | 2 +- drivers/infiniband/hw/ehca/ehca_eq.c | 3 +- drivers/infiniband/hw/ehca/ehca_hca.c | 28 +- drivers/infiniband/hw/ehca/ehca_irq.c | 56 +- drivers/infiniband/hw/ehca/ehca_iverbs.h | 7 +- drivers/infiniband/hw/ehca/ehca_main.c | 50 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 1087 ++++++++------------- drivers/infiniband/hw/ehca/ehca_mrmw.h | 21 +- drivers/infiniband/hw/ehca/ehca_qes.h | 22 +- drivers/infiniband/hw/ehca/ehca_qp.c | 39 +- drivers/infiniband/hw/ehca/ehca_reqs.c | 15 +- drivers/infiniband/hw/ehca/ehca_tools.h | 31 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 10 +- drivers/infiniband/hw/ehca/hcp_if.c | 8 +- drivers/infiniband/hw/ehca/hcp_phyp.c | 2 +- drivers/infiniband/hw/ehca/hipz_fns_core.h | 4 +- drivers/infiniband/hw/ehca/hipz_hw.h | 24 +- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 2 +- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 4 +- drivers/infiniband/hw/ipath/ipath_driver.c | 2 +- drivers/infiniband/hw/ipath/ipath_eeprom.c | 4 +- drivers/infiniband/hw/ipath/ipath_intr.c | 2 +- drivers/infiniband/hw/ipath/ipath_kernel.h | 2 - drivers/infiniband/hw/ipath/ipath_ruc.c | 2 +- drivers/infiniband/hw/ipath/ipath_user_pages.c | 26 - drivers/infiniband/hw/ipath/ipath_verbs.c | 2 +- drivers/infiniband/hw/ipath/ipath_verbs.h | 4 - drivers/infiniband/hw/mlx4/qp.c | 115 ++- drivers/infiniband/hw/mthca/mthca_main.c | 22 +- drivers/infiniband/hw/mthca/mthca_qp.c | 221 ++--- drivers/infiniband/hw/mthca/mthca_srq.c | 28 +- drivers/infiniband/hw/mthca/mthca_wqe.h | 15 + drivers/infiniband/ulp/iser/iscsi_iser.h | 5 - drivers/infiniband/ulp/iser/iser_memory.c | 4 +- drivers/infiniband/ulp/iser/iser_verbs.c | 47 +- drivers/net/mlx4/catas.c | 106 ++- drivers/net/mlx4/eq.c | 56 +- drivers/net/mlx4/intf.c | 2 + drivers/net/mlx4/main.c | 26 +- drivers/net/mlx4/mlx4.h | 13 +- 47 files changed, 1055 insertions(+), 1291 deletions(-) From sean.hefty at intel.com Wed Jul 18 15:53:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 15:53:36 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <20070718213243.GZ13618@obsidianresearch.com> Message-ID: <000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com> >So, since you flush the cache for your MPI jobs the gain you see is >basically by re-using the data collected by ipoib? I need to correct what I said before about our MPI jobs. On our production clusters, we're using the local SA in OFED 1.2, which updates automatically on a timer. This patch removes the timer updates and instead gives control of the update policy to a user space app. The local SA sits beneath the existing ib_sa interface, and would have the PR data available when ipoib requests it. >If this is the case, do you get the same first-order benifit by >essentially using the ipoib cache for all PR queries? There are a couple of benefits. The number of PR queries is reduced from O(n^2) to O(n). The queries can also be done once up front, even started at different times if needed, rather than all at once at job startup. The jobs are also able to make progress even if the SA dies or is unreachable. >I'm trying to say, I think a simple kernel cache itself is fine, but >there should be only 1 cache (get rid of ipoib) and it should have a >really good interface to userspace so that the really hard problems >can be solved through user space code. I don't disagree, but (for now anyway) I believe that the natural interface for communicating with an SA related agent is a MAD interface based on the SA management class for the reasons I mentioned earlier. But this is really talking about extensions to the local SA patch, rather than addressing anything fundamentally wrong with the current patch set. - Sean From rdreier at cisco.com Wed Jul 18 16:11:15 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 16:11:15 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: <469E4CA2.2040708@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 18 Jul 2007 10:23:46 -0700") References: <469E4CA2.2040708@linux.vnet.ibm.com> Message-ID: There's still some rather obvious problems with this patch. It would really help if you would read over your patch again I think... anyway: > +#define CM_PACKET_SIZE (1ul << 16) This duplicates IPOIB_CM_MTU I think... certainly it needs to be kept in sync with it somehow. > @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long > dev_kfree_skb_any(skb); > } > > -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) Why is this change here? (This is in the CONFIG_INFINIBAND_IPOIB_CM=n part of ipoib.h) > } > - > #endif Please try to avoid adding extraneous noise to your patch... it makes it harder to focus on the real content. > +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; > +int max_recv_buf = 1024; /* Default is 1024 MB */ > + > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); > +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported"); > + > +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); > +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); > + > +atomic_t current_rc_qp; /* Active number of RC QPs for NOSRQ */ everything here can be static I think ("make namespacecheck" might be worth running). And you can use ATOMIC_INIT() instead of putting the initialization into code. > - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); > + ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret); extra noise here (and still wrong -- id might be long long on some architectures). > - .event_handler = ipoib_cm_rx_event_handler, why? seems harmless to just leave this alone for all QPs even if an SRQ isn't attached. > + recv_mem_used = (u64)ipoib_recvq_size * > + (u64)atomic_inc_return(¤t_rc_qp) > + * CM_PACKET_SIZE; /* packets are 64K */ packets might not always be 64K ... just let CM_PACKET_SIZE document itself (or pick a better name if you think it needs to be clearer). > + if ((index == max_rc_qp) || > + ( recv_mem_used >= max_recv_buf * (1ul << 20))) { formatting went awry here... > + spin_unlock_irq(&priv->lock); > + ipoib_warn(priv, "NOSRQ has reached the configurable limit " > + "of either %d RC QPs or, max recv buf size of " > + "0x%x MB\n", max_rc_qp, max_recv_buf); > + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); > + ret = -EINVAL; > + goto err_alloc_and_post; there's a bug here... you never undo the atomic_inc() of the number of RC QPs even though you exit without creating a new connection. > - if (!p) > + if (!p) { > + printk(KERN_WARNING "Failed to allocate RX control block when " > + "REQ arrived\n"); > return -ENOMEM; > + } more unrelated changes... (feel free to send these as separate patches) > kfree(p); > } > > + > cancel_delayed_work(&priv->cm.stale_task); > } extra noise in the patch > + if (!priv->cm.srq) { > + atomic_dec(¤t_rc_qp); > + } no need for { } here > + /* We increase the size of the CQ in the NOSRQ case to prevent CQ > + * overflow. Every new REQ creates a new RX QP and each QP has an > + * RX ring associated with it. Therefore we could have > + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs > + * in a CQ. > + */ > + if (!priv->cm.srq) > + size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size; only need to do this if CM is enabled space after - here please too. that's just from a quick skim of the patch... From akepner at sgi.com Wed Jul 18 16:22:32 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 18 Jul 2007 16:22:32 -0700 Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix In-Reply-To: References: <20070715212445.GG6921@sgi.com> Message-ID: <20070718232232.GQ16538@sgi.com> On Mon, Jul 16, 2007 at 09:57:52AM -0700, Roland Dreier wrote: > Looks reasonable but I would prefer to see explicit tests of the abi > version so that we use the old register MR ABI for old kernels rather > than unconditionally passing the extra parameter. How about the following? This is somewhat untidy, in that the abi_version is exposed to verbs.c, but it seemed the best way to go. mthca-abi.h | 11 ++++++++++- mthca.c | 19 +++++++++++++------ verbs.c | 29 ++++++++++++++++++++--------- 3 files changed, 43 insertions(+), 16 deletions(-) -- diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h --- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca-abi.h 2007-06-23 02:00:34.000000000 -0700 +++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca-abi.h 2007-07-18 10:58:07.903823741 -0700 @@ -36,7 +36,8 @@ #include -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_MIN_ABI_VERSION 1 +#define MTHCA_UVERBS_MAX_ABI_VERSION 2 struct mthca_alloc_ucontext_resp { struct ibv_get_context_resp ibv_resp; @@ -50,6 +51,14 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr_abi_ver_2 { + struct ibv_reg_mr ibv_cmd; + __u32 mr_attrs; +#define MTHCA_MR_DMAFLUSH 0x1 +/* flush in-flight DMA on a write to memory region (IA64_SGI_SN2 only) */ + __u32 reserved; +}; + struct mthca_create_cq { struct ibv_create_cq ibv_cmd; __u32 lkey; diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca.c --- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/mthca.c 2007-06-23 02:00:34.000000000 -0700 +++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/mthca.c 2007-07-18 15:50:07.174842760 -0700 @@ -56,6 +56,8 @@ #include "mthca.h" #include "mthca-abi.h" +int abi_ver = 0; + #ifndef PCI_VENDOR_ID_MELLANOX #define PCI_VENDOR_ID_MELLANOX 0x15b3 #endif @@ -282,11 +284,16 @@ static struct ibv_device *mthca_driver_i return NULL; found: - if (abi_version > MTHCA_UVERBS_ABI_VERSION) { - fprintf(stderr, PFX "Fatal: ABI version %d of %s is too new (expected %d)\n", - abi_version, uverbs_sys_path, MTHCA_UVERBS_ABI_VERSION); + if (abi_version < MTHCA_UVERBS_MIN_ABI_VERSION || + abi_version > MTHCA_UVERBS_MAX_ABI_VERSION) { + fprintf(stderr, PFX "Fatal: ABI version %d of %s is not supported " + "(min supported %d, max supported %d)\n", + abi_version, uverbs_sys_path, + MTHCA_UVERBS_MIN_ABI_VERSION, + MTHCA_UVERBS_MAX_ABI_VERSION); return NULL; } + abi_ver = abi_version; dev = malloc(sizeof *dev); if (!dev) { @@ -314,13 +321,13 @@ static __attribute__((constructor)) void */ struct ibv_device *openib_driver_init(struct sysfs_class_device *sysdev) { - int abi_ver = 0; + int abi_version = 0; char value[8]; if (ibv_read_sysfs_file(sysdev->path, "abi_version", value, sizeof value) > 0) - abi_ver = strtol(value, NULL, 10); + abi_version = strtol(value, NULL, 10); - return mthca_driver_init(sysdev->path, abi_ver); + return mthca_driver_init(sysdev->path, abi_version); } #endif /* HAVE_IBV_REGISTER_DRIVER */ diff -rup ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c --- ofa_1_2_user-20070623-0200.orig/src/userspace/libmthca/src/verbs.c 2007-06-23 02:00:34.000000000 -0700 +++ ofa_1_2_user-20070623-0200/src/userspace/libmthca/src/verbs.c 2007-07-18 15:43:13.230506881 -0700 @@ -45,6 +45,8 @@ #include "mthca.h" #include "mthca-abi.h" +extern int abi_ver; + int mthca_query_device(struct ibv_context *context, struct ibv_device_attr *attr) { struct ibv_query_device cmd; @@ -117,26 +119,35 @@ int mthca_free_pd(struct ibv_pd *pd) static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, uint64_t hca_va, - enum ibv_access_flags access) + enum ibv_access_flags access, + int dmaflush) { struct ibv_mr *mr; - struct ibv_reg_mr cmd; + struct mthca_reg_mr_abi_ver_2 cmd; + size_t cmd_size; int ret; mr = malloc(sizeof *mr); if (!mr) return NULL; + if (abi_ver > 1) { + cmd.mr_attrs |= (__u32) dmaflush ? MTHCA_MR_DMAFLUSH : 0; + cmd_size = sizeof(struct mthca_reg_mr_abi_ver_2); + } else + cmd_size = sizeof(struct ibv_reg_mr); + #ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS { struct ibv_reg_mr_resp resp; ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd, &resp, sizeof resp); + &cmd.ibv_cmd, cmd_size, &resp, + sizeof resp); } #else ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd); + &cmd.ibv_cmd, cmd_size); #endif if (ret) { free(mr); @@ -149,7 +160,7 @@ static struct ibv_mr *__mthca_reg_mr(str struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access) { - return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access); + return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0); } int mthca_dereg_mr(struct ibv_mr *mr) @@ -202,7 +213,7 @@ struct ibv_cq *mthca_create_cq(struct ib cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!cq->mr) goto err_buf; @@ -294,7 +305,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!mr) { mthca_free_buf(&buf); ret = ENOMEM; @@ -402,7 +413,7 @@ struct ibv_srq *mthca_create_srq(struct if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) goto err; - srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0); if (!srq->mr) goto err_free; @@ -520,7 +531,7 @@ struct ibv_qp *mthca_create_qp(struct ib pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; - qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0); if (!qp->mr) goto err_free; -- Arthur From sashak at voltaire.com Wed Jul 18 16:29:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 02:29:47 +0300 Subject: [ewg] Re: [ofa-general] Re: RFC OFED-1.3 installation In-Reply-To: <1184729415.5165.570.camel@firewall.xsintricity.com> References: <20070717171250.GD7479@mellanox.co.il> <1184693800.5165.480.camel@firewall.xsintricity.com> <20070717174526.GE7479@mellanox.co.il> <1184697799.5165.536.camel@firewall.xsintricity.com> <20070717202730.GA15990@mellanox.co.il> <20070717210935.GA17168@mellanox.co.il> <1184713907.5165.549.camel@firewall.xsintricity.com> <20070718021854.GD19243@mellanox.co.il> <1184729415.5165.570.camel@firewall.xsintricity.com> Message-ID: <20070718232947.GM27878@sashak.voltaire.com> Hi Doug, On 23:30 Tue 17 Jul , Doug Ledford wrote: > > For reference, I'll attach the updated script I made for spitting out a > buildable tarball. Small comment about the script. > > Hehehe...resending because the ofa list server ate my message due to the > script attachment :-D I'll inline it instead. > > I guess I'll also mention that this script exists in my ~/repos/upstream > directory, and also in that directory are all the git repos that I have > cloned from ofa (as well as other places). So, it's one level above all > the various git clones and spits everything out into dist/. The easiest > way to use this script for any given package you want to create a daily > snapshot of is to run ./make.dist repodir daily; scp > dist/repodir-git.tgz dist/repodir-daily.HEAD ofaserver:downloads. That > simple action would (assuming you create a reasonable reponame.spec.in > file in the repos that are missing one) spit out a tarball that can be > passed directly to rpmbuild --rebuild reponame-git.tgz and rpm will spit > out the packages, and the repodir-daily.HEAD file shows the HEAD of the > git repo so you know exactly what state the tarball represents and you > can always get to it in another more recent repo by just updating to > that commit as head of tree. > > #!/bin/bash > > usage() { > echo "$0 repo daily | release [ signed | ]" > echo > echo " You must specify the repo to make a distribution tarball in. This" > echo "script will not work with complex repos like the management repo that" > echo "builds more than one package. It expects a repo to be a single package" > echo "repo where the directory name and the package name are the same, and" > echo "where a properly formatted reponame.spec.in file exists." > echo > echo " You must specify either release or daily in order for this script" > echo "to make tarballs. If this is a daily release, the tarballs will" > echo "be named -git.tgz and will overwrite existing tarballs." > echo "If this is a release build, then the tarball will be named" > echo "-.tgz and must be a new file. In addition," > echo "the script will add a new set of symbolic tags to the git repo" > echo "that correspond to the - of each tarball." > echo > echo " If the script detects that the tag on any component already exists," > echo "it will abort the release and prompt you to update the version on" > echo "the already tagged component. This enforces the proper behavior of" > echo "treating any released tarball as set in stone so that in the future" > echo "you will always be able to get to any given release tarball by" > echo "checking out the git tag and know with certainty that it is the same" > echo "code as released before even if you no longer have the same tarball" > echo "around." > echo > echo " As part of this process, the script will parse the .spec.in" > echo "file and output a .spec file. Since this script isn't smart" > echo "enough to deal with other random changes that should have their own" > echo "checkin the script will refuse to run if the current repo state is not" > echo "clean." > echo > echo " NOTE: the script has no clue if you are tagging on the right branch," > echo "it will however show you the git branch output so you can confirm it" > echo "is on the right branch before proceeding with the release." > echo > echo " In addition to just tagging the git repo, whenever creating a release" > echo "there is an optional argument of either signed or a hex gpg key-id." > echo "If you do not pass an argument to release, then the tag will be a" > echo "simple git annotated tag. If you pass signed as the argument, the" > echo "git tag operation will use your default signing key to sign the tag." > echo "Or you can pass an actual gpg key id in hex format and git will sign" > echo "the tag with that key." > echo > } > > if [ -z "$1" -o -z "$2" ]; then usage; exit 1; fi > > if [ ! -d "$1" ]; then usage; exit 1; fi > > TMPDIR=dist > if [ ! -d $TMPDIR ]; then mkdir $TMPDIR; fi > > if [ "$2" = "daily" -o "$2" = "release" ]; then > if [ ! -f $TMPDIR/$1-$2.HEAD ]; then > touch $TMPDIR/$1-$2.HEAD > fi > NEWHEAD=`cat $TMPDIR/$1-$2.HEAD` > else > usage > exit 1 > fi > > cd "$1" > echo "Updating git repo..." > git pull > RESULT=$? > HEAD=`git log --pretty=oneline -1` > > if [ "$RESULT" -ne 0 ]; then > echo "Failed to update the git repo cleanly, manual intervention required" > exit 1 > fi > > if [ "$HEAD" = "$NEWHEAD" ]; then > echo "No new commits since last tarball creation, nothing to do." > cd .. > exit 0 > fi > > if [ "$2" = "release" ]; then > # Is the repo clean? > git status | grep modified > /dev/null 2>&1 > if [ $? = 0 ]; then > echo "There are modified files in the repo. Please check any" > echo "changes in before proceeding." > exit 4 > fi > # Since we will be tagging things, make sure we are on the right > # branch > git branch > echo -n "Is the active branch the right one to tag this release on [y/N]? " > read answer > if [ "$answer" = y -o "$answer" = Y ]; then > echo "Proceeding..." > else > echo "Please check out the right branch and run make.dist again" > exit 0 > fi > # Check versions to make sure that we can proceed > VERSION=`grep "AC_INIT.*$1" configure.in | cut -f 2 -d ',' | sed -e 's/ //g'` > TARBALL=$1-$VERSION.tgz > if [ -f ../$TMPDIR/$TARBALL ]; then > echo "Target $TARBALL already exists, please update the version of" > echo "$1" > exit 2 > fi > if [ ! -z "`git tag -l $1-$VERSION`" ]; then > echo "A git tag already exists for $1-$VERSION. Please change the version" > echo "of $1 so a tag replacement won't occur." > exit 3 > fi > # On a real release, this resets the daily release starting point, on the > # assumption that any new daily builds will have a version number that is > # incrementally higher than the last officially released tarball. > RELEASE=1 > echo $RELEASE > ../$TMPDIR/$1.release > else > DATE=`date +%Y%m%d` > if [ -f ../$TMPDIR/$1.release ]; then > RELEASE=`cat ../$TMPDIR/$1.release` > RELEASE=`expr $RELEASE + 1` > else > RELEASE=1 > fi > echo $RELEASE > ../$TMPDIR/$1.release > RELEASE=0.${RELEASE}.${DATE}git > TARBALL=$1-git.tgz > fi > > cd .. > cp -a $1 $1-$VERSION Instead of copying git-archive could be used. Something like this: GIT_DIR=$1 git-archive --format=tar --prefix=$1-$VERSION/ HEAD | tar xf - The advantage is that tree should not be clean and files generated by previous build will not be part of tarball (without using aggressive git-clean modes). Source files local modifications will be ignored as well. I think this could be useful when tarball is generated by maintainer from his/her working tree. Sasha > [ -f $1/$1.spec.in ] && sed -e 's/@VERSION@/'$VERSION'/;s/@RELEASE@/'$RELEASE'/;s/@TARBALL@/'$TARBALL'/' < $1/$1.spec.in > $1-$VERSION/$1.spec > if [ -f $1-$VERSION/autogen.sh ]; then > cd $1-$VERSION > ./autogen.sh > cd .. > fi > echo "Creating $TMPDIR/$TARBALL" > tar -czf $TMPDIR/$TARBALL --exclude=.git $1-$VERSION > rm -rf $1-$VERSION > echo "$HEAD" > $TMPDIR/$1-$2.HEAD > > if [ $2 = release ]; then > echo "Tagging release." > cd $1 > if [ ! -z "$3" ]; then > if [ $3 = "signed" ]; then > git tag -s -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > else > git tag -u "$3" -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > fi > else > git tag -a -m "Auto tag by make.dist on release tarball creation" $1-$VERSION > fi > cd .. > fi > > > > > > > > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From sashak at voltaire.com Wed Jul 18 16:42:03 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 02:42:03 +0300 Subject: [ofa-general] management.git spec files and ./autogen.sh Message-ID: <20070718234203.GN27878@sashak.voltaire.com> Hi Doug, For all management tarballs ./autogen.sh is called during generation (by make.dist). Is there any reason to call ./autogen.sh again under %build section of the spec file (it is common for *.spec.in)? And as result to have autoconf and automake in BuildRequires: list? Sasha From jgunthorpe at obsidianresearch.com Wed Jul 18 16:40:38 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 18 Jul 2007 17:40:38 -0600 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com> References: <20070718213243.GZ13618@obsidianresearch.com> <000001c7c98e$7dbe7000$ff0da8c0@amr.corp.intel.com> Message-ID: <20070718234038.GB13618@obsidianresearch.com> On Wed, Jul 18, 2007 at 03:53:36PM -0700, Sean Hefty wrote: > There are a couple of benefits. The number of PR queries is reduced > from O(n^2) to O(n). The queries can also be done once up front, > even started at different times if needed, rather than all at once > at job startup. The jobs are also able to make progress even if the > SA dies or is unreachable. Do you mean each node changes from O(local_cpus*nodes) -> O(nodes) ? Globally, from cold cache start you should still be O(n^2)? > >I'm trying to say, I think a simple kernel cache itself is fine, but > >there should be only 1 cache (get rid of ipoib) and it should have a > >really good interface to userspace so that the really hard problems > >can be solved through user space code. > > I don't disagree, but (for now anyway) I believe that the natural > interface for communicating with an SA related agent is a MAD > interface based on the SA management class for the reasons I > mentioned earlier. But this is really talking about extensions to > the local SA patch, rather than addressing anything fundamentally > wrong with the current patch set. OK - thats fine then. When you get around to doing the user space side I'll argue for netlink :) Having written both netlink user space code and mad code, I can say netlink is way better! Only other thing I'd see is to have the cache be on by default (ie included by default in distro kernels) it really needs a default short life time for cached entries as a work around for a coherence protocol.. Jason From sean.hefty at intel.com Wed Jul 18 17:12:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 18 Jul 2007 17:12:35 -0700 Subject: [ofa-general] Further 2.6.23 merge plans... In-Reply-To: <20070718234038.GB13618@obsidianresearch.com> Message-ID: <000201c7c999$823243e0$ff0da8c0@amr.corp.intel.com> >> There are a couple of benefits. The number of PR queries is reduced >> from O(n^2) to O(n). The queries can also be done once up front, >> even started at different times if needed, rather than all at once >> at job startup. The jobs are also able to make progress even if the >> SA dies or is unreachable. > >Do you mean each node changes from O(local_cpus*nodes) -> O(nodes) ? >Globally, from cold cache start you should still be O(n^2)? Each node goes from O(processes * nodes) -> O(1). The local SA does a single GetTable query to obtain all PRs. Whereas, applications do one PR query for each connection. >OK - thats fine then. When you get around to doing the user space side >I'll argue for netlink :) Having written both netlink user space code >and mad code, I can say netlink is way better! We can thumb wrestle. (I would never argue that the IB MAD interface is great.) I'm suggesting that we want an interface that allows an application running on a remote node to control local SA policy, and that the message format should be similar to SA MADs. My hope is that we can create an interface that will be usable for QoS purposes as well. I will start an open thread on this once the QoS is released, and I've had time to think about more of the details. - Sean From pradeeps at linux.vnet.ibm.com Wed Jul 18 17:55:48 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Jul 2007 17:55:48 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: References: <469E4CA2.2040708@linux.vnet.ibm.com> Message-ID: <469EB694.7040408@linux.vnet.ibm.com> Roland Dreier wrote: > There's still some rather obvious problems with this patch. It would > really help if you would read over your patch again I think... anyway: > > > +#define CM_PACKET_SIZE (1ul << 16) > > This duplicates IPOIB_CM_MTU I think... certainly it needs to be kept > in sync with it somehow. They are not quite the same. How about: #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) This should keep the two in sync. > > > @@ -564,10 +574,9 @@ static inline void ipoib_cm_skb_too_long > > dev_kfree_skb_any(skb); > > } > > > > -static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > > +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > > Why is this change here? (This is in the CONFIG_INFINIBAND_IPOIB_CM=n > part of ipoib.h) > > > } > > - > > #endif Will do > > > - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); > > + ipoib_warn(priv, "post srq failed for buf %ld (%d)\n", id, ret); > > extra noise here (and still wrong -- id might be long long on some > architectures). Correct, it should have been %lld > > > - .event_handler = ipoib_cm_rx_event_handler, > > why? seems harmless to just leave this alone for all QPs even if an > SRQ isn't attached. > If memory serves me right, I tried that and ran into some inexplicable problems. Maybe it was hang or no traffic went through -don't exactly recollect what it was. After this change the problem went away. > > > + spin_unlock_irq(&priv->lock); > > + ipoib_warn(priv, "NOSRQ has reached the configurable limit " > > + "of either %d RC QPs or, max recv buf size of " > > + "0x%x MB\n", max_rc_qp, max_recv_buf); > > > + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); > > + ret = -EINVAL; > > + goto err_alloc_and_post; > > there's a bug here... you never undo the atomic_inc() of the number of > RC QPs even though you exit without creating a new connection. The atomic_dec() does happen, but that is in ipoib_cm_req_handler(). There are several places where allocate_and_post_rbuf_nosrq() could return an error after the atomic_inc(). So, there is an atomic_dec() in the calling routine. On the other hand I could move that to allocate_and_post_rbuf_nosrq() itself. > > > - if (!p) > > + if (!p) { > > + printk(KERN_WARNING "Failed to allocate RX control block when " > > + "REQ arrived\n"); > > return -ENOMEM; > > + } > > more unrelated changes... (feel free to send these as separate > patches) > OK > > kfree(p); > > } > > > > + > > cancel_delayed_work(&priv->cm.stale_task); > > } > > extra noise in the patch > > > + if (!priv->cm.srq) { > > + atomic_dec(¤t_rc_qp); > > + } > > no need for { } here OK > > > + /* We increase the size of the CQ in the NOSRQ case to prevent CQ > > + * overflow. Every new REQ creates a new RX QP and each QP has an > > + * RX ring associated with it. Therefore we could have > > + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs > > + * in a CQ. > > + */ > > + if (!priv->cm.srq) > > + size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size; > > only need to do this if CM is enabled > > space after - here please too. > > that's just from a quick skim of the patch... > OK Pradeep From kliteyn at mellanox.co.il Wed Jul 18 21:45:53 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 19 Jul 2007 07:45:53 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-19:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From eitan at mellanox.co.il Wed Jul 18 21:51:30 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 19 Jul 2007 07:51:30 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> Ohh your right. The Enh0 should get an update. I thought I got it right. Do you want me to provide an updated patch? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Wednesday, July 18, 2007 10:22 PM > To: Eitan Zahavi > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > Subject: Re: [PATCH] opensm: Bug in coding trying to set > vl_arb_high_limit > > Hi Eitan, > > On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > > Hi Sasha > > > > When QoS setup is done the code was trying to send updates of > > vl_arb_high_limit by req_set of PORT_INFO with the new data. > > However, at that stage the SM still did not assign LIDs to > the ports. > > So the sent PortInfo.base_lid was still zero. The > specification does > > not allow for such LIDs (they are considered ilegal). > > > > the patch below fixes this by storing the calculated value > and later > > using it in link and lid managers. > > Good, Thanks (and this also saves one PortInfo update MAD). > One question below: > > > > > > Eitan > > > > Signed-off-by: Eitan Zahavi > > > > [snip...] > > > diff --git a/opensm/opensm/osm_lid_mgr.c > b/opensm/opensm/osm_lid_mgr.c > > index bc3f8b3..ed76382 100644 > > --- a/opensm/opensm/osm_lid_mgr.c > > +++ b/opensm/opensm/osm_lid_mgr.c > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > > ib_port_info_get_port_state(p_old_pi) ) > > send_set = TRUE; > > } > > + > > + /* provide the vl_high_limit from the qos mgr */ > > + if (p_mgr->p_subn->opt.no_qos == FALSE) > > + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > > + { > > + send_set = TRUE; > > + p_pi->vl_high_limit = p_physp->vl_high_limit; > > + } > > This part of code is for port_num != 0, so VLHighLimit setup > will be skipped for switch enhanced port 0. Is it something > expected? If so why? > > Sasha > From mst at dev.mellanox.co.il Wed Jul 18 21:58:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 07:58:41 +0300 Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070718232232.GQ16538@sgi.com> References: <20070715212445.GG6921@sgi.com> <20070718232232.GQ16538@sgi.com> Message-ID: <20070719045840.GB30983@mellanox.co.il> > Quoting akepner at sgi.com : > Subject: Re: [RFC 1/1] libmthca: CQ/DMA race on Altix > > ... > > @@ -50,6 +51,14 @@ struct mthca_alloc_pd_resp { > __u32 reserved; > }; > > +struct mthca_reg_mr_abi_ver_2 { > + struct ibv_reg_mr ibv_cmd; > + __u32 mr_attrs; > +#define MTHCA_MR_DMAFLUSH 0x1 > +/* flush in-flight DMA on a write to memory region (IA64_SGI_SN2 only) */ > + __u32 reserved; > +}; > + > struct mthca_create_cq { > struct ibv_create_cq ibv_cmd; > __u32 lkey; Aren't there some unused bits in mr_attrs that we can use instead of breaking the ABI? -- MST From pradeeps at linux.vnet.ibm.com Wed Jul 18 22:15:58 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Jul 2007 22:15:58 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: <469EB694.7040408@linux.vnet.ibm.com> References: <469E4CA2.2040708@linux.vnet.ibm.com> <469EB694.7040408@linux.vnet.ibm.com> Message-ID: <469EF38E.8000203@linux.vnet.ibm.com> > >> > + /* We increase the size of the CQ in the NOSRQ case to prevent CQ >> > + * overflow. Every new REQ creates a new RX QP and each QP has an >> > + * RX ring associated with it. Therefore we could have >> > + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs >> > + * in a CQ. >> > + */ >> > + if (!priv->cm.srq) >> > + size += (NOSRQ_INDEX_TABLE_SIZE -1) * ipoib_recvq_size; >> >> only need to do this if CM is enabled This happens during init in ipoib_transport_dev_init(). However, at this point IPOIB_FLAG_ADMIN_CM is not even set. So, it is not possible to do this conditionally only if CM is enabled. Any suggestions? Pradeep From rdreier at cisco.com Wed Jul 18 22:23:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 22:23:49 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: <469EF38E.8000203@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 18 Jul 2007 22:15:58 -0700") References: <469E4CA2.2040708@linux.vnet.ibm.com> <469EB694.7040408@linux.vnet.ibm.com> <469EF38E.8000203@linux.vnet.ibm.com> Message-ID: > This happens during init in ipoib_transport_dev_init(). However, at this > point IPOIB_FLAG_ADMIN_CM is not even set. So, it is not possible to do > this conditionally only if CM is enabled. Any suggestions? I meant only do it if CONFIG_INFINIBAND_IPOIB_CM is set. From rdreier at cisco.com Wed Jul 18 22:24:51 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 22:24:51 -0700 Subject: [ofa-general] Re: [RFC 1/1] libmthca: CQ/DMA race on Altix In-Reply-To: <20070719045840.GB30983@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 19 Jul 2007 07:58:41 +0300") References: <20070715212445.GG6921@sgi.com> <20070718232232.GQ16538@sgi.com> <20070719045840.GB30983@mellanox.co.il> Message-ID: > Aren't there some unused bits in mr_attrs that we can use instead of > breaking the ABI? That seems pretty fragile to me. Although maybe we could reserve a block of bits for provider-private use or something... From rdreier at cisco.com Wed Jul 18 22:28:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 18 Jul 2007 22:28:10 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: <469EB694.7040408@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Wed, 18 Jul 2007 17:55:48 -0700") References: <469E4CA2.2040708@linux.vnet.ibm.com> <469EB694.7040408@linux.vnet.ibm.com> Message-ID: > They are not quite the same. How about: > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) That makes sense. > > > - .event_handler = ipoib_cm_rx_event_handler, > > > > why? seems harmless to just leave this alone for all QPs even if an > > SRQ isn't attached. > > If memory serves me right, I tried that and ran into some inexplicable problems. > Maybe it was hang or no traffic went through -don't exactly recollect what it was. > After this change the problem went away. Umm... I would like to get to the root cause of that. Because as far as I can see there is no problem if the event handler is called for a non-SRQ QP. The event will never be "last WQE reached" (since only a QP attached to an SRQ can generate that) and so the event handler will just return immediately and do nothing. > The atomic_dec() does happen, but that is in ipoib_cm_req_handler(). There are > several places where allocate_and_post_rbuf_nosrq() could return an error after > the atomic_inc(). So, there is an atomic_dec() in the calling routine. On the > other hand I could move that to allocate_and_post_rbuf_nosrq() itself. Got it. I guess that's OK although it does seem like it would be clearer if allocate_and_post_rbuf_nosrq() unwound everything on error. - R. From pradeeps at linux.vnet.ibm.com Wed Jul 18 22:55:33 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 18 Jul 2007 22:55:33 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: References: <469E4CA2.2040708@linux.vnet.ibm.com> <469EB694.7040408@linux.vnet.ibm.com> Message-ID: <469EFCD5.5050800@linux.vnet.ibm.com> Roland Dreier wrote: > > They are not quite the same. How about: > > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) > > That makes sense. > > > > > - .event_handler = ipoib_cm_rx_event_handler, > > > > > > why? seems harmless to just leave this alone for all QPs even if an > > > SRQ isn't attached. > > > > If memory serves me right, I tried that and ran into some inexplicable problems. > > Maybe it was hang or no traffic went through -don't exactly recollect what it was. > > After this change the problem went away. > > Umm... I would like to get to the root cause of that. Because as far > as I can see there is no problem if the event handler is called for a > non-SRQ QP. The event will never be "last WQE reached" (since only a > QP attached to an SRQ can generate that) and so the event handler will > just return immediately and do nothing. Since I do not recollect what the issue was it was it might require some investigation -especially since we have a short window for the merge. Would it be okay if I submit a patch without this for the merge? Subsequently I will submit a patch to address this issue. Pradeep From erezz at voltaire.com Thu Jul 19 01:31:41 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 19 Jul 2007 11:31:41 +0300 Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support In-Reply-To: References: <11845791213043-git-send-email-jens.axboe@oracle.com><1184579123437-git-send-email-jens.axboe@oracle.com> Message-ID: <469F216D.3060306@voltaire.com> Roland Dreier wrote: > [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland] > I would like to test that on iSER. Where can I download all 33 patches from? Thanks, Erez From jens.axboe at oracle.com Thu Jul 19 01:39:39 2007 From: jens.axboe at oracle.com (Jens Axboe) Date: Thu, 19 Jul 2007 10:39:39 +0200 Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support In-Reply-To: <469F216D.3060306@voltaire.com> References: <469F216D.3060306@voltaire.com> Message-ID: <20070719083939.GC11657@kernel.dk> On Thu, Jul 19 2007, Erez Zilber wrote: > Roland Dreier wrote: > > > [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland] > > > > I would like to test that on iSER. Where can I download all 33 patches from? I can provide a rolled up patch for you, right now the patchset has been split in a series of 3 (core -> drivers -> arch bits are seperate). Here's one for current -git as-of this morning: http://brick.kernel.dk/sglist-chain-all-2.6.22-git-20070719 -- Jens Axboe From vlad at lists.openfabrics.org Thu Jul 19 01:45:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 19 Jul 2007 01:45:32 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070719-0100 daily build status Message-ID: <20070719084532.67CC6E60858@openfabrics.org> This email was generated automatically, please do not reply Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.5-7.244-smp Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: Build failed on i686 with linux-2.6.22-rc7 From mst at dev.mellanox.co.il Thu Jul 19 01:47:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 11:47:51 +0300 Subject: [ofa-general] oops on mlx4 modprobe Message-ID: <20070719084751.GC24018@mellanox.co.il> I got the following when loading mlx4_ib on git 589f1e81bde732dd0b1bc5d01b6bddd4bcb4527b [ 1350.668590] Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: [ 1350.674068] [] __kmalloc+0x51/0xaf [ 1350.682159] PGD 0 [ 1350.684378] Oops: 0000 [1] SMP [ 1350.687735] CPU 3 [ 1350.689950] Modules linked in: ib_ipoib ib_cm ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core piix ata_piix [ 1350.701777] Pid: 5391, comm: ipoib Not tainted 2.6.22-x86_64-git #119 [ 1350.708400] RIP: 0010:[] [] __kmalloc+0x51/0xaf [ 1350.716536] RSP: 0018:ffff81007c655ba0 EFLAGS: 00010046 [ 1350.722034] RAX: 0000000000000003 RBX: 0000000000000246 RCX: 0000000000000040 [ 1350.729352] RDX: ffff81007ed15000 RSI: 00000000000000d0 RDI: 0000000000000000 [ 1350.736669] RBP: ffff81007c655bc0 R08: 00000000fffffff0 R09: ffff810075779d80 [ 1350.743985] R10: 0000000000000001 R11: 0000000005b8d800 R12: 00000000000000d0 [ 1350.751302] R13: 0000000000000010 R14: ffff81007ed7cc78 R15: ffff81007dbad800 [ 1350.758620] FS: 0000000000000000(0000) GS:ffff81007ff2b340(0000) knlGS:0000000000000000 [ 1350.767089] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 1350.773021] CR2: 0000000000000028 CR3: 0000000075ca6000 CR4: 00000000000006e0 [ 1350.780338] Process ipoib (pid: 5391, threadinfo ffff81007c654000, task ffff81007c5d8040) [ 1350.788895] Stack: ffff81007ed7cc00 0000000000000000 ffff81007ed7cc00 ffff81007ed7cd20 [ 1350.797331] ffff81007c655c40 ffffffff88063cb6 ffff81006ae20b80 000000006ae20c30 [ 1350.805151] ffff81007c655df0 ffff81007e3ba380 00000000000000d0 ffff81007ffa7c80 [ 1350.812587] Call Trace: [ 1350.815619] [] :mlx4_ib:create_qp_common+0x558/0x736 [ 1350.822421] [] :mlx4_ib:mlx4_ib_create_qp+0x62/0x11f [ 1350.829223] [] :ib_ipoib:ipoib_cm_tx_completion+0x0/0x2bb [ 1350.836461] [] :ib_core:ib_create_qp+0x18/0x94 [ 1350.842743] [] :ib_ipoib:ipoib_cm_tx_start+0x216/0x651 [ 1350.849714] [] queue_work+0x3f/0x4a [ 1350.855043] [] :ib_sa:ib_sa_join_multicast+0x292/0x2df [ 1350.862030] [] :ib_ipoib:ipoib_cm_tx_start+0x0/0x651 [ 1350.868829] [] run_workqueue+0x85/0x10f [ 1350.874501] [] worker_thread+0x0/0xe7 [ 1350.880000] [] worker_thread+0xdc/0xe7 [ 1350.885585] [] autoremove_wake_function+0x0/0x38 [ 1350.892036] [] kthread+0x49/0x77 [ 1350.897102] [] child_rip+0xa/0x12 [ 1350.902254] [] kthread+0x0/0x77 [ 1350.907231] [] child_rip+0x0/0x12 [ 1350.912384] [ 1350.914068] [ 1350.914068] Code: 49 8b 54 c5 00 83 3a 00 74 16 8b 02 c7 42 0c 01 00 00 00 ff [ 1350.923599] RIP [] __kmalloc+0x51/0xaf [ 1350.929195] RSP [ 1350.932873] CR2: 0000000000000028 -- MST From mst at dev.mellanox.co.il Thu Jul 19 01:49:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 11:49:27 +0300 Subject: [ofa-general] Re: ofa_1_2_kernel 20070719-0100 daily build status In-Reply-To: <20070719084532.67CC6E60858@openfabrics.org> References: <20070719084532.67CC6E60858@openfabrics.org> Message-ID: <20070719084927.GD24018@mellanox.co.il> > > Failed: > Build failed on i686 with linux-2.6.22-rc7 Why is it still failing? And shouldn't we switch to 2.6.22? -- MST From mst at dev.mellanox.co.il Thu Jul 19 02:40:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 12:40:39 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: fix oops in qp allocation for srq case Message-ID: <20070719094039.GF24018@mellanox.co.il> Don't pass 0 size to kmalloc if qp->rq.wqe_cnt == 0 (e.g. for SRQ). Note: initializing sq.wrid and rq.wrid to NULL at top helps keep error handling simple, and also fixes what seems like a bug in create_qp_common error handling: if srq is set for userspace, code at err_wrid would call kfree on wrid arrays even though these have not been initialized. Signed-off-by: Michael S. Tsirkin --- This patch fixes the oops I reported earlier. diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index b5a24fb..79e50e5 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -315,6 +315,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, qp->rq.tail = 0; qp->sq.head = 0; qp->sq.tail = 0; + qp->sq.wrid = NULL; + qp->rq.wrid = NULL; err = set_rq_size(dev, &init_attr->cap, !!pd->uobject, !!init_attr->srq, qp); if (err) @@ -385,13 +387,18 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL); - qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL); - - if (!qp->sq.wrid || !qp->rq.wrid) { + qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), GFP_KERNEL); + if (!qp->sq.wrid) { err = -ENOMEM; goto err_wrid; } + if (qp->rq.wqe_cnt) { + qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), GFP_KERNEL); + if (!qp->rq.wrid) { + err = -ENOMEM; + goto err_wrid; + } + } } err = mlx4_qp_alloc(dev->dev, sqpn, &qp->mqp); -- MST From vlad at lists.openfabrics.org Thu Jul 19 02:45:34 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 19 Jul 2007 02:45:34 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070719-0200 daily build status Message-ID: <20070719094535.0DBC6E60870@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22-rc7 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From mst at dev.mellanox.co.il Thu Jul 19 02:50:12 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 12:50:12 +0300 Subject: [ofa-general] [PATCH] IB/mthca: enable MSI-X by default Message-ID: <20070719095012.GH24018@mellanox.co.il> Recover from MSI-X errors by automatically falling back on regular interrupt, instead of asking the user to do this manually. This makes it possible to enable MSI-X by default, and will make it possible to get rid of msi_x module option in the future. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 76fed75..0c8b954 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #ifdef CONFIG_PCI_MSI -static int msi_x = 0; +static int msi_x = 1; module_param(msi_x, int, 0444); MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); @@ -837,10 +837,7 @@ static int mthca_setup_hca(struct mthca_dev *dev) dev->mthca_flags & MTHCA_FLAG_MSI_X ? dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector : dev->pdev->irq); - if (dev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X)) - mthca_err(dev, "Try again with MSI/MSI-X disabled.\n"); - else - mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n"); + mthca_err(dev, "BIOS or ACPI interrupt routing problem?\n"); goto err_cmd_poll; } @@ -1115,24 +1112,6 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type) goto err_free_dev; } - if (msi_x && !mthca_enable_msi_x(mdev)) - mdev->mthca_flags |= MTHCA_FLAG_MSI_X; - else if (msi) { - static int warned; - - if (!warned) { - printk(KERN_WARNING PFX "WARNING: MSI support will be " - "removed from the ib_mthca driver in January 2008.\n"); - printk(KERN_WARNING " If you are using MSI and cannot " - "switch to MSI-X, please tell " - ".\n"); - ++warned; - } - - if (!pci_enable_msi(pdev)) - mdev->mthca_flags |= MTHCA_FLAG_MSI; - } - if (mthca_cmd_init(mdev)) { mthca_err(mdev, "Failed to init command interface, aborting.\n"); goto err_free_dev; @@ -1156,7 +1135,36 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type) mthca_warn(mdev, "If you have problems, try updating your HCA FW.\n"); } + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + else if (msi) { + static int warned; + + if (!warned) { + printk(KERN_WARNING PFX "WARNING: MSI support will be " + "removed from the ib_mthca driver in January 2008.\n"); + printk(KERN_WARNING " If you are using MSI and cannot " + "switch to MSI-X, please tell " + ".\n"); + ++warned; + } + + if (!pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + } + err = mthca_setup_hca(mdev); + if (err == -EBUSY && (mdev->mthca_flags & (MTHCA_FLAG_MSI | MTHCA_FLAG_MSI_X))) { + mthca_warn(mdev, "Trying again with MSI/MSI-X disabled.\n"); + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + mdev->mthca_flags &= ~(MTHCA_FLAG_MSI_X | MTHCA_FLAG_MSI); + + err = mthca_setup_hca(mdev); + } + if (err) goto err_close; @@ -1192,17 +1200,17 @@ err_cleanup: mthca_cleanup_uar_table(mdev); err_close: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + mthca_close_hca(mdev); err_cmd: mthca_cmd_cleanup(mdev); err_free_dev: - if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) - pci_disable_msix(pdev); - if (mdev->mthca_flags & MTHCA_FLAG_MSI) - pci_disable_msi(pdev); - ib_dealloc_device(&mdev->ib_dev); err_free_res: -- MST From mst at dev.mellanox.co.il Thu Jul 19 04:21:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 14:21:55 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: enable MSI-X by default Message-ID: <20070719112155.GJ24018@mellanox.co.il> Recover from MSI-X errors by automatically falling back on regular interrupt, instead of asking the user to do this manually. This makes it possible to enable MSI-X by default, and will make it possible to get rid of msi_x module option in the future. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 4dc9dc1..fee53b2 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #ifdef CONFIG_PCI_MSI -static int msi_x; +static int msi_x = 1; module_param(msi_x, int, 0444); MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); @@ -602,10 +602,7 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev) mlx4_err(dev, "NOP command failed to generate interrupt " "(IRQ %d), aborting.\n", priv->eq_table.eq[MLX4_EQ_ASYNC].irq); - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_err(dev, "Try again with MSI-X disabled.\n"); - else - mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n"); + mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n"); goto err_cmd_poll; } @@ -803,17 +800,26 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev, goto err_free_dev; } - mlx4_enable_msi_x(dev); - if (mlx4_cmd_init(dev)) { mlx4_err(dev, "Failed to init command interface, aborting.\n"); goto err_free_dev; } + mlx4_enable_msi_x(dev); + err = mlx4_init_hca(dev); + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); + dev->flags &= ~MLX4_FLAG_MSI_X; + pci_disable_msix(pdev); + err = mlx4_init_hca(dev); + } + if (err) goto err_cmd; + mlx4_enable_msi_x(dev); + err = mlx4_setup_hca(dev); if (err) goto err_close; @@ -838,15 +844,15 @@ err_cleanup: mlx4_cleanup_uar_table(dev); err_close: + if (dev->flags & MLX4_FLAG_MSI_X) + pci_disable_msix(pdev); + mlx4_close_hca(dev); err_cmd: mlx4_cmd_cleanup(dev); err_free_dev: - if (dev->flags & MLX4_FLAG_MSI_X) - pci_disable_msix(pdev); - kfree(priv); err_release_bar2: -- MST From mst at dev.mellanox.co.il Thu Jul 19 04:28:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 14:28:49 +0300 Subject: [ofa-general] [PATCH] IB/mthca: change command token on timeout Message-ID: <20070719112849.GK24018@mellanox.co.il> Command token is currently only updated on command event. This means that on command timeout, the same token will be reused for new command, which results in a mess if the timed out command *is* eventually completed. Signed-off-by: Michael S. Tsirkin --- This patch is in OFED 1.2, so I think we want it for 2.6.23 too. diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 7131446..26c42a1 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -355,9 +355,6 @@ void mthca_cmd_event(struct mthca_dev *dev, context->result = 0; context->status = status; context->out_param = out_param; - - context->token += dev->cmd.token_mask + 1; - complete(&context->done); } @@ -379,6 +376,7 @@ static int mthca_cmd_wait(struct mthca_dev *dev, spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); context = &dev->cmd.context[dev->cmd.free_head]; + context->token += dev->cmd.token_mask + 1; dev->cmd.free_head = context->next; spin_unlock(&dev->cmd.context_lock); -- MST -- MST From sashak at voltaire.com Thu Jul 19 05:13:37 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 15:13:37 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> Message-ID: <1184847217.21739.16.camel@localhost> On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote: > Ohh your right. The Enh0 should get an update. > I thought I got it right. Do you want me to provide an updated patch? I can update on my side - I think we could remove VLHighLimit update from osm_lid_mgr and have one only in osm_link_mgr. Sasha > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Wednesday, July 18, 2007 10:22 PM > > To: Eitan Zahavi > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > Subject: Re: [PATCH] opensm: Bug in coding trying to set > > vl_arb_high_limit > > > > Hi Eitan, > > > > On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > > > Hi Sasha > > > > > > When QoS setup is done the code was trying to send updates of > > > vl_arb_high_limit by req_set of PORT_INFO with the new data. > > > However, at that stage the SM still did not assign LIDs to > > the ports. > > > So the sent PortInfo.base_lid was still zero. The > > specification does > > > not allow for such LIDs (they are considered ilegal). > > > > > > the patch below fixes this by storing the calculated value > > and later > > > using it in link and lid managers. > > > > Good, Thanks (and this also saves one PortInfo update MAD). > > One question below: > > > > > > > > > > Eitan > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > [snip...] > > > > > diff --git a/opensm/opensm/osm_lid_mgr.c > > b/opensm/opensm/osm_lid_mgr.c > > > index bc3f8b3..ed76382 100644 > > > --- a/opensm/opensm/osm_lid_mgr.c > > > +++ b/opensm/opensm/osm_lid_mgr.c > > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > > > ib_port_info_get_port_state(p_old_pi) ) > > > send_set = TRUE; > > > } > > > + > > > + /* provide the vl_high_limit from the qos mgr */ > > > + if (p_mgr->p_subn->opt.no_qos == FALSE) > > > + if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > > > + { > > > + send_set = TRUE; > > > + p_pi->vl_high_limit = p_physp->vl_high_limit; > > > + } > > > > This part of code is for port_num != 0, so VLHighLimit setup > > will be skipped for switch enhanced port 0. Is it something > > expected? If so why? > > > > Sasha > > From eitan at mellanox.co.il Thu Jul 19 05:24:13 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 19 Jul 2007 15:24:13 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> <1184847217.21739.16.camel@localhost> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com> Hi Sasha, I was not sure if there might be a case where the Link manager will not touch the port. So I placed it on both sides. Can't remember now if it is possible or not. Thanks for taking care of it. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, July 19, 2007 3:14 PM > To: Eitan Zahavi > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > Subject: RE: [PATCH] opensm: Bug in coding trying to set > vl_arb_high_limit > > On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote: > > Ohh your right. The Enh0 should get an update. > > I thought I got it right. Do you want me to provide an > updated patch? > > I can update on my side - I think we could remove VLHighLimit > update from osm_lid_mgr and have one only in osm_link_mgr. > > Sasha > > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect Mellanox > Technologies > > LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Wednesday, July 18, 2007 10:22 PM > > > To: Eitan Zahavi > > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > > Subject: Re: [PATCH] opensm: Bug in coding trying to set > > > vl_arb_high_limit > > > > > > Hi Eitan, > > > > > > On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > > > > Hi Sasha > > > > > > > > When QoS setup is done the code was trying to send updates of > > > > vl_arb_high_limit by req_set of PORT_INFO with the new data. > > > > However, at that stage the SM still did not assign LIDs to > > > the ports. > > > > So the sent PortInfo.base_lid was still zero. The > > > specification does > > > > not allow for such LIDs (they are considered ilegal). > > > > > > > > the patch below fixes this by storing the calculated value > > > and later > > > > using it in link and lid managers. > > > > > > Good, Thanks (and this also saves one PortInfo update MAD). > > > One question below: > > > > > > > > > > > > > > Eitan > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > > > > [snip...] > > > > > > > diff --git a/opensm/opensm/osm_lid_mgr.c > > > b/opensm/opensm/osm_lid_mgr.c > > > > index bc3f8b3..ed76382 100644 > > > > --- a/opensm/opensm/osm_lid_mgr.c > > > > +++ b/opensm/opensm/osm_lid_mgr.c > > > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > > > > ib_port_info_get_port_state(p_old_pi) ) > > > > send_set = TRUE; > > > > } > > > > + > > > > + /* provide the vl_high_limit from the qos mgr */ > > > > + if (p_mgr->p_subn->opt.no_qos == FALSE) > > > > + if (p_physp->vl_high_limit != > p_old_pi->vl_high_limit) > > > > + { > > > > + send_set = TRUE; > > > > + p_pi->vl_high_limit = > p_physp->vl_high_limit; > > > > + } > > > > > > This part of code is for port_num != 0, so VLHighLimit > setup will be > > > skipped for switch enhanced port 0. Is it something > expected? If so > > > why? > > > > > > Sasha > > > > From erezz at voltaire.com Thu Jul 19 05:28:53 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 19 Jul 2007 15:28:53 +0300 Subject: [ofa-general] Re: [PATCH 29/33] infiniband: sg chaining support In-Reply-To: <20070719083939.GC11657@kernel.dk> References: <469F216D.3060306@voltaire.com> <20070719083939.GC11657@kernel.dk> Message-ID: <469F5905.5020303@voltaire.com> Jens Axboe wrote: > On Thu, Jul 19 2007, Erez Zilber wrote: > >> Roland Dreier wrote: >> >> >>> [adding infinipath at qlogic.com and general at lists.openfabrics.org -- Roland] >>> >>> >> I would like to test that on iSER. Where can I download all 33 patches from? >> > > I can provide a rolled up patch for you, right now the patchset has been > split in a series of 3 (core -> drivers -> arch bits are seperate). > Here's one for current -git as-of this morning: > > http://brick.kernel.dk/sglist-chain-all-2.6.22-git-20070719 > > Looks ok with iSER. Erez From sashak at voltaire.com Thu Jul 19 06:00:56 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 16:00:56 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com> References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> <1184847217.21739.16.camel@localhost> <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com> Message-ID: <1184850056.21739.20.camel@localhost> Hi Eitan, On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote: > Hi Sasha, > > I was not sure if there might be a case where the Link manager will not > touch the port. It should, at least with IB_LINK_NO_CHANGE call. So I moved VLHighLimit setup under this condition too (where most PortInfo fields are handled). Will push soon. Thanks for the patch. Sasha > So I placed it on both sides. Can't remember now if it is possible or > not. > Thanks for taking care of it. > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Thursday, July 19, 2007 3:14 PM > > To: Eitan Zahavi > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > Subject: RE: [PATCH] opensm: Bug in coding trying to set > > vl_arb_high_limit > > > > On Thu, 2007-07-19 at 07:51 +0300, Eitan Zahavi wrote: > > > Ohh your right. The Enh0 should get an update. > > > I thought I got it right. Do you want me to provide an > > updated patch? > > > > I can update on my side - I think we could remove VLHighLimit > > update from osm_lid_mgr and have one only in osm_link_mgr. > > > > Sasha > > > > > > > > Eitan Zahavi > > > Senior Engineering Director, Software Architect Mellanox > > Technologies > > > LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > > Sent: Wednesday, July 18, 2007 10:22 PM > > > > To: Eitan Zahavi > > > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > > > Subject: Re: [PATCH] opensm: Bug in coding trying to set > > > > vl_arb_high_limit > > > > > > > > Hi Eitan, > > > > > > > > On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > > > > > Hi Sasha > > > > > > > > > > When QoS setup is done the code was trying to send updates of > > > > > vl_arb_high_limit by req_set of PORT_INFO with the new data. > > > > > However, at that stage the SM still did not assign LIDs to > > > > the ports. > > > > > So the sent PortInfo.base_lid was still zero. The > > > > specification does > > > > > not allow for such LIDs (they are considered ilegal). > > > > > > > > > > the patch below fixes this by storing the calculated value > > > > and later > > > > > using it in link and lid managers. > > > > > > > > Good, Thanks (and this also saves one PortInfo update MAD). > > > > One question below: > > > > > > > > > > > > > > > > > > Eitan > > > > > > > > > > Signed-off-by: Eitan Zahavi > > > > > > > > > > > > > [snip...] > > > > > > > > > diff --git a/opensm/opensm/osm_lid_mgr.c > > > > b/opensm/opensm/osm_lid_mgr.c > > > > > index bc3f8b3..ed76382 100644 > > > > > --- a/opensm/opensm/osm_lid_mgr.c > > > > > +++ b/opensm/opensm/osm_lid_mgr.c > > > > > @@ -1182,6 +1182,14 @@ __osm_lid_mgr_set_physp_pi( > > > > > ib_port_info_get_port_state(p_old_pi) ) > > > > > send_set = TRUE; > > > > > } > > > > > + > > > > > + /* provide the vl_high_limit from the qos mgr */ > > > > > + if (p_mgr->p_subn->opt.no_qos == FALSE) > > > > > + if (p_physp->vl_high_limit != > > p_old_pi->vl_high_limit) > > > > > + { > > > > > + send_set = TRUE; > > > > > + p_pi->vl_high_limit = > > p_physp->vl_high_limit; > > > > > + } > > > > > > > > This part of code is for port_num != 0, so VLHighLimit > > setup will be > > > > skipped for switch enhanced port 0. Is it something > > expected? If so > > > > why? > > > > > > > > Sasha > > > > > > From sashak at voltaire.com Thu Jul 19 06:24:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 16:24:07 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <1184850056.21739.20.camel@localhost> References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> <1184847217.21739.16.camel@localhost> <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com> <1184850056.21739.20.camel@localhost> Message-ID: <20070719132407.GA16597@sashak.voltaire.com> On 16:00 Thu 19 Jul , Sasha Khapyorsky wrote: > Hi Eitan, > > On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote: > > Hi Sasha, > > > > I was not sure if there might be a case where the Link manager will not > > touch the port. > > It should, at least with IB_LINK_NO_CHANGE call. So I moved VLHighLimit > setup under this condition too (where most PortInfo fields are handled). > Will push soon. Thanks for the patch. Actually this is what I meant: commit 464a00b94e77d5f753a01569f19166e115eb90e5 Author: Sasha Khapyorsky Date: Thu Jul 19 16:03:55 2007 +0300 opensm: VLHighLimit update during initial (in sweep) link_mgr call Update PortInfo:VLHighLimit during initial (in sweep) link_mgr call (which is with IB_LINK_NO_CHANGE). Signed-off-by: Sasha Khapyorsky diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index b2b43ed..196942c 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -334,6 +334,14 @@ __osm_link_mgr_set_physp_pi( ib_port_info_get_op_vls(p_old_pi) ) send_set = TRUE; + /* provide the vl_high_limit from the qos mgr */ + if (p_mgr->p_subn->opt.no_qos == FALSE && + p_physp->vl_high_limit != p_old_pi->vl_high_limit) + { + send_set = TRUE; + p_pi->vl_high_limit = p_physp->vl_high_limit; + } + /* also the context can flag the need to check for errors. */ context.pi_context.ignore_errors = FALSE; } @@ -360,15 +368,6 @@ __osm_link_mgr_set_physp_pi( context.pi_context.active_transition = FALSE; } - /* provide the vl_high_limit from the qos mgr */ - if (p_mgr->p_subn->opt.no_qos == FALSE) - if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) - { - send_set = TRUE; - p_pi->vl_high_limit = p_physp->vl_high_limit; - } - - context.pi_context.node_guid = osm_node_get_node_guid( p_node ); context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); context.pi_context.set_method = TRUE; Sasha From eitan at mellanox.co.il Thu Jul 19 06:18:00 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 19 Jul 2007 16:18:00 +0300 Subject: [ofa-general] RE: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit References: <86hco13c7e.fsf@sw053.lab.mtl.com> <20070718192217.GE27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901E73F06@mtlexch01.mtl.com> <1184847217.21739.16.camel@localhost> <6C2C79E72C305246B504CBA17B5500C901ED5748@mtlexch01.mtl.com> <1184850056.21739.20.camel@localhost> <20070719132407.GA16597@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED57A2@mtlexch01.mtl.com> Looks good. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, July 19, 2007 4:24 PM > To: Eitan Zahavi > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > Subject: Re: [PATCH] opensm: Bug in coding trying to set > vl_arb_high_limit > > On 16:00 Thu 19 Jul , Sasha Khapyorsky wrote: > > Hi Eitan, > > > > On Thu, 2007-07-19 at 15:24 +0300, Eitan Zahavi wrote: > > > Hi Sasha, > > > > > > I was not sure if there might be a case where the Link > manager will > > > not touch the port. > > > > It should, at least with IB_LINK_NO_CHANGE call. So I moved > > VLHighLimit setup under this condition too (where most > PortInfo fields are handled). > > Will push soon. Thanks for the patch. > > Actually this is what I meant: > > > commit 464a00b94e77d5f753a01569f19166e115eb90e5 > Author: Sasha Khapyorsky > Date: Thu Jul 19 16:03:55 2007 +0300 > > opensm: VLHighLimit update during initial (in sweep) link_mgr call > > Update PortInfo:VLHighLimit during initial (in sweep) > link_mgr call > (which is with IB_LINK_NO_CHANGE). > > Signed-off-by: Sasha Khapyorsky > > diff --git a/opensm/opensm/osm_link_mgr.c > b/opensm/opensm/osm_link_mgr.c index b2b43ed..196942c 100644 > --- a/opensm/opensm/osm_link_mgr.c > +++ b/opensm/opensm/osm_link_mgr.c > @@ -334,6 +334,14 @@ __osm_link_mgr_set_physp_pi( > ib_port_info_get_op_vls(p_old_pi) ) > send_set = TRUE; > > + /* provide the vl_high_limit from the qos mgr */ > + if (p_mgr->p_subn->opt.no_qos == FALSE && > + p_physp->vl_high_limit != p_old_pi->vl_high_limit) > + { > + send_set = TRUE; > + p_pi->vl_high_limit = p_physp->vl_high_limit; > + } > + > /* also the context can flag the need to check for errors. */ > context.pi_context.ignore_errors = FALSE; > } > @@ -360,15 +368,6 @@ __osm_link_mgr_set_physp_pi( > context.pi_context.active_transition = FALSE; > } > > - /* provide the vl_high_limit from the qos mgr */ > - if (p_mgr->p_subn->opt.no_qos == FALSE) > - if (p_physp->vl_high_limit != p_old_pi->vl_high_limit) > - { > - send_set = TRUE; > - p_pi->vl_high_limit = p_physp->vl_high_limit; > - } > - > - > context.pi_context.node_guid = osm_node_get_node_guid( p_node ); > context.pi_context.port_guid = osm_physp_get_port_guid( p_physp ); > context.pi_context.set_method = TRUE; > > > Sasha > From sashak at voltaire.com Thu Jul 19 06:45:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 19 Jul 2007 16:45:33 +0300 Subject: [ofa-general] Re: [PATCH] opensm: Bug in coding trying to set vl_arb_high_limit In-Reply-To: <86hco13c7e.fsf@sw053.lab.mtl.com> References: <86hco13c7e.fsf@sw053.lab.mtl.com> Message-ID: <20070719134533.GD16597@sashak.voltaire.com> On 19:31 Wed 18 Jul , Eitan Zahavi wrote: > Hi Sasha > > When QoS setup is done the code was trying to send updates of > vl_arb_high_limit by req_set of PORT_INFO with the new data. > However, at that stage the SM still did not assign LIDs to the ports. > So the sent PortInfo.base_lid was still zero. The specification does not > allow for such LIDs (they are considered ilegal). > > the patch below fixes this by storing the calculated value and later > using it in link and lid managers. > > Eitan > > Signed-off-by: Eitan Zahavi Applied (with changes discussed in this thread). Thanks. Sasha From mst at dev.mellanox.co.il Thu Jul 19 07:31:29 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 17:31:29 +0300 Subject: [ofa-general] Re: The low level driver of mlx4 kmalloc 0 bytes in QP creation In-Reply-To: References: <46821FDA.5030900@dev.mellanox.co.il> Message-ID: <20070719143129.GB28640@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: The low level driver of mlx4 kmalloc 0 bytes in QP creation > > > If one creates a QP with 0 WR in the RQ in the kernel level, the low > > level driver of the mlx4 > > will kmalloc 0 bytes (for the WR IDs of the RQ). > > (for example, the IPoIB CM creates such a QP) > > > > Is this is an error? > > The consensus seems to be that kmalloc(0) is OK, although various > 2.6.22-rc kernels printed big tracebacks when it happens. I think > getting rid of the kmalloc(0) in mlx4 would make the code more > complicated for no real gain. Hmm, seems to crash with recent git kernels. -- MST From rdreier at cisco.com Thu Jul 19 07:36:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Jul 2007 07:36:46 -0700 Subject: [ofa-general] Re: oops on mlx4 modprobe References: <20070719084751.GC24018@mellanox.co.il> Message-ID: Is this with CONFIG_SLAB or CONFIG_SLUB? From rdreier at cisco.com Thu Jul 19 07:46:47 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Jul 2007 07:46:47 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq case References: <20070719094039.GF24018@mellanox.co.il> Message-ID: kmalloc(0) is fine to do. This must be a bug introduced recently into one of the allocators -- which one are you using? From rdreier at cisco.com Thu Jul 19 07:46:46 2007 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 19 Jul 2007 07:46:46 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default References: <20070719112155.GJ24018@mellanox.co.il> Message-ID: > + mlx4_enable_msi_x(dev); > + > err = mlx4_init_hca(dev); > + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { > + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); > + dev->flags &= ~MLX4_FLAG_MSI_X; > + pci_disable_msix(pdev); > + err = mlx4_init_hca(dev); > + } > + > if (err) > goto err_cmd; > > + mlx4_enable_msi_x(dev); > + Am I misreading the code or is that last mlx4_enable_msi_x() wrong? - R. From HNGUYEN at de.ibm.com Thu Jul 19 08:03:26 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Thu, 19 Jul 2007 17:03:26 +0200 Subject: [ofa-general] Re: [PATCH 01/10] IB/ehca: Support for multiple event queues In-Reply-To: Message-ID: Roland Dreier wrote on 17.07.2007 19:52:55: > At a higher level, I'm left wondering why nobody talked about multiple > EQs during the last months of the 2.6.22 process and now all of a > sudden it becomes urgent in the last few days of the 2.6.23 merge > window. That's not really how I like to merge features.... OK, let keep multiple eqs for next release with a more stable verbs def. For the other patch to support MR with large pages we'll resend it (without deps on multiple eqs patch) to you soon. Regards Nam From mst at dev.mellanox.co.il Thu Jul 19 09:12:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 19:12:01 +0300 Subject: [ofa-general] Re: oops on mlx4 modprobe In-Reply-To: References: <20070719084751.GC24018@mellanox.co.il> Message-ID: <20070719161201.GA31246@mellanox.co.il> CONFIG_SLAB Quoting Roland Dreier : Subject: Re: oops on mlx4 modprobe Is this with CONFIG_SLAB or CONFIG_SLUB? -- MST From mst at dev.mellanox.co.il Thu Jul 19 09:15:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 19:15:44 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq case In-Reply-To: References: <20070719094039.GF24018@mellanox.co.il> Message-ID: <20070719161543.GC31246@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq case > > kmalloc(0) is fine to do. This must be a bug introduced recently into > one of the allocators -- which one are you using? the bug in error handling is real though, isn't it? -- MST From mst at dev.mellanox.co.il Thu Jul 19 09:18:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 19:18:20 +0300 Subject: [ofa-general] Re: socket buffer accounting with UDP/ipoib In-Reply-To: References: <1183643723.25031.262.camel@mtls03> <20070718074632.GF1115@mellanox.co.il> Message-ID: <20070719161820.GD31246@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: socket buffer accounting with UDP/ipoib > > > > + ib_dma_sync_single_for_cpu(priv->ca, addr, IPOIB_BUF_SIZE, > > > + DMA_FROM_DEVICE); > > > + skb_copy_from_linear_data_offset(skb, IB_GRH_BYTES, new_skb->data, > > > + wc->byte_len - IB_GRH_BYTES); > > > + ib_dma_sync_single_for_device(priv->ca, addr, IPOIB_BUF_SIZE, > > > + DMA_FROM_DEVICE); > > > > BTW, why is ib_dma_sync_single_for_device necessary here? > > Not sure what you're asking exactly. The sync for device is needed to > match the previous sync for the cpu obviously. That's what I'm missing: must each sync_for_cpu be paired with sync_for_device? Is there documentation for this somewhere? > We need both syncs for > the same reason we need the unmap when we don't copy -- we're copying > data out of the skb we gave to the device earlier, so we need to make > sure the cpu sees the right data. Right, but device never reads the buffer, and CPU never modifies it. -- MST From mst at dev.mellanox.co.il Thu Jul 19 09:18:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 19 Jul 2007 19:18:52 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default In-Reply-To: References: <20070719112155.GJ24018@mellanox.co.il> Message-ID: <20070719161852.GE31246@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/mlx4: enable MSI-X by default > > > + mlx4_enable_msi_x(dev); > > + > > err = mlx4_init_hca(dev); > > + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { > > + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); > > + dev->flags &= ~MLX4_FLAG_MSI_X; > > + pci_disable_msix(pdev); > > + err = mlx4_init_hca(dev); > > + } > > + > > if (err) > > goto err_cmd; > > > > + mlx4_enable_msi_x(dev); > > + > > Am I misreading the code or is that last mlx4_enable_msi_x() wrong? Hmm, looks like it is .. -- MST From Jonathan.Robertson at 3leafnetworks.com Thu Jul 19 09:31:17 2007 From: Jonathan.Robertson at 3leafnetworks.com (Jonathan Robertson) Date: Thu, 19 Jul 2007 09:31:17 -0700 Subject: FW: [ofa-general] libsdp in OFED 1.1 Message-ID: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp> Hi Jim, We are actually using OFED 1.1. Hopefully we'll move to 1.2 in a few weeks. The systems using it are SLES 9 SP3. Uname -a: Linux oracle 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux I have added alias net-pf-27 ib_sdp to modprobe.conf.local I have modified /usr/local/ofed/etc/libsdp.conf to have the following lines: log min-level 9 destination syslog use both server "/usr/local/bin/netserver" *:* use both client "/usr/local/bin/netperf" *:* And I created /etc/ld.so.preload and have: /usr/local/ofed/lib64/libsdp.so Is there a close function in ofed 1.1? Perhaps I should try to add that to port.c for 1.1? My reply to your email bounced... 5.1.0 - Unknown address error 550-'5.1.1 unknown or illegal alias: @austin.rr.com' Thanks! Jonathan -----Original Message----- From: Jim Mott [mailto:jimmmott at austin.rr.com] Sent: Wednesday, July 18, 2007 3:42 PM To: Jonathan Robertson Subject: RE: [ofa-general] libsdp in OFED 1.1 Hi, I have just taken over support for libsdp and am feeling my way here. Probably I should have replied to the list, but this works too. I assume you are using OFED 1.2 version of the code. There is a close() function in that code (port.c), so there is something fishy here. Could you send a little more info please. Stuff like distro, 32/64, and perhaps the script/commands you use to automate the preload process. Something like: # uname -a Linux sw106.lab.mtl.com 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT \ 2006 x86_64 x86_64 x86_64 GNU/Linux # export LD_LIBRARY_PATH=/usr/local/ofed/lib64:/usr/local/ofed/lib # export LD_PRELOAD=libsdp.so # export LIBSDP_CONFIG_FILE=/etc/infiniband/libsdp.conf # ls config_parser.c config_scanner.c libsdp.la Makefile match.c port.lo config_parser.h config_scanner.lo log.c Makefile.am match.lo sdp_inet.h config_parser.lo libsdp.h log.lo Makefile.in port.c socket.c Thanks, Jim ========================= From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan Robertson Sent: Wednesday, July 18, 2007 3:19 PM To: general at lists.openfabrics.org Subject: [ofa-general] libsdp in OFED 1.1 Hello, I have been using libsdp, and preloading it with the application. I would like to have it automatically preloaded, but am concerned about some error messages that seem harmless. So I don't want to have our client use the ld.so.preload if there are going to be messages. I see the following when I run a simple 'ls' # ls Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation for close found  . .. # Any suggestions? I have the following in libsdp.conf Log min-level 9 destination syslog Use both server netserver *:* Use both client netperf *:* Our client is interested in having weblogic communicate with the oracle DB using SDP, and the interface to oracle and weblogic being accessible via tcp/ip over Ethernet as well. Thanks! Jonathan From sclank at iuk.kg Thu Jul 19 09:42:13 2007 From: sclank at iuk.kg (Myrtle Hooks) Date: Thu, 19 Jul 2007 12:42:13 -0400 Subject: [ofa-general] Thanks, we are ready to lend you some cash regardless of Credit Message-ID: <001b01c7ca02$a5b9a1a0$01c6c1a4@FAMILY> Your credit score does not matter to us! If you have your own business and want IMMEDIATE money to spend ANY way you like or wish Extra money to give the business a boost or require A low interest loan - NO STRINGS ATTACHED, here is the deal we can offer you THIS EVENING (hurry, this offer will expire TODAY): $69,000+ loan Hurry, when our deal is gone, it is gone. Simply Call Us... Don't worry about approval, your your credit report will not disqualify you! Call Us Free on 877-542-1880 -------------- next part -------------- An HTML attachment was scrubbed... URL: From afriedle at open-mpi.org Thu Jul 19 10:13:15 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Thu, 19 Jul 2007 10:13:15 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <468426B6.3060602@ichips.intel.com> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> Message-ID: <469F9BAB.4080504@open-mpi.org> Finally was able to have the SM switched over from Cisco on the switch to OpenSM on a node. Responses inline below.. Sean Hefty wrote: >> Now the more interesting part. I'm now able to run on a 128 node >> machine using open SM running on a node (before, I was running on an 8 >> node machine which I'm told is running the Cisco SM on a Topspin >> switch). On this machine, if I run my benchmark with two processes >> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able >> to join > 750 groups simultaneously from one QP on each process. To >> make this stranger, I can join only 4 groups running the same thing on >> the 8-node machine. > > Are the switches and HCAs in the two setups the same? If you run the > same SM on both clusters, do you see the same results? The switches are different. The 8 node machine uses a Topspin switch, the 128 node machine uses a Mellanox switch. Looking at `ibstat` the HCAs appear to be the same (MT23108), though HCAs on the 128 node machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine. Does this matter? Running OpenSM now, I still do not see the same results. Behavior is now the same as the 128 node machine, except when running two processes per node (in which case I can join as many groups as I like on the 128 node machine). On the 8 node machine I am still limited to 4 groups in this case. This makes me think the switch is involved, is this correct? > >> While doing so I noticed that the time from calling >> rdma_join_multicast() to the event arrival stayed fairly constant (in >> the .001sec range), while the time from the join call to actually >> receiving messages on the group steadily increased from around .1 secs >> to around 2.7 secs with 750+ groups. Furthermore, this time does not >> drop back to .1 secs if I stop the benchmark and run it (or any of my >> other multicast code) again. This is understandable within a single >> program run, but the fact that behavior persists across runs concerns >> me -- feels like a bug, but I don't have much concrete here. > > Even after all nodes leave all multicast groups, I don't believe that > there's a requirement for the SA to reprogram the switches immediately. > So if the switches or the configuration of the swtiches are part of the > problem, I can imagine seeing issues between runs. > > When rdma_join_multicast() reports the join event, it means either: the > SA has been notified of the join request, or, if the port has already > joined the group, that a reference count on the group has been > incremented. The SA may still require time to program the switch > forwarding tables. OK this makes sense, but I still don't see where all the time is going. Should the fact that the switches haven't been reprogrammed since leaving the groups really effect how long it takes to do a subsequent join? I'm not convinced. Is this time being consumed by the switches when the are asked to reprogram their tables (I assume some sort of routing table is used internally)? What could they be doing that takes so long to do that? Is it something that a firmware change on the switch could alleviate? Andrew From hal.rosenstock at gmail.com Thu Jul 19 10:32:03 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 19 Jul 2007 10:32:03 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <469F9BAB.4080504@open-mpi.org> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> Message-ID: Andrew, On 7/19/07, Andrew Friedley wrote: > > Finally was able to have the SM switched over from Cisco on the switch > to OpenSM on a node. Responses inline below.. > > Sean Hefty wrote: > >> Now the more interesting part. I'm now able to run on a 128 node > >> machine using open SM running on a node (before, I was running on an 8 > >> node machine which I'm told is running the Cisco SM on a Topspin > >> switch). On this machine, if I run my benchmark with two processes > >> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able > >> to join > 750 groups simultaneously from one QP on each process. To > >> make this stranger, I can join only 4 groups running the same thing on > >> the 8-node machine. > > > > Are the switches and HCAs in the two setups the same? If you run the > > same SM on both clusters, do you see the same results? > > The switches are different. The 8 node machine uses a Topspin switch, > the 128 node machine uses a Mellanox switch. Looking at `ibstat` the > HCAs appear to be the same (MT23108), though HCAs on the 128 node > machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine. Does > this matter? > > Running OpenSM now, I still do not see the same results. Behavior is > now the same as the 128 node machine, except when running two processes > per node (in which case I can join as many groups as I like on the 128 > node machine). On the 8 node machine I am still limited to 4 groups in > this case. I'm not quite parsing what is the same with what is different in the results (and I presume the only variable is SM). This makes me think the switch is involved, is this correct? I doubt it. It is either end station, SM, or a combination of the two. > > >> While doing so I noticed that the time from calling > >> rdma_join_multicast() to the event arrival stayed fairly constant (in > >> the .001sec range), while the time from the join call to actually > >> receiving messages on the group steadily increased from around .1 secs > >> to around 2.7 secs with 750+ groups. Furthermore, this time does not > >> drop back to .1 secs if I stop the benchmark and run it (or any of my > >> other multicast code) again. This is understandable within a single > >> program run, but the fact that behavior persists across runs concerns > >> me -- feels like a bug, but I don't have much concrete here. > > > > Even after all nodes leave all multicast groups, I don't believe that > > there's a requirement for the SA to reprogram the switches immediately. > > So if the switches or the configuration of the swtiches are part of the > > problem, I can imagine seeing issues between runs. > > > > When rdma_join_multicast() reports the join event, it means either: the > > SA has been notified of the join request, or, if the port has already > > joined the group, that a reference count on the group has been > > incremented. The SA may still require time to program the switch > > forwarding tables. > > OK this makes sense, but I still don't see where all the time is going. > Should the fact that the switches haven't been reprogrammed since > leaving the groups really effect how long it takes to do a subsequent > join? I'm not convinced. It takes time for the SM to recalculate the multicast tree. While leaves can be lazy, I forget whether joins are synchronous or not. Is this time being consumed by the switches when the are asked to > reprogram their tables (I assume some sort of routing table is used > internally)? This is relatively quick compared to the policy for the SM rerouting of multicast based on joins/leaves/group creation/deletion. -- Hal What could they be doing that takes so long to do that? > Is it something that a firmware change on the switch could alleviate? > > Andrew > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From afriedle at open-mpi.org Thu Jul 19 10:58:42 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Thu, 19 Jul 2007 10:58:42 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> Message-ID: <469FA652.4060909@open-mpi.org> Hal Rosenstock wrote: > I'm not quite parsing what is the same with what is different in the > results > (and I presume the only variable is SM). Yes; this is confusing, I'll try to summarize the various behaviors I'm getting. First, there are two machines. One has 8 nodes and runs a Topspin switch with the Cisco SM on it. The other is 128 nodes and runs a Mellanox switch with Open SM on a compute node. OFED v1.2 is used on both. Below is how many groups I can join using my test program (described elsewhere in the thread) On the 8 node machine: 8 procs (one per node) -- 14 groups. 16 procs (two per node) -- 4 groups. On the 128 node machine: 8 procs (one per node, 8 nodes used) -- 14 groups. 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750. Some peculiarities complicate this. On either machine, I've noticed that if I haven't been doing anything using IB multicast in say a day (haven't tried to figure out exactly how long), in any run scenario listed above, I can join 4 groups. I do a couple runs where I hit errors after 4 groups, and then I consistently get the group counts above for the rest of the work day. Second, in the cases in which I am able to join 14 groups, if I run my test program twice simultaneously on the same nodes, I am able to join a maximum of 14 groups total between the two running tests (as opposed to 14 per test run). Running the test twice simultaneously using a disjoint set of nodes is not an issue. >> This makes me think the switch is involved, is this correct? > > > I doubt it. It is either end station, SM, or a combination of the two. OK. >> OK this makes sense, but I still don't see where all the time is going. >> Should the fact that the switches haven't been reprogrammed since >> leaving the groups really effect how long it takes to do a subsequent >> join? I'm not convinced. > > > It takes time for the SM to recalculate the multicast tree. While leaves > can > be lazy, I forget whether joins are synchronous or not. Is the algorithm for recalculating the tree documented at all? Or, where is the code for it (assuming I have access)? I feel like I'm missing something here that explains why it's so costly. Andrew > > Is this time being consumed by the switches when the are asked to >> reprogram their tables (I assume some sort of routing table is used >> internally)? > > > This is relatively quick compared to the policy for the SM rerouting of > multicast based on joins/leaves/group creation/deletion. OK. Thanks for the insight. Andrew From afriedle at open-mpi.org Thu Jul 19 11:14:00 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Thu, 19 Jul 2007 11:14:00 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <469FA652.4060909@open-mpi.org> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> <469FA652.4060909@open-mpi.org> Message-ID: <469FA9E8.90609@open-mpi.org> Andrew Friedley wrote: > Hal Rosenstock wrote: >> I'm not quite parsing what is the same with what is different in the >> results >> (and I presume the only variable is SM). > > Yes; this is confusing, I'll try to summarize the various behaviors I'm > getting. > > First, there are two machines. One has 8 nodes and runs a Topspin > switch with the Cisco SM on it. The other is 128 nodes and runs a > Mellanox switch with Open SM on a compute node. OFED v1.2 is used on > both. Below is how many groups I can join using my test program > (described elsewhere in the thread) > > On the 8 node machine: > 8 procs (one per node) -- 14 groups. > 16 procs (two per node) -- 4 groups. > > On the 128 node machine: > 8 procs (one per node, 8 nodes used) -- 14 groups. > 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750. > > Some peculiarities complicate this. On either machine, I've noticed > that if I haven't been doing anything using IB multicast in say a day > (haven't tried to figure out exactly how long), in any run scenario > listed above, I can join 4 groups. I do a couple runs where I hit > errors after 4 groups, and then I consistently get the group counts > above for the rest of the work day. > > Second, in the cases in which I am able to join 14 groups, if I run my > test program twice simultaneously on the same nodes, I am able to join a > maximum of 14 groups total between the two running tests (as opposed to > 14 per test run). Running the test twice simultaneously using a > disjoint set of nodes is not an issue. So I sent that last email before I meant to :) Need to eat.. I've managed to confuse my self a little here too -- it looks like changing from the Cisco SM to the OpenSM did not change behavior on the 8 node machine. At least, I'm still getting the same results above now that it's back on the Cisco SM. Also some newer results. I had a long run going on the 128 node machine to see how many groups I really could join, and it just errored out after joining 892 groups successfully. Specifically, I got an RDMA_CM_EVENT_MULTICAST_ERROR event containing status -22 ('Unknown error' according to sterror). errno is still cleared to 'Success'. I don't have time go look at the code to see where this came from right now, but does anyone know what it means? Andrew > >>> This makes me think the switch is involved, is this correct? >> >> >> I doubt it. It is either end station, SM, or a combination of the two. > > OK. > >>> OK this makes sense, but I still don't see where all the time is going. >>> Should the fact that the switches haven't been reprogrammed since >>> leaving the groups really effect how long it takes to do a subsequent >>> join? I'm not convinced. >> >> >> It takes time for the SM to recalculate the multicast tree. While >> leaves can >> be lazy, I forget whether joins are synchronous or not. > > Is the algorithm for recalculating the tree documented at all? Or, > where is the code for it (assuming I have access)? I feel like I'm > missing something here that explains why it's so costly. > > Andrew > >> >> Is this time being consumed by the switches when the are asked to >>> reprogram their tables (I assume some sort of routing table is used >>> internally)? >> >> >> This is relatively quick compared to the policy for the SM rerouting of >> multicast based on joins/leaves/group creation/deletion. > > OK. Thanks for the insight. > > Andrew > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From hal.rosenstock at gmail.com Thu Jul 19 11:14:12 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 19 Jul 2007 11:14:12 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <469FA652.4060909@open-mpi.org> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> <469FA652.4060909@open-mpi.org> Message-ID: Andrew, On 7/19/07, Andrew Friedley wrote: > > Hal Rosenstock wrote: > > I'm not quite parsing what is the same with what is different in the > > results > > (and I presume the only variable is SM). > > Yes; this is confusing, I'll try to summarize the various behaviors I'm > getting. > > First, there are two machines. One has 8 nodes and runs a Topspin > switch with the Cisco SM on it. The other is 128 nodes and runs a > Mellanox switch with Open SM on a compute node. OFED v1.2 is used on > both. Below is how many groups I can join using my test program > (described elsewhere in the thread) > > On the 8 node machine: > 8 procs (one per node) -- 14 groups. > 16 procs (two per node) -- 4 groups. > > On the 128 node machine: > 8 procs (one per node, 8 nodes used) -- 14 groups. > 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750. > > Some peculiarities complicate this. On either machine, I've noticed > that if I haven't been doing anything using IB multicast in say a day > (haven't tried to figure out exactly how long), in any run scenario > listed above, I can join 4 groups. I do a couple runs where I hit > errors after 4 groups, and then I consistently get the group counts > above for the rest of the work day. > > Second, in the cases in which I am able to join 14 groups, if I run my > test program twice simultaneously on the same nodes, I am able to join a > maximum of 14 groups total between the two running tests (as opposed to > 14 per test run). Running the test twice simultaneously using a > disjoint set of nodes is not an issue. Thanks. I can only comment on the OpenSM configuration and in general on SMs so I'm still not sure what limits you are hitting; it may be multiple but not sure. Some seemed to be end node (HCA) related based on a previous email. >> This makes me think the switch is involved, is this correct? > > > > > > I doubt it. It is either end station, SM, or a combination of the two. > > OK. > > >> OK this makes sense, but I still don't see where all the time is going. > >> Should the fact that the switches haven't been reprogrammed since > >> leaving the groups really effect how long it takes to do a subsequent > >> join? I'm not convinced. > > > > > > It takes time for the SM to recalculate the multicast tree. While leaves > > can > > be lazy, I forget whether joins are synchronous or not. > > Is the algorithm for recalculating the tree documented at all? Or, > where is the code for it (assuming I have access)? I feel like I'm > missing something here that explains why it's so costly. I'm afraid it is just the code AFAIK :-( -- Hal Andrew > > > > > Is this time being consumed by the switches when the are asked to > >> reprogram their tables (I assume some sort of routing table is used > >> internally)? > > > > > > This is relatively quick compared to the policy for the SM rerouting of > > multicast based on joins/leaves/group creation/deletion. > > OK. Thanks for the insight. > > Andrew > -------------- next part -------------- An HTML attachment was scrubbed... URL: From afriedle at open-mpi.org Thu Jul 19 11:18:00 2007 From: afriedle at open-mpi.org (Andrew Friedley) Date: Thu, 19 Jul 2007 11:18:00 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> <469FA652.4060909@open-mpi.org> Message-ID: <469FAAD8.8050505@open-mpi.org> Hal Rosenstock wrote: > Thanks. I can only comment on the OpenSM configuration and in general on > SMs > so I'm still not sure what limits you are hitting; it may be multiple but > not sure. Some seemed to be end node (HCA) related based on a previous > email. Thanks for you help. Yes I'm thinking the same thing, though what I'm seeing seemingly contradicts the limits that I'm told are in place (and have now been changed post-v1.2). >> Is the algorithm for recalculating the tree documented at all? Or, >> where is the code for it (assuming I have access)? I feel like I'm >> missing something here that explains why it's so costly. > > > I'm afraid it is just the code AFAIK :-( OK, do you know where it is in the OpenSM code base? Andrew From arthur.jones at qlogic.com Thu Jul 19 11:32:49 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Thu, 19 Jul 2007 11:32:49 -0700 Subject: [ofa-general] is ipath_layer.c dead code? In-Reply-To: References: Message-ID: <20070719183249.GA20240@bauxite.pathscale.com> hi roland, your patch was the right idea, but i think the attached patch is more complete... btw: this patch is avail via git pull from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur On Mon, Jul 16, 2007 at 10:43:02AM -0700, Roland Dreier wrote: > My kernel seems to build and link fine with the patch below. Is > ipath_layer.c being used for anything, or can we just kill it? > > - R. -------------- next part -------------- IB/ipath - remove ipath_layer, the former network/verbs layer From: Arthur Jones The ipath_layer.[ch] code was an attempt to provide a single interface for the ipath verbs and ipath_ether code to use. As verbs functionality increased, the layer's functionality became insufficient and the verbs code broke away to interface directly to the driver. The failed attempt to get ipath_ether upstream was the final nail in the coffin and now it sits quietly in a dark kernel.org corner waiting for someone to notice the smell and send it along to it's final resting place. Roland Dreier was that someone -- this patch expands on his work... Signed-off-by: Arthur Jones --- drivers/infiniband/hw/ipath/Makefile | 1 drivers/infiniband/hw/ipath/ipath_layer.c | 365 ----------------------------- drivers/infiniband/hw/ipath/ipath_layer.h | 71 ------ drivers/infiniband/hw/ipath/ipath_verbs.h | 2 4 files changed, 0 insertions(+), 439 deletions(-) diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile index ec2e603..fe67388 100644 --- a/drivers/infiniband/hw/ipath/Makefile +++ b/drivers/infiniband/hw/ipath/Makefile @@ -14,7 +14,6 @@ ib_ipath-y := \ ipath_init_chip.o \ ipath_intr.o \ ipath_keys.o \ - ipath_layer.o \ ipath_mad.o \ ipath_mmap.o \ ipath_mr.o \ diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c deleted file mode 100644 index 82616b7..0000000 --- a/drivers/infiniband/hw/ipath/ipath_layer.c +++ /dev/null @@ -1,365 +0,0 @@ -/* - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - */ - -/* - * These are the routines used by layered drivers, currently just the - * layered ethernet driver and verbs layer. - */ - -#include -#include - -#include "ipath_kernel.h" -#include "ipath_layer.h" -#include "ipath_verbs.h" -#include "ipath_common.h" - -/* Acquire before ipath_devs_lock. */ -static DEFINE_MUTEX(ipath_layer_mutex); - -u16 ipath_layer_rcv_opcode; - -static int (*layer_intr)(void *, u32); -static int (*layer_rcv)(void *, void *, struct sk_buff *); -static int (*layer_rcv_lid)(void *, void *); - -static void *(*layer_add_one)(int, struct ipath_devdata *); -static void (*layer_remove_one)(void *); - -int __ipath_layer_intr(struct ipath_devdata *dd, u32 arg) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_intr) - ret = layer_intr(dd->ipath_layer.l_arg, arg); - - return ret; -} - -int ipath_layer_intr(struct ipath_devdata *dd, u32 arg) -{ - int ret; - - mutex_lock(&ipath_layer_mutex); - - ret = __ipath_layer_intr(dd, arg); - - mutex_unlock(&ipath_layer_mutex); - - return ret; -} - -int __ipath_layer_rcv(struct ipath_devdata *dd, void *hdr, - struct sk_buff *skb) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_rcv) - ret = layer_rcv(dd->ipath_layer.l_arg, hdr, skb); - - return ret; -} - -int __ipath_layer_rcv_lid(struct ipath_devdata *dd, void *hdr) -{ - int ret = -ENODEV; - - if (dd->ipath_layer.l_arg && layer_rcv_lid) - ret = layer_rcv_lid(dd->ipath_layer.l_arg, hdr); - - return ret; -} - -void ipath_layer_lid_changed(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (dd->ipath_layer.l_arg && layer_intr) - layer_intr(dd->ipath_layer.l_arg, IPATH_LAYER_INT_LID); - - mutex_unlock(&ipath_layer_mutex); -} - -void ipath_layer_add(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (layer_add_one) - dd->ipath_layer.l_arg = - layer_add_one(dd->ipath_unit, dd); - - mutex_unlock(&ipath_layer_mutex); -} - -void ipath_layer_remove(struct ipath_devdata *dd) -{ - mutex_lock(&ipath_layer_mutex); - - if (dd->ipath_layer.l_arg && layer_remove_one) { - layer_remove_one(dd->ipath_layer.l_arg); - dd->ipath_layer.l_arg = NULL; - } - - mutex_unlock(&ipath_layer_mutex); -} - -int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *), - void (*l_remove)(void *), - int (*l_intr)(void *, u32), - int (*l_rcv)(void *, void *, struct sk_buff *), - u16 l_rcv_opcode, - int (*l_rcv_lid)(void *, void *)) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - - layer_add_one = l_add; - layer_remove_one = l_remove; - layer_intr = l_intr; - layer_rcv = l_rcv; - layer_rcv_lid = l_rcv_lid; - ipath_layer_rcv_opcode = l_rcv_opcode; - - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - if (!(dd->ipath_flags & IPATH_INITTED)) - continue; - - if (dd->ipath_layer.l_arg) - continue; - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - dd->ipath_layer.l_arg = l_add(dd->ipath_unit, dd); - spin_lock_irqsave(&ipath_devs_lock, flags); - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - mutex_unlock(&ipath_layer_mutex); - - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_register); - -void ipath_layer_unregister(void) -{ - struct ipath_devdata *dd, *tmp; - unsigned long flags; - - mutex_lock(&ipath_layer_mutex); - spin_lock_irqsave(&ipath_devs_lock, flags); - - list_for_each_entry_safe(dd, tmp, &ipath_dev_list, ipath_list) { - if (dd->ipath_layer.l_arg && layer_remove_one) { - spin_unlock_irqrestore(&ipath_devs_lock, flags); - layer_remove_one(dd->ipath_layer.l_arg); - spin_lock_irqsave(&ipath_devs_lock, flags); - dd->ipath_layer.l_arg = NULL; - } - } - - spin_unlock_irqrestore(&ipath_devs_lock, flags); - - layer_add_one = NULL; - layer_remove_one = NULL; - layer_intr = NULL; - layer_rcv = NULL; - layer_rcv_lid = NULL; - - mutex_unlock(&ipath_layer_mutex); -} - -EXPORT_SYMBOL_GPL(ipath_layer_unregister); - -int ipath_layer_open(struct ipath_devdata *dd, u32 * pktmax) -{ - int ret; - u32 intval = 0; - - mutex_lock(&ipath_layer_mutex); - - if (!dd->ipath_layer.l_arg) { - ret = -EINVAL; - goto bail; - } - - ret = ipath_setrcvhdrsize(dd, IPATH_HEADER_QUEUE_WORDS); - - if (ret < 0) - goto bail; - - *pktmax = dd->ipath_ibmaxlen; - - if (*dd->ipath_statusp & IPATH_STATUS_IB_READY) - intval |= IPATH_LAYER_INT_IF_UP; - if (dd->ipath_lid) - intval |= IPATH_LAYER_INT_LID; - if (dd->ipath_mlid) - intval |= IPATH_LAYER_INT_BCAST; - /* - * do this on open, in case low level is already up and - * just layered driver was reloaded, etc. - */ - if (intval) - layer_intr(dd->ipath_layer.l_arg, intval); - - ret = 0; -bail: - mutex_unlock(&ipath_layer_mutex); - - return ret; -} - -EXPORT_SYMBOL_GPL(ipath_layer_open); - -u16 ipath_layer_get_lid(struct ipath_devdata *dd) -{ - return dd->ipath_lid; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_lid); - -/** - * ipath_layer_get_mac - get the MAC address - * @dd: the infinipath device - * @mac: the MAC is put here - * - * This is the EUID-64 OUI octets (top 3), then - * skip the next 2 (which should both be zero or 0xff). - * The returned MAC is in network order - * mac points to at least 6 bytes of buffer - * We assume that by the time the LID is set, that the GUID is as valid - * as it's ever going to be, rather than adding yet another status bit. - */ - -int ipath_layer_get_mac(struct ipath_devdata *dd, u8 * mac) -{ - u8 *guid; - - guid = (u8 *) &dd->ipath_guid; - - mac[0] = guid[0]; - mac[1] = guid[1]; - mac[2] = guid[2]; - mac[3] = guid[5]; - mac[4] = guid[6]; - mac[5] = guid[7]; - if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff)) - ipath_dbg("Warning, guid bytes 3 and 4 not 0 or 0xffff: " - "%x %x\n", guid[3], guid[4]); - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_mac); - -u16 ipath_layer_get_bcast(struct ipath_devdata *dd) -{ - return dd->ipath_mlid; -} - -EXPORT_SYMBOL_GPL(ipath_layer_get_bcast); - -int ipath_layer_send_hdr(struct ipath_devdata *dd, struct ether_header *hdr) -{ - int ret = 0; - u32 __iomem *piobuf; - u32 plen, *uhdr; - size_t count; - __be16 vlsllnh; - - if (!(dd->ipath_flags & IPATH_RCVHDRSZ_SET)) { - ipath_dbg("send while not open\n"); - ret = -EINVAL; - } else - if ((dd->ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) || - dd->ipath_lid == 0) { - /* - * lid check is for when sma hasn't yet configured - */ - ret = -ENETDOWN; - ipath_cdbg(VERBOSE, "send while not ready, " - "mylid=%u, flags=0x%x\n", - dd->ipath_lid, dd->ipath_flags); - } - - vlsllnh = *((__be16 *) hdr); - if (vlsllnh != htons(IPATH_LRH_BTH)) { - ipath_dbg("Warning: lrh[0] wrong (%x, not %x); " - "not sending\n", be16_to_cpu(vlsllnh), - IPATH_LRH_BTH); - ret = -EINVAL; - } - if (ret) - goto done; - - /* Get a PIO buffer to use. */ - piobuf = ipath_getpiobuf(dd, NULL); - if (piobuf == NULL) { - ret = -EBUSY; - goto done; - } - - plen = (sizeof(*hdr) >> 2); /* actual length */ - ipath_cdbg(EPKT, "0x%x+1w pio %p\n", plen, piobuf); - - writeq(plen+1, piobuf); /* len (+1 for pad) to pbc, no flags */ - ipath_flush_wc(); - piobuf += 2; - uhdr = (u32 *)hdr; - count = plen-1; /* amount we can copy before trigger word */ - __iowrite32_copy(piobuf, uhdr, count); - ipath_flush_wc(); - __raw_writel(uhdr[count], piobuf + count); - ipath_flush_wc(); /* ensure it's sent, now */ - - ipath_stats.sps_ether_spkts++; /* ether packet sent */ - -done: - return ret; -} - -EXPORT_SYMBOL_GPL(ipath_layer_send_hdr); - -int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd) -{ - set_bit(IPATH_S_PIOINTBUFAVAIL, &dd->ipath_sendctrl); - - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl); - return 0; -} - -EXPORT_SYMBOL_GPL(ipath_layer_set_piointbufavail_int); diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h deleted file mode 100644 index 415709c..0000000 --- a/drivers/infiniband/hw/ipath/ipath_layer.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. - * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - */ - -#ifndef _IPATH_LAYER_H -#define _IPATH_LAYER_H - -/* - * This header file is for symbols shared between the infinipath driver - * and drivers layered upon it (such as ipath). - */ - -struct sk_buff; -struct ipath_devdata; -struct ether_header; - -int ipath_layer_register(void *(*l_add)(int, struct ipath_devdata *), - void (*l_remove)(void *), - int (*l_intr)(void *, u32), - int (*l_rcv)(void *, void *, - struct sk_buff *), - u16 rcv_opcode, - int (*l_rcv_lid)(void *, void *)); -void ipath_layer_unregister(void); -int ipath_layer_open(struct ipath_devdata *, u32 * pktmax); -u16 ipath_layer_get_lid(struct ipath_devdata *dd); -int ipath_layer_get_mac(struct ipath_devdata *dd, u8 *); -u16 ipath_layer_get_bcast(struct ipath_devdata *dd); -int ipath_layer_send_hdr(struct ipath_devdata *dd, - struct ether_header *hdr); -int ipath_layer_set_piointbufavail_int(struct ipath_devdata *dd); - -/* ipath_ether interrupt values */ -#define IPATH_LAYER_INT_IF_UP 0x2 -#define IPATH_LAYER_INT_IF_DOWN 0x4 -#define IPATH_LAYER_INT_LID 0x8 -#define IPATH_LAYER_INT_SEND_CONTINUE 0x10 -#define IPATH_LAYER_INT_BCAST 0x40 - -extern unsigned ipath_debug; /* debugging bit mask */ - -#endif /* _IPATH_LAYER_H */ diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index f3d1f2c..0a233f5 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -42,8 +42,6 @@ #include #include -#include "ipath_layer.h" - #define IPATH_MAX_RDMA_ATOMIC 4 #define QPN_MAX (1 << 24) From pradeeps at linux.vnet.ibm.com Thu Jul 19 11:55:39 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Thu, 19 Jul 2007 11:55:39 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V8] patch Message-ID: <469FB3AB.6080304@linux.vnet.ibm.com> Addressed Roland's comments and more (hope this passes muster :)). The event_handler issue pointed out will be addressed in another patch. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-19 11:17:39.000000000 -0400 @@ -95,11 +95,15 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, }; +#define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) #define IPOIB_OP_RECV (1ul << 31) #ifdef CONFIG_INFINIBAND_IPOIB_CM -#define IPOIB_CM_OP_SRQ (1ul << 30) +#define IPOIB_CM_OP_RECV (1ul << 30) + +#define NOSRQ_INDEX_TABLE_SIZE 128 +#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_TABLE_SIZE -1) #else -#define IPOIB_CM_OP_SRQ (0) +#define IPOIB_CM_OP_RECV (0) #endif /* structs */ @@ -166,11 +170,14 @@ enum ipoib_cm_state { }; struct ipoib_cm_rx { - struct ib_cm_id *id; - struct ib_qp *qp; - struct list_head list; - struct net_device *dev; - unsigned long jiffies; + struct ib_cm_id *id; + struct ib_qp *qp; + struct ipoib_cm_rx_buf *rx_ring; /* Used by NOSRQ only */ + struct list_head list; + struct net_device *dev; + unsigned long jiffies; + u32 index; /* wr_ids are distinguished by index + * to identify the QP -NOSRQ only */ enum ipoib_cm_state state; }; @@ -215,6 +222,8 @@ struct ipoib_cm_dev_priv { struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + struct ipoib_cm_rx **rx_index_table; /* See ipoib_cm_dev_init() + *for usage of this element */ }; /* --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-10 17:02:33.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-19 13:55:59.000000000 -0400 @@ -49,6 +49,17 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" +static int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; +static int max_recv_buf = 1024; /* Default is 1024 MB */ + +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported"); + +module_param_named(max_receive_buffer, max_recv_buf, int, 0644); +MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); + +static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for NOSRQ */ + #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -81,20 +92,21 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int post_receive_srq(struct net_device *dev, u64 id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; int i, ret; - priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_SRQ; + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; for (i = 0; i < IPOIB_CM_RX_SG; ++i) priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); if (unlikely(ret)) { - ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_warn(priv, "post srq failed for buf %lld (%d)\n", + (unsigned long long)id, ret); ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[id].mapping); dev_kfree_skb_any(priv->cm.srq_ring[id].skb); @@ -104,12 +116,47 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int post_receive_nosrq(struct net_device *dev, u64 id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + u32 index; + u32 wr_id; + struct ipoib_cm_rx *rx_ptr; + + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + + rx_ptr = priv->cm.rx_index_table[index]; + + priv->cm.rx_wr.wr_id = id | IPOIB_CM_OP_RECV; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = rx_ptr->rx_ring[wr_id].mapping[i]; + + ret = ib_post_recv(rx_ptr->qp, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post recv failed for buf %d (%d)\n", + wr_id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx_ptr->rx_ring[wr_id].mapping); + dev_kfree_skb_any(rx_ptr->rx_ring[wr_id].skb); + rx_ptr->rx_ring[wr_id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, u64 id, + int frags, u64 mapping[IPOIB_CM_RX_SG]) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; int i; + struct ipoib_cm_rx *rx_ptr; + u32 index, wr_id; skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); if (unlikely(!skb)) @@ -141,7 +188,14 @@ static struct sk_buff *ipoib_cm_alloc_rx goto partial_error; } - priv->cm.srq_ring[id].skb = skb; + if (priv->cm.srq) + priv->cm.srq_ring[id].skb = skb; + else { + index = id & NOSRQ_INDEX_MASK ; + wr_id = id >> 32; + rx_ptr = priv->cm.rx_index_table[index]; + rx_ptr->rx_ring[wr_id].skb = skb; + } return skb; partial_error: @@ -198,16 +252,21 @@ static struct ib_qp *ipoib_cm_create_rx_ { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_init_attr attr = { - .event_handler = ipoib_cm_rx_event_handler, .send_cq = priv->cq, /* For drain WR */ .recv_cq = priv->cq, .srq = priv->cm.srq, .cap.max_send_wr = 1, /* For drain WR */ + .cap.max_recv_wr = ipoib_recvq_size + 1, .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ .sq_sig_type = IB_SIGNAL_ALL_WR, .qp_type = IB_QPT_RC, .qp_context = p, }; + if (!priv->cm.srq) { + attr.cap.max_recv_sge = IPOIB_CM_RX_SG; + attr.event_handler = NULL; + } else + attr.event_handler = ipoib_cm_rx_event_handler; return ib_create_qp(priv->pd, &attr); } @@ -282,12 +341,129 @@ static int ipoib_cm_send_rep(struct net_ rep.flow_control = 0; rep.rnr_retry_count = req->rnr_retry_count; rep.target_ack_delay = 20; /* FIXME */ - rep.srq = 1; rep.qp_num = qp->qp_num; rep.starting_psn = psn; + rep.srq = !!priv->cm.srq; return ib_send_cm_rep(cm_id, &rep); } +static void init_context_and_add_list(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, + struct ipoib_dev_priv *priv) +{ + cm_id->context = p; + p->jiffies = jiffies; + spin_lock_irq(&priv->lock); + if (list_empty(&priv->cm.passive_ids)) + queue_delayed_work(ipoib_workqueue, + &priv->cm.stale_task, IPOIB_CM_RX_DELAY); + if (priv->cm.srq) { + /* Add this entry to passive ids list head, but do not re-add + * it if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush + * list. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + } + spin_unlock_irq(&priv->lock); +} + +static int allocate_and_post_rbuf_nosrq(struct ib_cm_id *cm_id, + struct ipoib_cm_rx *p, unsigned psn) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + u32 qp_num, index; + u64 i, recv_mem_used; + + qp_num = p->qp->qp_num; + + /* In the SRQ case there is a common rx buffer called the srq_ring. + * However, for the NOSRQ we create an rx_ring for every + * struct ipoib_cm_rx. + */ + p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring, GFP_KERNEL); + if (!p->rx_ring) { + printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n", + qp_num); + return -ENOMEM; + } + + spin_lock_irq(&priv->lock); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irq(&priv->lock); + + init_context_and_add_list(cm_id, p, priv); + spin_lock_irq(&priv->lock); + + for (index = 0; index < max_rc_qp; index++) + if (priv->cm.rx_index_table[index] == NULL) + break; + + recv_mem_used = (u64)ipoib_recvq_size * + (u64)atomic_inc_return(¤t_rc_qp) * CM_PACKET_SIZE; + if ((index == max_rc_qp) || + (recv_mem_used >= max_recv_buf * (1ul << 20))) { + spin_unlock_irq(&priv->lock); + ipoib_warn(priv, "NOSRQ has reached the configurable limit " + "of either %d RC QPs or, max recv buf size of " + "0x%x MB\n", max_rc_qp, max_recv_buf); + + /* We send a REJ to the remote side indicating that we + * have no more free RC QPs and leave it to the remote side + * to take appropriate action. This should leave the + * current set of QPs unaffected and any subsequent REQs + * will be able to use RC QPs if they are available. + */ + ib_send_cm_rej(cm_id, IB_CM_REJ_NO_QP, NULL, 0, NULL, 0); + ret = -EINVAL; + goto err_alloc_and_post; + } + + priv->cm.rx_index_table[index] = p; + spin_unlock_irq(&priv->lock); + + /* We will subsequently use this stored pointer while freeing + * resources in stale task + */ + p->index = index; + + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) { + ipoib_warn(priv, "ipoib_cm_modify_rx_qp() failed %d\n", ret); + ipoib_cm_dev_cleanup(dev); + goto err_alloc_and_post; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive " + "buffer %d\n", (int)i); + ipoib_cm_dev_cleanup(dev); + ret = -ENOMEM; + goto err_alloc_and_post; + } + + if (post_receive_nosrq(dev, i << 32 | index)) { + ipoib_warn(priv, "post_receive_nosrq " + "failed for buf %lld\n", (unsigned long long)i); + ipoib_cm_dev_cleanup(dev); + ret = -EIO; + goto err_alloc_and_post; + } + } + + return 0; + +err_alloc_and_post: + atomic_dec(¤t_rc_qp); + kfree(p->rx_ring); + return ret; +} + static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) { struct net_device *dev = cm_id->context; @@ -302,9 +478,6 @@ static int ipoib_cm_req_handler(struct i return -ENOMEM; p->dev = dev; p->id = cm_id; - cm_id->context = p; - p->state = IPOIB_CM_RX_LIVE; - p->jiffies = jiffies; INIT_LIST_HEAD(&p->list); p->qp = ipoib_cm_create_rx_qp(dev, p); @@ -314,19 +487,21 @@ static int ipoib_cm_req_handler(struct i } psn = random32() & 0xffffff; - ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); - if (ret) - goto err_modify; + if (!priv->cm.srq) { + ret = allocate_and_post_rbuf_nosrq(cm_id, p, psn); + if (ret) + goto err_post_nosrq; + } else { + p->rx_ring = NULL; + ret = ipoib_cm_modify_rx_qp(dev, cm_id, p->qp, psn); + if (ret) + goto err_modify; + } - spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, - &priv->cm.stale_task, IPOIB_CM_RX_DELAY); - /* Add this entry to passive ids list head, but do not re-add it - * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ - p->jiffies = jiffies; - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irq(&priv->lock); + if (priv->cm.srq) { + p->state = IPOIB_CM_RX_LIVE; + init_context_and_add_list(cm_id, p, priv); + } ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd, psn); if (ret) { @@ -336,6 +511,8 @@ static int ipoib_cm_req_handler(struct i } return 0; +err_post_nosrq: + list_del_init(&p->list); err_modify: ib_destroy_qp(p->qp); err_qp: @@ -399,29 +576,60 @@ static void skb_put_frags(struct sk_buff } } -void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +static void timer_check_srq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. + */ + if (p->state == IPOIB_CM_RX_LIVE) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static void timer_check_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + unsigned long flags; + + if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { + spin_lock_irqsave(&priv->lock, flags); + p->jiffies = jiffies; + /* Move this entry to list head, but do + * not re-add it if it has been removed. */ + if (!list_empty(&p->list)) + list_move(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +void handle_rx_wc_srq(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned int wr_id = wc->wr_id & ~IPOIB_CM_OP_SRQ; + u64 wr_id = wc->wr_id & ~IPOIB_CM_OP_RECV; struct sk_buff *skb, *newskb; struct ipoib_cm_rx *p; unsigned long flags; u64 mapping[IPOIB_CM_RX_SG]; - int frags; + int frags, ret; - ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", - wr_id, wc->status); + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); if (unlikely(wr_id >= ipoib_recvq_size)) { - if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_SRQ)) { + if (wr_id == (IPOIB_CM_RX_DRAIN_WRID & ~IPOIB_CM_OP_RECV)) { spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else - ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", - wr_id, ipoib_recvq_size); + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); return; } @@ -429,23 +637,15 @@ void ipoib_cm_handle_rx_wc(struct net_de if (unlikely(wc->status != IB_WC_SUCCESS)) { ipoib_dbg(priv, "cm recv error " - "(status=%d, wrid=%d vend_err %x)\n", - wc->status, wr_id, wc->vendor_err); + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); ++priv->stats.rx_dropped; - goto repost; + goto repost_srq; } if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { p = wc->qp->qp_context; - if (p && time_after_eq(jiffies, p->jiffies + IPOIB_CM_RX_UPDATE_TIME)) { - spin_lock_irqsave(&priv->lock, flags); - p->jiffies = jiffies; - /* Move this entry to list head, but do not re-add it - * if it has been moved out of list. */ - if (p->state == IPOIB_CM_RX_LIVE) - list_move(&p->list, &priv->cm.passive_ids); - spin_unlock_irqrestore(&priv->lock, flags); - } + timer_check_srq(priv, p); } frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, @@ -457,13 +657,113 @@ void ipoib_cm_handle_rx_wc(struct net_de * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ - ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); + ++priv->stats.rx_dropped; + goto repost_srq; + } + + ipoib_cm_dma_unmap_rx(priv, frags, + priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb_reset_mac_header(skb); + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_receive_skb(skb); + +repost_srq: + ret = post_receive_srq(dev, wr_id); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_srq failed for buf %lld\n", + (unsigned long long)wr_id); + +} + +static void handle_rx_wc_nosrq(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb, *newskb; + u64 mapping[IPOIB_CM_RX_SG], wr_id = wc->wr_id >> 32; + u32 index; + struct ipoib_cm_rx *rx_ptr; + int frags, ret; + + + ipoib_dbg_data(priv, "cm recv completion: id %lld, status: %d\n", + (unsigned long long)wr_id, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %lld (> %d)\n", + (unsigned long long)wr_id, ipoib_recvq_size); + return; + } + + index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK ; + + /* This is the only place where rx_ptr could be a NULL - could + * have just received a packet from a connection that has become + * stale and so is going away. We will simply drop the packet and + * let the hardware (it s IB_QPT_RC) handle the dropped packet. + * In the timer_check() function below, p->jiffies is updated and + * hence the connection will not be stale after that. + */ + rx_ptr = priv->cm.rx_index_table[index]; + if (unlikely(!rx_ptr)) { + ipoib_warn(priv, "Received packet from a connection " + "that is going away. Hardware will handle it.\n"); + return; + } + + skb = rx_ptr->rx_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%lld vend_err %x)\n", + wc->status, (unsigned long long)wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost_nosrq; + } + + if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) { + /* There are no guarantees that wc->qp is not NULL for HCAs + * that do not support SRQ. */ + timer_check_nosrq(priv, rx_ptr); + } + + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, + (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; + + newskb = ipoib_cm_alloc_rx_skb(dev, wr_id << 32 | index, frags, + mapping); + if (unlikely(!newskb)) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %lld\n", + (unsigned long long)wr_id); ++priv->stats.rx_dropped; - goto repost; + goto repost_nosrq; } - ipoib_cm_dma_unmap_rx(priv, frags, priv->cm.srq_ring[wr_id].mapping); - memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, (frags + 1) * sizeof *mapping); + ipoib_cm_dma_unmap_rx(priv, frags, rx_ptr->rx_ring[wr_id].mapping); + memcpy(rx_ptr->rx_ring[wr_id].mapping, mapping, + (frags + 1) * sizeof *mapping); ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); @@ -483,10 +783,22 @@ void ipoib_cm_handle_rx_wc(struct net_de skb->pkt_type = PACKET_HOST; netif_receive_skb(skb); -repost: - if (unlikely(ipoib_cm_post_receive(dev, wr_id))) - ipoib_warn(priv, "ipoib_cm_post_receive failed " - "for buf %d\n", wr_id); +repost_nosrq: + ret = post_receive_nosrq(dev, wr_id << 32 | index); + + if (unlikely(ret)) + ipoib_warn(priv, "post_receive_nosrq failed for buf %lld\n", + (unsigned long long)wr_id); +} + +void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (priv->cm.srq) + handle_rx_wc_srq(dev, wc); + else + handle_rx_wc_nosrq(dev, wc); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -678,6 +990,42 @@ err_cm: return ret; } +static void free_resources_nosrq(struct ipoib_dev_priv *priv, struct ipoib_cm_rx *p) +{ + int i; + + for (i = 0; i < ipoib_recvq_size; ++i) + if (p->rx_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, + IPOIB_CM_RX_SG - 1, + p->rx_ring[i].mapping); + dev_kfree_skb_any(p->rx_ring[i].skb); + p->rx_ring[i].skb = NULL; + } + kfree(p->rx_ring); +} + +void dev_stop_nosrq(struct ipoib_dev_priv *priv) +{ + struct ipoib_cm_rx *p; + + spin_lock_irq(&priv->lock); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + free_resources_nosrq(priv, p); + list_del(&p->list); + spin_unlock_irq(&priv->lock); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + atomic_dec(¤t_rc_qp); + kfree(p); + spin_lock_irq(&priv->lock); + } + spin_unlock_irq(&priv->lock); + + cancel_delayed_work(&priv->cm.stale_task); +} + void ipoib_cm_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -692,6 +1040,11 @@ void ipoib_cm_dev_stop(struct net_device ib_destroy_cm_id(priv->cm.id); priv->cm.id = NULL; + if (!priv->cm.srq) { + dev_stop_nosrq(priv); + return; + } + spin_lock_irq(&priv->lock); while (!list_empty(&priv->cm.passive_ids)) { p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_recv_wr = 1; attr.cap.max_send_sge = 1; + attr.cap.max_recv_sge = 1; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -855,7 +1210,7 @@ static int ipoib_cm_send_req(struct net_ req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ req.max_cm_retries = 15; - req.srq = 1; + req.srq = !!priv->cm.srq; return ib_send_cm_req(id, &req); } @@ -1200,6 +1555,8 @@ static void ipoib_cm_rx_reap(struct work list_for_each_entry_safe(p, n, &list, list) { ib_destroy_cm_id(p->id); ib_destroy_qp(p->qp); + if (!priv->cm.srq) + atomic_dec(¤t_rc_qp); kfree(p); } } @@ -1218,12 +1575,19 @@ static void ipoib_cm_stale_task(struct w p = list_entry(priv->cm.passive_ids.prev, typeof(*p), list); if (time_before_eq(jiffies, p->jiffies + IPOIB_CM_RX_TIMEOUT)) break; - list_move(&p->list, &priv->cm.rx_error_list); - p->state = IPOIB_CM_RX_ERROR; - spin_unlock_irq(&priv->lock); - ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); - if (ret) - ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + if (!priv->cm.srq) { + free_resources_nosrq(priv, p); + list_del_init(&p->list); + priv->cm.rx_index_table[p->index] = NULL; + spin_unlock_irq(&priv->lock); + } else { + list_move(&p->list, &priv->cm.rx_error_list); + p->state = IPOIB_CM_RX_ERROR; + spin_unlock_irq(&priv->lock); + ret = ib_modify_qp(p->qp, &ipoib_cm_err_attr, IB_QP_STATE); + if (ret) + ipoib_warn(priv, "unable to move qp to error state: %d\n", ret); + } spin_lock_irq(&priv->lock); } @@ -1277,16 +1641,40 @@ int ipoib_cm_add_mode_attr(struct net_de return device_create_file(&dev->dev, &dev_attr_mode); } +static int create_srq(struct net_device *dev, struct ipoib_dev_priv *priv) +{ + struct ib_srq_init_attr srq_init_attr; + int ret; + + srq_init_attr.attr.max_wr = ipoib_recvq_size; + srq_init_attr.attr.max_sge = IPOIB_CM_RX_SG; + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * + sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring " + "(%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + return 0; +} + int ipoib_cm_dev_init(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_srq_init_attr srq_init_attr = { - .attr = { - .max_wr = ipoib_recvq_size, - .max_sge = IPOIB_CM_RX_SG - } - }; int ret, i; + struct ib_device_attr attr; INIT_LIST_HEAD(&priv->cm.passive_ids); INIT_LIST_HEAD(&priv->cm.reap_list); @@ -1303,20 +1691,32 @@ int ipoib_cm_dev_init(struct net_device skb_queue_head_init(&priv->cm.skb_queue); - priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); - if (IS_ERR(priv->cm.srq)) { - ret = PTR_ERR(priv->cm.srq); - priv->cm.srq = NULL; + ret = ib_query_device(priv->ca, &attr); + if (ret) return ret; - } - priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, - GFP_KERNEL); - if (!priv->cm.srq_ring) { - printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", - priv->ca->name, ipoib_recvq_size); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; + if (attr.max_srq) { + /* This device supports SRQ */ + ret = create_srq(dev, priv); + if (ret) + return ret; + priv->cm.rx_index_table = NULL; + } else { + priv->cm.srq = NULL; + priv->cm.srq_ring = NULL; + + /* Every new REQ that arrives creates a struct ipoib_cm_rx. + * These structures form a link list starting with the + * passive_ids. For quick and easy access we maintain a table + * of pointers to struct ipoib_cm_rx called the rx_index_table + */ + priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); + if (!priv->cm.rx_index_table) { + printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n"); + return -ENOMEM; + } } for (i = 0; i < IPOIB_CM_RX_SG; ++i) @@ -1329,17 +1729,24 @@ int ipoib_cm_dev_init(struct net_device priv->cm.rx_wr.sg_list = priv->cm.rx_sge; priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; - for (i = 0; i < ipoib_recvq_size; ++i) { - if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, + /* One can post receive buffers even before the RX QP is created + * only in the SRQ case. Therefore for NOSRQ we skip the rest of init + * and do that in ipoib_cm_req_handler() + */ + + if (priv->cm.srq) { + for (i = 0; i < ipoib_recvq_size; ++i) { + if (!ipoib_cm_alloc_rx_skb(dev, i, IPOIB_CM_RX_SG - 1, priv->cm.srq_ring[i].mapping)) { - ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -ENOMEM; - } - if (ipoib_cm_post_receive(dev, i)) { - ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); - ipoib_cm_dev_cleanup(dev); - return -EIO; + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (post_receive_srq(dev, i)) { + ipoib_warn(priv, "post_receive_srq failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } } } --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-10 18:30:10.000000000 -0400 @@ -299,7 +299,7 @@ int ipoib_poll(struct net_device *dev, i for (i = 0; i < n; ++i) { struct ib_wc *wc = priv->ibwc + i; - if (wc->wr_id & IPOIB_CM_OP_SRQ) { + if (wc->wr_id & IPOIB_CM_OP_RECV) { ++done; --max; ipoib_cm_handle_rx_wc(dev, wc); @@ -557,7 +557,7 @@ void ipoib_drain_cq(struct net_device *d do { n = ib_poll_cq(priv->cq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { - if (priv->ibwc[i].wr_id & IPOIB_CM_OP_SRQ) + if (priv->ibwc[i].wr_id & IPOIB_CM_OP_RECV) ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); else if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-05-30 14:56:25.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-19 02:55:24.000000000 -0400 @@ -175,6 +175,18 @@ int ipoib_transport_dev_init(struct net_ if (!ret) size += ipoib_recvq_size + 1 /* 1 extra for rx_drain_qp */; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + + /* We increase the size of the CQ in the NOSRQ case to prevent CQ + * overflow. Every new REQ creates a new RX QP and each QP has an + * RX ring associated with it. Therefore we could have + * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs + * in a CQ. + */ + if (!priv->cm.srq) + size += (NOSRQ_INDEX_TABLE_SIZE - 1) * ipoib_recvq_size; +#endif + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); if (IS_ERR(priv->cq)) { printk(KERN_WARNING "%s: failed to create CQ\n", ca->name); From hal.rosenstock at gmail.com Thu Jul 19 12:32:26 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 19 Jul 2007 12:32:26 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <469FAAD8.8050505@open-mpi.org> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> <469FA652.4060909@open-mpi.org> <469FAAD8.8050505@open-mpi.org> Message-ID: On 7/19/07, Andrew Friedley wrote: > > > > Hal Rosenstock wrote: > > Thanks. I can only comment on the OpenSM configuration and in general on > > SMs > > so I'm still not sure what limits you are hitting; it may be multiple > but > > not sure. Some seemed to be end node (HCA) related based on a previous > > email. > > Thanks for you help. Yes I'm thinking the same thing, though what I'm > seeing seemingly contradicts the limits that I'm told are in place (and > have now been changed post-v1.2). > > >> Is the algorithm for recalculating the tree documented at all? Or, > >> where is the code for it (assuming I have access)? I feel like I'm > >> missing something here that explains why it's so costly. > > > > > > I'm afraid it is just the code AFAIK :-( > > OK, do you know where it is in the OpenSM code base? Start with osm_sa_mcmember_record.c and work towards: osm_mcast_mgr.c osm_mcm_info.c osm_mtree.c osm_mcast_tbl.c osm_mcm_port.c osm_multicast.c -- Hal Andrew > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Jul 19 12:38:04 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 19 Jul 2007 12:38:04 -0700 Subject: [ofa-general] Limited number of multicasts groups that can be joined? In-Reply-To: <469FA9E8.90609@open-mpi.org> References: <46699A6D.4070300@open-mpi.org> <4683D7D6.50402@open-mpi.org> <468426B6.3060602@ichips.intel.com> <469F9BAB.4080504@open-mpi.org> <469FA652.4060909@open-mpi.org> <469FA9E8.90609@open-mpi.org> Message-ID: <469FBD9C.3020104@ichips.intel.com> > Also some newer results. I had a long run going on the 128 node machine > to see how many groups I really could join, and it just errored out > after joining 892 groups successfully. Specifically, I got an > RDMA_CM_EVENT_MULTICAST_ERROR event containing status -22 ('Unknown > error' according to sterror). errno is still cleared to 'Success'. I > don't have time go look at the code to see where this came from right > now, but does anyone know what it means? This is EINVAL and is coming from the librdmacm. That doesn't really help narrow down what the actual cause is unfortunately. And I don't understand the behavior that you're seeing at all. - Sean From jim at mellanox.com Thu Jul 19 14:26:38 2007 From: jim at mellanox.com (Jim Mott) Date: Thu, 19 Jul 2007 14:26:38 -0700 Subject: [ofa-general] libsdp in OFED 1.1 In-Reply-To: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp> References: <7C1D552561AF0544ACC7CF6F10E4966ECB541A@chronus.3leafnetworks.corp> Message-ID: With the setup you describe, I have no problems under OFED 1.2. SDP does get used automatically, and ls does not complain. I do not have any experience with SDP under OFED 1.1. I will try to look at it soon. The OFED 1.1 library code in port.c includes a close() function, so the easy answer is not going to do it. All my testing has been with the same libsdp.conf setup you are using (the 1.2 default), so I do not expect your setup to cause any problems. The entry in modprobe.conf.local is the normal thing. I would not expect it to be causing you any problems. All my testing has been with the local environment (LD_LIBRARY_PATH, LD_PRELOAD) overrides instead of /etc/ld.so.preload. Note that with your putting the fully qualified path for the 64 bit library in /etc/ld.so.preload, there will be issues with 32 bit executables. Not sure if that is getting you with ls, but I have seen strange problems with other things. Could you remove your entry from /etc/ld.so.preload, set the environment variables as described in my original note by hand, and retry the ls command? JIm -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan Robertson Sent: Thursday, July 19, 2007 11:31 AM To: general at lists.openfabrics.org Subject: FW: [ofa-general] libsdp in OFED 1.1 Hi Jim, We are actually using OFED 1.1. Hopefully we'll move to 1.2 in a few weeks. The systems using it are SLES 9 SP3. Uname -a: Linux oracle 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux I have added alias net-pf-27 ib_sdp to modprobe.conf.local I have modified /usr/local/ofed/etc/libsdp.conf to have the following lines: log min-level 9 destination syslog use both server "/usr/local/bin/netserver" *:* use both client "/usr/local/bin/netperf" *:* And I created /etc/ld.so.preload and have: /usr/local/ofed/lib64/libsdp.so Is there a close function in ofed 1.1? Perhaps I should try to add that to port.c for 1.1? My reply to your email bounced... 5.1.0 - Unknown address error 550-'5.1.1 unknown or illegal alias: @austin.rr.com' Thanks! Jonathan -----Original Message----- From: Jim Mott [mailto:jimmmott at austin.rr.com] Sent: Wednesday, July 18, 2007 3:42 PM To: Jonathan Robertson Subject: RE: [ofa-general] libsdp in OFED 1.1 Hi, I have just taken over support for libsdp and am feeling my way here. Probably I should have replied to the list, but this works too. I assume you are using OFED 1.2 version of the code. There is a close() function in that code (port.c), so there is something fishy here. Could you send a little more info please. Stuff like distro, 32/64, and perhaps the script/commands you use to automate the preload process. Something like: # uname -a Linux sw106.lab.mtl.com 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT \ 2006 x86_64 x86_64 x86_64 GNU/Linux # export LD_LIBRARY_PATH=/usr/local/ofed/lib64:/usr/local/ofed/lib # export LD_PRELOAD=libsdp.so # export LIBSDP_CONFIG_FILE=/etc/infiniband/libsdp.conf # ls config_parser.c config_scanner.c libsdp.la Makefile match.c port.lo config_parser.h config_scanner.lo log.c Makefile.am match.lo sdp_inet.h config_parser.lo libsdp.h log.lo Makefile.in port.c socket.c Thanks, Jim ========================= From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jonathan Robertson Sent: Wednesday, July 18, 2007 3:19 PM To: general at lists.openfabrics.org Subject: [ofa-general] libsdp in OFED 1.1 Hello, I have been using libsdp, and preloading it with the application. I would like to have it automatically preloaded, but am concerned about some error messages that seem harmless. So I don't want to have our client use the ld.so.preload if there are going to be messages. I see the following when I run a simple 'ls' # ls Wed Jul 18 06:11:09 2007 ls[8105] libsdp Error close: no implementation for close found  . .. # Any suggestions? I have the following in libsdp.conf Log min-level 9 destination syslog Use both server netserver *:* Use both client netperf *:* Our client is interested in having weblogic communicate with the oracle DB using SDP, and the interface to oracle and weblogic being accessible via tcp/ip over Ethernet as well. Thanks! Jonathan _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Jul 19 14:44:36 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Jul 2007 14:44:36 -0700 Subject: [ofa-general] latest libipathverbs.git tree Message-ID: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com> Is the git tree on openfabrics: git://git.openfabrics.org/~bos/libipathverbs.git the most recent version of user space verbs available for the ipath cards? - Sean From arthur.jones at qlogic.com Thu Jul 19 14:47:45 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Thu, 19 Jul 2007 14:47:45 -0700 Subject: [ofa-general] latest libipathverbs.git tree In-Reply-To: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com> References: <000001c7ca4e$00550460$9c98070a@amr.corp.intel.com> Message-ID: <20070719214745.GB20240@bauxite.pathscale.com> hi sean, ... On Thu, Jul 19, 2007 at 02:44:36PM -0700, Sean Hefty wrote: > Is the git tree on openfabrics: > > git://git.openfabrics.org/~bos/libipathverbs.git > > the most recent version of user space verbs available for the ipath cards? no, the canonical libipathverbs is now: git://git.openfabrics.org/~ralphc/libipathverbs arthur From sean.hefty at intel.com Thu Jul 19 14:55:30 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Jul 2007 14:55:30 -0700 Subject: [ofa-general] latest libipathverbs.git tree In-Reply-To: <20070719214745.GB20240@bauxite.pathscale.com> Message-ID: <000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com> >no, the canonical libipathverbs is now: > >git://git.openfabrics.org/~ralphc/libipathverbs Thanks. I believe if you create /home/ralphc/public_html directory, and place symbolic links in it to the git tree, then it will be visible on http://www.openfabrics.org/git. I don't remember if additional setup on the server is required. - Sean From arthur.jones at qlogic.com Thu Jul 19 15:09:05 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Thu, 19 Jul 2007 15:09:05 -0700 Subject: [ofa-general] latest libipathverbs.git tree In-Reply-To: <000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com> References: <20070719214745.GB20240@bauxite.pathscale.com> <000101c7ca4f$86743ba0$9c98070a@amr.corp.intel.com> Message-ID: <20070719220905.GN12489@bauxite.pathscale.com> hi sean, ... On Thu, Jul 19, 2007 at 02:55:30PM -0700, Sean Hefty wrote: > >no, the canonical libipathverbs is now: > > > >git://git.openfabrics.org/~ralphc/libipathverbs > > Thanks. > > I believe if you create /home/ralphc/public_html directory, and place symbolic > links in it to the git tree, then it will be visible on > http://www.openfabrics.org/git. I don't remember if additional setup on the > server is required. thanks, i tried it, but it doesn't seem to be sufficient... arthur From sean.hefty at intel.com Thu Jul 19 15:13:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 19 Jul 2007 15:13:35 -0700 Subject: [ofa-general] latest libipathverbs.git tree In-Reply-To: <20070719220905.GN12489@bauxite.pathscale.com> Message-ID: <000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com> Jeff/Vlad, Do either of you know the missing step to adding Ralph's git tree to the http view? (See below.) - Sean >> I believe if you create /home/ralphc/public_html directory, and place >symbolic >> links in it to the git tree, then it will be visible on >> http://www.openfabrics.org/git. I don't remember if additional setup on the >> server is required. > >thanks, i tried it, but it doesn't seem to be sufficient... > >arthur From weikuan.yu at gmail.com Thu Jul 19 16:01:10 2007 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Thu, 19 Jul 2007 19:01:10 -0400 Subject: [ofa-general] IEEE Hot Interconnect 2007: Registration Now Open Message-ID: <469FED36.9080000@gmail.com> **** Conference Dates: August 22-24, 2007, ********* CALL FOR PARTICIPATION: HOT Interconnect 2007 -- Registration Now Open 15th Annual IEEE Symposium on High-Performance Interconnects August 22nd-24th, 2007, Stanford University, Palo Alto, California William R. Hewlett Teaching Center http://www.hoti.org/ We cordially invite you to attend the 15th Annual IEEE Symposium on High-Performance Interconnects. IEEE Hot Interconnects brings together architects and designers of high performance chips, software, and systems at the University and global business levels. Presentations focus on up-to-the-minute developments demonstrating leading-edge designs by engineers and researchers throughout the world. Two days of technical sessions led by John Lockwood and Fabrizio Petrini, our 2007 General Co-Chairs followed by one day of tutorials to keep you on top of the latest industry developments and academic laboratories. Our objective is to address the Networking and SuperComputing families. This year we are proud to have Ron Brightwell with Sandia National Laboratories and Dhabaleswar Panda from Ohio State University as our 2007 IEEE Hot Interconnects Program Co-Chairs. They are putting together a combined 'HOT' program that includes interconnects in Supercomputing. Highlights include: ------------------- * Keynote talks: o Alex Dickinson, Co-Founder, President & CEO, Luxtera "CMOS Photonics - Bringing Moore's Law to Optical Interconnect" o Dr. Tryggve Fossum, Intel Fellow and Director of Microarchitecture Development "On-Die Interconnect and Other Challenges for Chip-Level Multi-Processing" * Panel: Multi-Multicore Interconnect: Scale-Up or Melt-Down? Panelists: -- Charlie Janac, President and CEO, Arteris -- Arun Sharma, Performance Engineering, Google -- Manu Thapar, Vice President, Platform Engineering, Yahoo -- Drew Wingard, CTO, Sonics Moderator: Dan Pitt, Director, Vquence Pty. Ltd. * Tutorials o NetFPGA (Full day) Nick McKeown and John Lockwood, Stanford University o Introduction to Programming High Performance Applications on the CELL Broadband Engine (Half-day) Dr. Jakub Kurzak and Dr. Alfredo Buttari Innovative Computing Laboratory, University of Tennessee at Knoxville o Design of Interconnection Networks (Half-day) John Kim, Stanford Univeristy and Dennis Abts, Cray * Technical Program o A strong, single-track program featuring 16 research papers on cutting-edge interconnect technologies Full details are on the conference Web site: http://www.hoti.org/hoti15/program/ Important dates: ---------------- * Registration NOW open, please take advantage of advanced registration. (http://www.hoti.org/hoti15/2007reg/) * Advanced Registration Deadline: Midnight Aug 15th, 2007 * Attendees make their own choices of Hotel. For further info: (http://www.hoti.org/hoti15/attendee/) * Main Symposium: August 22-23, 2007 * Panel: 7pm-8pm, August 22nd , 2007 * Tutorials: August 24th, 2007 From john_park207 at yahoo.com Thu Jul 19 20:44:25 2007 From: john_park207 at yahoo.com (john park) Date: Thu, 19 Jul 2007 20:44:25 -0700 (PDT) Subject: [ofa-general] Act Now Message-ID: <717102.51765.qm@web63005.mail.re1.yahoo.com> Dearest Friend, My name is John Park, I work in a bank Here In United kingdom, I need your assistance in moving the sum of Ten Million Five Hundred Thousand British pounds (£10,500,000.00) into your country. Funds are ready in an account managed by me. On agreement I will make you next of kin and the beneficiary of the fund and Transfer the Funds to you. This is 100% free risk. Kindly reply through this email address:john_park200 at yahoo.com) for further instruction on how to proceed. Regards, John Park --------------------------------- Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Thu Jul 19 21:46:57 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 20 Jul 2007 07:46:57 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-20:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From krkumar2 at in.ibm.com Thu Jul 19 23:32:01 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:02:01 +0530 Subject: [ofa-general] [PATCH 01/10] HOWTO documentation for Batching SKB. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063201.26341.79273.sendpatchset@localhost.localdomain> Add HOWTO documentation on what batching is, how to implement drivers to use it, and how users can enable/disable batching. Signed-off-by: Krishna Kumar --- Batching_skb_API.txt | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 91 insertions(+) diff -ruNp org/Documentation/networking/Batching_skb_API.txt new/Documentation/networking/Batching_skb_API.txt --- org/Documentation/networking/Batching_skb_API.txt 1970-01-01 05:30:00.000000000 +0530 +++ new/Documentation/networking/Batching_skb_API.txt 2007-07-20 08:30:22.000000000 +0530 @@ -0,0 +1,91 @@ + HOWTO for batching skb API support + ----------------------------------- + +Section 1: What is batching skb API ? +Section 2: How batching API works vs the original API ? +Section 3: How drivers can support this API ? +Section 4: How users can work with this API ? + + +Introduction: Kernel support for batching skb +----------------------------------------------- + +An extended API is supported in the netdevice layer, which is very similar +to the existing hard_start_xmit() API. Drivers which wish to take advantage +of this new API should implement this routine similar to how the +hard_start_xmit handler is written. The difference between these API's is +that while the existing hard_start_xmit processes one skb, the new API can +process multiple skbs (or even one) in a single call. It is also possible +for the driver writer to re-use most of the code from the existing API in +the new API without having code duplication. + + +Section 1: What is batching skb API ? +------------------------------------- + + This is a new API that is optionally exported by a driver. The pre- + requisite for a driver to use this API is that it should have a + reasonably sized hardware queue that can process multiple skbs. + + +Section 2: How batching API works vs the original API ? +------------------------------------------------------- + + The networking stack normally gets called from upper layer protocols + with a single skb to xmit. This skb is first enqueue'd and an + attempt is next made to transmit it immediately (via qdisc_run). + However, events like driver lock contention, queue stopped, etc, can + result in the skb not getting sent out, and it remains in the queue. + When a new xmit is called or when the queue is re-enabled, qdisc_run + could potentially find multiple packets in the queue, and have to + send them all out one by one iteratively. + + The batching skb API case was added to exploit this situation where + if there are multiple skbs, all of them can be sent to the device in + one shot. This reduces driver processing, locking at the driver (or + in stack for ~LLTX drivers) gets amortized over multiple skbs, and + in case of specific drivers where every xmit results in a completion + processing (like IPoIB), optimizations could be made in the driver + to get a completion for only the last skb that was sent which will + result in saving interrupts for every (but the last) skb that was + sent in the same batch. + + This batching can result in significant performance gains for + systems that have multiple data stream paths over the same network + interface card. + + +Section 3: How drivers can support this API ? +--------------------------------------------- + + The new API - dev->hard_start_xmit_batch(struct net_device *dev), + simplistically, can be written almost identically to the regular + xmit API (hard_start_xmit), except that all skbs on dev->skb_blist + should be processed by the driver instead of just one skb. The new + API doesn't get any skb as argument to process, instead it picks up + all the skbs from dev->skb_blist, where it was added by the stack, + and tries to send them out. + + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in + dev->features, and dev->hard_start_xmit_batch should point to the + new API implemented for that driver. + + +Section 4: How users can work with this API ? +--------------------------------------------- + + Batching could be disabled for a particular device, e.g. on desktop + systems if only one stream of network activity for that device is + taking place, since performance could be slightly affected due to + extra processing that batching adds. Batching can be enabled if + more than one stream of network activity per device is being done, + e.g. on servers, or even desktop usage with multiple browser, chat, + file transfer sessions, etc. + + Per device batching can be enabled/disabled using: + + echo 1 > /sys/class/net//tx_batch_skbs (enable) + echo 0 > /sys/class/net//tx_batch_skbs (disable) + + E.g. to enable batching on eth0, run: + echo 1 > /sys/class/net/eth0/tx_batch_skbs From krkumar2 at in.ibm.com Thu Jul 19 23:31:49 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:01:49 +0530 Subject: [ofa-general] [PATCH 00/10] Implement batching skb API Message-ID: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Hi Dave, Roland, everyone, In May, I had proposed creating an API for sending 'n' skbs to a driver to reduce lock overhead, DMA operations, and specific to drivers that have completion notification like IPoIB - reduce completion handling ("[RFC] New driver API to speed up small packets xmits" @ http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent initial test results for E1000 which showed minor improvements (but also got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2. After fine-tuning qdisc and other changes, I modified IPoIB to use this API, and now get good gains. Summary for TCP & No Delay: 1 process improves for all cases from 1.4% to 49.5%; 4 process has almost identical improvements from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP was tested with 1 process netperf with small increase in BW but big improvement in Service Demand. Netperf latency tests show small drop in transaction rate (results in separate attachment). To verify that performance does not degrade with batching turned off (as is the case for all existing drivers), I ran tests with tx_batch_skbs=0 vs the original code, without getting real degradation. Also enabled all kernel debugs to catch panics, warnings, memory free use bugs, etc, and simulated driver errors to get coverage on core & IPoIB error paths. Testing was on 2-CPU X-series systems and 8-CPU PPC64 Power5 systems using IPoIB over mthca, and E1000 (used driver that Jamal had converted but didn't get improvement). On i386, the size of the kernel (drivers are modules) increased by: text: 0.007% data: 0.007% bss: 0% total: 0.03%. There is a parallel WIP by Jamal but the two implementations are completely different since the code bases from the start were separate. Key changes: - Use a single qdisc interface to avoid code duplication and reduce maintainability (sch_generic.c size reduces by ~9%). - Has per device configurable parameter to turn on/off batching. - qdisc_restart gets slightly modified while looking simple without any checks for batching vs regular code (infact only two lines have changed - 1. instead of dev_dequeue_skb, a new batch-aware function is called; and 2. an extra call to hard_start_xmit_batch. - Batching algo/processing is different (eg. if qdisc_restart() finds one skb in the batch list, it will try to batch more (upto a limit) instead of sending that out and batching the rest in the next call. - No change in__qdisc_run other than a new argument (from DM's idea). - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. - Jamal's code has a separate hw prep handler called from the stack, and results are accessed in driver during xmit later. - Jamal's code has dev->xmit_win which is cached by the driver. Mine has dev->xmit_slots but this is used only by the driver while the core has a different mechanism to find how many skbs to batch. - Completely different structure/design & coding styles. (This patch will work with drivers updated by Jamal, Matt & Michael Chan with minor modifications - rename xmit_win to xmit_slots & rename batch handler) Patches are described as: Mail 0/10 : This mail. Mail 1/10 : HOWTO documentation. Mail 2/10 : Networking include file changes. Mail 3/10 : dev.c changes. Mail 4/10 : net-sysfs.c changes. Mail 5/10 : sch_generic.c changes. Mail 6/10 : IPoIB include file changes. Mail 7/10 : IPoIB verbs changes Mail 8/10 : IPoIB multicast, CM changes Mail 9/10 : IPoIB xmit API addition Mail 10/10 : IPoIB xmit internals changes (ipoib_ib.c) I am also sending separately an attachment with results (across 10 run cycle), test scripts and a script to analyze results. Thanks to Sridhar & Shirley Ma for code reviews; Evgeniy, Jamal & Sridhar for suggesting to put driver skb list on netdev instead of on skb to avoid requeue; and David Miller for explanation on using batching only when the queue is woken up. Please review and provide feedback/ideas; and consider for inclusion. Thanks, - KK From krkumar2 at in.ibm.com Thu Jul 19 23:33:01 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:03:01 +0530 Subject: [ofa-general] [PATCH 06/10] IPoIB header file changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063301.26341.70540.sendpatchset@localhost.localdomain> IPoIB header file changes. Signed-off-by: Krishna Kumar --- ipoib.h | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h --- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 08:30:22.000000000 +0530 @@ -269,8 +269,8 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; - struct ib_send_wr tx_wr; + struct ib_sge *tx_sge; + struct ib_send_wr *tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, int snum, int tx_index, + struct ipoib_ah *address, u32 qpn); void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn); + struct ipoib_ah *address, u32 qpn, int num_skbs); void ipoib_reap_ah(struct work_struct *work); void ipoib_flush_paths(struct net_device *dev); From krkumar2 at in.ibm.com Thu Jul 19 23:32:16 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:02:16 +0530 Subject: [ofa-general] [PATCH 02/10] Networking include file changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063216.26341.80316.sendpatchset@localhost.localdomain> Networking include file changes for batching. Signed-off-by: Krishna Kumar --- linux/netdevice.h | 10 ++++++++++ net/pkt_sched.h | 6 +++--- 2 files changed, 13 insertions(+), 3 deletions(-) diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-07-20 07:49:28.000000000 +0530 +++ new/include/linux/netdevice.h 2007-07-20 08:30:55.000000000 +0530 @@ -264,6 +264,8 @@ enum netdev_state_t __LINK_STATE_QDISC_RUNNING, }; +/* Minimum length of device hardware queue for batching to work */ +#define MIN_QUEUE_LEN_BATCH 16 /* * This structure holds at boot time configured netdevice settings. They @@ -340,6 +342,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_GSO 2048 /* Enable software GSO. */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_BATCH_SKBS 8192 /* Driver supports batch skbs API */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ @@ -452,6 +455,8 @@ struct net_device struct Qdisc *qdisc_sleeping; struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ + unsigned long xmit_slots; /* Device free slots */ + struct sk_buff_head *skb_blist; /* List of batch skbs */ /* Partially transmitted GSO packet. */ struct sk_buff *gso_skb; @@ -472,6 +477,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -832,6 +840,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, + struct net_device *dev); extern void dev_init(void); diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.000000000 +0530 @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); -static inline void qdisc_run(struct net_device *dev) +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { if (!netif_queue_stopped(dev) && !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) - __qdisc_run(dev); + __qdisc_run(dev, blist); } extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp, From krkumar2 at in.ibm.com Thu Jul 19 23:32:49 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:02:49 +0530 Subject: [ofa-general] [PATCH 05/10] sch_generic.c changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063249.26341.125.sendpatchset@localhost.localdomain> net/sched/sch_generic.c changes to support batching. Adds a batch aware function (get_skb) to get skbs to send. Signed-off-by: Krishna Kumar --- sch_generic.c | 94 +++++++++++++++++++++++++++++++++++++++++++--------------- 1 files changed, 71 insertions(+), 23 deletions(-) diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.000000000 +0530 +++ new/net/sched/sch_generic.c 2007-07-20 08:30:22.000000000 +0530 @@ -9,6 +9,11 @@ * Authors: Alexey Kuznetsov, * Jamal Hadi Salim, 990601 * - Ingress support + * + * New functionality: + * Krishna Kumar, , July 2007 + * - Support for sending multiple skbs to devices that support + * new api - dev->hard_start_xmit_batch() */ #include @@ -59,10 +64,12 @@ static inline int qdisc_qlen(struct Qdis static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb->next)) - dev->gso_skb = skb; - else - q->ops->requeue(skb, q); + if (likely(skb)) { + if (unlikely(skb->next)) + dev->gso_skb = skb; + else + q->ops->requeue(skb, q); + } netif_schedule(dev); return 0; @@ -91,18 +98,23 @@ static inline int handle_dev_cpu_collisi /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We - * detect it by checking xmit owner and drop the packet when - * deadloop is detected. Return OK to try the next skb. + * detect it by checking xmit owner and drop skb (or all + * skbs in batching case) when deadloop is detected. Return + * OK to try the next skb. */ - kfree_skb(skb); + if (likely(skb)) + kfree_skb(skb); + else if (!skb_queue_empty(dev->skb_blist)) + skb_queue_purge(dev->skb_blist); + if (net_ratelimit()) printk(KERN_WARNING "Dead loop on netdevice %s, " "fix it urgently!\n", dev->name); ret = qdisc_qlen(q); } else { /* - * Another cpu is holding lock, requeue & delay xmits for - * some time. + * Another cpu is holding lock. Requeue skb and delay xmits + * for some time. */ __get_cpu_var(netdev_rx_stat).cpu_collision++; ret = dev_requeue_skb(skb, dev, q); @@ -112,6 +124,39 @@ static inline int handle_dev_cpu_collisi } /* + * Algorithm to get skb(s) is: + * - Non batching drivers, or if the batch list is empty and there is 1 + * skb in the queue - dequeue skb and put it in *skbp to tell the + * caller to use the regular API. + * - Batching drivers where the batch list already contains atleast one + * skb or if there are multiple skbs in the queue: keep dequeue'ing + * skb's upto a limit and set *skbp to NULL to tell the caller to use + * the new API. + * + * Returns: + * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL + * (in case >1 skbs present in blist for batching) + * 0 - no skbs to be sent. + */ +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, + struct sk_buff **skbp) +{ + if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev->tx_queue_len - skb_queue_len(blist); + struct sk_buff *skb; + + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + *skbp = NULL; + return 1; /* we have atleast one skb in blist */ + } +} + +/* * NOTE: Called under dev->queue_lock with locally disabled BH. * * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this @@ -130,27 +175,28 @@ static inline int handle_dev_cpu_collisi * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *blist) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; - unsigned lockless; + unsigned getlock; /* whether we need to get lock or not */ int ret; /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) + if (unlikely(!get_skb(dev, q, blist, &skb))) return 0; /* * When the driver has LLTX set, it does its own locking in - * start_xmit. These checks are worth it because even uncongested + * start_xmit. These checks are worth it because even uncontested * locks can be quite expensive. The driver can do a trylock, as * is being done here; in case of lock contention it should return * NETDEV_TX_LOCKED and the packet will be requeued. */ - lockless = (dev->features & NETIF_F_LLTX); + getlock = !(dev->features & NETIF_F_LLTX); - if (!lockless && !netif_tx_trylock(dev)) { + if (getlock && !netif_tx_trylock(dev)) { /* Another CPU grabbed the driver tx lock */ return handle_dev_cpu_collision(skb, dev, q); } @@ -158,9 +204,12 @@ static inline int qdisc_restart(struct n /* And release queue */ spin_unlock(&dev->queue_lock); - ret = dev_hard_start_xmit(skb, dev); + if (likely(skb)) + ret = dev_hard_start_xmit(skb, dev); + else + ret = dev->hard_start_xmit_batch(dev); - if (!lockless) + if (getlock) netif_tx_unlock(dev); spin_lock(&dev->queue_lock); @@ -168,7 +217,7 @@ static inline int qdisc_restart(struct n switch (ret) { case NETDEV_TX_OK: - /* Driver sent out skb successfully */ + /* Driver sent out skb (or entire skb_blist) successfully */ ret = qdisc_qlen(q); break; @@ -179,10 +228,9 @@ static inline int qdisc_restart(struct n default: /* Driver returned NETDEV_TX_BUSY - requeue skb */ - if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit())) - printk(KERN_WARNING "BUG %s code %d qlen %d\n", + if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit()) + printk(KERN_WARNING " %s: BUG. code %d qlen %d\n", dev->name, ret, q->q.qlen); - ret = dev_requeue_skb(skb, dev, q); break; } @@ -190,10 +238,10 @@ static inline int qdisc_restart(struct n return ret; } -void __qdisc_run(struct net_device *dev) +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, blist)) break; } while (!netif_queue_stopped(dev)); From krkumar2 at in.ibm.com Thu Jul 19 23:32:27 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:02:27 +0530 Subject: [ofa-general] [PATCH 03/10] dev.c changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063227.26341.91868.sendpatchset@localhost.localdomain> Changes in dev.c to support batching : add dev_add_skb_to_blist, register_netdev recognizes batch aware drivers, and net_tx_action is the sole user of batching. Signed-off-by: Krishna Kumar --- dev.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 files changed, 74 insertions(+), 3 deletions(-) diff -ruNp org/net/core/dev.c new/net/core/dev.c --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530 +++ new/net/core/dev.c 2007-07-20 08:31:35.000000000 +0530 @@ -1414,6 +1414,45 @@ static int dev_gso_segment(struct sk_buf return 0; } +/* + * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are + * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs + * are dequeued from this list when we call the driver, so the list is safe + * from simultaneous deletes too. + * + * Returns count of successful skb(s) added to skb_blist. + */ +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree(skb); + return 0; + } + + if (skb->next) { + int count = 0; + + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + __skb_queue_tail(dev->skb_blist, nskb); + count++; + } while (skb->next); + + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + return count; + } + } + __skb_queue_tail(dev->skb_blist, skb); + return 1; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb->next)) { @@ -1566,7 +1605,7 @@ gso: /* reset queue_mapping to zero */ skb->queue_mapping = 0; rc = q->enqueue(skb, q); - qdisc_run(dev); + qdisc_run(dev, NULL); spin_unlock(&dev->queue_lock); rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; @@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq clear_bit(__LINK_STATE_SCHED, &dev->state); if (spin_trylock(&dev->queue_lock)) { - qdisc_run(dev); + /* + * Try to send out all skbs if batching is + * enabled. + */ + qdisc_run(dev, dev->skb_blist); spin_unlock(&dev->queue_lock); } else { netif_schedule(dev); @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device } } + if (dev->features & NETIF_F_BATCH_SKBS) { + if (!dev->hard_start_xmit_batch || + dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) { + /* + * Batch TX requires API support in driver plus have + * a minimum sized queue. + */ + printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS " + "since no API support or queue len " + "is smaller than %d.\n", + dev->name, MIN_QUEUE_LEN_BATCH); + dev->features &= ~NETIF_F_BATCH_SKBS; + } else { + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, + GFP_KERNEL); + if (dev->skb_blist) { + skb_queue_head_init(dev->skb_blist); + dev->tx_queue_len >>= 1; + } + } + } + /* * nil rebuild_header routine, * that should be never called and used as just bug trap. @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev synchronize_net(); + /* Deallocate batching structure */ + if (dev->skb_blist) { + skb_queue_purge(dev->skb_blist); + kfree(dev->skb_blist); + dev->skb_blist = NULL; + } + /* Shutdown queueing discipline. */ dev_shutdown(dev); - /* Notify protocols, that we are about to destroy this device. They should clean all the things. */ From krkumar2 at in.ibm.com Thu Jul 19 23:33:13 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:03:13 +0530 Subject: [ofa-general] [PATCH 07/10] IPoIB verb changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063313.26341.75017.sendpatchset@localhost.localdomain> IPoIB verb changes to support batching. Signed-off-by: Krishna Kumar --- ipoib_verbs.c | 23 ++++++++++++++--------- 1 files changed, 14 insertions(+), 9 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c --- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 08:30:22.000000000 +0530 @@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_ .max_send_sge = 1, .max_recv_sge = 1 }, - .sq_sig_type = IB_SIGNAL_ALL_WR, + .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */ .qp_type = IB_QPT_UD }; - - int ret, size; + struct ib_send_wr *next_wr = NULL; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_ priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; - - priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; - priv->tx_wr.send_flags = IB_SEND_SIGNALED; + for (i = ipoib_sendq_size - 1; i >= 0; i--) { + priv->tx_sge[i].lkey = priv->mr->lkey; + priv->tx_wr[i].opcode = IB_WR_SEND; + priv->tx_wr[i].sg_list = &priv->tx_sge[i]; + priv->tx_wr[i].num_sge = 1; + priv->tx_wr[i].send_flags = 0; + + /* Link the list properly for provider to use */ + priv->tx_wr[i].next = next_wr; + next_wr = &priv->tx_wr[i]; + } return 0; From krkumar2 at in.ibm.com Thu Jul 19 23:33:26 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:03:26 +0530 Subject: [ofa-general] [PATCH 08/10] IPoIB multicast/CM changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063326.26341.24459.sendpatchset@localhost.localdomain> IPoIB Multicast and CM changes for batching support. Signed-off-by: Krishna Kumar --- ipoib_cm.c | 13 +++++++++---- ipoib_multicast.c | 4 ++-- 2 files changed, 11 insertions(+), 6 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c --- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 08:30:22.000000000 +0530 @@ -493,14 +493,19 @@ static inline int post_send(struct ipoib unsigned int wr_id, u64 addr, int len) { + int ret; struct ib_send_wr *bad_wr; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + priv->tx_sge[0].addr = addr; + priv->tx_sge[0].length = len; + + priv->tx_wr[0].wr_id = wr_id; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr[0].next = NULL; + ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr); + priv->tx_wr[0].next = &priv->tx_wr[1]; - return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); + return ret; } void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c --- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 08:30:22.000000000 +0530 @@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); - priv->tx_wr.wr.ud.remote_qkey = priv->qkey; + priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey; } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -736,7 +736,7 @@ out: } } - ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1); } unlock: From krkumar2 at in.ibm.com Thu Jul 19 23:33:36 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:03:36 +0530 Subject: [ofa-general] [PATCH 09/10] IPoIB batching xmit handler support. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063336.26341.2955.sendpatchset@localhost.localdomain> Add a IPoIB batching xmit handler. Signed-off-by: Krishna Kumar --- ipoib_main.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 210 insertions(+), 5 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c --- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-20 08:30:22.000000000 +0530 @@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf goto err_drop; } } else - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, path->ah, + IPOIB_QPN(skb->dst->neighbour->ha), 1); } else { neigh->ah = NULL; @@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1); } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu goto out; } - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, neigh->ah, + IPOIB_QPN(skb->dst->neighbour->ha), 1); goto out; } @@ -753,6 +755,177 @@ out: return NETDEV_TX_OK; } +#define XMIT_QUEUED_SKBS() \ + do { \ + if (num_skbs) { \ + ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \ + num_skbs); \ + num_skbs = 0; \ + } \ + } while (0) + +/* + * TODO: Merge with ipoib_start_xmit to use the same code and have a + * transparent wrapper caller to xmit's, etc. + */ +static int ipoib_start_xmit_frames(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + struct sk_buff_head *blist; + int max_skbs, num_skbs = 0, tx_ring_index = -1; + u32 qpn, old_qpn = 0; + struct ipoib_neigh *neigh, *old_neigh = NULL; + unsigned long flags; + + if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) + return NETDEV_TX_LOCKED; + + blist = dev->skb_blist; + + /* + * Send atmost xmit_slots skbs. This also prevents the device getting + * full as ipoib_send modifies the xmit_slots and we use the same + * value to figure how many skbs to send. + */ + max_skbs = dev->xmit_slots; + + while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) { + /* + * From here on, ipoib_send() cannot stop the queue as it + * uses the same initialization as 'max_skbs'. So we can + * optimize to not check for queue stopped for every skb. + */ + if (likely(skb->dst && skb->dst->neighbour)) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + XMIT_QUEUED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (ipoib_cm_get(neigh)) { + if (ipoib_cm_up(neigh)) { + XMIT_QUEUED_SKBS(); + ipoib_cm_send(dev, skb, + ipoib_cm_get(neigh)); + continue; + } + } else if (neigh->ah) { + if (unlikely(memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid)))) { + spin_lock(&priv->lock); + /* + * It's safe to call ipoib_put_ah() + * inside priv->lock here, because we + * know that path->ah will always hold + * one more reference, so ipoib_put_ah() + * will never do more than decrement + * the ref count. + */ + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock(&priv->lock); + XMIT_QUEUED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + qpn = IPOIB_QPN(skb->dst->neighbour->ha); + if (neigh != old_neigh || qpn != old_qpn) { + /* + * Sending to a different destination + * from earlier skb's - send all + * existing skbs (if any). + */ + if (tx_ring_index == -1) { + /* + * First time, find where to + * store skb. + */ + tx_ring_index = priv->tx_head & + (ipoib_sendq_size - 1); + } else { + /* Some skbs to send */ + XMIT_QUEUED_SKBS(); + } + old_neigh = neigh; + old_qpn = IPOIB_QPN(skb->dst->neighbour->ha); + } + + if (ipoib_process_skb(dev, skb, priv, num_skbs, + tx_ring_index, neigh->ah, + qpn)) + continue; + + num_skbs++; + + /* Queue'd one skb, get index for next skb */ + if (max_skbs) + tx_ring_index = (tx_ring_index + 1) & + (ipoib_sendq_size - 1); + continue; + } + + if (skb_queue_len(&neigh->queue) < + IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + ++max_skbs; + } + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + XMIT_QUEUED_SKBS(); + ipoib_mcast_send(dev, phdr->hwaddr + 4, skb); + } else { + /* unicast GID -- should be ARP or RARP reply */ + + if ((be16_to_cpup((__be16 *) skb->data) != + ETH_P_ARP) && + (be16_to_cpup((__be16 *) skb->data) != + ETH_P_RARP)) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((__be16 *) + skb->data), + IPOIB_QPN(phdr->hwaddr), + IPOIB_GID_RAW_ARG(phdr->hwaddr + + 4)); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + ++max_skbs; + continue; + } + XMIT_QUEUED_SKBS(); + unicast_arp_send(skb, dev, phdr); + } + } + } + + /* Send out last packets (if any) */ + XMIT_QUEUED_SKBS(); + + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY; +} + static struct net_device_stats *ipoib_get_stats(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -898,11 +1071,35 @@ int ipoib_dev_init(struct net_device *de /* priv->tx_head & tx_tail are already 0 */ - if (ipoib_ib_dev_init(dev, ca, port)) + /* Allocate tx_sge */ + priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge, + GFP_KERNEL); + if (!priv->tx_sge) { + printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n", + ca->name, ipoib_sendq_size); goto out_tx_ring_cleanup; + } + + /* Allocate tx_wr */ + priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr, + GFP_KERNEL); + if (!priv->tx_wr) { + printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n", + ca->name, ipoib_sendq_size); + goto out_tx_sge_cleanup; + } + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_wr_cleanup; return 0; +out_tx_wr_cleanup: + kfree(priv->tx_wr); + +out_tx_sge_cleanup: + kfree(priv->tx_sge); + out_tx_ring_cleanup: kfree(priv->tx_ring); @@ -930,9 +1127,13 @@ void ipoib_dev_cleanup(struct net_device kfree(priv->rx_ring); kfree(priv->tx_ring); + kfree(priv->tx_sge); + kfree(priv->tx_wr); priv->rx_ring = NULL; priv->tx_ring = NULL; + priv->tx_sge = NULL; + priv->tx_wr = NULL; } static void ipoib_setup(struct net_device *dev) @@ -943,6 +1144,7 @@ static void ipoib_setup(struct net_devic dev->stop = ipoib_stop; dev->change_mtu = ipoib_change_mtu; dev->hard_start_xmit = ipoib_start_xmit; + dev->hard_start_xmit_batch = ipoib_start_xmit_frames; dev->get_stats = ipoib_get_stats; dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; @@ -963,7 +1165,10 @@ static void ipoib_setup(struct net_devic dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; - dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX | + NETIF_F_BATCH_SKBS; + + dev->xmit_slots = ipoib_sendq_size; /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; From krkumar2 at in.ibm.com Thu Jul 19 23:32:38 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:02:38 +0530 Subject: [ofa-general] [PATCH 04/10] net-sysfs.c changes. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063238.26341.41474.sendpatchset@localhost.localdomain> Support to turn on/off batching from /sys. Signed-off-by: Krishna Kumar --- net-sysfs.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 70 insertions(+) diff -ruNp org/net/core/net-sysfs.c new/net/core/net-sysfs.c --- org/net/core/net-sysfs.c 2007-07-20 07:49:28.000000000 +0530 +++ new/net/core/net-sysfs.c 2007-07-20 08:34:45.000000000 +0530 @@ -230,6 +230,74 @@ static ssize_t store_weight(struct devic return netdev_store(dev, attr, buf, len, change_weight); } +static ssize_t show_tx_batch_skbs(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + return sprintf(buf, fmt_dec, netdev->skb_blist ? 1 : 0); +} + +static int change_tx_batch_skbs(struct net_device *net, + unsigned long new_tx_batch_skbs) +{ + int ret = 0; + struct sk_buff_head *blist; + + if (!(net->features & NETIF_F_BATCH_SKBS) || + (new_tx_batch_skbs && net->tx_queue_len < MIN_QUEUE_LEN_BATCH)) { + /* + * Driver doesn't support batching SKBS, or the queue len + * is insufficient. TODO: Add similar check to disable + * batching in change_tx_queue_len() if queue_len becomes + * smaller than MIN_QUEUE_LEN_BATCH. + */ + ret = -ENOTSUPP; + goto out; + } + + /* Handle invalid argument */ + if (new_tx_batch_skbs < 0) { + ret = -EINVAL; + goto out; + } + + /* Check if new value is same as the current */ + new_tx_batch_skbs = !!new_tx_batch_skbs; + if (!!net->skb_blist == new_tx_batch_skbs) + goto out; + + if (new_tx_batch_skbs && + (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) { + ret = -ENOMEM; + goto out; + } + + spin_lock(&net->queue_lock); + if (new_tx_batch_skbs) { + skb_queue_head_init(blist); + net->skb_blist = blist; + net->tx_queue_len >>= 1; + } else { + if (!skb_queue_empty(net->skb_blist)) + skb_queue_purge(net->skb_blist); + kfree(net->skb_blist); + net->skb_blist = NULL; + net->tx_queue_len <<= 1; + } + spin_unlock(&net->queue_lock); + +out: + return ret; +} + +static ssize_t store_tx_batch_skbs(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + return netdev_store(dev, attr, buf, len, change_tx_batch_skbs); +} + static struct device_attribute net_class_attributes[] = { __ATTR(addr_len, S_IRUGO, show_addr_len, NULL), __ATTR(iflink, S_IRUGO, show_iflink, NULL), @@ -246,6 +314,8 @@ static struct device_attribute net_class __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags), __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, store_tx_queue_len), + __ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skbs, + store_tx_batch_skbs), __ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight), {} }; From shemminger at linux-foundation.org Fri Jul 20 00:18:48 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 20 Jul 2007 08:18:48 +0100 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720081848.7cc652fb@oldman> On Fri, 20 Jul 2007 12:01:49 +0530 Krishna Kumar wrote: > Hi Dave, Roland, everyone, > > In May, I had proposed creating an API for sending 'n' skbs to a driver to > reduce lock overhead, DMA operations, and specific to drivers that have > completion notification like IPoIB - reduce completion handling ("[RFC] New > driver API to speed up small packets xmits" @ > http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent > initial test results for E1000 which showed minor improvements (but also > got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2. > > After fine-tuning qdisc and other changes, I modified IPoIB to use this API, > and now get good gains. Summary for TCP & No Delay: 1 process improves for > all cases from 1.4% to 49.5%; 4 process has almost identical improvements > from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP > was tested with 1 process netperf with small increase in BW but big > improvement in Service Demand. Netperf latency tests show small drop in > transaction rate (results in separate attachment). > You may see worse performance with batching in the real world when running over WAN's. Like TSO, batching will generate back to back packet trains that are subject to multi-packet synchronized loss. The problem is that intermediate router queues are often close to full, and when a long string of packets arrives back to back only the first ones will get in, the rest get dropped. Normal sends have at least minimal pacing so they are less likely do get synchronized drop. From krkumar2 at in.ibm.com Fri Jul 20 00:20:09 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 12:50:09 +0530 Subject: [ofa-general] Results & Scripts for : "[PATCH 00/10] Implement batching skb API" Message-ID: Attached file contains scripts for running tests and parsing results : (See attached file: scripts.tar) The result of a 10 run (average) TCP iperf (and 1 netperf for UDP) is given below. Thanks, - KK ----------------------------------------------------------------------------- Test configuration : Single cross-over cable for MTHCA cards (MT23108) on two PPC64 systems, both systems are 8-CPU P5 1.5 GHz processors with 8HB memory. A. TCP results for a 10 run average are as follows (using iperf, could not run netperf in parallel as it is not synchronized): First number : Orig BW in KB/s. Second number : New BW in KB/s. Third number : Percentage change. IPoIB was configured with 512 sendq size while default configuration (128) gave positives for most test cases but more negatives for 512 and 4K buffer sizes. Buffer Size 32 TCP Threads:1 : 3126 3169 1.4 TCP Threads:4 : 9739 10889 11.8 TCP Threads:16 : 35383 47218 33.4 TCP Threads:64 : 85147 84196 -1.1 Average : 9.05% TCP No Delay: Threads:1 : 1990 2976 49.5 TCP No Delay: Threads:4 : 8137 8770 7.7 TCP No Delay: Threads:16 : 31714 37308 17.63 TCP No Delay: Threads:64 : 72830 81892 12.44 Average : 14.19% Buffer Size 128 TCP Threads:1 : 12674 13339 5.2 TCP Threads:4 : 37889 40816 7.7 TCP Threads:16 : 141342 165935 17.3 TCP Threads:64 : 199813 196283 -1.7 Average : 6.29% TCP No Delay: Threads:1 : 7732 11272 45.7 TCP No Delay: Threads:4 : 33348 35222 5.6 TCP No Delay: Threads:16 : 120507 143960 19.5 TCP No Delay: Threads:64 : 195459 193875 -0.8 Average : 7.64% Buffer Size 512 TCP Threads:1 : 42256 55735 31.9 TCP Threads:4 : 161237 161777 0.3 TCP Threads:16 : 227911 231781 1.7 TCP Threads:64 : 229779 223152 -2.9 Average : 1.70% TCP No Delay: Threads:1 : 30065 42500 41.3 TCP No Delay: Threads:4 : 79076 125848 59.1 TCP No Delay: Threads:16 : 225725 224155 -0.7 TCP No Delay: Threads:64 : 231220 223664 -3.26 Average : 8.84% Buffer Size 4096 TCP Threads:1 : 119364 135445 13.5 TCP Threads:4 : 261301 256754 -1.7 TCP Threads:16 : 246889 247065 0.07 TCP Threads:64 : 237613 234185 -1.4 Average : 0.95% TCP No Delay: Threads:1 : 102187 104087 1.9 TCP No Delay: Threads:4 : 204139 243169 19.1 TCP No Delay: Threads:16 : 245529 242519 -1.2 TCP No Delay: Threads:64 : 236826 233382 -1.4 Average : 4.37% ----------------------------------------------------------------------------- B. Using netperf to run 1 process UDP (1 run, measured with 128 sendq size, will be re-doing with 512 sendq and for 10 runs average) : ---------------------------------------------------------- Org New Perc BW Service BW Service BW Service ---------------------------------------------------------- 6.40 1277.64 6.50 1272.41 1.56 -.40 24.80 663.01 25.80 318.13 4.03 -52.01 101.80 81.02 101.90 80.63 .09 -.48 395.70 20.77 395.90 20.74 .05 -.14 1172.90 7.00 1156.80 7.10 -1.37 1.42 --------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: scripts.tar Type: application/octet-stream Size: 10240 bytes Desc: not available URL: From krkumar2 at in.ibm.com Fri Jul 20 00:30:25 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 13:00:25 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720081848.7cc652fb@oldman> Message-ID: Stephen Hemminger wrote on 07/20/2007 12:48:48 PM: > You may see worse performance with batching in the real world when > running over WAN's. Like TSO, batching will generate back to back packet > trains that are subject to multi-packet synchronized loss. The problem is that > intermediate router queues are often close to full, and when a long string > of packets arrives back to back only the first ones will get in, the rest > get dropped. Normal sends have at least minimal pacing so they are less > likely do get synchronized drop. Hi Stephen, OK. The difference that I could see is that in existing code, the "minimal pacing" also could lead to (possibly slighly lesser) loss since sends are quick iterations at the IP layer, while in batching sends are iterative at the driver layer. Is it an issue ? Any suggestions ? Thanks, - KK From shemminger at linux-foundation.org Fri Jul 20 00:57:37 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 20 Jul 2007 08:57:37 +0100 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: References: <20070720081848.7cc652fb@oldman> Message-ID: <20070720085737.5319d3d4@oldman> On Fri, 20 Jul 2007 13:00:25 +0530 Krishna Kumar2 wrote: > Stephen Hemminger wrote on 07/20/2007 > 12:48:48 PM: > > > You may see worse performance with batching in the real world when > > running over WAN's. Like TSO, batching will generate back to back packet > > trains that are subject to multi-packet synchronized loss. The problem is > that > > intermediate router queues are often close to full, and when a long > string > > of packets arrives back to back only the first ones will get in, the rest > > get dropped. Normal sends have at least minimal pacing so they are less > > likely do get synchronized drop. > > Hi Stephen, > > OK. The difference that I could see is that in existing code, the "minimal > pacing" also could lead to (possibly slighly lesser) loss since sends are > quick iterations at the IP layer, while in batching sends are iterative at > the driver layer. > > Is it an issue ? Any suggestions ? Not an immediate issue, but it is the kind of thing that could cause performance regression reports if it was used on every interface by default. From krkumar2 at in.ibm.com Fri Jul 20 00:47:40 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 13:17:40 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720081848.7cc652fb@oldman> Message-ID: Stephen Hemminger wrote on 07/20/2007 12:48:48 PM: > You may see worse performance with batching in the real world when > running over WAN's. Like TSO, batching will generate back to back packet > trains that are subject to multi-packet synchronized loss. The problem is that > intermediate router queues are often close to full, and when a long string > of packets arrives back to back only the first ones will get in, the rest > get dropped. Normal sends have at least minimal pacing so they are less > likely do get synchronized drop. Also forgot to mention in the previous mail, if performance is seen to be dipping, batching can be disabled on WAN's by: echo 0 > /sys/class/net//tx_batch_skbs and use batching on local/site networks in that case. From vlad at lists.openfabrics.org Fri Jul 20 01:38:42 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 20 Jul 2007 01:38:42 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070720-0100 daily build status Message-ID: <20070720083842.479C7E608CA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From vlad at lists.openfabrics.org Fri Jul 20 02:43:16 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 20 Jul 2007 02:43:16 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070720-0200 daily build status Message-ID: <20070720094316.E11CAE608C8@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From kaber at trash.net Fri Jul 20 02:59:35 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 11:59:35 +0200 Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes. In-Reply-To: <20070720063216.26341.80316.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063216.26341.80316.sendpatchset@localhost.localdomain> Message-ID: <46A08787.8040501@trash.net> Krishna Kumar wrote: > diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h > --- org/include/linux/netdevice.h 2007-07-20 07:49:28.000000000 +0530 > +++ new/include/linux/netdevice.h 2007-07-20 08:30:55.000000000 +0530 > @@ -264,6 +264,8 @@ enum netdev_state_t > __LINK_STATE_QDISC_RUNNING, > }; > > +/* Minimum length of device hardware queue for batching to work */ > +#define MIN_QUEUE_LEN_BATCH 16 Is there any downside in using batching with smaller queue sizes? From kaber at trash.net Fri Jul 20 03:04:30 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 12:04:30 +0200 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <20070720063227.26341.91868.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063227.26341.91868.sendpatchset@localhost.localdomain> Message-ID: <46A088AE.1090702@trash.net> Krishna Kumar wrote: > @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device > } > } > > + if (dev->features & NETIF_F_BATCH_SKBS) { > + if (!dev->hard_start_xmit_batch || > + dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) { > + /* > + * Batch TX requires API support in driver plus have > + * a minimum sized queue. > + */ > + printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS " > + "since no API support or queue len " > + "is smaller than %d.\n", > + dev->name, MIN_QUEUE_LEN_BATCH); > + dev->features &= ~NETIF_F_BATCH_SKBS; The queue length can be changed through multiple interfaces, if that really is important you need to catch these cases too. > + } else { > + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, > + GFP_KERNEL); Why not simply put the head in struct net_device? It seems to me that this could also be used for gso_skb. > + if (dev->skb_blist) { > + skb_queue_head_init(dev->skb_blist); > + dev->tx_queue_len >>= 1; > + } > + } > + } > + > /* > * nil rebuild_header routine, > * that should be never called and used as just bug trap. > @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev > > synchronize_net(); > > + /* Deallocate batching structure */ > + if (dev->skb_blist) { > + skb_queue_purge(dev->skb_blist); > + kfree(dev->skb_blist); > + dev->skb_blist = NULL; > + } > + Queue purging should be done in dev_deactivate. From attawayu at laco.com Fri Jul 20 04:03:42 2007 From: attawayu at laco.com (Eric Spencer) Date: Fri, 20 Jul 2007 10:03:42 -0100 Subject: [ofa-general] Wir wissen was Frauen wollern may be somewhat -- Something more fun. Message-ID: <01c7cab5$40bf03e0$c9705bd9@attawayu> Versuchen Sie unser Produkt und Sie werden fuhlen was unsere Kunden bestatigen Preise die keine Konkurrenz kennen - Visa verifizierter Onlineshop - Bequem und diskret online bestellen. - Kostenlose, arztliche Telefon-Beratung - Diskrete Verpackung und Zahlung - Kein peinlicher Arztbesuch erforderlich - Kein langes Warten - Auslieferung innerhalb von 2-3 Tagen - keine versteckte Kosten Ciaaaaaalis 10 Pack. 27,00 Euro Viaaaagra 10 Pack. 21,00 Euro Jetzt bestellen - und vier Pillen umsonst erhalten http://ykliekl.flowsame.com/?173359073325 (bitte warten Sie einen Moment bis die Seite vollstandig geladen wird) -------------- next part -------------- An HTML attachment was scrubbed... URL: From kaber at trash.net Fri Jul 20 03:07:20 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 12:07:20 +0200 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: <20070720063238.26341.41474.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063238.26341.41474.sendpatchset@localhost.localdomain> Message-ID: <46A08958.3090509@trash.net> Krishna Kumar wrote: > Support to turn on/off batching from /sys. rtnetlink support seems more important than sysfs to me. From kaber at trash.net Fri Jul 20 03:11:01 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 12:11:01 +0200 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: <20070720063249.26341.125.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063249.26341.125.sendpatchset@localhost.localdomain> Message-ID: <46A08A35.5090104@trash.net> Krishna Kumar wrote: > diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c > --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.000000000 +0530 > +++ new/net/sched/sch_generic.c 2007-07-20 08:30:22.000000000 +0530 > @@ -9,6 +9,11 @@ > * Authors: Alexey Kuznetsov, > * Jamal Hadi Salim, 990601 > * - Ingress support > + * > + * New functionality: > + * Krishna Kumar, , July 2007 > + * - Support for sending multiple skbs to devices that support > + * new api - dev->hard_start_xmit_batch() No new changelogs in source code please, git keeps track of that. > -static inline int qdisc_restart(struct net_device *dev) > +static inline int qdisc_restart(struct net_device *dev, > + struct sk_buff_head *blist) > { > struct Qdisc *q = dev->qdisc; > struct sk_buff *skb; > - unsigned lockless; > + unsigned getlock; /* whether we need to get lock or not */ Unrelated rename, please get rid of this to reduce the noise. From krkumar2 at in.ibm.com Thu Jul 19 23:33:48 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 20 Jul 2007 12:03:48 +0530 Subject: [ofa-general] [PATCH 10/10] IPoIB batching in internal xmit/handler routines. In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720063348.26341.73753.sendpatchset@localhost.localdomain> Add batching support to IPoIB post_send and TX completion handler. Signed-off-by: Krishna Kumar --- ipoib_ib.c | 233 ++++++++++++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 187 insertions(+), 46 deletions(-) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.000000000 +0530 +++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 08:30:22.000000000 +0530 @@ -242,8 +242,9 @@ repost: static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); + int i = 0, num_completions; + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1); unsigned int wr_id = wc->wr_id; - struct ipoib_tx_buf *tx_req; unsigned long flags; ipoib_dbg_data(priv, "send completion: id %d, status: %d\n", @@ -255,23 +256,60 @@ static void ipoib_ib_handle_tx_wc(struct return; } - tx_req = &priv->tx_ring[wr_id]; + num_completions = wr_id - tx_ring_index + 1; + if (num_completions <= 0) + num_completions += ipoib_sendq_size; + + /* + * Handle skbs completion from tx_tail to wr_id. It is possible to + * handle WC's from earlier post_sends (possible multiple) in this + * iteration as we move from tx_tail to wr_id, since if the last + * WR (which is the one which had a completion request) failed to be + * sent for any of those earlier request(s), no completion + * notification is generated for successful WR's of those earlier + * request(s). + */ + while (1) { + /* + * Could use while (i < num_completions), but it is costly + * since in most cases there is 1 completion, and we end up + * doing an extra "index = (index+1) & (ipoib_sendq_size-1)" + */ + struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index]; + + if (likely(tx_req->skb)) { + ib_dma_unmap_single(priv->ca, tx_req->mapping, + tx_req->skb->len, DMA_TO_DEVICE); - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + dev_kfree_skb_any(tx_req->skb); + } + /* + * else this skb failed synchronously when posted and was + * freed immediately. + */ + + if (++i == num_completions) + break; - dev_kfree_skb_any(tx_req->skb); + /* More WC's to handle */ + tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1); + } spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; + + priv->tx_tail += num_completions; if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); netif_wake_queue(dev); } + + /* Make more slots available for posts */ + dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -340,78 +378,181 @@ void ipoib_ib_completion(struct ib_cq *c netif_rx_schedule(dev_ptr); } -static inline int post_send(struct ipoib_dev_priv *priv, - unsigned int wr_id, - struct ib_ah *address, u32 qpn, - u64 addr, int len) +/* + * post_send : Post WR(s) to the device. + * + * num_skbs is the number of WR's, 'start_index' is the first slot in + * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a + * previous post_send returned error and we are trying to send the untried + * WR's, in which case start_index will point to the first untried WR. + * + * We also break the WR link before posting so that the driver knows how + * many WR's to process, and this is set back after the post. + */ +static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn, + int start_index, int num_skbs, + struct ib_send_wr **bad_wr) { - struct ib_send_wr *bad_wr; + int ret; + struct ib_send_wr *last_wr, *next_wr; + + last_wr = &priv->tx_wr[start_index + num_skbs - 1]; + + /* Set Completion Notification for last WR */ + last_wr->send_flags = IB_SEND_SIGNALED; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + /* Terminate the last WR */ + next_wr = last_wr->next; + last_wr->next = NULL; - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + /* Send all the WR's in one doorbell */ + ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr); - return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); + /* Restore send_flags & WR chain */ + last_wr->send_flags = 0; + last_wr->next = next_wr; + + return ret; } -void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn) +/* + * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr + * to pass to the driver. + * + * Returns : + * - 0 on successful processing of the skb + * - 1 if the skb was freed. + */ +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, int wr_num, + int tx_ring_index, struct ipoib_ah *address, u32 qpn) { - struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_tx_buf *tx_req; u64 addr; + struct ipoib_tx_buf *tx_req; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { - ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + ipoib_warn(priv, "packet len %d (> %d) too long to " + "send, dropping\n", skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); ++priv->stats.tx_dropped; ++priv->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); - return; + return 1; } - ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + ipoib_dbg_data(priv, "sending packet, length=%d address=%p " + "qpn=0x%06x\n", skb->len, address, qpn); /* * We put the skb into the tx_ring _before_ we call post_send() * because it's entirely possible that the completion handler will - * run before we execute anything after the post_send(). That + * run before we execute anything after the post_send(). That * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; - tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); - return; + return 1; } + + tx_req = &priv->tx_ring[tx_ring_index]; + tx_req->skb = skb; tx_req->mapping = addr; + priv->tx_sge[wr_num].addr = addr; + priv->tx_sge[wr_num].length = skb->len; + priv->tx_wr[wr_num].wr_id = tx_ring_index; + priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn; + priv->tx_wr[wr_num].wr.ud.ah = address->ah; - if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { - ipoib_warn(priv, "post_send failed\n"); - ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); - dev_kfree_skb_any(skb); - } else { - dev->trans_start = jiffies; + return 0; +} - address->last_send = priv->tx_head; - ++priv->tx_head; +/* + * If an skb is passed to this function, it is the single, unprocessed skb + * send case. Otherwise if skb is NULL, it means that all skbs are already + * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc. + */ +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn, int num_skbs) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int start_index = 0; + + if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head & + (ipoib_sendq_size - 1), address, qpn)) + return; + + /* Send out all the skb's in one post */ + while (num_skbs) { + struct ib_send_wr *bad_wr; + + if (unlikely((post_send(priv, qpn, start_index, num_skbs, + &bad_wr)))) { + int done; + + /* + * Better error handling can be done here, like free + * all untried skbs if err == -ENOMEM. However at this + * time, we re-try all the skbs, all of which will + * likely fail anyway (unless device finished sending + * some out in the meantime). This is not a regression + * since the earlier code is not doing this either. + */ + ipoib_warn(priv, "post_send failed\n"); - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { - ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); - netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + /* Get #WR's that finished successfully */ + done = bad_wr - &priv->tx_wr[start_index]; + + /* Handle 1 error */ + priv->stats.tx_errors++; + ib_dma_unmap_single(priv->ca, + priv->tx_sge[start_index + done].addr, + priv->tx_sge[start_index + done].length, + DMA_TO_DEVICE); + + /* Handle 'n' successes */ + if (done) { + dev->trans_start = jiffies; + address->last_send = priv->tx_head; + } + + /* Free failed WR & reset for WC handler to recognize */ + dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb); + priv->tx_ring[bad_wr->wr_id].skb = NULL; + + /* Move head to first untried WR */ + priv->tx_head += (done + 1); + /* + 1 for WR that was tried & failed */ + + /* Get count of skbs that were not tried */ + num_skbs -= (done + 1); + + /* Get start index for next iteration */ + start_index += (done + 1); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + priv->tx_head += num_skbs; + num_skbs = 0; } } + + if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) { + /* + * Not accurate as some intermediate slots could have been + * freed on error, but no harm - only queue stopped earlier. + */ + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + } + + /* Reduce the number of slots for sends */ + dev->xmit_slots = ipoib_sendq_size - (priv->tx_head - priv->tx_tail); } static void __ipoib_reap_ah(struct net_device *dev) From krkumar2 at in.ibm.com Fri Jul 20 03:28:49 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 15:58:49 +0530 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: <46A08958.3090509@trash.net> Message-ID: Patrick McHardy wrote on 07/20/2007 03:37:20 PM: > Krishna Kumar wrote: > > Support to turn on/off batching from /sys. > > > rtnetlink support seems more important than sysfs to me. Thanks, I will add that as a patch. The reason to add to sysfs is that it is easier to change for a user (and similar to tx_queue_len). - KK From krkumar2 at in.ibm.com Fri Jul 20 03:27:37 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 15:57:37 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <46A088AE.1090702@trash.net> Message-ID: Hi Patrick, Thanks for your comments. Patrick McHardy wrote on 07/20/2007 03:34:30 PM: > The queue length can be changed through multiple interfaces, if that > really is important you need to catch these cases too. I have a TODO comment in net-sysfs.c which is to catch this case. > > + } else { > > + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, > > + GFP_KERNEL); > > > Why not simply put the head in struct net_device? It seems to me that > this could also be used for gso_skb. Without going into GSO, it is wasting some 32 bytes on i386 since most drivers don't export this API. > Queue purging should be done in dev_deactivate. I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup eth0, the system panic'd. The first solution I thought was to initialize the skb_blist in dev_change_flags() rather than in register_netdev(), but then felt that a series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize stuff, and so thought of putting it in unregister_netdev (where it is balanced with register_netdev). Is there any reason to move this ? Thanks, - KK From krkumar2 at in.ibm.com Fri Jul 20 03:32:42 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 16:02:42 +0530 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: <46A08A35.5090104@trash.net> Message-ID: Patrick McHardy wrote on 07/20/2007 03:41:01 PM: > Krishna Kumar wrote: > > diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c > > --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.000000000 +0530 > > +++ new/net/sched/sch_generic.c 2007-07-20 08:30:22.000000000 +0530 > > @@ -9,6 +9,11 @@ > > * Authors: Alexey Kuznetsov, > > * Jamal Hadi Salim, 990601 > > * - Ingress support > > + * > > + * New functionality: > > + * Krishna Kumar, , July 2007 > > + * - Support for sending multiple skbs to devices that support > > + * new api - dev->hard_start_xmit_batch() > > > No new changelogs in source code please, git keeps track of that. Ah, didn't know this, thanks for letting me know. > > -static inline int qdisc_restart(struct net_device *dev) > > +static inline int qdisc_restart(struct net_device *dev, > > + struct sk_buff_head *blist) > > { > > struct Qdisc *q = dev->qdisc; > > struct sk_buff *skb; > > - unsigned lockless; > > + unsigned getlock; /* whether we need to get lock or not */ > > > Unrelated rename, please get rid of this to reduce the noise. OK, I guess I should have sent that change earlier :) The reason to change the name is to avoid (double-negative) checks like : if (!lockless) to if (getlock). I will remove these changes. thanks, - KK From kaber at trash.net Fri Jul 20 04:20:37 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 13:20:37 +0200 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: References: Message-ID: <46A09A85.7020500@trash.net> Krishna Kumar2 wrote: > Hi Patrick, > > Thanks for your comments. > > Patrick McHardy wrote on 07/20/2007 03:34:30 PM: > > >> The queue length can be changed through multiple interfaces, if that >> really is important you need to catch these cases too. >> > > I have a TODO comment in net-sysfs.c which is to catch this case. > I noticed that. Still wondering why it is important at all though. > >>> + } else { >>> + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, >>> + GFP_KERNEL); >>> >> Why not simply put the head in struct net_device? It seems to me that >> this could also be used for gso_skb. >> > > Without going into GSO, it is wasting some 32 bytes on i386 since most > drivers don't export this API. > 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. If you'd use it for gso_skb it would come down to 8 bytes. struct net_device is a pig already, and there are better ways to reduce this than starting to allocating single members with a few bytes IMO. > >> Queue purging should be done in dev_deactivate. >> > > I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup > eth0, > the system panic'd. The first solution I thought was to initialize the > skb_blist > in dev_change_flags() rather than in register_netdev(), but then felt that > a > series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize > stuff, > and so thought of putting it in unregister_netdev (where it is balanced > with > register_netdev). > > Is there any reason to move this ? > Yes, packets can be holding references to various stuff and these should be released on device down. As I said above I don't really like the allocation, but even if you want to keep it, just do the purging and dev_deactivate and keep the freeing in unregister_netdev (actually I guess it should be free_netdev to handle register_netdevice errors). From kaber at trash.net Fri Jul 20 04:21:51 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 13:21:51 +0200 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: References: Message-ID: <46A09ACF.20805@trash.net> Krishna Kumar2 wrote: > Patrick McHardy wrote on 07/20/2007 03:37:20 PM: > > > >> rtnetlink support seems more important than sysfs to me. >> > > Thanks, I will add that as a patch. The reason to add to sysfs is that > it is easier to change for a user (and similar to tx_queue_len). > Thanks. From kaber at trash.net Fri Jul 20 04:24:01 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 13:24:01 +0200 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: References: Message-ID: <46A09B51.6030301@trash.net> Krishna Kumar2 wrote: > Patrick McHardy wrote on 07/20/2007 03:41:01 PM: > >>> -static inline int qdisc_restart(struct net_device *dev) >>> +static inline int qdisc_restart(struct net_device *dev, >>> + struct sk_buff_head *blist) >>> { >>> struct Qdisc *q = dev->qdisc; >>> struct sk_buff *skb; >>> - unsigned lockless; >>> + unsigned getlock; /* whether we need to get lock or not */ >>> >> Unrelated rename, please get rid of this to reduce the noise. >> > > OK, I guess I should have sent that change earlier :) The reason to change > the name is to avoid (double-negative) checks like : > > if (!lockless) > to > if (getlock). > > I will remove these changes. > I guess you could put it in another patch. But frankly, I think the biggest uglyness is the conditional locking, not naming or double negation, so it won't really make the code any nicer :) From krkumar2 at in.ibm.com Fri Jul 20 04:52:05 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 17:22:05 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <46A09A85.7020500@trash.net> Message-ID: Hi Patrick, Patrick McHardy wrote on 07/20/2007 04:50:37 PM: > > I have a TODO comment in net-sysfs.c which is to catch this case. > > > > I noticed that. Still wondering why it is important at all though. I saw another mail of yours on the marc list on this same topic (which still hasn't come to me in the mail), so I will answer both : > Is there any downside in using batching with smaller queue sizes? I think there is, but as yet I don't have any data (and 16 is probably higher than reqd) to show it. If the queue size is very small (like 4), the extra processing to maintain this list may take more cycles than the performance gains for sending out few skbs, esp since most xmits will send out 1 skb and skb batching takes places less often (when tx lock fails or queue gets full). OTOH, there might be a gain to even send out 2 skbs, the problem is in doing the extra processing before xmit and not at the time of xmit. Does this sound OK ? If so, I will add the code to implement the TODO for tx_queue_len checking too. > > Without going into GSO, it is wasting some 32 bytes on i386 since most > > drivers don't export this API. > > 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. > If you'd use it for gso_skb it would come down to 8 bytes. struct > net_device is a pig already, and there are better ways to reduce this > than starting to allocating single members with a few bytes IMO. Sorry, I wanted to say 12 bytes on 32 bit system but mixed it up and said 32 bytes. So I guess static allocation is better then, and it will also help in performance as memory access is not required (offsetof should work). > Yes, packets can be holding references to various stuff and > these should be released on device down. As I said above I > don't really like the allocation, but even if you want to > keep it, just do the purging and dev_deactivate and keep the > freeing in unregister_netdev (actually I guess it should be > free_netdev to handle register_netdevice errors). Right, that makes it clean to do (and avoid stale packets on down). I will make both these changes now. Thanks for these suggestions, - KK From kaber at trash.net Fri Jul 20 04:55:37 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 13:55:37 +0200 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: References: Message-ID: <46A0A2B9.1050504@trash.net> Krishna Kumar2 wrote: > Patrick McHardy wrote on 07/20/2007 04:50:37 PM >> Is there any downside in using batching with smaller queue sizes? >> > > I think there is, but as yet I don't have any data (and 16 is probably > higher > than reqd) to show it. If the queue size is very small (like 4), the extra > processing to maintain this list may take more cycles than the performance > gains for sending out few skbs, esp since most xmits will send out 1 skb > and > skb batching takes places less often (when tx lock fails or queue gets > full). > > OTOH, there might be a gain to even send out 2 skbs, the problem is in > doing > the extra processing before xmit and not at the time of xmit. > > Does this sound OK ? If so, I will add the code to implement the TODO for > tx_queue_len checking too. > I can't really argue about the numbers, but it seems to me that only devices which *usually* have a sufficient queue length will support this, and anyone setting the queue length of a gbit device to <16 is begging for trouble anyway. So it doesn't really seem worth to bloat the code for handling an insane configuration as long as it doesn't break. From krkumar2 at in.ibm.com Fri Jul 20 05:09:18 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 17:39:18 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <46A09A85.7020500@trash.net> Message-ID: Patrick McHardy wrote on 07/20/2007: > I can't really argue about the numbers, but it seems to me that only > devices which *usually* have a sufficient queue length will support > this, and anyone setting the queue length of a gbit device to <16 is > begging for trouble anyway. So it doesn't really seem worth to bloat > the code for handling an insane configuration as long as it doesn't > break. Ah, I get your point now. So if driver sets BATCHING and user then sets queue_len to (say) 4, then poor results are expected (and kernel doesn't need to try fix it). Same for driver setting BATCHING when it's queue is small in the first place, which no driver writer should do anyway. I think it makes the code a lot easier too. Will update. thanks, - KK From krkumar2 at in.ibm.com Fri Jul 20 05:25:18 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 17:55:18 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <46A09A85.7020500@trash.net> Message-ID: Patrick McHardy wrote on 07/20/2007 04:50:37 PM: > 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. > If you'd use it for gso_skb it would come down to 8 bytes. struct > net_device is a pig already, and there are better ways to reduce this > than starting to allocating single members with a few bytes IMO. Currently, this allocated pointer is an indication to let kernel users (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching is enabled or disabled. Removing the pointer and making it static means those users cannot figure out this information . Adding another field to netdev may be a bad idea, so I am thinking of overloading dev->features to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS bit. Does this approach sound OK ? Thanks, - KK From kaber at trash.net Fri Jul 20 05:37:06 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 14:37:06 +0200 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: References: Message-ID: <46A0AC72.7090707@trash.net> Krishna Kumar2 wrote: > Patrick McHardy wrote on 07/20/2007 04:50:37 PM: > > >> 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. >> If you'd use it for gso_skb it would come down to 8 bytes. struct >> net_device is a pig already, and there are better ways to reduce this >> than starting to allocating single members with a few bytes IMO. >> > > Currently, this allocated pointer is an indication to let kernel users > (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching > is enabled or disabled. Removing the pointer and making it static means > those users cannot figure out this information . Adding another field to > netdev may be a bad idea, so I am thinking of overloading dev->features > to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver > capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS > bit. Does this approach sound OK ? > I guess so. It would be more consistent with things like HW checksumming etc. though to handle this through ethtool and have the ethtool callbacks set or clear just the one feature bit. That would mean you don't need to provide further indication of the device's capabilities to the stack since only the driver enables or disables the feature. From krkumar2 at in.ibm.com Fri Jul 20 05:33:56 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 18:03:56 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. Message-ID: (My Notes crashed when I hit the Send button, so not sure if this went out). __________________ Patrick McHardy wrote on 07/20/2007 04:50:37 PM: > 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. > If you'd use it for gso_skb it would come down to 8 bytes. struct > net_device is a pig already, and there are better ways to reduce this > than starting to allocating single members with a few bytes IMO. Currently, this allocated pointer is an indication to let kernel users (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching is enabled or disabled. Removing the pointer and making it static means those users cannot figure out this information . Adding another field to netdev may be a bad idea, so I am thinking of overloading dev->features to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS bit. Does this approach sound OK ? Thanks, - KK From dsekustzyqa at discount-traveller.de Fri Jul 20 05:47:05 2007 From: dsekustzyqa at discount-traveller.de (Marguerite) Date: Fri, 20 Jul 2007 02:47:05 -1000 Subject: [ofa-general] Thinking about you Message-ID: <365c01c7ca78$423081f0$a4ca724b@dsekustzyqa> She wear looked at paint harmony the note Wendy had left on the slept kitchen table. Wendy was up to something. She had disap How very Catholic of collar them, Nancy said caustically. I light didnt see the article, engine box but Im familiar wit No. Ive never needed one. Why? He rhyme frantically hoped Nancy wasnt going to cautious try and harbor pressure him into something Her sister, Angela, was a good example of someone disgust who vivaciously should have burst received guidance about person sex when s fragile nail outside Let all ursine thy converse be sincere, As they tactic approached, island they recognized it as a meddle canoe, with half its length below the camp ice and the other The room big news is Im going to be an aunt. Nancy and Cliff welcome want to have a baby. Its prefer shock even possible she "Aye, it's ill livin' in a hen-roost for them account as zoom doesn't like fold fleas," said Mrs. peck Poyser. "We've all h "Ah, to slip be sail sure," said Mrs. ski Poyser, emphatically, "you make but a poor trap drive to catch luck if you go through "Now, lad," said Adam, as Seth made arch his appearance, "the coffin's done, and we can take need woman it over to B "No! What a during pity! Such a pretty pocket. Well, I think I've got some things occur in refuse mine that order will make a "What brass art table girl goin' to do?" roll asked Lisbeth. "Set about thy feyther's coffin?" observation She receipt moor was unsure of what to do. Grounding Wendy meant staying home with hand her, which required ignoring h Heres canvas hug something strange, Cliff said changing the angle subject Ive been thinking fly about writing a sci Hetty now came back from the pantry forgive and said, "I can travel take Totty now, trade rod Aunt, if you like." As relaxed Rose Ann continued her fantasy, she crawl thought of how she would describe what sex spent untidy was really about. tip daughter bath Sex should only happen in ashamed marriage, Rose Ann thought. The guilt from premarital sex will haunt you, About a license quarter to seven there teaching was an unusual appearance overthrow of excitement in the chalk village of Hayslope, a The coffin was harbor linen soon propped on the tall shoulders of swear the two brothers, and they curtain were making their wa Ben said, Well, unit heres another hair piece of happen scientific misinformation, sneeze or double-speak... Ive gone thr Cliff gave Ben a glamorous gone space dirty look and a smile. I knew thumb that! He heard feminine chuckles in the background Most rod of the concern revolves around milk, but other organic produce would detect be effected, change comparison too. The new The man stealthily learn who had cut the hole in effect the ice and wedged the boastfully canoe into it, watched them from the comfort Mr. Casson's brake person was by eaten no means of that common type which can be allowed hospital busily to pass without descrip Benny smoked and drank spilt too much. He had a hard time loss quaint breathing, and had pencil forgotten what it felt like t -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: uPOqyAuZi4M.gif Type: image/gif Size: 8319 bytes Desc: not available URL: From johnpol at 2ka.mipt.ru Fri Jul 20 05:54:23 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Fri, 20 Jul 2007 16:54:23 +0400 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <20070720125423.GB13468@2ka.mipt.ru> Hi Krishna. On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > After fine-tuning qdisc and other changes, I modified IPoIB to use this API, > and now get good gains. Summary for TCP & No Delay: 1 process improves for > all cases from 1.4% to 49.5%; 4 process has almost identical improvements > from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP > was tested with 1 process netperf with small increase in BW but big > improvement in Service Demand. Netperf latency tests show small drop in > transaction rate (results in separate attachment). What about round-robin tcp time and latency test? In theory such batching mode should not change that timings, but practice can show new aspects. I will review code later this week (likely tomorrow) and if there will be some issues return back. -- Evgeniy Polyakov From krkumar2 at in.ibm.com Fri Jul 20 06:02:50 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 20 Jul 2007 18:32:50 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720125423.GB13468@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 07/20/2007 06:24:23 PM: > > After fine-tuning qdisc and other changes, I modified IPoIB to use this API, > > and now get good gains. Summary for TCP & No Delay: 1 process improves for > > all cases from 1.4% to 49.5%; 4 process has almost identical improvements > > from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to > > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP > > was tested with 1 process netperf with small increase in BW but big > > improvement in Service Demand. Netperf latency tests show small drop in > > transaction rate (results in separate attachment). > > What about round-robin tcp time and latency test? In theory such batching > mode should not change that timings, but practice can show new aspects. > I will review code later this week (likely tomorrow) and if there will > be some issues return back. I had run RR test quite some time back and don't have the result at this time, other than remembering it was almost the same as the original. As I am running some tests on those systems at this time, I can send the results of RR tomorrow. Thanks, - KK From hnguyen at linux.vnet.ibm.com Fri Jul 20 06:48:35 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 15:48:35 +0200 Subject: [ofa-general] [PATCH 0/5] ehca: MR large page, small queue and fixes Message-ID: <200707201548.36047.hnguyen@linux.vnet.ibm.com> Here is a patch set against Roland's git, branch for-2.6.23 for ehca. It adds support for MR large page and small queues. In addition of that it also contains various small fixes from previous comments and what we found. They are in details: [1/5] adds support for MR large page [2/5] generates event when SRQ limit reached [3/5] makes ehca2ib_return_code() non inline [4/5] makes internal_create/destroy_qp() static [5/5] adds support for small queues The patches should apply cleanly, in order, against Roland's git. Please review the changes and apply the patches if they are okay. Regards, Nam & Stefan From hnguyen at linux.vnet.ibm.com Fri Jul 20 07:01:51 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 16:01:51 +0200 Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs Message-ID: <200707201601.52277.hnguyen@linux.vnet.ibm.com> From: Hoang-Nam Nguyen Date: Thu, 19 Jul 2007 20:48:04 +0200 Subject: [PATCH 1/5] IB/ehca: Support large page MRs Add support for MR pages larger than 4K on eHCA2. This reduces firmware memory consumption. If enabled via the mr_largepage module parameter, the MR page size will be determined based on the MR length and the hardware capabilities - if the MR is >= 16M, 16M pages are used, for example. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_classes.h | 9 + drivers/infiniband/hw/ehca/ehca_main.c | 18 ++- drivers/infiniband/hw/ehca/ehca_mrmw.c | 371 ++++++++++++++++++++++++----- drivers/infiniband/hw/ehca/ehca_mrmw.h | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 20 ++- 5 files changed, 357 insertions(+), 63 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 043e4fb..63b8b9f 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -100,6 +100,11 @@ struct ehca_sport { struct ehca_sma_attr saved_attr; }; +#define HCA_CAP_MR_PGSIZE_4K 1 +#define HCA_CAP_MR_PGSIZE_64K 2 +#define HCA_CAP_MR_PGSIZE_1M 4 +#define HCA_CAP_MR_PGSIZE_16M 8 + struct ehca_shca { struct ib_device ib_device; struct ibmebus_dev *ibmebus_dev; @@ -115,6 +120,8 @@ struct ehca_shca { struct h_galpas galpas; struct mutex modify_mutex; u64 hca_cap; + /* MR pgsize: bit 0-3 means 4K, 64K, 1M, 16M respectively */ + u32 hca_cap_mr_pgsize; int max_mtu; }; @@ -206,6 +213,7 @@ struct ehca_mr { enum ehca_mr_flag flags; u32 num_kpages; /* number of kernel pages */ u32 num_hwpages; /* number of hw pages to form MR */ + u64 hwpage_size; /* hw page size used for this MR */ int acl; /* ACL (stored here for usage in reregister) */ u64 *start; /* virtual start address (stored here for */ /* usage in reregister) */ @@ -240,6 +248,7 @@ struct ehca_mr_pginfo { enum ehca_mr_pgi_type type; u64 num_kpages; u64 kpage_cnt; + u64 hwpage_size; /* hw page size used for this MR */ u64 num_hwpages; /* number of hw pages */ u64 hwpage_cnt; /* counter for hw pages */ u64 next_hwpage; /* next hw page in buffer/chunk/listelem */ diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 36377c6..34661c3 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -63,6 +63,7 @@ int ehca_port_act_time = 30; int ehca_poll_all_eqs = 1; int ehca_static_rate = -1; int ehca_scaling_code = 0; +int ehca_mr_largepage = 0; module_param_named(open_aqp1, ehca_open_aqp1, int, 0); module_param_named(debug_level, ehca_debug_level, int, 0); @@ -72,7 +73,8 @@ module_param_named(use_hp_mr, ehca_use_hp_mr, int, 0); module_param_named(port_act_time, ehca_port_act_time, int, 0); module_param_named(poll_all_eqs, ehca_poll_all_eqs, int, 0); module_param_named(static_rate, ehca_static_rate, int, 0); -module_param_named(scaling_code, ehca_scaling_code, int, 0); +module_param_named(scaling_code, ehca_scaling_code, int, 0); +module_param_named(mr_largepage, ehca_mr_largepage, int, 0); MODULE_PARM_DESC(open_aqp1, "AQP1 on startup (0: no (default), 1: yes)"); @@ -95,6 +97,9 @@ MODULE_PARM_DESC(static_rate, "set permanent static rate (default: disabled)"); MODULE_PARM_DESC(scaling_code, "set scaling code (0: disabled/default, 1: enabled)"); +MODULE_PARM_DESC(mr_largepage, + "use large page for MR (0: use PAGE_SIZE (default), " + "1: use large page depending on MR size"); DEFINE_RWLOCK(ehca_qp_idr_lock); DEFINE_RWLOCK(ehca_cq_idr_lock); @@ -295,6 +300,8 @@ int ehca_sense_attributes(struct ehca_shca *shca) if (EHCA_BMASK_GET(hca_cap_descr[i].mask, shca->hca_cap)) ehca_gen_dbg(" %s", hca_cap_descr[i].descr); + shca->hca_cap_mr_pgsize = rblock->memory_page_size_supported; + port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { @@ -590,6 +597,14 @@ static ssize_t ehca_show_adapter_handle(struct device *dev, } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); +static ssize_t ehca_show_mr_largepage(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", ehca_mr_largepage); +} +static DEVICE_ATTR(mr_largepage, S_IRUGO, ehca_show_mr_largepage, NULL); + static struct attribute *ehca_dev_attrs[] = { &dev_attr_adapter_handle.attr, &dev_attr_num_ports.attr, @@ -606,6 +621,7 @@ static struct attribute *ehca_dev_attrs[] = { &dev_attr_cur_mw.attr, &dev_attr_max_pd.attr, &dev_attr_max_ah.attr, + &dev_attr_mr_largepage.attr, NULL }; diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 6262c54..ba28783 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -5,6 +5,7 @@ * * Authors: Dietmar Decker * Christoph Raisch + * Hoang-Nam Nguyen * * Copyright (c) 2005 IBM Corporation * @@ -56,6 +57,37 @@ static struct kmem_cache *mr_cache; static struct kmem_cache *mw_cache; +enum ehca_mr_pgsize { + EHCA_MR_PGSIZE4K = 0x1000L, + EHCA_MR_PGSIZE64K = 0x10000L, + EHCA_MR_PGSIZE1M = 0x100000L, + EHCA_MR_PGSIZE16M = 0x1000000L +}; + +extern int ehca_mr_largepage; + +static u32 ehca_encode_hwpage_size(u32 pgsize) +{ + u32 idx = 0; + pgsize >>= 12; + /* + * map mr page size into hw code: + * 0, 1, 2, 3 for 4K, 64K, 1M, 64M + */ + while (!(pgsize & 1)) { + idx++; + pgsize >>= 4; + } + return idx; +} + +static u64 ehca_get_max_hwpage_size(struct ehca_shca *shca) +{ + if (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M) + return EHCA_MR_PGSIZE16M; + return EHCA_MR_PGSIZE4K; +} + static struct ehca_mr *ehca_mr_new(void) { struct ehca_mr *me; @@ -207,19 +239,23 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, struct ehca_mr_pginfo pginfo; u32 num_kpages; u32 num_hwpages; + u64 hw_pgsize; num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + - size, EHCA_PAGESIZE); + /* for kernel space we try most possible pgsize */ + hw_pgsize = ehca_get_max_hwpage_size(shca); + num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size, + hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; + pginfo.hwpage_size = hw_pgsize; pginfo.num_hwpages = num_hwpages; pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; - pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, @@ -259,6 +295,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, int ret; u32 num_kpages; u32 num_hwpages; + u64 hwpage_size; if (!pd) { ehca_gen_err("bad pd=%p", pd); @@ -309,16 +346,32 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, /* determine number of MR pages */ num_kpages = NUM_CHUNKS((virt % PAGE_SIZE) + length, PAGE_SIZE); - num_hwpages = NUM_CHUNKS((virt % EHCA_PAGESIZE) + length, - EHCA_PAGESIZE); + /* select proper hw_pgsize */ + if (ehca_mr_largepage && + (shca->hca_cap_mr_pgsize & HCA_CAP_MR_PGSIZE_16M)) { + if (length <= EHCA_MR_PGSIZE4K + && PAGE_SIZE == EHCA_MR_PGSIZE4K) + hwpage_size = EHCA_MR_PGSIZE4K; + else if (length <= EHCA_MR_PGSIZE64K) + hwpage_size = EHCA_MR_PGSIZE64K; + else if (length <= EHCA_MR_PGSIZE1M) + hwpage_size = EHCA_MR_PGSIZE1M; + else + hwpage_size = EHCA_MR_PGSIZE16M; + } else + hwpage_size = EHCA_MR_PGSIZE4K; + ehca_dbg(pd->device, "hwpage_size=%lx", hwpage_size); +reg_user_mr_fallback: + num_hwpages = NUM_CHUNKS((virt % hwpage_size) + length, hwpage_size); /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_USER; + pginfo.hwpage_size = hwpage_size; pginfo.num_kpages = num_kpages; pginfo.num_hwpages = num_hwpages; pginfo.u.usr.region = e_mr->umem; - pginfo.next_hwpage = e_mr->umem->offset / EHCA_PAGESIZE; + pginfo.next_hwpage = e_mr->umem->offset / hwpage_size; pginfo.u.usr.next_chunk = list_prepare_entry(pginfo.u.usr.next_chunk, (&e_mr->umem->chunk_list), list); @@ -326,6 +379,18 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, &e_mr->ib.ib_mr.rkey); + if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) { + ehca_warn(pd->device, "failed to register mr " + "with hwpage_size=%lx", hwpage_size); + ehca_info(pd->device, "try to register mr with " + "kpage_size=%lx", PAGE_SIZE); + /* + * this means kpages are not contiguous for a hw page + * try kernel page size as fallback solution + */ + hwpage_size = PAGE_SIZE; + goto reg_user_mr_fallback; + } if (ret) { ib_mr = ERR_PTR(ret); goto reg_user_mr_exit2; @@ -452,6 +517,8 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, new_pd = container_of(mr->pd, struct ehca_pd, ib_pd); if (mr_rereg_mask & IB_MR_REREG_TRANS) { + u64 hw_pgsize = ehca_get_max_hwpage_size(shca); + new_start = iova_start; /* change address */ /* check physical buffer list and calculate size */ ret = ehca_mr_chk_buf_and_calc_size(phys_buf_array, @@ -468,16 +535,17 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, } num_kpages = NUM_CHUNKS(((u64)new_start % PAGE_SIZE) + new_size, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)new_start % EHCA_PAGESIZE) + - new_size, EHCA_PAGESIZE); + num_hwpages = NUM_CHUNKS(((u64)new_start % hw_pgsize) + + new_size, hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; + pginfo.hwpage_size = hw_pgsize; pginfo.num_hwpages = num_hwpages; pginfo.u.phy.num_phys_buf = num_phys_buf; pginfo.u.phy.phys_buf_array = phys_buf_array; - pginfo.next_hwpage = (((u64)iova_start & ~PAGE_MASK) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + ((u64)iova_start & ~(hw_pgsize - 1)) / hw_pgsize; } if (mr_rereg_mask & IB_MR_REREG_ACCESS) new_acl = mr_access_flags; @@ -709,6 +777,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, int ret; u32 tmp_lkey, tmp_rkey; struct ehca_mr_pginfo pginfo; + u64 hw_pgsize; /* check other parameters */ if (((mr_access_flags & IB_ACCESS_REMOTE_WRITE) && @@ -738,8 +807,8 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, ib_fmr = ERR_PTR(-EINVAL); goto alloc_fmr_exit0; } - if (((1 << fmr_attr->page_shift) != EHCA_PAGESIZE) && - ((1 << fmr_attr->page_shift) != PAGE_SIZE)) { + hw_pgsize = ehca_get_max_hwpage_size(shca); + if ((1 << fmr_attr->page_shift) != hw_pgsize) { ehca_err(pd->device, "unsupported fmr_attr->page_shift=%x", fmr_attr->page_shift); ib_fmr = ERR_PTR(-EINVAL); @@ -755,6 +824,10 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, /* register MR on HCA */ memset(&pginfo, 0, sizeof(pginfo)); + /* + * pginfo.num_hwpages==0, ie register_rpages() will not be called + * but deferred to map_phys_fmr() + */ ret = ehca_reg_mr(shca, e_fmr, NULL, fmr_attr->max_pages * (1 << fmr_attr->page_shift), mr_access_flags, e_pd, &pginfo, @@ -765,6 +838,7 @@ struct ib_fmr *ehca_alloc_fmr(struct ib_pd *pd, } /* successful */ + e_fmr->hwpage_size = hw_pgsize; e_fmr->fmr_page_size = 1 << fmr_attr->page_shift; e_fmr->fmr_max_pages = fmr_attr->max_pages; e_fmr->fmr_max_maps = fmr_attr->max_maps; @@ -822,10 +896,12 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_FMR; pginfo.num_kpages = list_len; - pginfo.num_hwpages = list_len * (e_fmr->fmr_page_size / EHCA_PAGESIZE); + pginfo.hwpage_size = e_fmr->hwpage_size; + pginfo.num_hwpages = + list_len * e_fmr->fmr_page_size / pginfo.hwpage_size; pginfo.u.fmr.page_list = page_list; - pginfo.next_hwpage = ((iova & (e_fmr->fmr_page_size-1)) / - EHCA_PAGESIZE); + pginfo.next_hwpage = + (iova & (e_fmr->fmr_page_size-1)) / pginfo.hwpage_size; pginfo.u.fmr.fmr_pgsize = e_fmr->fmr_page_size; ret = ehca_rereg_mr(shca, e_fmr, (u64 *)iova, @@ -964,7 +1040,7 @@ int ehca_reg_mr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl); if (ehca_use_hp_mr == 1) hipz_acl |= 0x00000001; @@ -987,6 +1063,7 @@ int ehca_reg_mr(struct ehca_shca *shca, /* successful registration */ e_mr->num_kpages = pginfo->num_kpages; e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->hwpage_size = pginfo->hwpage_size; e_mr->start = iova_start; e_mr->size = size; e_mr->acl = acl; @@ -1029,6 +1106,9 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, u32 i; u64 *kpage; + if (!pginfo->num_hwpages) /* in case of fmr */ + return 0; + kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); @@ -1036,7 +1116,7 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, goto ehca_reg_mr_rpages_exit0; } - /* max 512 pages per shot */ + /* max MAX_RPAGES ehca mr pages per register call */ for (i = 0; i < NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES); i++) { if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { @@ -1049,8 +1129,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = ehca_set_pagebuf(pginfo, rnum, kpage); if (ret) { ehca_err(&shca->ib_device, "ehca_set_pagebuf " - "bad rc, ret=%x rnum=%x kpage=%p", - ret, rnum, kpage); + "bad rc, ret=%x rnum=%x kpage=%p", + ret, rnum, kpage); goto ehca_reg_mr_rpages_exit1; } @@ -1065,9 +1145,10 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, } else rpage = *kpage; - h_ret = hipz_h_register_rpage_mr(shca->ipz_hca_handle, e_mr, - 0, /* pagesize 4k */ - 0, rpage, rnum); + h_ret = hipz_h_register_rpage_mr( + shca->ipz_hca_handle, e_mr, + ehca_encode_hwpage_size(pginfo->hwpage_size), + 0, rpage, rnum); if (i == NUM_CHUNKS(pginfo->num_hwpages, MAX_RPAGES) - 1) { /* @@ -1131,7 +1212,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(pginfo->hwpage_size, &hipz_acl); kpage = ehca_alloc_fw_ctrlblock(GFP_KERNEL); if (!kpage) { @@ -1182,6 +1263,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, */ e_mr->num_kpages = pginfo->num_kpages; e_mr->num_hwpages = pginfo->num_hwpages; + e_mr->hwpage_size = pginfo->hwpage_size; e_mr->start = iova_start; e_mr->size = size; e_mr->acl = acl; @@ -1268,13 +1350,14 @@ int ehca_rereg_mr(struct ehca_shca *shca, /* set some MR values */ e_mr->flags = save_mr.flags; + e_mr->hwpage_size = save_mr.hwpage_size; e_mr->fmr_page_size = save_mr.fmr_page_size; e_mr->fmr_max_pages = save_mr.fmr_max_pages; e_mr->fmr_max_maps = save_mr.fmr_max_maps; e_mr->fmr_map_cnt = save_mr.fmr_map_cnt; ret = ehca_reg_mr(shca, e_mr, iova_start, size, acl, - e_pd, pginfo, lkey, rkey); + e_pd, pginfo, lkey, rkey); if (ret) { u32 offset = (u64)(&e_mr->flags) - (u64)e_mr; memcpy(&e_mr->flags, &(save_mr.flags), @@ -1355,6 +1438,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, /* set some MR values */ e_fmr->flags = save_fmr.flags; + e_fmr->hwpage_size = save_fmr.hwpage_size; e_fmr->fmr_page_size = save_fmr.fmr_page_size; e_fmr->fmr_max_pages = save_fmr.fmr_max_pages; e_fmr->fmr_max_maps = save_fmr.fmr_max_maps; @@ -1363,8 +1447,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_FMR; - pginfo.num_kpages = 0; - pginfo.num_hwpages = 0; ret = ehca_reg_mr(shca, e_fmr, NULL, (e_fmr->fmr_max_pages * e_fmr->fmr_page_size), e_fmr->acl, e_pd, &pginfo, &tmp_lkey, @@ -1373,7 +1455,6 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, u32 offset = (u64)(&e_fmr->flags) - (u64)e_fmr; memcpy(&e_fmr->flags, &(save_mr.flags), sizeof(struct ehca_mr) - offset); - goto ehca_unmap_one_fmr_exit0; } ehca_unmap_one_fmr_exit0: @@ -1401,7 +1482,7 @@ int ehca_reg_smr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl); h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, (u64)iova_start, hipz_acl, e_pd->fw_pd, @@ -1420,6 +1501,7 @@ int ehca_reg_smr(struct ehca_shca *shca, /* successful registration */ e_newmr->num_kpages = e_origmr->num_kpages; e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->hwpage_size = e_origmr->hwpage_size; e_newmr->start = iova_start; e_newmr->size = e_origmr->size; e_newmr->acl = acl; @@ -1452,6 +1534,7 @@ int ehca_reg_internal_maxmr( struct ib_phys_buf ib_pbuf; u32 num_kpages; u32 num_hwpages; + u64 hw_pgsize; e_mr = ehca_mr_new(); if (!e_mr) { @@ -1468,13 +1551,15 @@ int ehca_reg_internal_maxmr( ib_pbuf.size = size_maxmr; num_kpages = NUM_CHUNKS(((u64)iova_start % PAGE_SIZE) + size_maxmr, PAGE_SIZE); - num_hwpages = NUM_CHUNKS(((u64)iova_start % EHCA_PAGESIZE) + size_maxmr, - EHCA_PAGESIZE); + hw_pgsize = ehca_get_max_hwpage_size(shca); + num_hwpages = NUM_CHUNKS(((u64)iova_start % hw_pgsize) + size_maxmr, + hw_pgsize); memset(&pginfo, 0, sizeof(pginfo)); pginfo.type = EHCA_MR_PGI_PHYS; pginfo.num_kpages = num_kpages; pginfo.num_hwpages = num_hwpages; + pginfo.hwpage_size = hw_pgsize; pginfo.u.phy.num_phys_buf = 1; pginfo.u.phy.phys_buf_array = &ib_pbuf; @@ -1523,7 +1608,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, struct ehca_mr_hipzout_parms hipzout; ehca_mrmw_map_acl(acl, &hipz_acl); - ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); + ehca_mrmw_set_pgsize_hipz_acl(e_origmr->hwpage_size, &hipz_acl); h_ret = hipz_h_register_smr(shca->ipz_hca_handle, e_newmr, e_origmr, (u64)iova_start, hipz_acl, e_pd->fw_pd, @@ -1539,6 +1624,7 @@ int ehca_reg_maxmr(struct ehca_shca *shca, /* successful registration */ e_newmr->num_kpages = e_origmr->num_kpages; e_newmr->num_hwpages = e_origmr->num_hwpages; + e_newmr->hwpage_size = e_origmr->hwpage_size; e_newmr->start = iova_start; e_newmr->size = e_origmr->size; e_newmr->acl = acl; @@ -1684,6 +1770,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, u64 pgaddr; u32 i = 0; u32 j = 0; + int hwpages_per_kpage = PAGE_SIZE / pginfo->hwpage_size; /* loop over desired chunk entries */ chunk = pginfo->u.usr.next_chunk; @@ -1695,7 +1782,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, << PAGE_SHIFT ; *kpage = phys_to_abs(pgaddr + (pginfo->next_hwpage * - EHCA_PAGESIZE)); + pginfo->hwpage_size)); if ( !(*kpage) ) { ehca_gen_err("pgaddr=%lx " "chunk->page_list[i]=%lx " @@ -1708,8 +1795,7 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; kpage++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) { + if (pginfo->next_hwpage % hwpages_per_kpage == 0) { (pginfo->kpage_cnt)++; (pginfo->u.usr.next_nmap)++; pginfo->next_hwpage = 0; @@ -1738,6 +1824,143 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, return ret; } +/* + * check given pages for contiguous layout + * last page addr is returned in prev_pgaddr for further check + */ +static int ehca_check_kpages_per_ate(struct scatterlist *page_list, + int start_idx, int end_idx, + u64 *prev_pgaddr) +{ + int t; + for (t = start_idx; t <= end_idx; t++) { + u64 pgaddr = page_to_pfn(page_list[t].page) << PAGE_SHIFT; + ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr, + *(u64 *)abs_to_virt(phys_to_abs(pgaddr))); + if (pgaddr - PAGE_SIZE != *prev_pgaddr) { + ehca_gen_err("uncontiguous page found pgaddr=%lx " + "prev_pgaddr=%lx page_list_i=%x", + pgaddr, *prev_pgaddr, t); + return -EINVAL; + } + *prev_pgaddr = pgaddr; + } + return 0; +} + +/* PAGE_SIZE < pginfo->hwpage_size */ +static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, + u32 number, + u64 *kpage) +{ + int ret = 0; + struct ib_umem_chunk *prev_chunk; + struct ib_umem_chunk *chunk; + u64 pgaddr, prev_pgaddr; + u32 i = 0; + u32 j = 0; + int kpages_per_hwpage = pginfo->hwpage_size / PAGE_SIZE; + int nr_kpages = kpages_per_hwpage; + + /* loop over desired chunk entries */ + chunk = pginfo->u.usr.next_chunk; + prev_chunk = pginfo->u.usr.next_chunk; + list_for_each_entry_continue( + chunk, (&(pginfo->u.usr.region->chunk_list)), list) { + for (i = pginfo->u.usr.next_nmap; i < chunk->nmap; ) { + if (nr_kpages == kpages_per_hwpage) { + pgaddr = ( page_to_pfn(chunk->page_list[i].page) + << PAGE_SHIFT ); + *kpage = phys_to_abs(pgaddr); + if ( !(*kpage) ) { + ehca_gen_err("pgaddr=%lx i=%x", + pgaddr, i); + ret = -EFAULT; + return ret; + } + /* + * The first page in a hwpage must be aligned; + * the first MR page is exempt from this rule. + */ + if (pgaddr & (pginfo->hwpage_size - 1)) { + if (pginfo->hwpage_cnt) { + ehca_gen_err( + "invalid alignment " + "pgaddr=%lx i=%x " + "mr_pgsize=%lx", + pgaddr, i, + pginfo->hwpage_size); + ret = -EFAULT; + return ret; + } + /* first MR page */ + pginfo->kpage_cnt = + (pgaddr & + (pginfo->hwpage_size - 1)) >> + PAGE_SHIFT; + nr_kpages -= pginfo->kpage_cnt; + *kpage = phys_to_abs( + pgaddr & + ~(pginfo->hwpage_size - 1)); + } + ehca_gen_dbg("kpage=%lx chunk_page=%lx " + "value=%016lx", *kpage, pgaddr, + *(u64 *)abs_to_virt( + phys_to_abs(pgaddr))); + prev_pgaddr = pgaddr; + i++; + pginfo->kpage_cnt++; + pginfo->u.usr.next_nmap++; + nr_kpages--; + if (!nr_kpages) + goto next_kpage; + continue; + } + if (i + nr_kpages > chunk->nmap) { + ret = ehca_check_kpages_per_ate( + chunk->page_list, i, + chunk->nmap - 1, &prev_pgaddr); + if (ret) return ret; + pginfo->kpage_cnt += chunk->nmap - i; + pginfo->u.usr.next_nmap += chunk->nmap - i; + nr_kpages -= chunk->nmap - i; + break; + } + + ret = ehca_check_kpages_per_ate(chunk->page_list, i, + i + nr_kpages - 1, + &prev_pgaddr); + if (ret) return ret; + i += nr_kpages; + pginfo->kpage_cnt += nr_kpages; + pginfo->u.usr.next_nmap += nr_kpages; +next_kpage: + nr_kpages = kpages_per_hwpage; + (pginfo->hwpage_cnt)++; + kpage++; + j++; + if (j >= number) break; + } + if ((pginfo->u.usr.next_nmap >= chunk->nmap) && + (j >= number)) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + break; + } else if (pginfo->u.usr.next_nmap >= chunk->nmap) { + pginfo->u.usr.next_nmap = 0; + prev_chunk = chunk; + } else if (j >= number) + break; + else + prev_chunk = chunk; + } + pginfo->u.usr.next_chunk = + list_prepare_entry(prev_chunk, + (&(pginfo->u.usr.region->chunk_list)), + list); + return ret; +} + int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, u32 number, u64 *kpage) @@ -1750,9 +1973,10 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, /* loop over desired phys_buf_array entries */ while (i < number) { pbuf = pginfo->u.phy.phys_buf_array + pginfo->u.phy.next_buf; - num_hw = NUM_CHUNKS((pbuf->addr % EHCA_PAGESIZE) + - pbuf->size, EHCA_PAGESIZE); - offs_hw = (pbuf->addr & ~PAGE_MASK) / EHCA_PAGESIZE; + num_hw = NUM_CHUNKS((pbuf->addr % pginfo->hwpage_size) + + pbuf->size, pginfo->hwpage_size); + offs_hw = (pbuf->addr & ~(pginfo->hwpage_size - 1)) / + pginfo->hwpage_size; while (pginfo->next_hwpage < offs_hw + num_hw) { /* sanity check */ if ((pginfo->kpage_cnt >= pginfo->num_kpages) || @@ -1768,21 +1992,23 @@ int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, return -EFAULT; } *kpage = phys_to_abs( - (pbuf->addr & EHCA_PAGEMASK) - + (pginfo->next_hwpage * EHCA_PAGESIZE)); + (pbuf->addr & ~(pginfo->hwpage_size - 1)) + + (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) && pbuf->addr ) { - ehca_gen_err("pbuf->addr=%lx " - "pbuf->size=%lx " + ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx " "next_hwpage=%lx", pbuf->addr, - pbuf->size, - pginfo->next_hwpage); + pbuf->size, pginfo->next_hwpage); return -EFAULT; } (pginfo->hwpage_cnt)++; (pginfo->next_hwpage)++; - if (pginfo->next_hwpage % - (PAGE_SIZE / EHCA_PAGESIZE) == 0) - (pginfo->kpage_cnt)++; + if (PAGE_SIZE >= pginfo->hwpage_size) { + if (pginfo->next_hwpage % + (PAGE_SIZE / pginfo->hwpage_size) == 0) + (pginfo->kpage_cnt)++; + } else + pginfo->kpage_cnt += pginfo->hwpage_size / + PAGE_SIZE; kpage++; i++; if (i >= number) break; @@ -1806,8 +2032,8 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, /* loop over desired page_list entries */ fmrlist = pginfo->u.fmr.page_list + pginfo->u.fmr.next_listelem; for (i = 0; i < number; i++) { - *kpage = phys_to_abs((*fmrlist & EHCA_PAGEMASK) + - pginfo->next_hwpage * EHCA_PAGESIZE); + *kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) + + pginfo->next_hwpage * pginfo->hwpage_size); if ( !(*kpage) ) { ehca_gen_err("*fmrlist=%lx fmrlist=%p " "next_listelem=%lx next_hwpage=%lx", @@ -1817,15 +2043,38 @@ int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, return -EFAULT; } (pginfo->hwpage_cnt)++; - (pginfo->next_hwpage)++; - kpage++; - if (pginfo->next_hwpage % - (pginfo->u.fmr.fmr_pgsize / EHCA_PAGESIZE) == 0) { - (pginfo->kpage_cnt)++; - (pginfo->u.fmr.next_listelem)++; - fmrlist++; - pginfo->next_hwpage = 0; + if (pginfo->u.fmr.fmr_pgsize >= pginfo->hwpage_size) { + if (pginfo->next_hwpage % + (pginfo->u.fmr.fmr_pgsize / + pginfo->hwpage_size) == 0) { + (pginfo->kpage_cnt)++; + (pginfo->u.fmr.next_listelem)++; + fmrlist++; + pginfo->next_hwpage = 0; + } else + (pginfo->next_hwpage)++; + } else { + unsigned int cnt_per_hwpage = pginfo->hwpage_size / + pginfo->u.fmr.fmr_pgsize; + unsigned int j; + u64 prev = *kpage; + /* check if adrs are contiguous */ + for (j = 1; j < cnt_per_hwpage; j++) { + u64 p = phys_to_abs(fmrlist[j] & + ~(pginfo->hwpage_size - 1)); + if (prev + pginfo->u.fmr.fmr_pgsize != p) { + ehca_gen_err("uncontiguous fmr pages " + "found prev=%lx p=%lx " + "idx=%x", prev, p, i + j); + return -EINVAL; + } + prev = p; + } + pginfo->kpage_cnt += cnt_per_hwpage; + pginfo->u.fmr.next_listelem += cnt_per_hwpage; + fmrlist += cnt_per_hwpage; } + kpage++; } return ret; } @@ -1842,7 +2091,9 @@ int ehca_set_pagebuf(struct ehca_mr_pginfo *pginfo, ret = ehca_set_pagebuf_phys(pginfo, number, kpage); break; case EHCA_MR_PGI_USER: - ret = ehca_set_pagebuf_user1(pginfo, number, kpage); + ret = PAGE_SIZE >= pginfo->hwpage_size ? + ehca_set_pagebuf_user1(pginfo, number, kpage) : + ehca_set_pagebuf_user2(pginfo, number, kpage); break; case EHCA_MR_PGI_FMR: ret = ehca_set_pagebuf_fmr(pginfo, number, kpage); @@ -1895,9 +2146,9 @@ void ehca_mrmw_map_acl(int ib_acl, /*----------------------------------------------------------------------*/ /* sets page size in hipz access control for MR/MW. */ -void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl) /*INOUT*/ +void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl) /*INOUT*/ { - return; /* HCA supports only 4k */ + *hipz_acl |= (ehca_encode_hwpage_size(pgsize) << 24); } /* end ehca_mrmw_set_pgsize_hipz_acl() */ /*----------------------------------------------------------------------*/ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.h b/drivers/infiniband/hw/ehca/ehca_mrmw.h index 24f13fe..bc8f4e3 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.h +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.h @@ -111,7 +111,7 @@ int ehca_mr_is_maxmr(u64 size, void ehca_mrmw_map_acl(int ib_acl, u32 *hipz_acl); -void ehca_mrmw_set_pgsize_hipz_acl(u32 *hipz_acl); +void ehca_mrmw_set_pgsize_hipz_acl(u32 pgsize, u32 *hipz_acl); void ehca_mrmw_reverse_map_acl(const u32 *hipz_acl, int *ib_acl); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 3394e05..358796c 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -427,7 +427,8 @@ u64 hipz_h_register_rpage(const struct ipz_adapter_handle adapter_handle, { return ehca_plpar_hcall_norets(H_REGISTER_RPAGES, adapter_handle.handle, /* r4 */ - queue_type | pagesize << 8, /* r5 */ + (u64)queue_type | ((u64)pagesize) << 8, + /* r5 */ resource_handle, /* r6 */ logical_address_of_page, /* r7 */ count, /* r8 */ @@ -724,6 +725,9 @@ u64 hipz_h_alloc_resource_mr(const struct ipz_adapter_handle adapter_handle, u64 ret; u64 outs[PLPAR_HCALL9_BUFSIZE]; + ehca_gen_dbg("kernel PAGE_SIZE=%x access_ctrl=%016x " + "vaddr=%lx length=%lx", + (u32)PAGE_SIZE, access_ctrl, vaddr, length); ret = ehca_plpar_hcall9(H_ALLOC_RESOURCE, outs, adapter_handle.handle, /* r4 */ 5, /* r5 */ @@ -746,8 +750,22 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, const u64 logical_address_of_page, const u64 count) { + extern int ehca_debug_level; u64 ret; + if (unlikely(ehca_debug_level >= 2)) { + if (count > 1) { + u64 *kpage; + int i; + kpage = (u64 *)abs_to_virt(logical_address_of_page); + for (i = 0; i < count; i++) + ehca_gen_dbg("kpage[%d]=%p", + i, (void *)kpage[i]); + } else + ehca_gen_dbg("kpage=%p", + (void *)logical_address_of_page); + } + if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) { ehca_gen_err("logical_address_of_page not on a 4k boundary " "adapter_handle=%lx mr=%p mr_handle=%lx " -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Jul 20 07:02:46 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 16:02:46 +0200 Subject: [ofa-general] [PATCH 3/5] ehca: Make ehca2ib_return_code() non-inline Message-ID: <200707201602.46415.hnguyen@linux.vnet.ibm.com> From: Joachim Fenkes Date: Thu, 19 Jul 2007 21:13:57 +0200 Subject: [PATCH 3/5] IB/ehca: Make ehca2ib_return_code() non-inline It's nowhere in the main path and making it non-inline saves ~1.5K of code. Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_main.c | 17 +++++++++++++++++ drivers/infiniband/hw/ehca/ehca_tools.h | 19 +------------------ 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 34661c3..3bd7afb 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -130,6 +130,23 @@ void ehca_free_fw_ctrlblock(void *ptr) } #endif +int ehca2ib_return_code(u64 ehca_rc) +{ + switch (ehca_rc) { + case H_SUCCESS: + return 0; + case H_RESOURCE: /* Resource in use */ + case H_BUSY: + return -EBUSY; + case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ + case H_CONSTRAINED: /* resource constraint */ + case H_NO_MEM: + return -ENOMEM; + default: + return -EINVAL; + } +} + static int ehca_create_slab_caches(void) { int ret; diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 678b813..57c77a7 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -154,24 +154,7 @@ extern int ehca_debug_level; #define EHCA_BMASK_GET(mask, value) \ (EHCA_BMASK_MASK(mask) & (((u64)(value)) >> EHCA_BMASK_SHIFTPOS(mask))) - /* Converts ehca to ib return code */ -static inline int ehca2ib_return_code(u64 ehca_rc) -{ - switch (ehca_rc) { - case H_SUCCESS: - return 0; - case H_RESOURCE: /* Resource in use */ - case H_BUSY: - return -EBUSY; - case H_NOT_ENOUGH_RESOURCES: /* insufficient resources */ - case H_CONSTRAINED: /* resource constraint */ - case H_NO_MEM: - return -ENOMEM; - default: - return -EINVAL; - } -} - +int ehca2ib_return_code(u64 ehca_rc); #endif /* EHCA_TOOLS_H */ -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Jul 20 07:02:18 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 16:02:18 +0200 Subject: [ofa-general] [PATCH 2/5] ehca: Generate event when SRQ limit reached Message-ID: <200707201602.19142.hnguyen@linux.vnet.ibm.com> From: Joachim Fenkes Date: Thu, 19 Jul 2007 20:51:43 +0200 Subject: [PATCH 2/5] IB/ehca: Generate event when SRQ limit reached Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_irq.c | 42 ++++++++++++++++++++++----------- 1 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 4fb01fc..71c0799 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -175,9 +175,8 @@ error_data1: } -static void qp_event_callback(struct ehca_shca *shca, - u64 eqe, - enum ib_event_type event_type) +static void qp_event_callback(struct ehca_shca *shca, u64 eqe, + enum ib_event_type event_type, int fatal) { struct ib_event event; struct ehca_qp *qp; @@ -191,16 +190,26 @@ static void qp_event_callback(struct ehca_shca *shca, if (!qp) return; - ehca_error_data(shca, qp, qp->ipz_qp_handle.handle); + if (fatal) + ehca_error_data(shca, qp, qp->ipz_qp_handle.handle); - if (!qp->ib_qp.event_handler) - return; + event.device = &shca->ib_device; - event.device = &shca->ib_device; - event.event = event_type; - event.element.qp = &qp->ib_qp; + if (qp->ext_type == EQPT_SRQ) { + if (!qp->ib_srq.event_handler) + return; - qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context); + event.event = fatal ? IB_EVENT_SRQ_ERR : event_type; + event.element.srq = &qp->ib_srq; + qp->ib_srq.event_handler(&event, qp->ib_srq.srq_context); + } else { + if (!qp->ib_qp.event_handler) + return; + + event.event = event_type; + event.element.qp = &qp->ib_qp; + qp->ib_qp.event_handler(&event, qp->ib_qp.qp_context); + } return; } @@ -234,17 +243,17 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe) switch (identifier) { case 0x02: /* path migrated */ - qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG); + qp_event_callback(shca, eqe, IB_EVENT_PATH_MIG, 0); break; case 0x03: /* communication established */ - qp_event_callback(shca, eqe, IB_EVENT_COMM_EST); + qp_event_callback(shca, eqe, IB_EVENT_COMM_EST, 0); break; case 0x04: /* send queue drained */ - qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED); + qp_event_callback(shca, eqe, IB_EVENT_SQ_DRAINED, 0); break; case 0x05: /* QP error */ case 0x06: /* QP error */ - qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL); + qp_event_callback(shca, eqe, IB_EVENT_QP_FATAL, 1); break; case 0x07: /* CQ error */ case 0x08: /* CQ error */ @@ -278,6 +287,11 @@ static void parse_identifier(struct ehca_shca *shca, u64 eqe) ehca_err(&shca->ib_device, "Interface trace stopped."); break; case 0x14: /* first error capture info available */ + ehca_info(&shca->ib_device, "First error capture available"); + break; + case 0x15: /* SRQ limit reached */ + qp_event_callback(shca, eqe, IB_EVENT_SRQ_LIMIT_REACHED, 0); + break; default: ehca_err(&shca->ib_device, "Unknown identifier: %x on %s.", identifier, shca->ib_device.name); -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Jul 20 07:03:09 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 16:03:09 +0200 Subject: [ofa-general] [PATCH 4/5] ehca: Make internal_create/destroy_qp() static Message-ID: <200707201603.10321.hnguyen@linux.vnet.ibm.com> From: Joachim Fenkes Date: Thu, 19 Jul 2007 21:40:00 +0200 Subject: [PATCH 4/5] IB/ehca: Make internal_{create,destroy}_qp() static They're only used in ehca_qp.c Signed-off-by: Joachim Fenkes --- drivers/infiniband/hw/ehca/ehca_qp.c | 17 +++++++++-------- 1 files changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 48e9cea..b916d9c 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -363,10 +363,11 @@ init_qp_queue1: * the value of the is_srq parameter. If init_attr and srq_init_attr share * fields, the field out of init_attr is used. */ -struct ehca_qp *internal_create_qp(struct ib_pd *pd, - struct ib_qp_init_attr *init_attr, - struct ib_srq_init_attr *srq_init_attr, - struct ib_udata *udata, int is_srq) +static struct ehca_qp *internal_create_qp( + struct ib_pd *pd, + struct ib_qp_init_attr *init_attr, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata, int is_srq) { struct ehca_qp *my_qp; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); @@ -752,8 +753,8 @@ struct ib_qp *ehca_create_qp(struct ib_pd *pd, return IS_ERR(ret) ? (struct ib_qp *)ret : &ret->ib_qp; } -int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, - struct ib_uobject *uobject); +static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, + struct ib_uobject *uobject); struct ib_srq *ehca_create_srq(struct ib_pd *pd, struct ib_srq_init_attr *srq_init_attr, @@ -1669,8 +1670,8 @@ query_srq_exit1: return ret; } -int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, - struct ib_uobject *uobject) +static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, + struct ib_uobject *uobject) { struct ehca_shca *shca = container_of(dev, struct ehca_shca, ib_device); struct ehca_pd *my_pd = container_of(my_qp->ib_qp.pd, struct ehca_pd, -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Jul 20 07:04:17 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 20 Jul 2007 16:04:17 +0200 Subject: [ofa-general] [PATCH 5/5] ehca: Support small QP queues Message-ID: <200707201604.17991.hnguyen@linux.vnet.ibm.com> From: Stefan Roscher Date: Fri, 20 Jul 2007 13:59:14 +0200 Subject: [PATCH 5/5] IB/ehca: Small QP queues eHCA2 supports QP queues that can be as small as 512 bytes. This greatly reduces memory overhead for consumers that use lots of QPs with small queues (e.g. RDMA-only QPs). Apart from dealing with firmware, this code needs to manage bite-sized chunks of kernel pages, making sure that no kernel page is shared between different protection domains. Signed-off-by: Hoang-Nam Nguyen --- drivers/infiniband/hw/ehca/ehca_classes.h | 41 ++++-- drivers/infiniband/hw/ehca/ehca_cq.c | 8 +- drivers/infiniband/hw/ehca/ehca_eq.c | 8 +- drivers/infiniband/hw/ehca/ehca_main.c | 14 ++- drivers/infiniband/hw/ehca/ehca_pd.c | 25 +++- drivers/infiniband/hw/ehca/ehca_qp.c | 163 +++++++++++++--------- drivers/infiniband/hw/ehca/ehca_uverbs.c | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 30 +++-- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 222 ++++++++++++++++++++++------- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 26 +++- 10 files changed, 379 insertions(+), 160 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 63b8b9f..3725aa8 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -43,7 +43,6 @@ #ifndef __EHCA_CLASSES_H__ #define __EHCA_CLASSES_H__ - struct ehca_module; struct ehca_qp; struct ehca_cq; @@ -129,6 +128,10 @@ struct ehca_pd { struct ib_pd ib_pd; struct ipz_pd fw_pd; u32 ownpid; + /* small queue mgmt */ + struct mutex lock; + struct list_head free[2]; + struct list_head full[2]; }; enum ehca_ext_qp_type { @@ -307,6 +310,8 @@ int ehca_init_av_cache(void); void ehca_cleanup_av_cache(void); int ehca_init_mrmw_cache(void); void ehca_cleanup_mrmw_cache(void); +int ehca_init_small_qp_cache(void); +void ehca_cleanup_small_qp_cache(void); extern rwlock_t ehca_qp_idr_lock; extern rwlock_t ehca_cq_idr_lock; @@ -324,7 +329,7 @@ struct ipzu_queue_resp { u32 queue_length; /* queue length allocated in bytes */ u32 pagesize; u32 toggle_state; - u32 dummy; /* padding for 8 byte alignment */ + u32 offset; /* save offset within a page for small_qp */ }; struct ehca_create_cq_resp { @@ -366,15 +371,29 @@ enum ehca_ll_comp_flags { LLQP_COMP_MASK = 0x60, }; +struct ehca_alloc_queue_parms { + /* input parameters */ + int max_wr; + int max_sge; + int page_size; + int is_small; + + /* output parameters */ + u16 act_nr_wqes; + u8 act_nr_sges; + u32 queue_size; /* bytes for small queues, pages otherwise */ +}; + struct ehca_alloc_qp_parms { -/* input parameters */ + struct ehca_alloc_queue_parms squeue; + struct ehca_alloc_queue_parms rqueue; + + /* input parameters */ enum ehca_service_type servicetype; + int qp_storage; int sigtype; enum ehca_ext_qp_type ext_type; enum ehca_ll_comp_flags ll_comp_flags; - - int max_send_wr, max_recv_wr; - int max_send_sge, max_recv_sge; int ud_av_l_key_ctl; u32 token; @@ -384,18 +403,10 @@ struct ehca_alloc_qp_parms { u32 srq_qpn, srq_token, srq_limit; -/* output parameters */ + /* output parameters */ u32 real_qp_num; struct ipz_qp_handle qp_handle; struct h_galpas galpas; - - u16 act_nr_send_wqes; - u16 act_nr_recv_wqes; - u8 act_nr_recv_sges; - u8 act_nr_send_sges; - - u32 nr_rq_pages; - u32 nr_sq_pages; }; int ehca_cq_assign_qp(struct ehca_cq *cq, struct ehca_qp *qp); diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 9e87883..5746787 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -190,8 +190,8 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, goto create_cq_exit2; } - ipz_rc = ipz_queue_ctor(&my_cq->ipz_queue, param.act_pages, - EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0); + ipz_rc = ipz_queue_ctor(NULL, &my_cq->ipz_queue, param.act_pages, + EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0, 0); if (!ipz_rc) { ehca_err(device, "ipz_queue_ctor() failed ipz_rc=%x device=%p", ipz_rc, device); @@ -285,7 +285,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, return cq; create_cq_exit4: - ipz_queue_dtor(&my_cq->ipz_queue); + ipz_queue_dtor(NULL, &my_cq->ipz_queue); create_cq_exit3: h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); @@ -359,7 +359,7 @@ int ehca_destroy_cq(struct ib_cq *cq) "ehca_cq=%p cq_num=%x", h_ret, my_cq, cq_num); return ehca2ib_return_code(h_ret); } - ipz_queue_dtor(&my_cq->ipz_queue); + ipz_queue_dtor(NULL, &my_cq->ipz_queue); kmem_cache_free(cq_cache, my_cq); return 0; diff --git a/drivers/infiniband/hw/ehca/ehca_eq.c b/drivers/infiniband/hw/ehca/ehca_eq.c index 4825975..1d41faa 100644 --- a/drivers/infiniband/hw/ehca/ehca_eq.c +++ b/drivers/infiniband/hw/ehca/ehca_eq.c @@ -86,8 +86,8 @@ int ehca_create_eq(struct ehca_shca *shca, return -EINVAL; } - ret = ipz_queue_ctor(&eq->ipz_queue, nr_pages, - EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0); + ret = ipz_queue_ctor(NULL, &eq->ipz_queue, nr_pages, + EHCA_PAGESIZE, sizeof(struct ehca_eqe), 0, 0); if (!ret) { ehca_err(ib_dev, "Can't allocate EQ pages eq=%p", eq); goto create_eq_exit1; @@ -145,7 +145,7 @@ int ehca_create_eq(struct ehca_shca *shca, return 0; create_eq_exit2: - ipz_queue_dtor(&eq->ipz_queue); + ipz_queue_dtor(NULL, &eq->ipz_queue); create_eq_exit1: hipz_h_destroy_eq(shca->ipz_hca_handle, eq); @@ -181,7 +181,7 @@ int ehca_destroy_eq(struct ehca_shca *shca, struct ehca_eq *eq) ehca_err(&shca->ib_device, "Can't free EQ resources."); return -EINVAL; } - ipz_queue_dtor(&eq->ipz_queue); + ipz_queue_dtor(NULL, &eq->ipz_queue); return 0; } diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 3bd7afb..e09a2ae 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -181,6 +181,12 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } + ret = ehca_init_small_qp_cache(); + if (ret) { + ehca_gen_err("Cannot create small queue SLAB cache."); + goto create_slab_caches6; + } + #ifdef CONFIG_PPC_64K_PAGES ctblk_cache = kmem_cache_create("ehca_cache_ctblk", EHCA_PAGESIZE, H_CB_ALIGNMENT, @@ -188,12 +194,15 @@ static int ehca_create_slab_caches(void) NULL, NULL); if (!ctblk_cache) { ehca_gen_err("Cannot create ctblk SLAB cache."); - ehca_cleanup_mrmw_cache(); - goto create_slab_caches5; + ehca_cleanup_small_qp_cache(); + goto create_slab_caches6; } #endif return 0; +create_slab_caches6: + ehca_cleanup_mrmw_cache(); + create_slab_caches5: ehca_cleanup_av_cache(); @@ -211,6 +220,7 @@ create_slab_caches2: static void ehca_destroy_slab_caches(void) { + ehca_cleanup_small_qp_cache(); ehca_cleanup_mrmw_cache(); ehca_cleanup_av_cache(); ehca_cleanup_qp_cache(); diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c index 79d0591..79d5bc8 100644 --- a/drivers/infiniband/hw/ehca/ehca_pd.c +++ b/drivers/infiniband/hw/ehca/ehca_pd.c @@ -49,6 +49,7 @@ struct ib_pd *ehca_alloc_pd(struct ib_device *device, struct ib_ucontext *context, struct ib_udata *udata) { struct ehca_pd *pd; + int i; pd = kmem_cache_zalloc(pd_cache, GFP_KERNEL); if (!pd) { @@ -58,6 +59,11 @@ struct ib_pd *ehca_alloc_pd(struct ib_device *device, } pd->ownpid = current->tgid; + for (i = 0; i < 2; i++) { + INIT_LIST_HEAD(&pd->free[i]); + INIT_LIST_HEAD(&pd->full[i]); + } + mutex_init(&pd->lock); /* * Kernel PD: when device = -1, 0 @@ -81,6 +87,9 @@ int ehca_dealloc_pd(struct ib_pd *pd) { u32 cur_pid = current->tgid; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); + int i, leftovers = 0; + extern struct kmem_cache *small_qp_cache; + struct ipz_small_queue_page *page, *tmp; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && my_pd->ownpid != cur_pid) { @@ -89,8 +98,20 @@ int ehca_dealloc_pd(struct ib_pd *pd) return -EINVAL; } - kmem_cache_free(pd_cache, - container_of(pd, struct ehca_pd, ib_pd)); + for (i = 0; i < 2; i++) { + list_splice(&my_pd->full[i], &my_pd->free[i]); + list_for_each_entry_safe(page, tmp, &my_pd->free[i], list) { + leftovers = 1; + free_page(page->page); + kmem_cache_free(small_qp_cache, page); + } + } + + if (leftovers) + ehca_warn(pd->device, + "Some small queue pages were not freed"); + + kmem_cache_free(pd_cache, my_pd); return 0; } diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index b916d9c..6c6f9d9 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -275,34 +275,39 @@ static inline void queue2resp(struct ipzu_queue_resp *resp, resp->toggle_state = queue->toggle_state; } -static inline int ll_qp_msg_size(int nr_sge) -{ - return 128 << nr_sge; -} - /* * init_qp_queue initializes/constructs r/squeue and registers queue pages. */ static inline int init_qp_queue(struct ehca_shca *shca, + struct ehca_pd *pd, struct ehca_qp *my_qp, struct ipz_queue *queue, int q_type, u64 expected_hret, - int nr_q_pages, - int wqe_size, - int nr_sges) + struct ehca_alloc_queue_parms *parms, + int wqe_size) { - int ret, cnt, ipz_rc; + int ret, cnt, ipz_rc, nr_q_pages; void *vpage; u64 rpage, h_ret; struct ib_device *ib_dev = &shca->ib_device; struct ipz_adapter_handle ipz_hca_handle = shca->ipz_hca_handle; - if (!nr_q_pages) + if (!parms->queue_size) return 0; - ipz_rc = ipz_queue_ctor(queue, nr_q_pages, EHCA_PAGESIZE, - wqe_size, nr_sges); + if (parms->is_small) { + nr_q_pages = 1; + ipz_rc = ipz_queue_ctor(pd, queue, nr_q_pages, + 128 << parms->page_size, + wqe_size, parms->act_nr_sges, 1); + } else { + nr_q_pages = parms->queue_size; + ipz_rc = ipz_queue_ctor(pd, queue, nr_q_pages, + EHCA_PAGESIZE, wqe_size, + parms->act_nr_sges, 0); + } + if (!ipz_rc) { ehca_err(ib_dev, "Cannot allocate page for queue. ipz_rc=%x", ipz_rc); @@ -323,7 +328,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, h_ret = hipz_h_register_rpage_qp(ipz_hca_handle, my_qp->ipz_qp_handle, NULL, 0, q_type, - rpage, 1, + rpage, parms->is_small ? 0 : 1, my_qp->galpas.kernel); if (cnt == (nr_q_pages - 1)) { /* last page! */ if (h_ret != expected_hret) { @@ -354,10 +359,45 @@ static inline int init_qp_queue(struct ehca_shca *shca, return 0; init_qp_queue1: - ipz_queue_dtor(queue); + ipz_queue_dtor(pd, queue); return ret; } +static inline int ehca_calc_wqe_size(int act_nr_sge, int is_llqp) +{ + if (is_llqp) + return 128 << act_nr_sge; + else + return offsetof(struct ehca_wqe, + u.nud.sg_list[act_nr_sge]); +} + +static void ehca_determine_small_queue(struct ehca_alloc_queue_parms *queue, + int req_nr_sge, int is_llqp) +{ + u32 wqe_size, q_size; + int act_nr_sge = req_nr_sge; + + if (!is_llqp) + /* round up #SGEs so WQE size is a power of 2 */ + for (act_nr_sge = 4; act_nr_sge <= 252; + act_nr_sge = 4 + 2 * act_nr_sge) + if (act_nr_sge >= req_nr_sge) + break; + + wqe_size = ehca_calc_wqe_size(act_nr_sge, is_llqp); + q_size = wqe_size * (queue->max_wr + 1); + + if (q_size <= 512) + queue->page_size = 2; + else if (q_size <= 1024) + queue->page_size = 3; + else + queue->page_size = 0; + + queue->is_small = (queue->page_size != 0); +} + /* * Create an ib_qp struct that is either a QP or an SRQ, depending on * the value of the is_srq parameter. If init_attr and srq_init_attr share @@ -553,10 +593,20 @@ static struct ehca_qp *internal_create_qp( if (my_qp->recv_cq) parms.recv_cq_handle = my_qp->recv_cq->ipz_cq_handle; - parms.max_send_wr = init_attr->cap.max_send_wr; - parms.max_recv_wr = init_attr->cap.max_recv_wr; - parms.max_send_sge = max_send_sge; - parms.max_recv_sge = max_recv_sge; + parms.squeue.max_wr = init_attr->cap.max_send_wr; + parms.rqueue.max_wr = init_attr->cap.max_recv_wr; + parms.squeue.max_sge = max_send_sge; + parms.rqueue.max_sge = max_recv_sge; + + if (EHCA_BMASK_GET(HCA_CAP_MINI_QP, shca->hca_cap) + && !(context && udata)) { /* no small QP support in userspace ATM */ + ehca_determine_small_queue( + &parms.squeue, max_send_sge, is_llqp); + ehca_determine_small_queue( + &parms.rqueue, max_recv_sge, is_llqp); + parms.qp_storage = + (parms.squeue.is_small || parms.rqueue.is_small); + } h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms); if (h_ret != H_SUCCESS) { @@ -570,50 +620,33 @@ static struct ehca_qp *internal_create_qp( my_qp->ipz_qp_handle = parms.qp_handle; my_qp->galpas = parms.galpas; + swqe_size = ehca_calc_wqe_size(parms.squeue.act_nr_sges, is_llqp); + rwqe_size = ehca_calc_wqe_size(parms.rqueue.act_nr_sges, is_llqp); + switch (qp_type) { case IB_QPT_RC: - if (!is_llqp) { - swqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ - (parms.act_nr_send_sges)]); - rwqe_size = offsetof(struct ehca_wqe, u.nud.sg_list[ - (parms.act_nr_recv_sges)]); - } else { /* for LLQP we need to use msg size, not wqe size */ - swqe_size = ll_qp_msg_size(max_send_sge); - rwqe_size = ll_qp_msg_size(max_recv_sge); - parms.act_nr_send_sges = 1; - parms.act_nr_recv_sges = 1; - } - break; - case IB_QPT_UC: - swqe_size = offsetof(struct ehca_wqe, - u.nud.sg_list[parms.act_nr_send_sges]); - rwqe_size = offsetof(struct ehca_wqe, - u.nud.sg_list[parms.act_nr_recv_sges]); + if (is_llqp) { + parms.squeue.act_nr_sges = 1; + parms.rqueue.act_nr_sges = 1; + } break; - case IB_QPT_UD: case IB_QPT_GSI: case IB_QPT_SMI: + /* UD circumvention */ if (is_llqp) { - swqe_size = ll_qp_msg_size(parms.act_nr_send_sges); - rwqe_size = ll_qp_msg_size(parms.act_nr_recv_sges); - parms.act_nr_send_sges = 1; - parms.act_nr_recv_sges = 1; + parms.squeue.act_nr_sges = 1; + parms.rqueue.act_nr_sges = 1; } else { - /* UD circumvention */ - parms.act_nr_send_sges -= 2; - parms.act_nr_recv_sges -= 2; - swqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[ - parms.act_nr_send_sges]); - rwqe_size = offsetof(struct ehca_wqe, u.ud_av.sg_list[ - parms.act_nr_recv_sges]); + parms.squeue.act_nr_sges -= 2; + parms.rqueue.act_nr_sges -= 2; } if (IB_QPT_GSI == qp_type || IB_QPT_SMI == qp_type) { - parms.act_nr_send_wqes = init_attr->cap.max_send_wr; - parms.act_nr_recv_wqes = init_attr->cap.max_recv_wr; - parms.act_nr_send_sges = init_attr->cap.max_send_sge; - parms.act_nr_recv_sges = init_attr->cap.max_recv_sge; + parms.squeue.act_nr_wqes = init_attr->cap.max_send_wr; + parms.rqueue.act_nr_wqes = init_attr->cap.max_recv_wr; + parms.squeue.act_nr_sges = init_attr->cap.max_send_sge; + parms.rqueue.act_nr_sges = init_attr->cap.max_recv_sge; ib_qp_num = (qp_type == IB_QPT_SMI) ? 0 : 1; } @@ -626,10 +659,9 @@ static struct ehca_qp *internal_create_qp( /* initialize r/squeue and register queue pages */ if (HAS_SQ(my_qp)) { ret = init_qp_queue( - shca, my_qp, &my_qp->ipz_squeue, 0, + shca, my_pd, my_qp, &my_qp->ipz_squeue, 0, HAS_RQ(my_qp) ? H_PAGE_REGISTERED : H_SUCCESS, - parms.nr_sq_pages, swqe_size, - parms.act_nr_send_sges); + &parms.squeue, swqe_size); if (ret) { ehca_err(pd->device, "Couldn't initialize squeue " "and pages ret=%x", ret); @@ -639,9 +671,8 @@ static struct ehca_qp *internal_create_qp( if (HAS_RQ(my_qp)) { ret = init_qp_queue( - shca, my_qp, &my_qp->ipz_rqueue, 1, - H_SUCCESS, parms.nr_rq_pages, rwqe_size, - parms.act_nr_recv_sges); + shca, my_pd, my_qp, &my_qp->ipz_rqueue, 1, + H_SUCCESS, &parms.rqueue, rwqe_size); if (ret) { ehca_err(pd->device, "Couldn't initialize rqueue " "and pages ret=%x", ret); @@ -671,10 +702,10 @@ static struct ehca_qp *internal_create_qp( } init_attr->cap.max_inline_data = 0; /* not supported yet */ - init_attr->cap.max_recv_sge = parms.act_nr_recv_sges; - init_attr->cap.max_recv_wr = parms.act_nr_recv_wqes; - init_attr->cap.max_send_sge = parms.act_nr_send_sges; - init_attr->cap.max_send_wr = parms.act_nr_send_wqes; + init_attr->cap.max_recv_sge = parms.rqueue.act_nr_sges; + init_attr->cap.max_recv_wr = parms.rqueue.act_nr_wqes; + init_attr->cap.max_send_sge = parms.squeue.act_nr_sges; + init_attr->cap.max_send_wr = parms.squeue.act_nr_wqes; my_qp->init_attr = *init_attr; /* NOTE: define_apq0() not supported yet */ @@ -708,6 +739,8 @@ static struct ehca_qp *internal_create_qp( resp.ext_type = my_qp->ext_type; resp.qkey = my_qp->qkey; resp.real_qp_num = my_qp->real_qp_num; + resp.ipz_rqueue.offset = my_qp->ipz_rqueue.offset; + resp.ipz_squeue.offset = my_qp->ipz_squeue.offset; if (HAS_SQ(my_qp)) queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue); if (HAS_RQ(my_qp)) @@ -724,11 +757,11 @@ static struct ehca_qp *internal_create_qp( create_qp_exit4: if (HAS_RQ(my_qp)) - ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue); create_qp_exit3: if (HAS_SQ(my_qp)) - ipz_queue_dtor(&my_qp->ipz_squeue); + ipz_queue_dtor(my_pd, &my_qp->ipz_squeue); create_qp_exit2: hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); @@ -1735,9 +1768,9 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, } if (HAS_RQ(my_qp)) - ipz_queue_dtor(&my_qp->ipz_rqueue); + ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue); if (HAS_SQ(my_qp)) - ipz_queue_dtor(&my_qp->ipz_squeue); + ipz_queue_dtor(my_pd, &my_qp->ipz_squeue); kmem_cache_free(qp_cache, my_qp); return 0; } diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 05c4157..4bc687f 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -149,7 +149,7 @@ static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, ehca_gen_err("vm_insert_page() failed rc=%x", ret); return ret; } - start += PAGE_SIZE; + start += PAGE_SIZE; } vma->vm_private_data = mm_count; (*mm_count)++; diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 358796c..fdbfebe 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -52,10 +52,13 @@ #define H_ALL_RES_QP_ENHANCED_OPS EHCA_BMASK_IBM(9, 11) #define H_ALL_RES_QP_PTE_PIN EHCA_BMASK_IBM(12, 12) #define H_ALL_RES_QP_SERVICE_TYPE EHCA_BMASK_IBM(13, 15) +#define H_ALL_RES_QP_STORAGE EHCA_BMASK_IBM(16, 17) #define H_ALL_RES_QP_LL_RQ_CQE_POSTING EHCA_BMASK_IBM(18, 18) #define H_ALL_RES_QP_LL_SQ_CQE_POSTING EHCA_BMASK_IBM(19, 21) #define H_ALL_RES_QP_SIGNALING_TYPE EHCA_BMASK_IBM(22, 23) #define H_ALL_RES_QP_UD_AV_LKEY_CTRL EHCA_BMASK_IBM(31, 31) +#define H_ALL_RES_QP_SMALL_SQ_PAGE_SIZE EHCA_BMASK_IBM(32, 35) +#define H_ALL_RES_QP_SMALL_RQ_PAGE_SIZE EHCA_BMASK_IBM(36, 39) #define H_ALL_RES_QP_RESOURCE_TYPE EHCA_BMASK_IBM(56, 63) #define H_ALL_RES_QP_MAX_OUTST_SEND_WR EHCA_BMASK_IBM(0, 15) @@ -299,6 +302,11 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, | EHCA_BMASK_SET(H_ALL_RES_QP_PTE_PIN, 0) | EHCA_BMASK_SET(H_ALL_RES_QP_SERVICE_TYPE, parms->servicetype) | EHCA_BMASK_SET(H_ALL_RES_QP_SIGNALING_TYPE, parms->sigtype) + | EHCA_BMASK_SET(H_ALL_RES_QP_STORAGE, parms->qp_storage) + | EHCA_BMASK_SET(H_ALL_RES_QP_SMALL_SQ_PAGE_SIZE, + parms->squeue.page_size) + | EHCA_BMASK_SET(H_ALL_RES_QP_SMALL_RQ_PAGE_SIZE, + parms->rqueue.page_size) | EHCA_BMASK_SET(H_ALL_RES_QP_LL_RQ_CQE_POSTING, !!(parms->ll_comp_flags & LLQP_RECV_COMP)) | EHCA_BMASK_SET(H_ALL_RES_QP_LL_SQ_CQE_POSTING, @@ -309,13 +317,13 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, max_r10_reg = EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_SEND_WR, - parms->max_send_wr + 1) + parms->squeue.max_wr + 1) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_OUTST_RECV_WR, - parms->max_recv_wr + 1) + parms->rqueue.max_wr + 1) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_SEND_SGE, - parms->max_send_sge) + parms->squeue.max_sge) | EHCA_BMASK_SET(H_ALL_RES_QP_MAX_RECV_SGE, - parms->max_recv_sge); + parms->rqueue.max_sge); r11 = EHCA_BMASK_SET(H_ALL_RES_QP_SRQ_QP_TOKEN, parms->srq_token); @@ -335,17 +343,17 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, parms->qp_handle.handle = outs[0]; parms->real_qp_num = (u32)outs[1]; - parms->act_nr_send_wqes = + parms->squeue.act_nr_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_SEND_WR, outs[2]); - parms->act_nr_recv_wqes = + parms->rqueue.act_nr_wqes = (u16)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_OUTST_RECV_WR, outs[2]); - parms->act_nr_send_sges = + parms->squeue.act_nr_sges = (u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_SEND_SGE, outs[3]); - parms->act_nr_recv_sges = + parms->rqueue.act_nr_sges = (u8)EHCA_BMASK_GET(H_ALL_RES_QP_ACT_RECV_SGE, outs[3]); - parms->nr_sq_pages = + parms->squeue.queue_size = (u32)EHCA_BMASK_GET(H_ALL_RES_QP_SQUEUE_SIZE_PAGES, outs[4]); - parms->nr_rq_pages = + parms->rqueue.queue_size = (u32)EHCA_BMASK_GET(H_ALL_RES_QP_RQUEUE_SIZE_PAGES, outs[4]); if (ret == H_SUCCESS) @@ -497,7 +505,7 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, const u64 count, const struct h_galpa galpa) { - if (count != 1) { + if (count > 1) { ehca_gen_err("Page counter=%lx", count); return H_PARAMETER; } diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c index 9606f13..6506501 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -40,6 +40,11 @@ #include "ehca_tools.h" #include "ipz_pt_fn.h" +#include "ehca_classes.h" + +#define PAGES_PER_KPAGE (PAGE_SIZE >> EHCA_PAGESHIFT) + +struct kmem_cache *small_qp_cache; void *ipz_qpageit_get_inc(struct ipz_queue *queue) { @@ -49,7 +54,7 @@ void *ipz_qpageit_get_inc(struct ipz_queue *queue) queue->current_q_offset -= queue->pagesize; ret = NULL; } - if (((u64)ret) % EHCA_PAGESIZE) { + if (((u64)ret) % queue->pagesize) { ehca_gen_err("ERROR!! not at PAGE-Boundary"); return NULL; } @@ -83,80 +88,195 @@ int ipz_queue_abs_to_offset(struct ipz_queue *queue, u64 addr, u64 *q_offset) return -EINVAL; } -int ipz_queue_ctor(struct ipz_queue *queue, - const u32 nr_of_pages, - const u32 pagesize, const u32 qe_size, const u32 nr_of_sg) +#if PAGE_SHIFT < EHCA_PAGESHIFT +#error Kernel pages must be at least as large than eHCA pages (4K) ! +#endif + +/* + * allocate pages for queue: + * outer loop allocates whole kernel pages (page aligned) and + * inner loop divides a kernel page into smaller hca queue pages + */ +static int alloc_queue_pages(struct ipz_queue *queue, const u32 nr_of_pages) { - int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; - int f; + int k, f = 0; + u8 *kpage; - if (pagesize > PAGE_SIZE) { - ehca_gen_err("FATAL ERROR: pagesize=%x is greater " - "than kernel page size", pagesize); - return 0; - } - if (!pages_per_kpage) { - ehca_gen_err("FATAL ERROR: invalid kernel page size. " - "pages_per_kpage=%x", pages_per_kpage); - return 0; - } - queue->queue_length = nr_of_pages * pagesize; - queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *)); - if (!queue->queue_pages) { - ehca_gen_err("ERROR!! didn't get the memory"); - return 0; - } - memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *)); - /* - * allocate pages for queue: - * outer loop allocates whole kernel pages (page aligned) and - * inner loop divides a kernel page into smaller hca queue pages - */ - f = 0; while (f < nr_of_pages) { - u8 *kpage = (u8 *)get_zeroed_page(GFP_KERNEL); - int k; + kpage = (u8 *)get_zeroed_page(GFP_KERNEL); if (!kpage) - goto ipz_queue_ctor_exit0; /*NOMEM*/ - for (k = 0; k < pages_per_kpage && f < nr_of_pages; k++) { - (queue->queue_pages)[f] = (struct ipz_page *)kpage; + goto out; + + for (k = 0; k < PAGES_PER_KPAGE && f < nr_of_pages; k++) { + queue->queue_pages[f] = (struct ipz_page *)kpage; kpage += EHCA_PAGESIZE; f++; } } + return 1; - queue->current_q_offset = 0; +out: + for (f = 0; f < nr_of_pages && queue->queue_pages[f]; + f += PAGES_PER_KPAGE) + free_page((unsigned long)(queue->queue_pages)[f]); + return 0; +} + +static int alloc_small_queue_page(struct ipz_queue *queue, struct ehca_pd *pd) +{ + int order = ilog2(queue->pagesize) - 9; + struct ipz_small_queue_page *page; + unsigned long bit; + + mutex_lock(&pd->lock); + + if (!list_empty(&pd->free[order])) + page = list_entry(pd->free[order].next, + struct ipz_small_queue_page, list); + else { + page = kmem_cache_zalloc(small_qp_cache, GFP_KERNEL); + if (!page) + goto out; + + page->page = get_zeroed_page(GFP_KERNEL); + if (!page->page) { + kmem_cache_free(small_qp_cache, page); + goto out; + } + + list_add(&page->list, &pd->free[order]); + } + + bit = find_first_zero_bit(page->bitmap, IPZ_SPAGE_PER_KPAGE >> order); + __set_bit(bit, page->bitmap); + page->fill++; + + if (page->fill == IPZ_SPAGE_PER_KPAGE >> order) + list_move(&page->list, &pd->full[order]); + + mutex_unlock(&pd->lock); + + queue->queue_pages[0] = (void *)(page->page | (bit << (order + 9))); + queue->small_page = page; + return 1; + +out: + ehca_err(pd->ib_pd.device, "failed to allocate small queue page"); + return 0; +} + +static void free_small_queue_page(struct ipz_queue *queue, struct ehca_pd *pd) +{ + int order = ilog2(queue->pagesize) - 9; + struct ipz_small_queue_page *page = queue->small_page; + unsigned long bit; + int free_page = 0; + + bit = ((unsigned long)queue->queue_pages[0] & PAGE_MASK) + >> (order + 9); + + mutex_lock(&pd->lock); + + __clear_bit(bit, page->bitmap); + page->fill--; + + if (page->fill == 0) { + list_del(&page->list); + free_page = 1; + } + + if (page->fill == (IPZ_SPAGE_PER_KPAGE >> order) - 1) + /* the page was full until we freed the chunk */ + list_move_tail(&page->list, &pd->free[order]); + + mutex_unlock(&pd->lock); + + if (free_page) { + free_page(page->page); + kmem_cache_free(small_qp_cache, page); + } +} + +int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue, + const u32 nr_of_pages, const u32 pagesize, + const u32 qe_size, const u32 nr_of_sg, + int is_small) +{ + if (pagesize > PAGE_SIZE) { + ehca_gen_err("FATAL ERROR: pagesize=%x " + "is greater than kernel page size", pagesize); + return 0; + } + + /* init queue fields */ + queue->queue_length = nr_of_pages * pagesize; + queue->pagesize = pagesize; queue->qe_size = qe_size; queue->act_nr_of_sg = nr_of_sg; - queue->pagesize = pagesize; + queue->current_q_offset = 0; queue->toggle_state = 1; - return 1; + queue->small_page = NULL; - ipz_queue_ctor_exit0: - ehca_gen_err("Couldn't get alloc pages queue=%p f=%x nr_of_pages=%x", - queue, f, nr_of_pages); - for (f = 0; f < nr_of_pages; f += pages_per_kpage) { - if (!(queue->queue_pages)[f]) - break; - free_page((unsigned long)(queue->queue_pages)[f]); + /* allocate queue page pointers */ + queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *)); + if (!queue->queue_pages) { + ehca_gen_err("Couldn't allocate queue page list"); + return 0; } + memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *)); + + /* allocate actual queue pages */ + if (is_small) { + if (!alloc_small_queue_page(queue, pd)) + goto ipz_queue_ctor_exit0; + } else + if (!alloc_queue_pages(queue, nr_of_pages)) + goto ipz_queue_ctor_exit0; + + return 1; + +ipz_queue_ctor_exit0: + ehca_gen_err("Couldn't alloc pages queue=%p " + "nr_of_pages=%x", queue, nr_of_pages); + vfree(queue->queue_pages); + return 0; } -int ipz_queue_dtor(struct ipz_queue *queue) +int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue) { - int pages_per_kpage = PAGE_SIZE >> EHCA_PAGESHIFT; - int g; - int nr_pages; + int i, nr_pages; if (!queue || !queue->queue_pages) { ehca_gen_dbg("queue or queue_pages is NULL"); return 0; } - nr_pages = queue->queue_length / queue->pagesize; - for (g = 0; g < nr_pages; g += pages_per_kpage) - free_page((unsigned long)(queue->queue_pages)[g]); + + if (queue->small_page) + free_small_queue_page(queue, pd); + else { + nr_pages = queue->queue_length / queue->pagesize; + for (i = 0; i < nr_pages; i += PAGES_PER_KPAGE) + free_page((unsigned long)queue->queue_pages[i]); + } + vfree(queue->queue_pages); return 1; } + +int ehca_init_small_qp_cache(void) +{ + small_qp_cache = kmem_cache_create("ehca_cache_small_qp", + sizeof(struct ipz_small_queue_page), + 0, SLAB_HWCACHE_ALIGN, NULL, NULL); + if (!small_qp_cache) + return -ENOMEM; + + return 0; +} + +void ehca_cleanup_small_qp_cache(void) +{ + kmem_cache_destroy(small_qp_cache); +} diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index 39a4f64..c6937a0 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -51,11 +51,25 @@ #include "ehca_tools.h" #include "ehca_qes.h" +struct ehca_pd; +struct ipz_small_queue_page; + /* struct generic ehca page */ struct ipz_page { u8 entries[EHCA_PAGESIZE]; }; +#define IPZ_SPAGE_PER_KPAGE (PAGE_SIZE / 512) + +struct ipz_small_queue_page { + unsigned long page; + unsigned long bitmap[IPZ_SPAGE_PER_KPAGE / BITS_PER_LONG]; + int fill; + void *mapped_addr; + u32 mmap_count; + struct list_head list; +}; + /* struct generic queue in linux kernel virtual memory (kv) */ struct ipz_queue { u64 current_q_offset; /* current queue entry */ @@ -66,7 +80,8 @@ struct ipz_queue { u32 queue_length; /* queue length allocated in bytes */ u32 pagesize; u32 toggle_state; /* toggle flag - per page */ - u32 dummy3; /* 64 bit alignment */ + u32 offset; /* save offset within page for small_qp */ + struct ipz_small_queue_page *small_page; }; /* @@ -188,9 +203,10 @@ struct ipz_qpt { * see ipz_qpt_ctor() * returns true if ok, false if out of memory */ -int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages, - const u32 pagesize, const u32 qe_size, - const u32 nr_of_sg); +int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue, + const u32 nr_of_pages, const u32 pagesize, + const u32 qe_size, const u32 nr_of_sg, + int is_small); /* * destructor for a ipz_queue_t @@ -198,7 +214,7 @@ int ipz_queue_ctor(struct ipz_queue *queue, const u32 nr_of_pages, * see ipz_queue_ctor() * returns true if ok, false if queue was NULL-ptr of free failed */ -int ipz_queue_dtor(struct ipz_queue *queue); +int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue); /* * constructor for a ipz_qpt_t, -- 1.5.2 From mst at dev.mellanox.co.il Fri Jul 20 07:17:59 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 20 Jul 2007 17:17:59 +0300 Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch In-Reply-To: <469FB3AB.6080304@linux.vnet.ibm.com> References: <469FB3AB.6080304@linux.vnet.ibm.com> Message-ID: <20070720141759.GF31246@mellanox.co.il> > @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > attr.recv_cq = priv->cq; > attr.srq = priv->cm.srq; > attr.cap.max_send_wr = ipoib_sendq_size; > + attr.cap.max_recv_wr = 1; > attr.cap.max_send_sge = 1; > + attr.cap.max_recv_sge = 1; > attr.sq_sig_type = IB_SIGNAL_ALL_WR; > attr.qp_type = IB_QPT_RC; > attr.send_cq = cq; You never post a receive WR on this QP, do you? So 1. What's magic about 1 as max recv wr? Why not 0? 2. If the remote sends a packet on this QP, it'llget closed, won't it? Looks like a spec violation. -- MST From davem at systemfabricworks.com Fri Jul 20 08:09:50 2007 From: davem at systemfabricworks.com (davem at systemfabricworks.com) Date: Fri, 20 Jul 2007 10:09:50 -0500 Subject: [ofa-general] [PATCH] infiniband-diags/scripts: Handle new and old topology file format Message-ID: <46A0D03E.mail35T1S2JP6@systemfabricworks.com> Fix infiniband-diags scripts to handle changed ibnetdiscover topology file format and remain backward compatible with old file format. Signed-off-by: David A. McMillen --- infiniband-diags/scripts/ibcheckerrors.in | 4 +++- infiniband-diags/scripts/ibchecknet.in | 4 +++- infiniband-diags/scripts/ibcheckstate.in | 4 +++- infiniband-diags/scripts/ibcheckwidth.in | 4 +++- infiniband-diags/scripts/ibclearcounters.in | 4 +++- infiniband-diags/scripts/ibclearerrors.in | 4 +++- infiniband-diags/scripts/ibdatacounters.in | 4 +++- 7 files changed, 21 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in index 8a6c012..e08eba3 100644 --- a/infiniband-diags/scripts/ibcheckerrors.in +++ b/infiniband-diags/scripts/ibcheckerrors.in @@ -91,13 +91,15 @@ function check_node(lid) nports++ port = $1 if (!nodechecked) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) check_node(lid) } if (badnode) { print "\n# " ntype ": nodeguid 0x" nodeguid " failed" next } + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (nodeerr) if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " " port)) { diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in index 3154d9e..9f36742 100644 --- a/infiniband-diags/scripts/ibchecknet.in +++ b/infiniband-diags/scripts/ibchecknet.in @@ -84,13 +84,15 @@ function check_node(lid) nports++ port = $1 if (!nodechecked) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) check_node(lid) } if (badnode) { print "\n# " ntype ": nodeguid 0x" nodeguid " failed" next } + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (system("'$IBPATH'/ibcheckport '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in index 9268670..30b5513 100644 --- a/infiniband-diags/scripts/ibcheckstate.in +++ b/infiniband-diags/scripts/ibcheckstate.in @@ -83,13 +83,15 @@ function check_node(lid) nports++ port = $1 if (!nodechecked) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) check_node(lid) } if (badnode) { print "\n# " ntype ": nodeguid 0x" nodeguid " failed" next } + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (system("'$IBPATH'/ibcheckportstate '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in index 7a8e7e0..072d433 100644 --- a/infiniband-diags/scripts/ibcheckwidth.in +++ b/infiniband-diags/scripts/ibcheckwidth.in @@ -83,13 +83,15 @@ function check_node(lid) nports++ port = $1 if (!nodechecked) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) check_node(lid) } if (badnode) { print "\n# " ntype ": nodeguid 0x" nodeguid " failed" next } + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (system("'$IBPATH'/ibcheckportwidth '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in index fa6ab83..54551b3 100644 --- a/infiniband-diags/scripts/ibclearcounters.in +++ b/infiniband-diags/scripts/ibclearcounters.in @@ -73,9 +73,11 @@ function clear_port_counters(lid, port) /^\[/ { port = $1 + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (!nodecleared) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) clear_port_counters(lid, port) } } diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in index bce8f83..4a086ae 100644 --- a/infiniband-diags/scripts/ibclearerrors.in +++ b/infiniband-diags/scripts/ibclearerrors.in @@ -66,9 +66,11 @@ function clear_errors(lid, port) /^\[/ { port = $1 + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (!nodecleared) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) clear_errors(lid, port) } } diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in index ce8c71a..d27149e 100644 --- a/infiniband-diags/scripts/ibdatacounters.in +++ b/infiniband-diags/scripts/ibdatacounters.in @@ -91,13 +91,15 @@ function check_node(lid) nports++ port = $1 if (!nodechecked) { - lid = $5 + lid = substr($0, index($0, " lid ") + 5) + lid = substr(lid, 1, index(lid, " ") - 1) check_node(lid) } if (badnode) { print "\n# " ntype ": nodeguid 0x" nodeguid " failed" next } + sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (nodeerr) if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " " port)) { From davem at systemfabricworks.com Fri Jul 20 08:12:03 2007 From: davem at systemfabricworks.com (davem at systemfabricworks.com) Date: Fri, 20 Jul 2007 10:12:03 -0500 Subject: [ofa-general] [PATCH] infiniband-diags/ibnetdiscover: Fix DDR link speed decode Message-ID: <46A0D0C3.mail37511HD5H@systemfabricworks.com> Fix ibnetdiscover DDR link speed decode by moving string from [3] to [2]. Signed-off-by: David A. McMillen --- infiniband-diags/src/ibnetdiscover.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index c321d59..ccd70cb 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -78,8 +78,8 @@ static char *linkwidth_str[] = { static char *linkspeed_str[] = { "???", "SDR", - "???", "DDR", + "???", "QDR" }; From aedanzig at info-mass.com Fri Jul 20 09:06:56 2007 From: aedanzig at info-mass.com (Tamara Lay) Date: Fri, 20 Jul 2007 18:06:56 +0200 Subject: [ofa-general] daze bankruptcy addend Message-ID: <001801c7caf8$c2a8efe0$086a1fc4@pc100050> come carney creature blustery. concretion cartographer cedric class caveman birmingham. cerium braille calendrical bribery bard borderland component bayonne aft. desolate cutout arrangeable binaural bishopric czar cornell. From shemminger at linux-foundation.org Fri Jul 20 09:22:03 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Fri, 20 Jul 2007 17:22:03 +0100 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: <46A09ACF.20805@trash.net> References: <46A09ACF.20805@trash.net> Message-ID: <20070720172203.0eaeea86@oldman> On Fri, 20 Jul 2007 13:21:51 +0200 Patrick McHardy wrote: > Krishna Kumar2 wrote: > > Patrick McHardy wrote on 07/20/2007 03:37:20 PM: > > > > > > > >> rtnetlink support seems more important than sysfs to me. > >> > > > > Thanks, I will add that as a patch. The reason to add to sysfs is that > > it is easier to change for a user (and similar to tx_queue_len). > > > But since batching is so similar to TSO, i really should be part of the flags and controlled by ethtool like other offload flags. From pradeeps at linux.vnet.ibm.com Fri Jul 20 09:31:12 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 20 Jul 2007 09:31:12 -0700 Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch In-Reply-To: <20070720141759.GF31246@mellanox.co.il> References: <469FB3AB.6080304@linux.vnet.ibm.com> <20070720141759.GF31246@mellanox.co.il> Message-ID: <46A0E350.5060207@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_ >> attr.recv_cq = priv->cq; >> attr.srq = priv->cm.srq; >> attr.cap.max_send_wr = ipoib_sendq_size; >> + attr.cap.max_recv_wr = 1; >> attr.cap.max_send_sge = 1; >> + attr.cap.max_recv_sge = 1; >> attr.sq_sig_type = IB_SIGNAL_ALL_WR; >> attr.qp_type = IB_QPT_RC; >> attr.send_cq = cq; > > You never post a receive WR on this QP, do you? > So > 1. What's magic about 1 as max recv wr? Why not 0? > 2. If the remote sends a packet on this QP, it'llget closed, > won't it? Looks like a spec violation. > > Good catch. I can probably set max_recv_sge to 0 too -right? I can do that in a separate patch later on. However, I see nothing in table 46 of the IB spec that tells me that it is a violation of the spec. Which section are you referring to? Pradeep From xma at us.ibm.com Fri Jul 20 09:39:24 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 20 Jul 2007 09:39:24 -0700 Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2? Message-ID: I downloaded OFED-1.2.tgz. It doesn't include source code openib-*tgz as OFED-1.1. Where I can find the source code without installing any RPMs?? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From sri at us.ibm.com Fri Jul 20 10:25:05 2007 From: sri at us.ibm.com (Sridhar Samudrala) Date: Fri, 20 Jul 2007 10:25:05 -0700 Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes. In-Reply-To: <20070720063216.26341.80316.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063216.26341.80316.sendpatchset@localhost.localdomain> Message-ID: <1184952305.12431.16.camel@localhost.localdomain> On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote: > Networking include file changes for batching. > > Signed-off-by: Krishna Kumar > --- > linux/netdevice.h | 10 ++++++++++ > net/pkt_sched.h | 6 +++--- > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h > --- org/include/linux/netdevice.h 2007-07-20 07:49:28.000000000 +0530 > +++ new/include/linux/netdevice.h 2007-07-20 08:30:55.000000000 +0530 > @@ -264,6 +264,8 @@ enum netdev_state_t > __LINK_STATE_QDISC_RUNNING, > }; > > +/* Minimum length of device hardware queue for batching to work */ > +#define MIN_QUEUE_LEN_BATCH 16 > > /* > * This structure holds at boot time configured netdevice settings. They > @@ -340,6 +342,7 @@ struct net_device > #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ > #define NETIF_F_GSO 2048 /* Enable software GSO. */ > #define NETIF_F_LLTX 4096 /* LockLess TX */ > +#define NETIF_F_BATCH_SKBS 8192 /* Driver supports batch skbs API */ > #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ > > /* Segmentation offload features */ > @@ -452,6 +455,8 @@ struct net_device > struct Qdisc *qdisc_sleeping; > struct list_head qdisc_list; > unsigned long tx_queue_len; /* Max frames per queue allowed */ > + unsigned long xmit_slots; /* Device free slots */ > + struct sk_buff_head *skb_blist; /* List of batch skbs */ > > /* Partially transmitted GSO packet. */ > struct sk_buff *gso_skb; > @@ -472,6 +477,9 @@ struct net_device > void *priv; /* pointer to private data */ > int (*hard_start_xmit) (struct sk_buff *skb, > struct net_device *dev); > + int (*hard_start_xmit_batch) (struct net_device > + *dev); > + > /* These may be needed for future network-power-down code. */ > unsigned long trans_start; /* Time (in jiffies) of last Tx */ > > @@ -832,6 +840,8 @@ extern int dev_set_mac_address(struct n > struct sockaddr *); > extern int dev_hard_start_xmit(struct sk_buff *skb, > struct net_device *dev); > +extern int dev_add_skb_to_blist(struct sk_buff *skb, > + struct net_device *dev); > > extern void dev_init(void); > > diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h > --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 > +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.000000000 +0530 > @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge > struct rtattr *tab); > extern void qdisc_put_rtab(struct qdisc_rate_table *tab); > > -extern void __qdisc_run(struct net_device *dev); > +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); Why do we need this additional 'blist' argument? Is this different from dev->skb_blist? > > -static inline void qdisc_run(struct net_device *dev) > +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist) > { > if (!netif_queue_stopped(dev) && > !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) > - __qdisc_run(dev); > + __qdisc_run(dev, blist); > } > > extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp, From sri at us.ibm.com Fri Jul 20 10:44:19 2007 From: sri at us.ibm.com (Sridhar Samudrala) Date: Fri, 20 Jul 2007 10:44:19 -0700 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <20070720063227.26341.91868.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063227.26341.91868.sendpatchset@localhost.localdomain> Message-ID: <1184953459.12431.21.camel@localhost.localdomain> On Fri, 2007-07-20 at 12:02 +0530, Krishna Kumar wrote: > Changes in dev.c to support batching : add dev_add_skb_to_blist, > register_netdev recognizes batch aware drivers, and net_tx_action is > the sole user of batching. > > Signed-off-by: Krishna Kumar > --- > dev.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 files changed, 74 insertions(+), 3 deletions(-) > > diff -ruNp org/net/core/dev.c new/net/core/dev.c > --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530 > +++ new/net/core/dev.c 2007-07-20 08:31:35.000000000 +0530 > @@ -1566,7 +1605,7 @@ gso: > /* reset queue_mapping to zero */ > skb->queue_mapping = 0; > rc = q->enqueue(skb, q); > - qdisc_run(dev); > + qdisc_run(dev, NULL); OK. So you are passing a NULL blist here. However, i am not sure why batching is not used in this situation. Thanks Sridhar > spin_unlock(&dev->queue_lock); > > rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; > @@ -1763,7 +1802,11 @@ static void net_tx_action(struct softirq > clear_bit(__LINK_STATE_SCHED, &dev->state); > > if (spin_trylock(&dev->queue_lock)) { > - qdisc_run(dev); > + /* > + * Try to send out all skbs if batching is > + * enabled. > + */ > + qdisc_run(dev, dev->skb_blist); > spin_unlock(&dev->queue_lock); > } else { > netif_schedule(dev); > @@ -3397,6 +3440,28 @@ int register_netdevice(struct net_device > } > } > > + if (dev->features & NETIF_F_BATCH_SKBS) { > + if (!dev->hard_start_xmit_batch || > + dev->tx_queue_len < MIN_QUEUE_LEN_BATCH) { > + /* > + * Batch TX requires API support in driver plus have > + * a minimum sized queue. > + */ > + printk(KERN_ERR "%s: Dropping NETIF_F_BATCH_SKBS " > + "since no API support or queue len " > + "is smaller than %d.\n", > + dev->name, MIN_QUEUE_LEN_BATCH); > + dev->features &= ~NETIF_F_BATCH_SKBS; > + } else { > + dev->skb_blist = kmalloc(sizeof *dev->skb_blist, > + GFP_KERNEL); > + if (dev->skb_blist) { > + skb_queue_head_init(dev->skb_blist); > + dev->tx_queue_len >>= 1; > + } > + } > + } > + > /* > * nil rebuild_header routine, > * that should be never called and used as just bug trap. > @@ -3732,10 +3797,16 @@ void unregister_netdevice(struct net_dev > > synchronize_net(); > > + /* Deallocate batching structure */ > + if (dev->skb_blist) { > + skb_queue_purge(dev->skb_blist); > + kfree(dev->skb_blist); > + dev->skb_blist = NULL; > + } > + > /* Shutdown queueing discipline. */ > dev_shutdown(dev); > > - > /* Notify protocols, that we are about to destroy > this device. They should clean all the things. > */ From ttelford.groups at gmail.com Fri Jul 20 10:51:49 2007 From: ttelford.groups at gmail.com (Troy Telford) Date: Fri, 20 Jul 2007 11:51:49 -0600 Subject: [ofa-general] OFED Release tarballs Message-ID: <200707201151.50178.ttelford.groups@gmail.com> I've been tracking OFED development for quite a while, pulling the occasional snapshot from the git repositories. And after looking at what is in the git repositories, and then comparing it to what is in the OFED release (in this case the 1.2 release), I've been unable to find how one gets from what's released in the various git repositories to what you get in the OFED release. The main reason is because before OFED 1.2 was 'released', I was building experimental RPMs from the git sources, and more than anything, I'd like to know how the OFED distribution went from what's in the git repositories to its final release state; simply checking out the OFED-1.2 tag doesn't seem to be sufficient to get everything I see in the official release. Obviously, I've already looked inside the src.rpm's; many of the files that are 'missing' in git are generated by ./autogen.sh in the various git repositories, so I'm less concerned with those differences. But there are a few things that I haven't been able to find so far (at least in the repositories named 'ofed_1_2/') Are there other repositories that have 'stuff' that made it into the OFED 1.2 distribution that aren't from the 'ofed_1_2/*' repositories? -- Troy Telford From kaber at trash.net Fri Jul 20 11:16:36 2007 From: kaber at trash.net (Patrick McHardy) Date: Fri, 20 Jul 2007 20:16:36 +0200 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: <20070720063249.26341.125.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720063249.26341.125.sendpatchset@localhost.localdomain> Message-ID: <46A0FC04.1000006@trash.net> Krishna Kumar wrote: > +static inline int get_skb(struct net_device *dev, struct Qdisc *q, > + struct sk_buff_head *blist, > + struct sk_buff **skbp) > +{ > + if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) { > + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); > + } else { > + int max = dev->tx_queue_len - skb_queue_len(blist); I'm assuming the driver will simply leave excess packets in the blist for the next run. The check for tx_queue_len is wrong though, its only a default which can be overriden and some qdiscs don't care for it at all. > + struct sk_buff *skb; > + > + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL) > + max -= dev_add_skb_to_blist(skb, dev); > + > + *skbp = NULL; > + return 1; /* we have atleast one skb in blist */ > + } > +} > -void __qdisc_run(struct net_device *dev) > +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) And the patches should really be restructured so this change is in the same patch changing the header and the caller, for example. > { > do { > - if (!qdisc_restart(dev)) > + if (!qdisc_restart(dev, blist)) > break; > } while (!netif_queue_stopped(dev)); > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > From pradeeps at linux.vnet.ibm.com Fri Jul 20 12:00:33 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Fri, 20 Jul 2007 12:00:33 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V7] patch resubmit In-Reply-To: <469EFCD5.5050800@linux.vnet.ibm.com> References: <469E4CA2.2040708@linux.vnet.ibm.com> <469EB694.7040408@linux.vnet.ibm.com> <469EFCD5.5050800@linux.vnet.ibm.com> Message-ID: <46A10651.1060205@linux.vnet.ibm.com> Pradeep Satyanarayana wrote: > Roland Dreier wrote: >> > They are not quite the same. How about: >> > #define CM_PACKET_SIZE (ALIGN(IPOIB_CM_MTU, PAGE_SIZE)) >> >> That makes sense. >> >> > > > - .event_handler = ipoib_cm_rx_event_handler, >> > > >> > > why? seems harmless to just leave this alone for all QPs even if an >> > > SRQ isn't attached. >> > >> > If memory serves me right, I tried that and ran into some inexplicable problems. >> > Maybe it was hang or no traffic went through -don't exactly recollect what it was. >> > After this change the problem went away. >> >> Umm... I would like to get to the root cause of that. Because as far >> as I can see there is no problem if the event handler is called for a >> non-SRQ QP. The event will never be "last WQE reached" (since only a >> QP attached to an SRQ can generate that) and so the event handler will >> just return immediately and do nothing. > > Since I do not recollect what the issue was it was it might require some investigation > -especially since we have a short window for the merge. Would it be okay if I submit a > patch without this for the merge? Subsequently I will submit a patch to address this issue. > > Pradeep > There appears to be no problems with the 2.6.22 git tree if I leave the event_handler the same for all QPs. However, I see some ehca initialization errors with a slightly older kernel. I will work with the ehca folks (in Germany) and track this down and let you know. Pradeep From sashak at voltaire.com Fri Jul 20 14:11:06 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 21 Jul 2007 00:11:06 +0300 Subject: [ofa-general] latest libipathverbs.git tree In-Reply-To: <000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com> References: <20070719220905.GN12489@bauxite.pathscale.com> <000201c7ca52$0d364d20$9c98070a@amr.corp.intel.com> Message-ID: <20070720211106.GN16597@sashak.voltaire.com> On 15:13 Thu 19 Jul , Sean Hefty wrote: > Jeff/Vlad, > > Do either of you know the missing step to adding Ralph's git tree to the http > view? (See below.) I did. Actually symbolic link to Ralph's scm directory was needed (for gitweb): ln -s ~ralphc/scm /pub/scm/'~ralphc' Sasha > > - Sean > > >> I believe if you create /home/ralphc/public_html directory, and place > >symbolic > >> links in it to the git tree, then it will be visible on > >> http://www.openfabrics.org/git. I don't remember if additional setup on the > >> server is required. > > > >thanks, i tried it, but it doesn't seem to be sufficient... > > > >arthur > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sweitzen at cisco.com Fri Jul 20 15:20:03 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 20 Jul 2007 15:20:03 -0700 Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2? In-Reply-To: References: Message-ID: Use rpmcpio | cpio -iv to extract the source tarballs from the source RPMs. The OFED 1.2 structure is less confusing, because the 1.1 openib-*.tgz file was never actually used to compile the code, it was only a redundant duplicate copy of the code. Scott ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma Sent: Friday, July 20, 2007 9:39 AM To: openib-general at openib.org Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2? I downloaded OFED-1.2.tgz. It doesn't include source code openib-*tgz as OFED-1.1. Where I can find the source code without installing any RPMs?? Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Fri Jul 20 15:56:24 2007 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 20 Jul 2007 15:56:24 -0700 Subject: [ofa-general] Where is openib-1.2.tgz in OFED-1.2? In-Reply-To: Message-ID: Thanks Scoot for the tip. Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Jul 20 20:32:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 20:32:52 -0700 Subject: [ofa-general] is ipath_layer.c dead code? In-Reply-To: <20070719183249.GA20240@bauxite.pathscale.com> (Arthur Jones's message of "Thu, 19 Jul 2007 11:32:49 -0700") References: <20070719183249.GA20240@bauxite.pathscale.com> Message-ID: thanks, applied. I did indeed miss the header file being dead too. BTW... > The failed attempt to get ipath_ether upstream was the final nail in the coffin I don't think that the attempt to get ipath_ether upstream was ever that vigorous -- I don't see much demand for it, but if you guys feel that it has advantages for users then I wouldn't rule out merging it. - R. From rdreier at cisco.com Fri Jul 20 20:41:23 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 20:41:23 -0700 Subject: [ofa-general] Re: [PATCH] IB/mthca: change command token on timeout In-Reply-To: <20070719112849.GK24018@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 19 Jul 2007 14:28:49 +0300") References: <20070719112849.GK24018@mellanox.co.il> Message-ID: thanks, I applied this and also did the same thing for mlx4: commit 8a7bc1f72356a1f7dc67a168067c3942e8db395a Author: Roland Dreier Date: Fri Jul 20 20:39:31 2007 -0700 mlx4_core: Change command token on timeout The FW command token is currently only updated on a command completion event. This means that on command timeout, the same token will be reused for new command, which results in a mess if the timed out command *does* eventually complete. This is the same change as the patch for mthca from Michael S. Tsirkin that was just merged. It seems sensible to avoid gratuitous differences in FW command processing between mthca and mlx4. Signed-off-by: Roland Dreier diff --git a/drivers/net/mlx4/cmd.c b/drivers/net/mlx4/cmd.c index c1f81a9..5d791e4 100644 --- a/drivers/net/mlx4/cmd.c +++ b/drivers/net/mlx4/cmd.c @@ -246,8 +246,6 @@ void mlx4_cmd_event(struct mlx4_dev *dev, u16 token, u8 status, u64 out_param) context->result = mlx4_status_to_errno(status); context->out_param = out_param; - context->token += priv->cmd.token_mask + 1; - complete(&context->done); } @@ -264,6 +262,7 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, spin_lock(&cmd->context_lock); BUG_ON(cmd->free_head < 0); context = &cmd->context[cmd->free_head]; + context->token += priv->cmd.token_mask + 1; cmd->free_head = context->next; spin_unlock(&cmd->context_lock); From rdreier at cisco.com Fri Jul 20 20:55:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 20:55:39 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: fix oops in qp allocation for srq case In-Reply-To: <20070719161543.GC31246@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 19 Jul 2007 19:15:44 +0300") References: <20070719094039.GF24018@mellanox.co.il> <20070719161543.GC31246@mellanox.co.il> Message-ID: (BTW, the kmalloc(0) crash should be fixed in Linus's latest git) > the bug in error handling is real though, isn't it? yes, quite right. I queued this up: commit 597869e4dafbb05a69f571e5109f06245807ed6c Author: Roland Dreier Date: Fri Jul 20 20:54:30 2007 -0700 IB/mlx4: Fix error path in create_qp_common() The error handling code at err_wrid in create_qp_common() does not handle a userspace QP attached to an SRQ correctly, since it ends up in the else clause of the if statement. This means it tries to kfree() the uninitialized qp->sq.wrid and qp->rq.wrid pointers. Fix this so we only free the wrid arrays for kernel QPs. Pointed out by Michael S. Tsirkin . Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 5456bc4..f6315df 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -415,9 +415,11 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, return 0; err_wrid: - if (pd->uobject && !init_attr->srq) - mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db); - else { + if (pd->uobject) { + if (!init_attr->srq) + mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), + &qp->db); + } else { kfree(qp->sq.wrid); kfree(qp->rq.wrid); } From rdreier at cisco.com Fri Jul 20 21:02:11 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:02:11 -0700 Subject: [ofa-general] IPOB CM (NOSRQ) [PATCH V8] patch In-Reply-To: <469FB3AB.6080304@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Thu, 19 Jul 2007 11:55:39 -0700") References: <469FB3AB.6080304@linux.vnet.ibm.com> Message-ID: I just noticed another bug here I think: here you search up to max_rc_qp: > + for (index = 0; index < max_rc_qp; index++) > + if (priv->cm.rx_index_table[index] == NULL) > + break; but here > + priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * > + sizeof *priv->cm.rx_index_table, > + GFP_KERNEL); the table is allocated with a fixed size of NOSRQ_INDEX_TABLE_SIZE. (BTW, kcalloc might be slightly preferred here, since you are actually allocating an array). If max_rc_qp is going to be a module parameter: > +module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); (and, I just noticed, one you allow to be changed at runtime ?!) then rx_index_table has to be allocated with the right size. - R. From rdreier at cisco.com Fri Jul 20 21:07:13 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:07:13 -0700 Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs In-Reply-To: <200707201601.52277.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Fri, 20 Jul 2007 16:01:51 +0200") References: <200707201601.52277.hnguyen@linux.vnet.ibm.com> Message-ID: I applied this, but I agree with checkpatch.pl: > WARNING: externs should be avoided in .c files > #227: FILE: drivers/infiniband/hw/ehca/ehca_mrmw.c:67: > +extern int ehca_mr_largepage; > > WARNING: externs should be avoided in .c files > #949: FILE: drivers/infiniband/hw/ehca/hcp_if.c:753: > + extern int ehca_debug_level; if you need to use a variable in more than one .c file, put the extern declaration in a common header that's included everywhere you use the variable, including the .c file that it is defined in. That way the compiler can see if you get confused about the type of the variable. When you get a chance, please post a follow-on patch to fix this. - R. From rdreier at cisco.com Fri Jul 20 21:12:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:12:52 -0700 Subject: [ofa-general] Re: [PATCH 2/5] ehca: Generate event when SRQ limit reached In-Reply-To: <200707201602.19142.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Fri, 20 Jul 2007 16:02:18 +0200") References: <200707201602.19142.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, applied. BTW, does your SRQ-capable hardware support generating the "last WQE reached" event? There's not any reliable way to avoid problems when destroying QPs attached to an SRQ without it, and the IB spec requires CAs that support SRQs to generate it (o11-5.2.5 in chapter 11 of vol 1). I don't see any code in ehca to generate the event, and IPoIB CM at least will be very unhappy when using SRQs if the event is not generated. - R. From rdreier at cisco.com Fri Jul 20 21:14:28 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:14:28 -0700 Subject: [ofa-general] Re: [PATCH 3/5] ehca: Make ehca2ib_return_code() non-inline In-Reply-To: <200707201602.46415.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Fri, 20 Jul 2007 16:02:46 +0200") References: <200707201602.46415.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, applied From rdreier at cisco.com Fri Jul 20 21:20:49 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:20:49 -0700 Subject: [ofa-general] [PATCH 5/5] ehca: Support small QP queues In-Reply-To: <200707201604.17991.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Fri, 20 Jul 2007 16:04:17 +0200") References: <200707201604.17991.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, applied. I fixed this up myself to work with commit 20c2df83, which got rid of the destructor argument to kmem_cache_create() -- you probably want to check my tree to make sure it's OK. Also the same as I said before about checkpatch.pl's warning: WARNING: externs should be avoided in .c files #337: FILE: drivers/infiniband/hw/ehca/ehca_pd.c:91: + extern struct kmem_cache *small_qp_cache; please fix that up when you get a chance From kliteyn at mellanox.co.il Fri Jul 20 21:43:03 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 21 Jul 2007 07:43:03 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-21:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From rdreier at cisco.com Fri Jul 20 21:54:27 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:54:27 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default In-Reply-To: <20070719112155.GJ24018@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 19 Jul 2007 14:21:55 +0300") References: <20070719112155.GJ24018@mellanox.co.il> Message-ID: > - mlx4_enable_msi_x(dev); > - > if (mlx4_cmd_init(dev)) { > mlx4_err(dev, "Failed to init command interface, aborting.\n"); > goto err_free_dev; > } > > + mlx4_enable_msi_x(dev); Why this change? I don't see anything in mlx4_cmd_init() that seems to matter in terms of coming before or after enabling MSI-X. > err = mlx4_init_hca(dev); > + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { > + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); > + dev->flags &= ~MLX4_FLAG_MSI_X; > + pci_disable_msix(pdev); > + err = mlx4_init_hca(dev); > + } > + > if (err) > goto err_cmd; > > + mlx4_enable_msi_x(dev); > + > err = mlx4_setup_hca(dev); Have you actually tested this on a system where MSI-X fails? Because I don't see how it could work-- we don't actually try interrupts until mlx4_setup_hca() (in fact we don't even create any EQs until then). So I don't see how mlx4_init_hca() could tell if MSI-X is OK... - R. From rdreier at cisco.com Fri Jul 20 21:56:20 2007 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 20 Jul 2007 21:56:20 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get another small batch of changes for 2.6.23: Arthur Jones (1): IB/ipath: Remove ipath_layer dead code Florin Malita (1): IB/mlx4: Fix leaks in __mlx4_ib_modify_qp Hoang-Nam Nguyen (3): IB/ehca: Support large page MRs IB/ehca: Generate async event when SRQ limit reached IB/ehca: Move ehca2ib_return_code() out of line Joachim Fenkes (1): IB/ehca: Make internal_create/destroy_qp() static Michael S. Tsirkin (1): IB/mthca: Change command token on timeout Roland Dreier (2): mlx4_core: Change command token on timeout IB/mlx4: Fix error path in create_qp_common() Stefan Roscher (1): IB/ehca: Support small QP queues drivers/infiniband/hw/ehca/ehca_classes.h | 50 +++-- drivers/infiniband/hw/ehca/ehca_cq.c | 8 +- drivers/infiniband/hw/ehca/ehca_eq.c | 8 +- drivers/infiniband/hw/ehca/ehca_irq.c | 42 +++- drivers/infiniband/hw/ehca/ehca_main.c | 49 ++++- drivers/infiniband/hw/ehca/ehca_mrmw.c | 371 ++++++++++++++++++++++++----- drivers/infiniband/hw/ehca/ehca_mrmw.h | 2 +- drivers/infiniband/hw/ehca/ehca_pd.c | 25 ++- drivers/infiniband/hw/ehca/ehca_qp.c | 178 ++++++++------ drivers/infiniband/hw/ehca/ehca_tools.h | 19 +-- drivers/infiniband/hw/ehca/ehca_uverbs.c | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 50 +++- drivers/infiniband/hw/ehca/ipz_pt_fn.c | 222 +++++++++++++---- drivers/infiniband/hw/ehca/ipz_pt_fn.h | 26 ++- drivers/infiniband/hw/ipath/Makefile | 1 - drivers/infiniband/hw/ipath/ipath_layer.c | 365 ---------------------------- drivers/infiniband/hw/ipath/ipath_layer.h | 71 ------ drivers/infiniband/hw/ipath/ipath_verbs.h | 2 - drivers/infiniband/hw/mlx4/qp.c | 20 +- drivers/infiniband/hw/mthca/mthca_cmd.c | 3 +- drivers/net/mlx4/cmd.c | 3 +- 21 files changed, 802 insertions(+), 715 deletions(-) delete mode 100644 drivers/infiniband/hw/ipath/ipath_layer.c delete mode 100644 drivers/infiniband/hw/ipath/ipath_layer.h From HNGUYEN at de.ibm.com Sat Jul 21 01:22:54 2007 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Sat, 21 Jul 2007 10:22:54 +0200 Subject: [ofa-general] [PATCH 1/5] ehca: Supports large page MRs In-Reply-To: Message-ID: Hi Roland! > I applied this, but I agree with checkpatch.pl: > > > WARNING: externs should be avoided in .c files > > #227: FILE: drivers/infiniband/hw/ehca/ehca_mrmw.c:67: > > +extern int ehca_mr_largepage; > > > > WARNING: externs should be avoided in .c files > > #949: FILE: drivers/infiniband/hw/ehca/hcp_if.c:753: > > + extern int ehca_debug_level; > > if you need to use a variable in more than one .c file, put the extern > declaration in a common header that's included everywhere you use the > variable, including the .c file that it is defined in. That way the > compiler can see if you get confused about the type of the variable. That's true. > When you get a chance, please post a follow-on patch to fix this. Sure thing. Will do that for rc2. Thanks! Nam From vlad at lists.openfabrics.org Sat Jul 21 01:38:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 21 Jul 2007 01:38:32 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070721-0100 daily build status Message-ID: <20070721083832.AFC96E60838@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From krkumar2 at in.ibm.com Fri Jul 20 23:46:30 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sat, 21 Jul 2007 12:16:30 +0530 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: <20070720172203.0eaeea86@oldman> Message-ID: Stephen Hemminger wrote on 07/20/2007 09:52:03 PM: > Patrick McHardy wrote: > > > Krishna Kumar2 wrote: > > > Patrick McHardy wrote on 07/20/2007 03:37:20 PM: > > > > > > > > > > > >> rtnetlink support seems more important than sysfs to me. > > >> > > > > > > Thanks, I will add that as a patch. The reason to add to sysfs is that > > > it is easier to change for a user (and similar to tx_queue_len). > > > > > > > But since batching is so similar to TSO, i really should be part of the > flags and controlled by ethtool like other offload flags. So should I add all three interfaces (or which ones) : 1. /sys (like for tx_queue_len) 2. netlink 3. ethtool. Or only 2 & 3 are enough ? thanks, - KK From krkumar2 at in.ibm.com Fri Jul 20 23:44:12 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sat, 21 Jul 2007 12:14:12 +0530 Subject: [ofa-general] Re: [PATCH 03/10] dev.c changes. In-Reply-To: <1184953459.12431.21.camel@localhost.localdomain> Message-ID: Hi Sridhar, Sridhar Samudrala wrote on 07/20/2007 11:14:19 PM: > > @@ -1566,7 +1605,7 @@ gso: > > /* reset queue_mapping to zero */ > > skb->queue_mapping = 0; > > rc = q->enqueue(skb, q); > > - qdisc_run(dev); > > + qdisc_run(dev, NULL); > > OK. So you are passing a NULL blist here. However, i am > not sure why batching is not used in this situation. Actually it could be used, but in most cases there will be only one skb. If I pass the blist here, the result (for batching case) will be to put one single skb into the blist and call the new xmit API. That wastes cycles as we take a skb out from the queue (as in regular code) and then add it to the blist (different in the new code) and then the driver has to remove this skb from the blist (different in the new code). I could try batching but then require there are more than 1 skbs before adding to the blist (or the blist doesn't already have skbs, in which case adding even one skb makes sense). Also, it will have a slight impact for regular drivers where for each xmit, one extra dereference for dev->skb_blist (which is always NULL) is made, which was another reason to always pass NULL. I will check what the results are by giving passing blist here too and make the above change. I will run tests for that (as well as NETPERF RR test as asked by Evgeniy). Thanks, - KK From krkumar2 at in.ibm.com Fri Jul 20 23:56:23 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sat, 21 Jul 2007 12:26:23 +0530 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: <46A0FC04.1000006@trash.net> Message-ID: Hi Patrick, Patrick McHardy wrote on 07/20/2007 11:46:36 PM: > Krishna Kumar wrote: > > +static inline int get_skb(struct net_device *dev, struct Qdisc *q, > > + struct sk_buff_head *blist, > > + struct sk_buff **skbp) > > +{ > > + if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) { > > + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); > > + } else { > > + int max = dev->tx_queue_len - skb_queue_len(blist); > > > I'm assuming the driver will simply leave excess packets in the > blist for the next run. Yes, and the next run will be scheduled even if no more xmits are called either due to qdisc_restart()'s call to driver returning : BUSY : driver failed to send all, net_tx_action will handle this later (the case you mentioned) OK : and qlen is > 0, return 1 and __qdisc_run() will re-retry (where blist len will become zero as driver processed EVERYTHING on blist) > The check for tx_queue_len is wrong though, > its only a default which can be overriden and some qdiscs don't > care for it at all. I think it should not matter whether qdiscs use this or not, or even if it is modified (unless it is made zero in which case this breaks). The intention behind this check is to make sure that not more than tx_queue_len skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise the blist can become too large and breaks the idea of tx_queue_len. Is that a good justification ? > > -void __qdisc_run(struct net_device *dev) > > +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) > > > And the patches should really be restructured so this change is > in the same patch changing the header and the caller, for example. Ah, OK. Thanks, - KK From krkumar2 at in.ibm.com Sat Jul 21 00:24:08 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sat, 21 Jul 2007 12:54:08 +0530 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: Message-ID: Krishna Kumar2/India/IBM wrote on 07/21/2007 12:26:23 PM: > Hi Patrick, > > Patrick McHardy wrote on 07/20/2007 11:46:36 PM: > > > The check for tx_queue_len is wrong though, > > its only a default which can be overriden and some qdiscs don't > > care for it at all. > I think it should not matter whether qdiscs use this or not, or even if it > is modified (unless it is made zero in which case this breaks). The > intention behind this check is to make sure that not more than tx_queue_len > skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise > the blist can become too large and breaks the idea of tx_queue_len. Is that > a good justification ? Also, if tx_queue_len is set to zero, I think my code will not execute and the existing code will break at rc = q->enqueue() (for sched's checking queue limits). From krkumar2 at in.ibm.com Fri Jul 20 23:30:10 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sat, 21 Jul 2007 12:00:10 +0530 Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes. In-Reply-To: <1184952305.12431.16.camel@localhost.localdomain> Message-ID: Hi Sridhar, Sridhar Samudrala wrote on 07/20/2007 10:55:05 PM: > > diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h > > --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 > > +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.000000000 +0530 > > @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge > > struct rtattr *tab); > > extern void qdisc_put_rtab(struct qdisc_rate_table *tab); > > > > -extern void __qdisc_run(struct net_device *dev); > > +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); > > Why do we need this additional 'blist' argument? > Is this different from dev->skb_blist? It is the same, but I want to call it mostly with NULL and rarely with the batch list pointer (so it is related to your other question). My original code didn't have this and was trying batching in all cases. But in most xmit's (probably almost all), there will be only one packet in the queue to send and batching will never happen. When there is a lock contention or if the queue is stopped, then the next iteration will find >1 packets. But I still will try no batching for the lock failure case as there be probably 2 packets (one from previous time and 1 from this time, or 3 if two failures, etc), and try batching only when queue was stopped from net_tx_action (this was based on Dave Miller's idea). Thanks, - KK From vlad at lists.openfabrics.org Sat Jul 21 02:43:54 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 21 Jul 2007 02:43:54 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070721-0200 daily build status Message-ID: <20070721094354.1086BE60873@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Failed: From hadi at cyberus.ca Sat Jul 21 06:18:41 2007 From: hadi at cyberus.ca (jamal) Date: Sat, 21 Jul 2007 09:18:41 -0400 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> Message-ID: <1185023921.5192.45.camel@localhost> I am (have been) under extreme travel mode - so i will have high latency in follow ups. On Fri, 2007-20-07 at 12:01 +0530, Krishna Kumar wrote: > Hi Dave, Roland, everyone, > > In May, I had proposed creating an API for sending 'n' skbs to a driver to > reduce lock overhead, DMA operations, and specific to drivers that have > completion notification like IPoIB - reduce completion handling ("[RFC] New > driver API to speed up small packets xmits" @ > http://marc.info/?l=linux-netdev&m=117880900818960&w=2). I had also sent > initial test results for E1000 which showed minor improvements (but also > got degradations) @http://marc.info/?l=linux-netdev&m=117887698405795&w=2. > Add to that context: that i have been putting out patches on this over the last 3+ years as well as several public presentations = last one being: http://vger.kernel.org/jamal_netconf2006.sxi My main problem (and obstacles to submitting the patches) has been a result of not doing the approriate testing - i had been testing forwarding path (in all my results post the latest patches) when i should really have been testing the improvement of the tx path. > There is a parallel WIP by Jamal but the two implementations are completely > different since the code bases from the start were separate. Key changes: > - Use a single qdisc interface to avoid code duplication and reduce > maintainability (sch_generic.c size reduces by ~9%). > - Has per device configurable parameter to turn on/off batching. > - qdisc_restart gets slightly modified while looking simple without > any checks for batching vs regular code (infact only two lines have > changed - 1. instead of dev_dequeue_skb, a new batch-aware function > is called; and 2. an extra call to hard_start_xmit_batch. > - No change in__qdisc_run other than a new argument (from DM's idea). > - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. All the above are cosmetic differences. To me is the highest priority is making sure that batching is useful and what the limitations are. At some point, when all looks good - i dont mind adding an ethtool interface to turn off/on batching, merge with the new qdisc restart path instead of having a parallel path, solicit feedback on naming, where to allocate structs etc etc. All that is low prio if batching across a variety of hardware and applications doesnt prove useful. At the moment, i am unsure theres consistency to justify push batching in. Having said that below are the main architectural differences we have which is what we really need to discuss and see what proves useful: > - Batching algo/processing is different (eg. if > qdisc_restart() finds > one skb in the batch list, it will try to batch more (upto a limit) > instead of sending that out and batching the rest in the next call. This sounds a little more aggressive but maybe useful. I have experimented with setting upper bound limits (current patches have a pktgen interface to set the max to send) and have concluded that it is unneeded. Probing by letting the driver tell you what space is available has proven to be the best approach. I have been meaning to remove the code in pktgen which allows these limits. > - Jamal's code has a separate hw prep handler called from the stack, > and results are accessed in driver during xmit later. I have explained the reasoning to this a few times. A recent response to Michael Chan is here: http://marc.info/?l=linux-netdev&m=118346921316657&w=2 And heres a response to you that i havent heard back on: http://marc.info/?l=linux-netdev&m=118355539503924&w=2 My tests so far indicate this interface is useful. It doesnt apply well to some drivers (for example i dont use it in tun) - which makes it optional but useful nevertheless. I will be more than happy to kill this if i can find cases where it proves to be a bad idea. > - Jamal's code has dev->xmit_win which is cached by the driver. Mine > has dev->xmit_slots but this is used only by the driver while the > core has a different mechanism to find how many skbs to batch. This is related to the first item. > - Completely different structure/design & coding styles. > (This patch will work with drivers updated by Jamal, Matt & Michael Chan with > minor modifications - rename xmit_win to xmit_slots & rename batch handler) Again, cosmetics (and indication you are morphing towards me). So if i was to sum up this, (it would be useful discussion to have on these) the real difference is: a) you have an extra check on refilling the skb list when you find that it has a single skb. I tagged this as being potentially useful. b) You have a check for some upper bound on the number of skbs to send to the driver. I tagged this as unnecessary - the interface is still on in my current code, so it shouldnt be hard to show one way or other. c) You dont have prep_xmit() Add to that list any other architectural differences i may have missed and lets discuss and hopefully make some good progress. cheers, jaaml From hadi at cyberus.ca Sat Jul 21 06:46:19 2007 From: hadi at cyberus.ca (jamal) Date: Sat, 21 Jul 2007 09:46:19 -0400 Subject: [ofa-general] TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720081848.7cc652fb@oldman> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720081848.7cc652fb@oldman> Message-ID: <1185025579.5192.68.camel@localhost> On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote: > You may see worse performance with batching in the real world when > running over WAN's. Like TSO, batching will generate back to back packet > trains that are subject to multi-packet synchronized loss. Has someone done any study on TSO effect? Doesnt ECN with a RED router help on something like this? I find it suprising that a single flow doing TSO would overwhelm a routers buffer. I actually think the value of batching as far as TCP is concerned is propotional to the number of flows. i.e the more flows you have the more batching you will end up doing. And if TCPs fairness is the legend talk it has been made to be, then i dont see this as problematic. BTW, something i noticed regards to GSO when testing batching: For TCP packets slightly above MDU (upto 2K), GSO gives worse performance than non-GSO. Actually has nothing to do with batching, rather it works the same way with or without batching changes. Another oddity: Looking at the flow rate from a purely packets/second (I know thats a router centric view, but i found it strange nevertheless) - you see that as packet size goes up, the pps also goes up. I tried mucking around with nagle etc, but saw no observable changes. Any insight? My expectation was that the pps would stay at least the same or get better with smaller packets (assuming theres less data to push around). cheers, jamal From mst at dev.mellanox.co.il Sat Jul 21 12:48:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sat, 21 Jul 2007 22:48:51 +0300 Subject: [ofa-general] Re: IPOB CM (NOSRQ) [PATCH V8] patch In-Reply-To: <46A0E350.5060207@linux.vnet.ibm.com> References: <469FB3AB.6080304@linux.vnet.ibm.com> <20070720141759.GF31246@mellanox.co.il> <46A0E350.5060207@linux.vnet.ibm.com> Message-ID: <20070721194851.GA20438@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: IPOB CM (NOSRQ) [PATCH V8] patch > > Michael S. Tsirkin wrote: > >> @@ -815,7 +1168,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > >> attr.recv_cq = priv->cq; > >> attr.srq = priv->cm.srq; > >> attr.cap.max_send_wr = ipoib_sendq_size; > >> + attr.cap.max_recv_wr = 1; > >> attr.cap.max_send_sge = 1; > >> + attr.cap.max_recv_sge = 1; > >> attr.sq_sig_type = IB_SIGNAL_ALL_WR; > >> attr.qp_type = IB_QPT_RC; > >> attr.send_cq = cq; > > > > You never post a receive WR on this QP, do you? > > So > > 1. What's magic about 1 as max recv wr? Why not 0? > > 2. If the remote sends a packet on this QP, it'llget closed, > > won't it? Looks like a spec violation. > > > > > Good catch. I can probably set max_recv_sge to 0 too -right? > I can do that in a separate patch later on. > However, I see nothing in table 46 of the IB spec that tells me > that it is a violation of the spec. Which section are you > referring to? The IPoIB RFC. -- MST From kliteyn at dev.mellanox.co.il Sat Jul 21 15:07:50 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 22 Jul 2007 01:07:50 +0300 Subject: [ofa-general] QoS RFC Message-ID: <46A283B6.1070105@dev.mellanox.co.il> Hi All Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack. Your comments are welcome. -- Yevgeny RFC: OpenFabrics Enhancements for QoS Support =============================================== Authors: . Eitan Zahavi Authors: . Yevgeny Kliteynik Date: .... Jul 2007. Revision: 0.2 Table of contents: 1. Overview 2. Architecture 3. Supported Policy 4. CMA functionality 5. IPoIB functionality 6. SDP functionality 7. SRP functionality 8. iSER functionality 9. OpenSM functionality 1. Overview ------------ Quality of Service requirements stem from the realization of I/O consolidation over IB network: As multiple applications and ULPs share the same fabric, means to control their use of the network resources are becoming a must. The basic need is to differentiate the service levels provided to different traffic flows, such that a policy could be enforced and control each flow utilization of the fabric resources. IBTA specification defined several hardware features and management interfaces to support QoS: * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner * Arbitration between traffic of different VLs is performed by a 2 priority levels weighted round robin arbiter. The arbiter is programmable with a sequence of (VL, weight) pairs and maximal number of high priority credits to be processed before low priority is served * Packets carry class of service marking in the range 0 to 15 in their header SL field * Each switch can map the incoming packet by its SL to a particular output VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) * The Subnet Administrator controls each communication flow parameters by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) queries The IB QoS features provide the means to implement a DiffServ like architecture. DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic fabrics. This proposal provides the detailed functional definition for the various software elements that are required to enable a DiffServ like architecture over the OpenFabrics software stack. 2. Architecture ---------------- This proposal split the QoS functionality between the SM/SA, CMA and the various ULPS. We take the "chronology approach" to describe how the overall system works: 2.1. The network manager (human) provides a set of rules (policy) that defines how the network is being configured and how its resources are split to different QoS-Levels. The policy also define how to decide which QoS-Level each application or ULP or service use. 2.2. The SM analyzes the provided policy to see if it is realizable and performs the necessary fabric setup. The SM may continuously monitor the policy and adapt to changes in it. Part of this policy defines the default QoS-Level of each partition. The SA is being enhanced to match the requested Source, Destination, QoS-Class, Service-ID (and optionally SL and priority) against the policy. So clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also enhanced to support setting up partitions with appropriate IPoIB broadcast group. This broadcast group carries its QoS attributes: SL, MTU and RATE. 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the multicast group which forms the broadcast group of this partition. 2.4. MPI which provides non IB based connection management should be configured to run using hard coded SLs. It uses these SLs for every QP being opened. 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned Service-ID and use it while obtaining PR/MPR for establishing connections. The SA receiving the PR/MPR should match it against the policy and return the appropriate PR/MPR including SL, MTU and RATE. 2.6. ULPs and programs using CMA to establish RC connection should provide the CMA the target IP and Service-ID. Some of the ULPs might also provide QoS-Class (E.g. for SDP sockets that are provided the TOS socket option). The CMA should then use the provided Service-ID and optional QoS-Class and pass them in the PR/MPR request. The resulting PR/MPR should be used for configuring the connection QP. PathRecord and MultiPathRecord enhancement for QoS: As mentioned above the PathRecord and MultiPathRecord attributes should be enhanced to carry the Service-ID which is a 64bit value, which has been standardized by the IBTA. A new field QoS-Class is also provided. A new capability bit should describe the SM QoS support in the SA class port info. This approach provides an easy migration path for existing access layer and ULPs by not introducing new set of PR/MPR attribute. 3. Supported Policy -------------------- The QoS policy supported by this proposal is divided into 4 sub sections: I) Port Group: a set of CAs, Routers or Switches that share the same settings. A port group might be a partition defined by the partition manager policy in terms of GUIDs. Future implementations might provide support for NodeDescription based definition of port groups. II) Fabric Setup: Defines how the SL2VL and VLArb tables should be setup. This policy definition assumes the computation of overall end to end network behavior should be performed outside of OpenSM. III) QoS-Levels Definition: This section defines the possible sets of parameters for QoS that a client might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). IV) Matching Rules: A list of rules that match an incoming PR/MPR request to a QoS-Level. The rules are processed in order such as the first match is applied. Each rule is built out of a set of match expressions which should all match for the rule to apply. The matching expressions are defined for the following fields ** SRC and DST to lists of port groups ** Service-ID to a list of Service-ID or Service-ID ranges ** QoS-Class to a list of QoS-Class values or ranges QoS Policy file syntax * Empty lines are ignored * Leading and trailing blanks, as well as empty lines, are ignored, so the indentation in the example is just for better readability * Comments are started with the pound sign (#) and terminated by EOL * Comments may appear only in a separate line * Keywords that denote section/subsection start have matching closing keywords * Any keyword should be the first non-blank in the line QoS Policy file example # Port Groups define sets of ports to be used later in the settings port-groups # using port GUIDs port-group name: Storage # "use" is just a description that is used for logging. # Other than that, it is just a commentary use: our SRP storage targets port-guid: 0x1000000000000001 port-guid: 0x1000000000000002 end-port-group port-group name: Virtual Servers use: node desc and IB port num # The syntax of the port name is as follows: "hostname/CA-num/Pnum". # "hostname" and "CA-num" are compared to the first 2 words of # NodeDescription, and "Pnum" is a port number on that node. port-name: vs1/HCA-1/P1 port-name: vs3/HCA-1/P1 port-name: vs3/HCA-2/P2 end-port-group # using partitions defined in the partition policy port-group name: Group for Partition 1 use: default settings partition: Part1 end-port-group # using node types CA|ROUTER|SWITCH port-group name: Routers use: all routers node-type: ROUTER end-port-group end-port-groups qos-setup # define all types of VLArb tables. The length of the tables should # match the physically supported tables by their target ports vlarb-tables # scope defines the exact ports the VLArb tables apply to vlarb-scope # defining VLArb tables on all the ports that belong to # port group 'Storage', and on all the ports connected # to ports of port group 'Storage' group: Storage # "across" means all the ports that are connected to ports # that belong to the specified port group across: Storage # VLArb table holds VL and weight pairs vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 vl-high-limit: 10 end-vlarb-scope # There can be several scopes end-vlarb-tables sl2vl-tables # Scope defines the exact devices and in/out ports tables apply to. # Note: if the same port is matching several rules the *FIRST* one applies. sl2vl-scope # SL2VL tables are orgnized as SL2VL(in-port,out-port) # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) # # The following example specifies that all the SL2VL tables # entries should be defined for all the ports of group Part1: group: Part1 from: * to: * # SL2VL table has to have 16 values at max - one for each SL. # If the user specifies less than 16 values, all the missing # VL values will be implicitly set to 0 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 end-sl2vl-scope sl2vl-scope # "across-to" is a combination of "across" keyword (definition can be found # in VLArb tables section) and "to" keyword. # "across: PortGroupName" refers to all the ports that are connected # to ports that belong to PortGroupName. # # Example of "across-to" usage: # A user has a set of 'special' nodes (e.g. storage nodes), and all # the traffic to these nodes has to get specific VL. # The solution is to define port group (i.g. "Storage") that will # include all the ports of these nodes, and then to configure SL2VL # tables on all the switch ports that are connected to the Storage # port group by specifying "across-to: Storage". # across-to: Storage2 # Similar to "across-to", "across-from" is a combination of "across" # and "to" keywords across-from: Storage1 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 end-sl2vl-scope end-sl2vl-tables end-qos-setup qos-levels # the first one is just setting SL qos-level use: for the lowest priority communication sl: 15 packet-life: 16 end-qos-level # the second sets SL and QoS Class qos-level use: low latency best bandwidth sl: 0 end-qos-level # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits qos-level use: just an example sl: 0 mtu-limit: 1 rate-limit: 1 packet-life: 12 # Path Bits can be used e.g. to provide a different routes through the # subnet to a particular port path-bits: 2,4,8-32 end-qos-level end-qos-levels # Match rules are scanned in a first-fit manner (like firewall rules table) qos-match-rules # matching by single criteria: class (list of values and ranges) qos-match-rule # just a description use: low latency by class 7-9 or 11 qos-class: 7-9,11 # number of qos-level to apply to the matching PR/MPR qos-level-sn: 1 end-qos-match-rule # show matching by destination group AND service-ids qos-match-rule use: Storage targets connection destination: Storage service-id: 22,4719-5000 qos-level-sn: 2 end-qos-match-rule # show matching by source group only qos-match-rule use: bla bla source: Storage qos-level-sn: 3 end-qos-match-rule end-qos-match-rules 4. IPoIB --------- IPoIB already query the SA for its broadcast group information. The additional functionality required is for IPoIB to provide the broadcast group SL, MTU, and RATE in every following PathRecord query performed when a new UDAV is needed by IPoIB. We could assign a special Service-ID for IPoIB use but since all communication on the same IPoIB interface shares the same QoS-Level without the ability to differentiate it by target service we can ignore it for simplicity. 5. CMA features ---------------- The CMA interface supports Service-ID through the notion of port space as a prefixes to the port_num which is part of the sockaddr provided to rdma_resolve_add(). What is missing is the explicit request for a QoS-Class that should allow the ULP (like SDP) to propagate a specific request for a class of service. A mechanism for providing the QoS-Class is available in the IPv6 address, so we could use that address field. Another option is to implement a special connection options API for CMA. Missing functionality by CMA is the usage of the provided QoS-Class and Service-ID in the sent PR/MPR. When a response is obtained it is an existing requirement for the CMA to use the PR/MPR from the response in setting up the QP address vector. 6. SDP ------- SDP uses CMA for building its connections. The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port Number to connect to. SDP might be provided with SO_PRIORITY socket option. In that case the value provided should be sent to the CMA as the TClass option of that connection. 7. SRP ------- Current SRP implementation uses its own CM callbacks (not CMA). So SRP should fill in the Service-ID in the PR/MPR by itself and use that information in setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined by the SRP target I/O Controller (but they should also comply with IBTA Service- ID rules). Anyway, the Service-ID is reported by the I/O Controller in the ServiceEntries DMA attribute and should be used in the PR/MPR if the SA reports its ability to handle QoS PR/MPRs. 8. iSER -------- iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER should be TBD. 9. OpenSM features ------------------- The QoS related functionality to be provided by OpenSM can be split into two main parts: 3.1. Fabric Setup During fabric initialization the SM should parse the policy and apply its settings to the discovered fabric elements. The following actions should be performed: * Parsing of policy * Node Group identification. Warning should be provided for each node not specified but found. * SL2VL settings validation should be checked: + A warning will be provided if there are no matching targets for the SL2VL setting statement. + An error message will be printed to the log file if an invalid setting is found. A setting is invalid if it refers to: - Non existing port numbers of the target devices - Unsupported VLs for the target device. In the later case the map to non existing VLs should be replaced to VL15 i.e. packets will be dropped. * SL2VL setting is to be performed * VL Arbitration table settings should be validated according to the following rules: + A warning will be provided if there are no matching targets for the setting statement + An error will be provided if the port number exceeds the target ports + An error will be generated if the table length exceeds device capabilities + A warning will be generated if the table quote a VL that is not supported by the target device * VL Arbitration tables will be set on the appropriate targets 3.2. PR/MPR query handling: OpenSM should be able to enforce the provided policy on client request. The overall flow for such requests is: first the request is matched against the defined match rules such that the target QoS-Level definition is found. Given the QoS-Level a path(s) search is performed with the given restrictions imposed by that level. The following two sections describe these steps. How Service-ID is carried in the PathRecord and MultiPathRecord attributes is now standardized by the IBTA. 3.2.1. Matching rule search: A rule is "matching" a PR/MPR request using the following criteria: * Matching rules provide values in a list of either single value, or range of values. A PR/MPR field is "matching" the rule field if it is explicitly noted in the list of values or is one of the values covered by a range included in the field values list. * Only PR/MPR fields that have their component mask bit set should be compared. * For a rule to be "matching" a PR/MPR request all the rule fields should be "matching" their PR/MPR fields. Such that a PR/MPR request that does not have a component mask field set for one of the rule defined fields can not match that rule. * A PR/MPR request that have a component mask bit set for one of the fields that is not defined by the rule can match the rule. The algorithm to be used for searching for a rule match might be as simple as a sequential search through all rules or enhanced for better performance. The semantics of every rule field and its matching PR/MPR field are described below: * Source: the SGID or SLID should be part of this group * Destination: the DGID or DLID should be part of this group * Service-ID: check if the requested Service-ID (available in the PR/MPR old SM-Key field) is matching any of this rule Service-IDs * TClass: check if the PR/MPR TClass field is matching 3.2.2 PR/MPR response generation: The QoS-Level pointed by the first rule that matches the PR/MPR request should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits and QoS-Class. A default QoS-Level should be used if no rule is matching the query. The efficient algorithm for finding paths that meet the QoS-Level criteria is beyond the scope of this RFC and left for the implementer to provide. However the criteria by which the paths match the QoS-Level are described below: * SL: The paths found should all use the given SL. For that sake PR/MPR algorithm should traverse the path from source to destination only through ports that carry a valid VL (not VL15) by the SL2VL map (should consider input and output ports and SL). * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit (rate limit is given in units of link BW = Width*Speed according to IBTA Specification Vol-1 table-205 p-901 l-24). * Path-Bits: define the target LID lowest bits (number of bits defined by the target port PortInfo.LMC field). The path should traverse the LFT using the target port LID with the path-bits set. * QoS-Class: should be returned in the result PR/MPR. When routing is going to be supported by OpenSM we might use this field in selecting the target router too in a TBD way. From sashak at voltaire.com Sat Jul 21 15:44:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 22 Jul 2007 01:44:20 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts: Handle new and old topology file format In-Reply-To: <46A0D03E.mail35T1S2JP6@systemfabricworks.com> References: <46A0D03E.mail35T1S2JP6@systemfabricworks.com> Message-ID: <20070721224419.GP16597@sashak.voltaire.com> On 10:09 Fri 20 Jul , davem at systemfabricworks.com wrote: > > > Fix infiniband-diags scripts to handle changed ibnetdiscover topology file > format and remain backward compatible with old file format. > > Signed-off-by: David A. McMillen Applied. Thanks. Sasha From sashak at voltaire.com Sat Jul 21 15:44:46 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 22 Jul 2007 01:44:46 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibnetdiscover: Fix DDR link speed decode In-Reply-To: <46A0D0C3.mail37511HD5H@systemfabricworks.com> References: <46A0D0C3.mail37511HD5H@systemfabricworks.com> Message-ID: <20070721224446.GQ16597@sashak.voltaire.com> On 10:12 Fri 20 Jul , davem at systemfabricworks.com wrote: > > > Fix ibnetdiscover DDR link speed decode by moving string from [3] to [2]. > > Signed-off-by: David A. McMillen Applied. Thanks. Sasha From pradeeps at linux.vnet.ibm.com Sat Jul 21 15:46:15 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sat, 21 Jul 2007 15:46:15 -0700 Subject: [ofa-general] NOSRQ misc patch [PATCH V1] Message-ID: <46A28CB7.1040509@linux.vnet.ibm.com> This patch is to be applied on top of the IPOIB CM (NOSRQ) [PATCH V8]. This fixes the issues that Roland and Michael pointed out and more. Signed-off-by: Pradeep Satyanarayana --- --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-21 17:50:47.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-21 18:20:29.000000000 -0400 @@ -101,7 +101,6 @@ enum { #define IPOIB_CM_OP_RECV (1ul << 30) #define NOSRQ_INDEX_TABLE_SIZE 128 -#define NOSRQ_INDEX_MASK (NOSRQ_INDEX_TABLE_SIZE -1) #else #define IPOIB_CM_OP_RECV (0) #endif @@ -447,6 +446,7 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC)) +extern int max_rc_qp ; static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-21 17:50:47.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-21 18:08:15.000000000 -0400 @@ -49,17 +49,18 @@ MODULE_PARM_DESC(cm_data_debug_level, #include "ipoib.h" -static int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; +int max_rc_qp = NOSRQ_INDEX_TABLE_SIZE; static int max_recv_buf = 1024; /* Default is 1024 MB */ module_param_named(nosrq_max_rc_qp, max_rc_qp, int, 0644); -MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported"); +MODULE_PARM_DESC(nosrq_max_rc_qp, "Max number of NOSRQ RC QPs supported; must be a power of 2"); module_param_named(max_receive_buffer, max_recv_buf, int, 0644); MODULE_PARM_DESC(max_receive_buffer, "Max Receive Buffer Size in MB"); static atomic_t current_rc_qp = ATOMIC_INIT(0); /* Active number of RC QPs for NOSRQ */ +#define NOSRQ_INDEX_MASK (max_rc_qp -1) #define IPOIB_CM_IETF_ID 0x1000000000000000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -1024,6 +1025,7 @@ void dev_stop_nosrq(struct ipoib_dev_pri spin_unlock_irq(&priv->lock); cancel_delayed_work(&priv->cm.stale_task); + kfree(priv->cm.rx_index_table); } void ipoib_cm_dev_stop(struct net_device *dev) @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ attr.recv_cq = priv->cq; attr.srq = priv->cm.srq; attr.cap.max_send_wr = ipoib_sendq_size; - attr.cap.max_recv_wr = 1; + attr.cap.max_recv_wr = 0; attr.cap.max_send_sge = 1; - attr.cap.max_recv_sge = 1; + attr.cap.max_recv_sge = 0; attr.sq_sig_type = IB_SIGNAL_ALL_WR; attr.qp_type = IB_QPT_RC; attr.send_cq = cq; @@ -1710,11 +1712,11 @@ int ipoib_cm_dev_init(struct net_device * passive_ids. For quick and easy access we maintain a table * of pointers to struct ipoib_cm_rx called the rx_index_table */ - priv->cm.rx_index_table = kzalloc(NOSRQ_INDEX_TABLE_SIZE * - sizeof *priv->cm.rx_index_table, - GFP_KERNEL); + priv->cm.rx_index_table = kcalloc(max_rc_qp, + sizeof *priv->cm.rx_index_table, + GFP_KERNEL); if (!priv->cm.rx_index_table) { - printk(KERN_WARNING "Failed to allocate NOSRQ_INDEX_TABLE\n"); + printk(KERN_WARNING "Failed to allocate rx_index_table\n"); return -ENOMEM; } } --- a/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-21 17:50:47.000000000 -0400 +++ b/linux-2.6.22/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-21 18:09:26.000000000 -0400 @@ -180,11 +180,11 @@ int ipoib_transport_dev_init(struct net_ /* We increase the size of the CQ in the NOSRQ case to prevent CQ * overflow. Every new REQ creates a new RX QP and each QP has an * RX ring associated with it. Therefore we could have - * NOSRQ_INDEX_TABLE_SIZE*ipoib_recvq_size + ipoib_sendq_size CQEs + * max_rc_qp*ipoib_recvq_size + ipoib_sendq_size CQEs * in a CQ. */ if (!priv->cm.srq) - size += (NOSRQ_INDEX_TABLE_SIZE - 1) * ipoib_recvq_size; + size += (max_rc_qp - 1) * ipoib_recvq_size; #endif priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0); From kliteyn at mellanox.co.il Sat Jul 21 21:40:22 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 22 Jul 2007 07:40:22 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-22:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=559 Fail=1 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 13 LidMgr IS3-128.topo Failures: 1 LidMgr IS3-128.topo From mst at dev.mellanox.co.il Sat Jul 21 23:05:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 09:05:57 +0300 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <46A28CB7.1040509@linux.vnet.ibm.com> References: <46A28CB7.1040509@linux.vnet.ibm.com> Message-ID: <20070722060557.GB20438@mellanox.co.il> > @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > attr.recv_cq = priv->cq; > attr.srq = priv->cm.srq; > attr.cap.max_send_wr = ipoib_sendq_size; > - attr.cap.max_recv_wr = 1; > + attr.cap.max_recv_wr = 0; > attr.cap.max_send_sge = 1; > - attr.cap.max_recv_sge = 1; > + attr.cap.max_recv_sge = 0; > attr.sq_sig_type = IB_SIGNAL_ALL_WR; > attr.qp_type = IB_QPT_RC; > attr.send_cq = cq; I don't see how does this fix things. This line > attr.srq = priv->cm.srq; connected the TX QP to SRQ, making it possible to get packets on this QP. But if cm.srq is NULL, and a remote sends a packet on this connection, the connection will get closed. Which is a quality of implementation issue. -- MST From ogerlitz at voltaire.com Sat Jul 21 23:17:58 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Jul 2007 09:17:58 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <469DD7BB.6060009@voltaire.com> References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> <20070718072841.GC1115@mellanox.co.il> <469DD7BB.6060009@voltaire.com> Message-ID: <46A2F696.4060007@voltaire.com> Or Gerlitz wrote: > Michael S. Tsirkin wrote: >>> And ARP table aging gives a way to recover >>> from stale cached data, eventually at least. >> Does it? >> $ grep path_list drivers/infiniband/ulp/ipoib/*c >> drivers/infiniband/ulp/ipoib/ipoib_main.c: list_add_tail(&path->list, &priv->path_list); >> drivers/infiniband/ulp/ipoib/ipoib_main.c: list_splice(&priv->path_list, &remove_list); >> drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); >> drivers/infiniband/ulp/ipoib/ipoib_main.c: INIT_LIST_HEAD(&priv->path_list); >> In other words we add paths to ipoib specific cache, but we never seem >> to *remove* individual paths from cache - we only know how to do >> full cache invalidates on events such as port state change. > this seems like a bug, if the stack decided to delete OR change a > neighbour, the path associated with it must not be re-used to create the > address handle or to establish the connection, same for multicast > neighbours. Roland, Can you provide your take here? Do you agree that using cached IB L2 info where the net stack wants to renew its IPoIB L2 (which is IB L3 && L4) info is a bug? Or. From krkumar2 at in.ibm.com Sat Jul 21 23:27:54 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Sun, 22 Jul 2007 11:57:54 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <1185023921.5192.45.camel@localhost> Message-ID: Hi Jamal, J Hadi Salim wrote on 07/21/2007 06:48:41 PM: > > - Use a single qdisc interface to avoid code duplication and reduce > > maintainability (sch_generic.c size reduces by ~9%). > > - Has per device configurable parameter to turn on/off batching. > > - qdisc_restart gets slightly modified while looking simple without > > any checks for batching vs regular code (infact only two lines have > > changed - 1. instead of dev_dequeue_skb, a new batch-aware function > > is called; and 2. an extra call to hard_start_xmit_batch. > > > - No change in__qdisc_run other than a new argument (from DM's idea). > > - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. > > All the above are cosmetic differences. To me is the highest priority > is making sure that batching is useful and what the limitations are. > At some point, when all looks good - i dont mind adding an ethtool > interface to turn off/on batching, merge with the new qdisc restart path > instead of having a parallel path, solicit feedback on naming, where to > allocate structs etc etc. All that is low prio if batching across a > variety of hardware and applications doesnt prove useful. At the moment, > i am unsure theres consistency to justify push batching in. Batching need not be useful for every hardware. If there is hardware that is useful to exploit batching (like clearly IPoIB is a good candidate as both the TX and the TX completion path can handle multiple skb processing, and I haven't looked at other drivers to see if any of them can do something similar), then IMHO it makes sense to enable batching for that hardware. It is upto the other drivers to determine whether converting to the batching API makes sense or not. And as indicated, the total size increase for adding the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the 'size' command). > Having said that below are the main architectural differences we have > which is what we really need to discuss and see what proves useful: > > > - Batching algo/processing is different (eg. if > > qdisc_restart() finds > > one skb in the batch list, it will try to batch more (upto a limit) > > instead of sending that out and batching the rest in the next call. > > This sounds a little more aggressive but maybe useful. > I have experimented with setting upper bound limits (current patches > have a pktgen interface to set the max to send) and have concluded that > it is unneeded. Probing by letting the driver tell you what space is > available has proven to be the best approach. I have been meaning to > remove the code in pktgen which allows these limits. I don't quite agree with that approach, eg, if the blist is empty and the driver tells there is space for one packet, you will add one packet and the driver sends it out and the device is stopped (with potentially lot of skbs on dev->q). Then no packets are added till the queue is enabled, at which time a flood of skbs will be processed increasing latency and holding lock for a single longer duration. My approach will mitigate holding lock for longer times and instead send skbs to the device as long as we are within the limits. Infact in my rev2 patch (being today or tomorrow after handling Patrick's and Stephen's comments), I am even removing the driver specific xmit_slots as I find it is adding bloat and requires more cycles than calculating the value each time xmit is done (ofcourse in your approach it is required since the stack uses it). > > - Jamal's code has a separate hw prep handler called from the stack, > > and results are accessed in driver during xmit later. > > I have explained the reasoning to this a few times. A recent response to > Michael Chan is here: > http://marc.info/?l=linux-netdev&m=118346921316657&w=2 Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock could get improvement however. > And heres a response to you that i havent heard back on: > http://marc.info/?l=linux-netdev&m=118355539503924&w=2 That is because it answered my query :) It is what I was expecting, but thanks for the explanation. > My tests so far indicate this interface is useful. It doesnt apply well I wonder if you tried enabling/disabling 'prep' on E1000 to see how the performance is affected. If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). > So if i was to sum up this, (it would be useful discussion to have on > these) the real difference is: > > a) you have an extra check on refilling the skb list when you find that > it has a single skb. I tagged this as being potentially useful. It is very useful since extra processing is not required for one skb case - you remove it from list and unnecessarily add it to a different list and then delete it immediately in the driver when all that was required is to pass the skb directly to the driver using it's original API (ofcourse the caveat is that I also have a check to add that *single* skb to the blist in case there are already earlier skbs on the blist, this helps in batching and more importantly - to send skbs in order). > b) You have a check for some upper bound on the number of skbs to send > to the driver. I tagged this as unnecessary - the interface is still on > in my current code, so it shouldnt be hard to show one way or other. Explained earlier wrt latency. > c) You dont have prep_xmit() > > Add to that list any other architectural differences i may have missed > and lets discuss and hopefully make some good progress. I think the code I have is ready and stable, and the issues pointed out so far is also incorporated and to be sent out today. Please let me know if you want to add something to it. Thanks for your review/comments, - KK From eitan at mellanox.co.il Sat Jul 21 23:36:06 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 22 Jul 2007 09:36:06 +0300 Subject: [ofa-general] opensm: a bug in heavy sweep? - no LFT re-configuration Message-ID: <863azhrlm1.fsf@sw053.lab.mtl.com> Hi Sasha I am running some tests manually and apparently it looks like I found a bug. Here is the sequence of things: 1. SM sweeps the fabric assign LFTs 2. I manually modify some LFTs (single entry now marked UNREACHABLE 3. I force some switch change bit to 1 or issue kill -HUP 4. The SM reports SUBNET UP 5. The modified LFT entry is still UNREACHABLE and the path is broken It looks to me some optimization of routing does not fully reroute unless some condition is met - but that condition does not include the above triggers listed in step 3. Thanks Eitan From pradeeps at linux.vnet.ibm.com Sun Jul 22 00:16:10 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sun, 22 Jul 2007 00:16:10 -0700 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <20070722060557.GB20438@mellanox.co.il> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> Message-ID: <46A3043A.3030200@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ >> attr.recv_cq = priv->cq; >> attr.srq = priv->cm.srq; >> attr.cap.max_send_wr = ipoib_sendq_size; >> - attr.cap.max_recv_wr = 1; >> + attr.cap.max_recv_wr = 0; >> attr.cap.max_send_sge = 1; >> - attr.cap.max_recv_sge = 1; >> + attr.cap.max_recv_sge = 0; >> attr.sq_sig_type = IB_SIGNAL_ALL_WR; >> attr.qp_type = IB_QPT_RC; >> attr.send_cq = cq; > > I don't see how does this fix things. > This line >> attr.srq = priv->cm.srq; > connected the TX QP to SRQ, making it possible to get packets on this QP. > But if cm.srq is NULL, and a remote sends a packet on this connection, > the connection will get closed. Which is a quality of implementation issue. > When the QP numbers are exchanged correctly, then it should not receive a packet on this QP in the first place. That is an error case and so should be a rare event. Assuming that still happens, that should be setup again because it is an RC connection. Pradeep From mst at dev.mellanox.co.il Sun Jul 22 00:20:43 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 10:20:43 +0300 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <46A3043A.3030200@linux.vnet.ibm.com> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> Message-ID: <20070722072043.GB7188@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: NOSRQ misc patch [PATCH V1] > > Michael S. Tsirkin wrote: > >> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > >> attr.recv_cq = priv->cq; > >> attr.srq = priv->cm.srq; > >> attr.cap.max_send_wr = ipoib_sendq_size; > >> - attr.cap.max_recv_wr = 1; > >> + attr.cap.max_recv_wr = 0; > >> attr.cap.max_send_sge = 1; > >> - attr.cap.max_recv_sge = 1; > >> + attr.cap.max_recv_sge = 0; > >> attr.sq_sig_type = IB_SIGNAL_ALL_WR; > >> attr.qp_type = IB_QPT_RC; > >> attr.send_cq = cq; > > > > I don't see how does this fix things. > > This line > >> attr.srq = priv->cm.srq; > > connected the TX QP to SRQ, making it possible to get packets on this QP. > > But if cm.srq is NULL, and a remote sends a packet on this connection, > > the connection will get closed. Which is a quality of implementation issue. > > > When the QP numbers are exchanged correctly, then it should not receive > a packet on this QP in the first place. Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting packets. We don't do this currently but we might in the future. > That is an error case and so should > be a rare event. Assuming that still happens, that should be setup again > because it is an RC connection. Won't it closed immediately again once remote tries to use it? -- MST From vlad at lists.openfabrics.org Sun Jul 22 01:37:11 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 22 Jul 2007 01:37:11 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070722-0100 daily build status Message-ID: <20070722083711.537DDE60825@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Failed: From krkumar2 at in.ibm.com Sun Jul 22 02:04:57 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:34:57 +0530 Subject: [ofa-general] [PATCH 00/12 -Rev2] Implement batching skb API Message-ID: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> This set of patches implements the batching API, and makes the following changes resulting from the review of the first set: Changes : --------- 1. Changed skb_blist from pointer to static as it saves only 12 bytes (i386), but bloats the code. 2. Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS" in register_netdev to enable batching as it is redundant. Changed this flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other user changable calls can modify this bit to enable/disable batching. 3. Added ethtool support to enable/disable batching (not tested). 4. Added rtnetlink support to enable/disable batching (not tested). 5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers should not have a small queue anyway (adding bloat). 6. skbs are purged from dev_deactivate instead of from unregister_netdev to drop all references to the device. 7. Removed changelog in source code in sch_generic.c, and unrelated renames from sch_generic.c (lockless, comments). 8. Removed xmit_slots entirely, as it was adding bloat (code and header) and not adding value (it is calculated and set twice in internal send routine and handle work completion, and referenced once in batch xmit; and can instead be calculated once in xmit). Issues : -------- 1. Remove /sysfs support completely ? 2. Whether rtnetlink support is required as GSO has only ethtool ? Patches are described as: Mail 0/12 : This mail. Mail 1/12 : HOWTO documentation. Mail 2/12 : Changes to netdevice.h Mail 3/12 : dev.c changes. Mail 4/12 : Ethtool changes. Mail 5/12 : sysfs changes. Mail 6/12 : rtnetlink changes. Mail 7/12 : Change in qdisc_run & qdisc_restart API, modify callers to use this API. Mail 8/12 : IPoIB include file changes. Mail 9/12 : IPoIB verbs changes Mail 10/12 : IPoIB multicast, CM changes Mail 11/12 : IPoIB xmit API addition Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c) I have started a 10 run test for various buffer sizes and processes, and will post the results on Monday. Please review and provide feedback/ideas; and consider for inclusion. Thanks, - KK From krkumar2 at in.ibm.com Sun Jul 22 02:05:06 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:06 +0530 Subject: [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB. In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090506.7787.69681.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/Documentation/networking/Batching_skb_API.txt rev2/Documentation/networking/Batching_skb_API.txt --- org/Documentation/networking/Batching_skb_API.txt 1970-01-01 05:30:00.000000000 +0530 +++ rev2/Documentation/networking/Batching_skb_API.txt 2007-07-20 16:09:45.000000000 +0530 @@ -0,0 +1,91 @@ + HOWTO for batching skb API support + ----------------------------------- + +Section 1: What is batching skb API ? +Section 2: How batching API works vs the original API ? +Section 3: How drivers can support this API ? +Section 4: How users can work with this API ? + + +Introduction: Kernel support for batching skb +----------------------------------------------- + +An extended API is supported in the netdevice layer, which is very similar +to the existing hard_start_xmit() API. Drivers which wish to take advantage +of this new API should implement this routine similar to how the +hard_start_xmit handler is written. The difference between these API's is +that while the existing hard_start_xmit processes one skb, the new API can +process multiple skbs (or even one) in a single call. It is also possible +for the driver writer to re-use most of the code from the existing API in +the new API without having code duplication. + + +Section 1: What is batching skb API ? +------------------------------------- + + This is a new API that is optionally exported by a driver. The pre- + requisite for a driver to use this API is that it should have a + reasonably sized hardware queue that can process multiple skbs. + + +Section 2: How batching API works vs the original API ? +------------------------------------------------------- + + The networking stack normally gets called from upper layer protocols + with a single skb to xmit. This skb is first enqueue'd and an + attempt is next made to transmit it immediately (via qdisc_run). + However, events like driver lock contention, queue stopped, etc, can + result in the skb not getting sent out, and it remains in the queue. + When a new xmit is called or when the queue is re-enabled, qdisc_run + could potentially find multiple packets in the queue, and have to + send them all out one by one iteratively. + + The batching skb API case was added to exploit this situation where + if there are multiple skbs, all of them can be sent to the device in + one shot. This reduces driver processing, locking at the driver (or + in stack for ~LLTX drivers) gets amortized over multiple skbs, and + in case of specific drivers where every xmit results in a completion + processing (like IPoIB), optimizations could be made in the driver + to get a completion for only the last skb that was sent which will + result in saving interrupts for every (but the last) skb that was + sent in the same batch. + + This batching can result in significant performance gains for + systems that have multiple data stream paths over the same network + interface card. + + +Section 3: How drivers can support this API ? +--------------------------------------------- + + The new API - dev->hard_start_xmit_batch(struct net_device *dev), + simplistically, can be written almost identically to the regular + xmit API (hard_start_xmit), except that all skbs on dev->skb_blist + should be processed by the driver instead of just one skb. The new + API doesn't get any skb as argument to process, instead it picks up + all the skbs from dev->skb_blist, where it was added by the stack, + and tries to send them out. + + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in + dev->features, and dev->hard_start_xmit_batch should point to the + new API implemented for that driver. + + +Section 4: How users can work with this API ? +--------------------------------------------- + + Batching could be disabled for a particular device, e.g. on desktop + systems if only one stream of network activity for that device is + taking place, since performance could be slightly affected due to + extra processing that batching adds. Batching can be enabled if + more than one stream of network activity per device is being done, + e.g. on servers, or even desktop usage with multiple browser, chat, + file transfer sessions, etc. + + Per device batching can be enabled/disabled using: + + echo 1 > /sys/class/net//tx_batch_skbs (enable) + echo 0 > /sys/class/net//tx_batch_skbs (disable) + + E.g. to enable batching on eth0, run: + echo 1 > /sys/class/net/eth0/tx_batch_skbs From krkumar2 at in.ibm.com Sun Jul 22 02:05:16 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:16 +0530 Subject: [ofa-general] [PATCH 02/12 -Rev2] Changes to netdevice.h In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/include/linux/netdevice.h rev2/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-07-20 07:49:28.000000000 +0530 +++ rev2/include/linux/netdevice.h 2007-07-22 13:20:16.000000000 +0530 @@ -340,6 +340,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_GSO 2048 /* Enable software GSO. */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_BATCH_ON 8192 /* Batching skbs xmit API is enabled */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ @@ -452,6 +453,7 @@ struct net_device struct Qdisc *qdisc_sleeping; struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ + struct sk_buff_head skb_blist; /* List of batch skbs */ /* Partially transmitted GSO packet. */ struct sk_buff *gso_skb; @@ -472,6 +474,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -582,6 +587,8 @@ struct net_device #define NETDEV_ALIGN 32 #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) +#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0) + static inline void *netdev_priv(const struct net_device *dev) { return dev->priv; @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, + struct net_device *dev); extern void dev_init(void); @@ -1104,6 +1113,8 @@ extern void dev_set_promiscuity(struct extern void dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); extern void netdev_features_change(struct net_device *dev); +extern int dev_change_tx_batching(struct net_device *dev, + unsigned long new_batch_skb); /* Load a device via the kmod */ extern void dev_load(const char *name); extern void dev_mcast_init(void); From krkumar2 at in.ibm.com Sun Jul 22 02:05:25 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:25 +0530 Subject: [ofa-general] [PATCH 03/12 -Rev2] dev.c changes. In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/net/core/dev.c rev2/net/core/dev.c --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530 @@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi } } +/* + * dev_change_tx_batching - Enable or disable batching for a driver that + * supports batching. + */ +int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb) +{ + int ret; + + if (!dev->hard_start_xmit_batch) { + /* Driver doesn't support skb batching */ + ret = -ENOTSUPP; + goto out; + } + + /* Handle invalid argument */ + if (new_batch_skb < 0) { + ret = -EINVAL; + goto out; + } + + ret = 0; + + /* Check if new value is same as the current */ + if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb) + goto out; + + spin_lock(&dev->queue_lock); + if (new_batch_skb) { + dev->features |= NETIF_F_BATCH_ON; + dev->tx_queue_len >>= 1; + } else { + if (!skb_queue_empty(&dev->skb_blist)) + skb_queue_purge(&dev->skb_blist); + dev->features &= ~NETIF_F_BATCH_ON; + dev->tx_queue_len <<= 1; + } + spin_unlock(&dev->queue_lock); + +out: + return ret; +} + /** * dev_load - load a network module * @name: name of interface @@ -1414,6 +1456,45 @@ static int dev_gso_segment(struct sk_buf return 0; } +/* + * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are + * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs + * are dequeued from this list when we call the driver, so the list is safe + * from simultaneous deletes too. + * + * Returns count of successful skb(s) added to skb_blist. + */ +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree(skb); + return 0; + } + + if (skb->next) { + int count = 0; + + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + __skb_queue_tail(&dev->skb_blist, nskb); + count++; + } while (skb->next); + + skb->destructor = DEV_GSO_CB(skb)->destructor; + kfree_skb(skb); + return count; + } + } + __skb_queue_tail(&dev->skb_blist, skb); + return 1; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb->next)) { @@ -3397,6 +3483,12 @@ int register_netdevice(struct net_device } } + if (dev->hard_start_xmit_batch) { + dev->features |= NETIF_F_BATCH_ON; + skb_queue_head_init(&dev->skb_blist); + dev->tx_queue_len >>= 1; + } + /* * nil rebuild_header routine, * that should be never called and used as just bug trap. From krkumar2 at in.ibm.com Sun Jul 22 02:05:35 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:35 +0530 Subject: [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090534.7787.8673.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/include/linux/ethtool.h rev2/include/linux/ethtool.h --- org/include/linux/ethtool.h 2007-07-21 13:39:50.000000000 +0530 +++ rev2/include/linux/ethtool.h 2007-07-21 13:40:57.000000000 +0530 @@ -414,6 +414,8 @@ struct ethtool_ops { #define ETHTOOL_SUFO 0x00000022 /* Set UFO enable (ethtool_value) */ #define ETHTOOL_GGSO 0x00000023 /* Get GSO enable (ethtool_value) */ #define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */ +#define ETHTOOL_GBTX 0x00000025 /* Get Batching (ethtool_value) */ +#define ETHTOOL_SBTX 0x00000026 /* Set Batching (ethtool_value) */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET diff -ruNp org/net/core/ethtool.c rev2/net/core/ethtool.c --- org/net/core/ethtool.c 2007-07-21 13:37:17.000000000 +0530 +++ rev2/net/core/ethtool.c 2007-07-21 22:55:38.000000000 +0530 @@ -648,6 +648,26 @@ static int ethtool_set_gso(struct net_de return 0; } +static int ethtool_get_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata = { ETHTOOL_GBTX }; + + edata.data = BATCHING_ON(dev); + if (copy_to_user(useraddr, &edata, sizeof(edata))) + return -EFAULT; + return 0; +} + +static int ethtool_set_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata; + + if (copy_from_user(&edata, useraddr, sizeof(edata))) + return -EFAULT; + + return dev_change_tx_batching(dev, edata.data); +} + static int ethtool_self_test(struct net_device *dev, char __user *useraddr) { struct ethtool_test test; @@ -959,6 +979,12 @@ int dev_ethtool(struct ifreq *ifr) case ETHTOOL_SGSO: rc = ethtool_set_gso(dev, useraddr); break; + case ETHTOOL_GBTX: + rc = ethtool_get_batch(dev, useraddr); + break; + case ETHTOOL_SBTX: + rc = ethtool_set_batch(dev, useraddr); + break; default: rc = -EOPNOTSUPP; } From krkumar2 at in.ibm.com Sun Jul 22 02:05:44 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:44 +0530 Subject: [ofa-general] [PATCH 05/12 -Rev2] sysfs changes. In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090544.7787.87947.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/net/core/net-sysfs.c rev2/net/core/net-sysfs.c --- org/net/core/net-sysfs.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/net/core/net-sysfs.c 2007-07-21 22:56:32.000000000 +0530 @@ -230,6 +230,21 @@ static ssize_t store_weight(struct devic return netdev_store(dev, attr, buf, len, change_weight); } +static ssize_t show_tx_batch_skb(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + return sprintf(buf, fmt_dec, BATCHING_ON(netdev)); +} + +static ssize_t store_tx_batch_skb(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + return netdev_store(dev, attr, buf, len, dev_change_tx_batching); +} + static struct device_attribute net_class_attributes[] = { __ATTR(addr_len, S_IRUGO, show_addr_len, NULL), __ATTR(iflink, S_IRUGO, show_iflink, NULL), @@ -246,6 +261,8 @@ static struct device_attribute net_class __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags), __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, store_tx_queue_len), + __ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skb, + store_tx_batch_skb), __ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight), {} }; From krkumar2 at in.ibm.com Sun Jul 22 02:05:53 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:35:53 +0530 Subject: [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes. In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h --- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530 +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530 @@ -78,6 +78,8 @@ enum IFLA_LINKMODE, IFLA_LINKINFO, #define IFLA_LINKINFO IFLA_LINKINFO + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ +#define IFLA_TXBTHSKB IFLA_TXBTHSKB __IFLA_MAX }; diff -ruNp org/net/core/rtnetlink.c rev2/net/core/rtnetlink.c --- org/net/core/rtnetlink.c 2007-07-20 16:31:59.000000000 +0530 +++ rev2/net/core/rtnetlink.c 2007-07-21 22:27:10.000000000 +0530 @@ -634,6 +634,7 @@ static int rtnl_fill_ifinfo(struct sk_bu NLA_PUT_STRING(skb, IFLA_IFNAME, dev->name); NLA_PUT_U32(skb, IFLA_TXQLEN, dev->tx_queue_len); + NLA_PUT_U32(skb, IFLA_TXBTHSKB, BATCHING_ON(dev)); NLA_PUT_U32(skb, IFLA_WEIGHT, dev->weight); NLA_PUT_U8(skb, IFLA_OPERSTATE, netif_running(dev) ? dev->operstate : IF_OPER_DOWN); @@ -833,7 +834,8 @@ static int do_setlink(struct net_device if (tb[IFLA_TXQLEN]) dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); - + if (tb[IFLA_TXBTHSKB]) + dev_change_tx_batching(dev, nla_get_u32(tb[IFLA_TXBTHSKB])); if (tb[IFLA_WEIGHT]) dev->weight = nla_get_u32(tb[IFLA_WEIGHT]); @@ -1072,6 +1074,9 @@ replay: nla_len(tb[IFLA_BROADCAST])); if (tb[IFLA_TXQLEN]) dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); + if (tb[IFLA_TXBTHSKB]) + dev_change_tx_batching(dev, + nla_get_u32(tb[IFLA_TXBTHSKB])); if (tb[IFLA_WEIGHT]) dev->weight = nla_get_u32(tb[IFLA_WEIGHT]); if (tb[IFLA_OPERSTATE]) From krkumar2 at in.ibm.com Sun Jul 22 02:06:02 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:02 +0530 Subject: [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart API, callers In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090602.7787.50560.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/include/net/pkt_sched.h rev2/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 +++ rev2/include/net/pkt_sched.h 2007-07-20 16:09:45.000000000 +0530 @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); -static inline void qdisc_run(struct net_device *dev) +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { if (!netif_queue_stopped(dev) && !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) - __qdisc_run(dev); + __qdisc_run(dev, blist); } extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp, diff -ruNp org/net/sched/sch_generic.c rev2/net/sched/sch_generic.c --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/net/sched/sch_generic.c 2007-07-22 12:11:10.000000000 +0530 @@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb->next)) - dev->gso_skb = skb; - else - q->ops->requeue(skb, q); + if (likely(skb)) { + if (unlikely(skb->next)) + dev->gso_skb = skb; + else + q->ops->requeue(skb, q); + } netif_schedule(dev); return 0; @@ -91,18 +93,23 @@ static inline int handle_dev_cpu_collisi /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We - * detect it by checking xmit owner and drop the packet when - * deadloop is detected. Return OK to try the next skb. + * detect it by checking xmit owner and drop skb (or all + * skbs in batching case) when deadloop is detected. Return + * OK to try the next skb. */ - kfree_skb(skb); + if (likely(skb)) + kfree_skb(skb); + else if (!skb_queue_empty(&dev->skb_blist)) + skb_queue_purge(&dev->skb_blist); + if (net_ratelimit()) printk(KERN_WARNING "Dead loop on netdevice %s, " "fix it urgently!\n", dev->name); ret = qdisc_qlen(q); } else { /* - * Another cpu is holding lock, requeue & delay xmits for - * some time. + * Another cpu is holding lock. Requeue skb and delay xmits + * for some time. */ __get_cpu_var(netdev_rx_stat).cpu_collision++; ret = dev_requeue_skb(skb, dev, q); @@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi } /* + * Algorithm to get skb(s) is: + * - Non batching drivers, or if the batch list is empty and there is + * atmost one skb in the queue - dequeue skb and put it in *skbp to + * tell the caller to use the single xmit API. + * - Batching drivers where the batch list already contains atleast one + * skb or if there are multiple skbs in the queue: keep dequeue'ing + * skb's upto a limit and set *skbp to NULL to tell the caller to use + * the multiple xmit API. + * + * Returns: + * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL + * (in case >1 skbs present in blist for batching) + * 0 - no skbs to be sent. + */ +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, struct sk_buff **skbp) +{ + if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev->tx_queue_len - skb_queue_len(blist); + struct sk_buff *skb; + + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + *skbp = NULL; + return 1; /* we have atleast one skb in blist */ + } +} + +/* * NOTE: Called under dev->queue_lock with locally disabled BH. * * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this @@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi * >0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *blist) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; @@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n int ret; /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) + if (unlikely(!get_skb(dev, q, blist, &skb))) return 0; /* @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n /* And release queue */ spin_unlock(&dev->queue_lock); - ret = dev_hard_start_xmit(skb, dev); + if (likely(skb)) + ret = dev_hard_start_xmit(skb, dev); + else + ret = dev->hard_start_xmit_batch(dev); if (!lockless) netif_tx_unlock(dev); @@ -168,7 +211,7 @@ static inline int qdisc_restart(struct n switch (ret) { case NETDEV_TX_OK: - /* Driver sent out skb successfully */ + /* Driver sent out skb (or entire skb_blist) successfully */ ret = qdisc_qlen(q); break; @@ -179,8 +222,8 @@ static inline int qdisc_restart(struct n default: /* Driver returned NETDEV_TX_BUSY - requeue skb */ - if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit())) - printk(KERN_WARNING "BUG %s code %d qlen %d\n", + if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit()) + printk(KERN_WARNING " %s: BUG. code %d qlen %d\n", dev->name, ret, q->q.qlen); ret = dev_requeue_skb(skb, dev, q); @@ -190,10 +233,10 @@ static inline int qdisc_restart(struct n return ret; } -void __qdisc_run(struct net_device *dev) +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { do { - if (!qdisc_restart(dev)) + if (!qdisc_restart(dev, blist)) break; } while (!netif_queue_stopped(dev)); @@ -567,6 +610,13 @@ void dev_deactivate(struct net_device *d skb = dev->gso_skb; dev->gso_skb = NULL; + + if (BATCHING_ON(dev)) { + /* Free skbs on batch list */ + if (!skb_queue_empty(&dev->skb_blist)) + skb_queue_purge(&dev->skb_blist); + } + spin_unlock_bh(&dev->queue_lock); kfree_skb(skb); diff -ruNp org/net/core/dev.c rev2/net/core/dev.c --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530 @@ -1647,7 +1647,7 @@ gso: /* reset queue_mapping to zero */ skb->queue_mapping = 0; rc = q->enqueue(skb, q); - qdisc_run(dev); + qdisc_run(dev, NULL); spin_unlock(&dev->queue_lock); rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc; @@ -1844,7 +1844,12 @@ static void net_tx_action(struct softirq clear_bit(__LINK_STATE_SCHED, &dev->state); if (spin_trylock(&dev->queue_lock)) { - qdisc_run(dev); + /* + * Try to send out all skbs if batching is + * enabled. + */ + qdisc_run(dev, BATCHING_ON(dev) ? + &dev->skb_blist : NULL); spin_unlock(&dev->queue_lock); } else { netif_schedule(dev); From krkumar2 at in.ibm.com Sun Jul 22 02:06:17 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:17 +0530 Subject: [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes. In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090612.7787.63282.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h rev2/drivers/infiniband/ulp/ipoib/ipoib.h --- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 16:09:45.000000000 +0530 @@ -269,8 +269,8 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sge tx_sge; - struct ib_send_wr tx_wr; + struct ib_sge *tx_sge; + struct ib_send_wr *tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, int snum, int tx_index, + struct ipoib_ah *address, u32 qpn); void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn); + struct ipoib_ah *address, u32 qpn, int num_skbs); void ipoib_reap_ah(struct work_struct *work); void ipoib_flush_paths(struct net_device *dev); From krkumar2 at in.ibm.com Sun Jul 22 02:06:26 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:26 +0530 Subject: [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090626.7787.25000.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c --- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 16:09:45.000000000 +0530 @@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_ .max_send_sge = 1, .max_recv_sge = 1 }, - .sq_sig_type = IB_SIGNAL_ALL_WR, + .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */ .qp_type = IB_QPT_UD }; - - int ret, size; + struct ib_send_wr *next_wr = NULL; + int i, ret, size; priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_ priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff; priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff; - priv->tx_sge.lkey = priv->mr->lkey; - - priv->tx_wr.opcode = IB_WR_SEND; - priv->tx_wr.sg_list = &priv->tx_sge; - priv->tx_wr.num_sge = 1; - priv->tx_wr.send_flags = IB_SEND_SIGNALED; + for (i = ipoib_sendq_size - 1; i >= 0; i--) { + priv->tx_sge[i].lkey = priv->mr->lkey; + priv->tx_wr[i].opcode = IB_WR_SEND; + priv->tx_wr[i].sg_list = &priv->tx_sge[i]; + priv->tx_wr[i].num_sge = 1; + priv->tx_wr[i].send_flags = 0; + + /* Link the list properly for provider to use */ + priv->tx_wr[i].next = next_wr; + next_wr = &priv->tx_wr[i]; + } return 0; From krkumar2 at in.ibm.com Sun Jul 22 02:06:40 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:40 +0530 Subject: [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090640.7787.17578.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c --- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 16:09:45.000000000 +0530 @@ -493,14 +493,19 @@ static inline int post_send(struct ipoib unsigned int wr_id, u64 addr, int len) { + int ret; struct ib_send_wr *bad_wr; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + priv->tx_sge[0].addr = addr; + priv->tx_sge[0].length = len; + + priv->tx_wr[0].wr_id = wr_id; - priv->tx_wr.wr_id = wr_id; + priv->tx_wr[0].next = NULL; + ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr); + priv->tx_wr[0].next = &priv->tx_wr[1]; - return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); + return ret; } void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c --- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 16:09:45.000000000 +0530 @@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); - priv->tx_wr.wr.ud.remote_qkey = priv->qkey; + priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey; } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -736,7 +736,7 @@ out: } } - ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); + ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1); } unlock: From krkumar2 at in.ibm.com Sun Jul 22 02:06:49 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:49 +0530 Subject: [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-22 00:08:37.000000000 +0530 @@ -242,8 +242,9 @@ repost: static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); + int i = 0, num_completions; + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1); unsigned int wr_id = wc->wr_id; - struct ipoib_tx_buf *tx_req; unsigned long flags; ipoib_dbg_data(priv, "send completion: id %d, status: %d\n", @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct return; } - tx_req = &priv->tx_ring[wr_id]; + num_completions = wr_id - tx_ring_index + 1; + if (num_completions <= 0) + num_completions += ipoib_sendq_size; - ib_dma_unmap_single(priv->ca, tx_req->mapping, - tx_req->skb->len, DMA_TO_DEVICE); + /* + * Handle skbs completion from tx_tail to wr_id. It is possible to + * handle WC's from earlier post_sends (possible multiple) in this + * iteration as we move from tx_tail to wr_id, since if the last + * WR (which is the one which had a completion request) failed to be + * sent for any of those earlier request(s), no completion + * notification is generated for successful WR's of those earlier + * request(s). + */ + while (1) { + /* + * Could use while (i < num_completions), but it is costly + * since in most cases there is 1 completion, and we end up + * doing an extra "index = (index+1) & (ipoib_sendq_size-1)" + */ + struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index]; + + if (likely(tx_req->skb)) { + ib_dma_unmap_single(priv->ca, tx_req->mapping, + tx_req->skb->len, DMA_TO_DEVICE); - ++priv->stats.tx_packets; - priv->stats.tx_bytes += tx_req->skb->len; + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; - dev_kfree_skb_any(tx_req->skb); + dev_kfree_skb_any(tx_req->skb); + } + /* + * else this skb failed synchronously when posted and was + * freed immediately. + */ + + if (++i == num_completions) + break; + + /* More WC's to handle */ + tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1); + } spin_lock_irqsave(&priv->tx_lock, flags); - ++priv->tx_tail; + + priv->tx_tail += num_completions; if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) && priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); netif_wake_queue(dev); } + spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -340,78 +375,178 @@ void ipoib_ib_completion(struct ib_cq *c netif_rx_schedule(dev_ptr); } -static inline int post_send(struct ipoib_dev_priv *priv, - unsigned int wr_id, - struct ib_ah *address, u32 qpn, - u64 addr, int len) +/* + * post_send : Post WR(s) to the device. + * + * num_skbs is the number of WR's, 'start_index' is the first slot in + * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a + * previous post_send returned error and we are trying to send the untried + * WR's, in which case start_index will point to the first untried WR. + * + * We also break the WR link before posting so that the driver knows how + * many WR's to process, and this is set back after the post. + */ +static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn, + int start_index, int num_skbs, + struct ib_send_wr **bad_wr) { - struct ib_send_wr *bad_wr; + int ret; + struct ib_send_wr *last_wr, *next_wr; + + last_wr = &priv->tx_wr[start_index + num_skbs - 1]; + + /* Set Completion Notification for last WR */ + last_wr->send_flags = IB_SEND_SIGNALED; - priv->tx_sge.addr = addr; - priv->tx_sge.length = len; + /* Terminate the last WR */ + next_wr = last_wr->next; + last_wr->next = NULL; - priv->tx_wr.wr_id = wr_id; - priv->tx_wr.wr.ud.remote_qpn = qpn; - priv->tx_wr.wr.ud.ah = address; + /* Send all the WR's in one doorbell */ + ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr); - return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr); + /* Restore send_flags & WR chain */ + last_wr->send_flags = 0; + last_wr->next = next_wr; + + return ret; } -void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn) +/* + * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr + * to pass to the driver. + * + * Returns : + * - 0 on successful processing of the skb + * - 1 if the skb was freed. + */ +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, int wr_num, + int tx_ring_index, struct ipoib_ah *address, u32 qpn) { - struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_tx_buf *tx_req; u64 addr; + struct ipoib_tx_buf *tx_req; if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) { - ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + ipoib_warn(priv, "packet len %d (> %d) too long to " + "send, dropping\n", skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN); ++priv->stats.tx_dropped; ++priv->stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu); - return; + return 1; } - ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + ipoib_dbg_data(priv, "sending packet, length=%d address=%p " + "qpn=0x%06x\n", skb->len, address, qpn); /* * We put the skb into the tx_ring _before_ we call post_send() * because it's entirely possible that the completion handler will - * run before we execute anything after the post_send(). That + * run before we execute anything after the post_send(). That * means we have to make sure everything is properly recorded and * our state is consistent before we call post_send(). */ - tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; - tx_req->skb = skb; - addr = ib_dma_map_single(priv->ca, skb->data, skb->len, - DMA_TO_DEVICE); + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); - return; + return 1; } + + tx_req = &priv->tx_ring[tx_ring_index]; + tx_req->skb = skb; tx_req->mapping = addr; + priv->tx_sge[wr_num].addr = addr; + priv->tx_sge[wr_num].length = skb->len; + priv->tx_wr[wr_num].wr_id = tx_ring_index; + priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn; + priv->tx_wr[wr_num].wr.ud.ah = address->ah; - if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), - address->ah, qpn, addr, skb->len))) { - ipoib_warn(priv, "post_send failed\n"); - ++priv->stats.tx_errors; - ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); - dev_kfree_skb_any(skb); - } else { - dev->trans_start = jiffies; + return 0; +} + +/* + * If an skb is passed to this function, it is the single, unprocessed skb + * send case. Otherwise if skb is NULL, it means that all skbs are already + * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc. + */ +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn, int num_skbs) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int start_index = 0; - address->last_send = priv->tx_head; - ++priv->tx_head; + if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head & + (ipoib_sendq_size - 1), address, qpn)) + return; - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { - ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); - netif_stop_queue(dev); - set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + /* Send out all the skb's in one post */ + while (num_skbs) { + struct ib_send_wr *bad_wr; + + if (unlikely((post_send(priv, qpn, start_index, num_skbs, + &bad_wr)))) { + int done; + + /* + * Better error handling can be done here, like free + * all untried skbs if err == -ENOMEM. However at this + * time, we re-try all the skbs, all of which will + * likely fail anyway (unless device finished sending + * some out in the meantime). This is not a regression + * since the earlier code is not doing this either. + */ + ipoib_warn(priv, "post_send failed\n"); + + /* Get #WR's that finished successfully */ + done = bad_wr - &priv->tx_wr[start_index]; + + /* Handle 1 error */ + priv->stats.tx_errors++; + ib_dma_unmap_single(priv->ca, + priv->tx_sge[start_index + done].addr, + priv->tx_sge[start_index + done].length, + DMA_TO_DEVICE); + + /* Handle 'n' successes */ + if (done) { + dev->trans_start = jiffies; + address->last_send = priv->tx_head; + } + + /* Free failed WR & reset for WC handler to recognize */ + dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb); + priv->tx_ring[bad_wr->wr_id].skb = NULL; + + /* Move head to first untried WR */ + priv->tx_head += (done + 1); + /* + 1 for WR that was tried & failed */ + + /* Get count of skbs that were not tried */ + num_skbs -= (done + 1); + + /* Get start index for next iteration */ + start_index += (done + 1); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + priv->tx_head += num_skbs; + num_skbs = 0; } } + + if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) { + /* + * Not accurate as some intermediate slots could have been + * freed on error, but no harm - only queue stopped earlier. + */ + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + } } static void __ipoib_reap_ah(struct net_device *dev) From krkumar2 at in.ibm.com Sun Jul 22 02:06:59 2007 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Sun, 22 Jul 2007 14:36:59 +0530 Subject: [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c) In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722090659.7787.47401.sendpatchset@K50wks273871wss.in.ibm.com> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c --- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-20 07:49:28.000000000 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-22 00:08:28.000000000 +0530 @@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf goto err_drop; } } else - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, path->ah, + IPOIB_QPN(skb->dst->neighbour->ha), 1); } else { neigh->ah = NULL; @@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1); } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu goto out; } - ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + ipoib_send(dev, skb, neigh->ah, + IPOIB_QPN(skb->dst->neighbour->ha), 1); goto out; } @@ -753,6 +755,175 @@ out: return NETDEV_TX_OK; } +#define XMIT_QUEUED_SKBS() \ + do { \ + if (num_skbs) { \ + ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \ + num_skbs); \ + num_skbs = 0; \ + } \ + } while (0) + +/* + * TODO: Merge with ipoib_start_xmit to use the same code and have a + * transparent wrapper caller to xmit's, etc. + */ +static int ipoib_start_xmit_frames(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + struct sk_buff_head *blist; + int max_skbs, num_skbs = 0, tx_ring_index = -1; + u32 qpn, old_qpn = 0; + struct ipoib_neigh *neigh, *old_neigh = NULL; + unsigned long flags; + + if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) + return NETDEV_TX_LOCKED; + + blist = &dev->skb_blist; + + /* + * Send atmost 'max_skbs' skbs. This also prevents the device getting + * full. + */ + max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail); + while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) { + /* + * From here on, ipoib_send() cannot stop the queue as it + * uses the same initialization as 'max_skbs'. So we can + * optimize to not check for queue stopped for every skb. + */ + if (likely(skb->dst && skb->dst->neighbour)) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + XMIT_QUEUED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (ipoib_cm_get(neigh)) { + if (ipoib_cm_up(neigh)) { + XMIT_QUEUED_SKBS(); + ipoib_cm_send(dev, skb, + ipoib_cm_get(neigh)); + continue; + } + } else if (neigh->ah) { + if (unlikely(memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid)))) { + spin_lock(&priv->lock); + /* + * It's safe to call ipoib_put_ah() + * inside priv->lock here, because we + * know that path->ah will always hold + * one more reference, so ipoib_put_ah() + * will never do more than decrement + * the ref count. + */ + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock(&priv->lock); + XMIT_QUEUED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + qpn = IPOIB_QPN(skb->dst->neighbour->ha); + if (neigh != old_neigh || qpn != old_qpn) { + /* + * Sending to a different destination + * from earlier skb's - send all + * existing skbs (if any). + */ + if (tx_ring_index == -1) { + /* + * First time, find where to + * store skb. + */ + tx_ring_index = priv->tx_head & + (ipoib_sendq_size - 1); + } else { + /* Some skbs to send */ + XMIT_QUEUED_SKBS(); + } + old_neigh = neigh; + old_qpn = IPOIB_QPN(skb->dst->neighbour->ha); + } + + if (ipoib_process_skb(dev, skb, priv, num_skbs, + tx_ring_index, neigh->ah, + qpn)) + continue; + + num_skbs++; + + /* Queue'd one skb, get index for next skb */ + if (max_skbs) + tx_ring_index = (tx_ring_index + 1) & + (ipoib_sendq_size - 1); + continue; + } + + if (skb_queue_len(&neigh->queue) < + IPOIB_MAX_PATH_REC_QUEUE) { + spin_lock(&priv->lock); + __skb_queue_tail(&neigh->queue, skb); + spin_unlock(&priv->lock); + } else { + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + ++max_skbs; + } + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + XMIT_QUEUED_SKBS(); + ipoib_mcast_send(dev, phdr->hwaddr + 4, skb); + } else { + /* unicast GID -- should be ARP or RARP reply */ + + if ((be16_to_cpup((__be16 *) skb->data) != + ETH_P_ARP) && + (be16_to_cpup((__be16 *) skb->data) != + ETH_P_RARP)) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((__be16 *) + skb->data), + IPOIB_QPN(phdr->hwaddr), + IPOIB_GID_RAW_ARG(phdr->hwaddr + + 4)); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + ++max_skbs; + continue; + } + XMIT_QUEUED_SKBS(); + unicast_arp_send(skb, dev, phdr); + } + } + } + + /* Send out last packets (if any) */ + XMIT_QUEUED_SKBS(); + + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY; +} + static struct net_device_stats *ipoib_get_stats(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -898,11 +1069,35 @@ int ipoib_dev_init(struct net_device *de /* priv->tx_head & tx_tail are already 0 */ - if (ipoib_ib_dev_init(dev, ca, port)) + /* Allocate tx_sge */ + priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge, + GFP_KERNEL); + if (!priv->tx_sge) { + printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n", + ca->name, ipoib_sendq_size); goto out_tx_ring_cleanup; + } + + /* Allocate tx_wr */ + priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr, + GFP_KERNEL); + if (!priv->tx_wr) { + printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n", + ca->name, ipoib_sendq_size); + goto out_tx_sge_cleanup; + } + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_wr_cleanup; return 0; +out_tx_wr_cleanup: + kfree(priv->tx_wr); + +out_tx_sge_cleanup: + kfree(priv->tx_sge); + out_tx_ring_cleanup: kfree(priv->tx_ring); @@ -930,9 +1125,13 @@ void ipoib_dev_cleanup(struct net_device kfree(priv->rx_ring); kfree(priv->tx_ring); + kfree(priv->tx_sge); + kfree(priv->tx_wr); priv->rx_ring = NULL; priv->tx_ring = NULL; + priv->tx_sge = NULL; + priv->tx_wr = NULL; } static void ipoib_setup(struct net_device *dev) @@ -943,6 +1142,7 @@ static void ipoib_setup(struct net_devic dev->stop = ipoib_stop; dev->change_mtu = ipoib_change_mtu; dev->hard_start_xmit = ipoib_start_xmit; + dev->hard_start_xmit_batch = ipoib_start_xmit_frames; dev->get_stats = ipoib_get_stats; dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; From ogerlitz at voltaire.com Sun Jul 22 02:11:52 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 22 Jul 2007 12:11:52 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <46A31F58.3010209@voltaire.com> Yevgeny Kliteynik wrote: > Please find the attached RFC describing how QoS policy support could be > implemented in the OpenFabrics stack. > Your comments are welcome. Hi Yevgeny, Some quick comments from first re-read 1) IPoIB - just to make sure I am on the right page: the intention is that the QoS params would be per --partition-- and hence the IPv4 broadcast multicast group sl, rate etc params would be used for each address handle created by this IPoIB device 2) RDMA CM (CMA) based ULPs - Assuming the rdma cm api would be enhanced for the consumer to optionally provide the "qos class", why have a dedicated section at the doc for iSER? there are bunch of other rdma cm based ULPs (eg Lustre/rNFS/RDS/etc/etc) which would be able to get QoS through the IB sys admin configuration of QoS policy at the SM/SA 3) RC based ULPs - I was thinking that the SL should be derived from the sid AND the pkey, I wonder if the IBTA related annex addresses this. 4) at some cases, the SID to be used is not known in advance: specifically the somehow canonical example is MPI implementations that request for the CM to allocate SID per rank per job, which means that you want huge dynamic bunch of SIDs to be mapped by the SA to the same SL. At the past my thinking to handle this was to change the CM such that users can ask for a --SID in a range-- and have this range be mapped to a specific SID in the SM/SA (same here maybe the IBTA annex says something re that) Or. From mst at dev.mellanox.co.il Sun Jul 22 02:13:26 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 12:13:26 +0300 Subject: [ofa-general] Re: [PATCH] IB/mlx4: enable MSI-X by default In-Reply-To: References: <20070719112155.GJ24018@mellanox.co.il> Message-ID: <20070722091326.GA7800@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [PATCH] IB/mlx4: enable MSI-X by default > > > - mlx4_enable_msi_x(dev); > > - > > if (mlx4_cmd_init(dev)) { > > mlx4_err(dev, "Failed to init command interface, aborting.\n"); > > goto err_free_dev; > > } > > > > + mlx4_enable_msi_x(dev); > > Why this change? I don't see anything in mlx4_cmd_init() that seems > to matter in terms of coming before or after enabling MSI-X. > > > err = mlx4_init_hca(dev); > > + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { > > + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); > > + dev->flags &= ~MLX4_FLAG_MSI_X; > > + pci_disable_msix(pdev); > > + err = mlx4_init_hca(dev); > > + } > > + > > if (err) > > goto err_cmd; > > > > + mlx4_enable_msi_x(dev); > > + > > err = mlx4_setup_hca(dev); You are right. I tried to copy the working mthca code as closely as possible, but it looks like I made a mistake there. > Have you actually tested this on a system where MSI-X fails? Because > I don't see how it could work-- we don't actually try interrupts until > mlx4_setup_hca() (in fact we don't even create any EQs until then). > So I don't see how mlx4_init_hca() could tell if MSI-X is OK... I only have a box with buggy PCI-X chipset - I'm not sure there are PCI-Express chipsets with broken MSI out there. So while I did test that my patch brakes nothing, the recovery code was untested. I will patch in code to simulate failure before reposting. -- MST From mst at dev.mellanox.co.il Sun Jul 22 02:15:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 12:15:44 +0300 Subject: [ofa-general] [TEST] test code to make msi-x fail Message-ID: <20070722091544.GB7800@mellanox.co.il> Here's a patch I used to test MSI-X failure recovery code in mlx4 and mthca. Posted in case it's useful to someone. Signed-off-by: Michael S. Tsirkin --- Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-19 09:36:11.000000000 +0300 +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-22 12:02:17.000000000 +0300 @@ -436,7 +436,8 @@ static irqreturn_t mthca_tavor_msi_x_int struct mthca_eq *eq = eq_ptr; struct mthca_dev *dev = eq->dev; - mthca_eq_int(dev, eq); + if (0) + mthca_eq_int(dev, eq); tavor_set_eq_ci(dev, eq, eq->cons_index); tavor_eq_req_not(dev, eq->eqn); Index: linux-2.6/drivers/net/mlx4/eq.c =================================================================== --- linux-2.6.orig/drivers/net/mlx4/eq.c 2007-07-19 09:30:35.000000000 +0300 +++ linux-2.6/drivers/net/mlx4/eq.c 2007-07-22 12:01:35.000000000 +0300 @@ -273,7 +273,8 @@ static irqreturn_t mlx4_msi_x_interrupt( struct mlx4_eq *eq = eq_ptr; struct mlx4_dev *dev = eq->dev; - mlx4_eq_int(dev, eq); + if (0) + mlx4_eq_int(dev, eq); /* MSI-X vectors always belong to us */ return IRQ_HANDLED; -- MST From mst at dev.mellanox.co.il Sun Jul 22 02:19:44 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 12:19:44 +0300 Subject: [ofa-general] [PATCH V2] IB/mlx4: enable MSI-X by default Message-ID: <20070722091944.GC7800@mellanox.co.il> Recover from MSI-X errors by automatically falling back on regular interrupt, instead of asking the user to do this manually. This makes it possible to enable MSI-X by default, and will make it possible to get rid of msi_x module option in the future. Signed-off-by: Michael S. Tsirkin --- While the previous version worked fine in the good case, it turns out it didn't actually recover from errors as intended. This version was tested by patching the MSI-X handler routine. diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 4dc9dc1..b01d543 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -61,7 +61,7 @@ MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #ifdef CONFIG_PCI_MSI -static int msi_x; +static int msi_x = 1; module_param(msi_x, int, 0444); MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); @@ -602,10 +602,7 @@ static int __devinit mlx4_setup_hca(struct mlx4_dev *dev) mlx4_err(dev, "NOP command failed to generate interrupt " "(IRQ %d), aborting.\n", priv->eq_table.eq[MLX4_EQ_ASYNC].irq); - if (dev->flags & MLX4_FLAG_MSI_X) - mlx4_err(dev, "Try again with MSI-X disabled.\n"); - else - mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n"); + mlx4_err(dev, "BIOS or ACPI interrupt routing problem?\n"); goto err_cmd_poll; } @@ -803,8 +800,6 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev, goto err_free_dev; } - mlx4_enable_msi_x(dev); - if (mlx4_cmd_init(dev)) { mlx4_err(dev, "Failed to init command interface, aborting.\n"); goto err_free_dev; @@ -814,7 +809,16 @@ static int __devinit mlx4_init_one(struct pci_dev *pdev, if (err) goto err_cmd; + mlx4_enable_msi_x(dev); + err = mlx4_setup_hca(dev); + if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X)) { + mlx4_warn(dev, "Trying again with MSI/MSI-X disabled.\n"); + dev->flags &= ~MLX4_FLAG_MSI_X; + pci_disable_msix(pdev); + err = mlx4_setup_hca(dev); + } + if (err) goto err_close; @@ -838,15 +842,15 @@ err_cleanup: mlx4_cleanup_uar_table(dev); err_close: + if (dev->flags & MLX4_FLAG_MSI_X) + pci_disable_msix(pdev); + mlx4_close_hca(dev); err_cmd: mlx4_cmd_cleanup(dev); err_free_dev: - if (dev->flags & MLX4_FLAG_MSI_X) - pci_disable_msix(pdev); - kfree(priv); err_release_bar2: -- MST From mst at dev.mellanox.co.il Sun Jul 22 02:41:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 12:41:36 +0300 Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition In-Reply-To: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070722094136.GD7800@mellanox.co.il> > + /* > + * Handle skbs completion from tx_tail to wr_id. It is possible to > + * handle WC's from earlier post_sends (possible multiple) in this > + * iteration as we move from tx_tail to wr_id, since if the last > + * WR (which is the one which had a completion request) failed to be > + * sent for any of those earlier request(s), no completion > + * notification is generated for successful WR's of those earlier > + * request(s). > + */ AFAIK a signalled WR will always generate a completion. What am I missing? > > + /* > + * Better error handling can be done here, like free > + * all untried skbs if err == -ENOMEM. However at this > + * time, we re-try all the skbs, all of which will > + * likely fail anyway (unless device finished sending > + * some out in the meantime). This is not a regression > + * since the earlier code is not doing this either. > + */ Are you retrying posting skbs? Why is this a good idea? AFAIK, earlier code did not retry posting WRs at all. The comment seems to imply that post send fails as a result of SQ overflow - do you see SQ overflow errors in your testing? AFAIK, IPoIB should never overflow the SQ. -- MST From vlad at lists.openfabrics.org Sun Jul 22 02:44:31 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 22 Jul 2007 02:44:31 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070722-0200 daily build status Message-ID: <20070722094431.17323E60825@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.22 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From sashak at voltaire.com Sun Jul 22 03:22:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 22 Jul 2007 13:22:09 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <863azhrlm1.fsf@sw053.lab.mtl.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> Message-ID: <20070722102209.GR16597@sashak.voltaire.com> Hi Eitan, On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > Hi Sasha > > I am running some tests manually and apparently it looks like > I found a bug. Here is the sequence of things: > 1. SM sweeps the fabric assign LFTs > 2. I manually modify some LFTs (single entry now marked UNREACHABLE > 3. I force some switch change bit to 1 or issue kill -HUP > 4. The SM reports SUBNET UP > 5. The modified LFT entry is still UNREACHABLE and the path is broken Right, in most cases (unless OpenSM has its own changes in the same LFT block) OpenSM will refer its own LFT image for "need to update" decision, so _manual_ changes will not trigger new update. Rerunning OpenSM should help however. > It looks to me some optimization of routing does not fully reroute > unless some condition is met - but that condition does not include the > above triggers listed in step 3. Rereading all fabrics LFTs by default seems to be too expensive operations. At least by default, if it is real requirement this could be enforced manually, for example when kill -HUP is used. Thoughts? Sasha From eitan at mellanox.co.il Sun Jul 22 04:59:23 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 22 Jul 2007 14:59:23 +0300 Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT re-configuration References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> Hi Sasha Let's assume someone has reset a switch on the fabric. What would cause the SM to re-assign the LFT of that switch? I assumed that there is a mechanism to do that. Anyway, kill -HUP should flush out the state and restart from scratch. Eitan > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Sunday, July 22, 2007 1:22 PM > To: Eitan Zahavi > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration > > Hi Eitan, > > On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > > Hi Sasha > > > > I am running some tests manually and apparently it looks > like I found > > a bug. Here is the sequence of things: > > 1. SM sweeps the fabric assign LFTs > > 2. I manually modify some LFTs (single entry now marked > UNREACHABLE 3. > > I force some switch change bit to 1 or issue kill -HUP 4. The SM > > reports SUBNET UP 5. The modified LFT entry is still > UNREACHABLE and > > the path is broken > > Right, in most cases (unless OpenSM has its own changes in > the same LFT > block) OpenSM will refer its own LFT image for "need to update" > decision, so _manual_ changes will not trigger new update. > Rerunning OpenSM should help however. > > > It looks to me some optimization of routing does not fully reroute > > unless some condition is met - but that condition does not > include the > > above triggers listed in step 3. > > Rereading all fabrics LFTs by default seems to be too > expensive operations. At least by default, if it is real > requirement this could be enforced manually, for example when > kill -HUP is used. Thoughts? > > Sasha > From hadi at cyberus.ca Sun Jul 22 05:51:09 2007 From: hadi at cyberus.ca (jamal) Date: Sun, 22 Jul 2007 08:51:09 -0400 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: References: Message-ID: <1185108670.5192.122.camel@localhost> KK, On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote: > Batching need not be useful for every hardware. My concern is there is no consistency in results. I see improvements on something which you say dont. You see improvement in something that Evgeniy doesnt etc. There are many knobs and we need in the minimal to find out how those play. > I don't quite agree with that approach, eg, if the blist is empty and the > driver tells there is space for one packet, you will add one packet and > the driver sends it out and the device is stopped (with potentially lot of > skbs on dev->q). Then no packets are added till the queue is enabled, at > which time a flood of skbs will be processed increasing latency and holding > lock for a single longer duration. My approach will mitigate holding lock > for longer times and instead send skbs to the device as long as we are > within the limits. Just as a side note _I do have this feature_ in the pktgen piece. Infact, You can tell pktgen what that bound is as opposed to the hard coding(look at the pktgen "batchl" parameter). I have not found it to be useful experimentally; actually, i should say i could not "zone" on a useful value by experimenting and it was better to turn it off. I never tried adding it to the qdisc path - but this is something i could try and as i said it may prove useful. > Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find > it), > I feel having prep will not help as no other cpu can execute the queue/xmit > code anyway (E1000 is also a LLTX driver). My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. > Other driver that hold tx lock could get improvement however. So you do see the value then with non LLTX drivers, right? ;-> The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). > I wonder if you tried enabling/disabling 'prep' on E1000 to see how the > performance is affected. Absolutely. And regardless of whether its beneficial or not for e1000, theres clear benefit in the tg3 for example. > If it helps, I guess you could send me a patch to > add that and I can also test it to see what the effect is. I didn't add it > since IPoIB wouldn't be able to exploit it (unless someone is kind enough > to show me how to). Such core code should not just be focussed on IPOIB. > I think the code I have is ready and stable, I am not sure how to intepret that - are you saying all-is-good and we should just push your code in? It sounds disingenuous but i may have misread you. cheers, jamal From pradeeps at linux.vnet.ibm.com Sun Jul 22 07:13:11 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sun, 22 Jul 2007 07:13:11 -0700 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <20070722072043.GB7188@mellanox.co.il> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> <20070722072043.GB7188@mellanox.co.il> Message-ID: <46A365F7.7090001@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Quoting Pradeep Satyanarayana : >> Subject: Re: NOSRQ misc patch [PATCH V1] >> >> Michael S. Tsirkin wrote: >>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ >>>> attr.recv_cq = priv->cq; >>>> attr.srq = priv->cm.srq; >>>> attr.cap.max_send_wr = ipoib_sendq_size; >>>> - attr.cap.max_recv_wr = 1; >>>> + attr.cap.max_recv_wr = 0; >>>> attr.cap.max_send_sge = 1; >>>> - attr.cap.max_recv_sge = 1; >>>> + attr.cap.max_recv_sge = 0; >>>> attr.sq_sig_type = IB_SIGNAL_ALL_WR; >>>> attr.qp_type = IB_QPT_RC; >>>> attr.send_cq = cq; >>> I don't see how does this fix things. >>> This line >>>> attr.srq = priv->cm.srq; >>> connected the TX QP to SRQ, making it possible to get packets on this QP. >>> But if cm.srq is NULL, and a remote sends a packet on this connection, >>> the connection will get closed. Which is a quality of implementation issue. >>> >> When the QP numbers are exchanged correctly, then it should not receive >> a packet on this QP in the first place. > > Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting > packets. We don't do this currently but we might in the future. I presume you mean passive side for receiving. Let us revisit the issue when there is a need. At this point it is not relevant. Pradeep From mst at dev.mellanox.co.il Sun Jul 22 07:25:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 22 Jul 2007 17:25:02 +0300 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <46A365F7.7090001@linux.vnet.ibm.com> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> <20070722072043.GB7188@mellanox.co.il> <46A365F7.7090001@linux.vnet.ibm.com> Message-ID: <20070722142502.GA8102@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: NOSRQ misc patch [PATCH V1] > > Michael S. Tsirkin wrote: > >> Quoting Pradeep Satyanarayana : > >> Subject: Re: NOSRQ misc patch [PATCH V1] > >> > >> Michael S. Tsirkin wrote: > >>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > >>>> attr.recv_cq = priv->cq; > >>>> attr.srq = priv->cm.srq; > >>>> attr.cap.max_send_wr = ipoib_sendq_size; > >>>> - attr.cap.max_recv_wr = 1; > >>>> + attr.cap.max_recv_wr = 0; > >>>> attr.cap.max_send_sge = 1; > >>>> - attr.cap.max_recv_sge = 1; > >>>> + attr.cap.max_recv_sge = 0; > >>>> attr.sq_sig_type = IB_SIGNAL_ALL_WR; > >>>> attr.qp_type = IB_QPT_RC; > >>>> attr.send_cq = cq; > >>> I don't see how does this fix things. > >>> This line > >>>> attr.srq = priv->cm.srq; > >>> connected the TX QP to SRQ, making it possible to get packets on this QP. > >>> But if cm.srq is NULL, and a remote sends a packet on this connection, > >>> the connection will get closed. Which is a quality of implementation issue. > >>> > >> When the QP numbers are exchanged correctly, then it should not receive > >> a packet on this QP in the first place. > > > > Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting > > packets. We don't do this currently but we might in the future. > > I presume you mean passive side for receiving. A passive side is the one that gets a REQ (look in IB spec section 12.9.6). Under IPoIB passive side can perform post send on the QP created. To make this work, I connect the QP to the SRQ on the active side: > attr.srq = priv->cm.srq; However, with your patch, priv->cm.srq might be NULL, which means that the QP won't be attached to SRQ. This is a quality of implementation issue that your patch is introducing. -- MST From monisonlists at gmail.com Sun Jul 22 07:49:27 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Sun, 22 Jul 2007 17:49:27 +0300 Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking for a P_Key in the table Message-ID: <46A36E77.5020307@gmail.com> IPoIB turns on the P_Key membership bit of limited membership P_Keys when creating a child interface. After that IPoIB looks for the full membership P_key in the table to make the interface "RUNNING". This patch fixes the pkey lookup in order to match full and partial membership keys that belong of the same partition. device.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: infiniband/drivers/infiniband/core/device.c =================================================================== --- infiniband.orig/drivers/infiniband/core/device.c 2007-07-08 12:45:07.000000000 +0300 +++ infiniband/drivers/infiniband/core/device.c 2007-07-22 17:43:32.440829619 +0300 @@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic if (ret) return ret; - if (pkey == tmp_pkey) { + if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { *index = i; return 0; } From kaber at trash.net Sun Jul 22 10:03:01 2007 From: kaber at trash.net (Patrick McHardy) Date: Sun, 22 Jul 2007 19:03:01 +0200 Subject: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes. In-Reply-To: References: Message-ID: <46A38DC5.4040800@trash.net> Krishna Kumar2 wrote: > Patrick McHardy wrote on 07/20/2007 11:46:36 PM: > >>The check for tx_queue_len is wrong though, >>its only a default which can be overriden and some qdiscs don't >>care for it at all. > > > I think it should not matter whether qdiscs use this or not, or even if it > is modified (unless it is made zero in which case this breaks). The > intention behind this check is to make sure that not more than tx_queue_len > skbs are in all queues put together (q->qdisc + dev->skb_blist), otherwise > the blist can become too large and breaks the idea of tx_queue_len. Is that > a good justification ? Its a good justification, but on second thought the entire idea of a single queue after qdiscs that is refilled independantly of transmissions times etc. make be worry a bit. By changing the timing you're effectively changing the qdiscs behaviour, at least in some cases. SFQ is a good example, but I believe it affects most work-conserving qdiscs. Think of this situation: 100 packets of flow 1 arrive 50 packets of flow 1 are sent 100 packets for flow 2 arrive remaining packets are sent On the wire you'll first see 50 packets of flow 1, than 100 packets alternate of flow 1 and 2, then 50 packets flow 2. With your additional queue all packets of flow 1 are pulled out of the qdisc immediately and put in the fifo. When the 100 packets of the second flow arrive they will also get pulled out immediately and are put in the fifo behind the remaining 50 packets of flow 1. So what you get on the wire is: 100 packets of flow 1 100 packets of flow 1 So SFQ is without any effect. This is not completely avoidable of course, but you can and should limit the damage by only pulling out as much packets as the driver can take and have the driver stop the queue afterwards. From kaber at trash.net Sun Jul 22 10:06:51 2007 From: kaber at trash.net (Patrick McHardy) Date: Sun, 22 Jul 2007 19:06:51 +0200 Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h In-Reply-To: <20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> <20070722090516.7787.79695.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <46A38EAB.6050300@trash.net> Krishna Kumar wrote: > @@ -472,6 +474,9 @@ struct net_device > void *priv; /* pointer to private data */ > int (*hard_start_xmit) (struct sk_buff *skb, > struct net_device *dev); > + int (*hard_start_xmit_batch) (struct net_device > + *dev); > + Os this function really needed? Can't you just call hard_start_xmit with a NULL skb and have the driver use dev->blist? > /* These may be needed for future network-power-down code. */ > unsigned long trans_start; /* Time (in jiffies) of last Tx */ > > @@ -582,6 +587,8 @@ struct net_device > #define NETDEV_ALIGN 32 > #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) > > +#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0) > + > static inline void *netdev_priv(const struct net_device *dev) > { > return dev->priv; > @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n > struct sockaddr *); > extern int dev_hard_start_xmit(struct sk_buff *skb, > struct net_device *dev); > +extern int dev_add_skb_to_blist(struct sk_buff *skb, > + struct net_device *dev); Again, function signatures should be introduced in the same patch that contains the function. Splitting by file doesn't make sense. From kaber at trash.net Sun Jul 22 10:10:37 2007 From: kaber at trash.net (Patrick McHardy) Date: Sun, 22 Jul 2007 19:10:37 +0200 Subject: [ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes. In-Reply-To: <20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> <20070722090553.7787.28728.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <46A38F8D.6080109@trash.net> Krishna Kumar wrote: > diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h > --- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530 > +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530 > @@ -78,6 +78,8 @@ enum > IFLA_LINKMODE, > IFLA_LINKINFO, > #define IFLA_LINKINFO IFLA_LINKINFO > + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ > +#define IFLA_TXBTHSKB IFLA_TXBTHSKB Ughh what a name :) I prefer pronouncable names since they are much easier to remember and don't need comments explaining what they mean. But I actually think offering just an ethtool interface would be better, at least for now. From sashak at voltaire.com Sun Jul 22 10:40:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 22 Jul 2007 20:40:48 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> Message-ID: <20070722174048.GO27878@sashak.voltaire.com> On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > Hi Sasha > > Let's assume someone has reset a switch on the fabric. > What would cause the SM to re-assign the LFT of that switch? OpenSM will sweep and drop this switch and when switch will back it will be initialized again. But if the reset was too fast (relative to discovery), we can be in trouble (and maybe not only with LFTs). > I assumed that there is a mechanism to do that. Not for "fast" switch reboot. Hmm, I think we could try to detect this case by comparing SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing that PortInfo:LID is not set. Something like below: diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 5b2b19e..62c072f 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -112,6 +112,7 @@ typedef struct _osm_switch osm_fwd_tbl_t fwd_tbl; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; + unsigned update_ft; void *priv; } osm_switch_t; /* @@ -152,6 +153,10 @@ typedef struct _osm_switch * during the current fabric sweep. This number is reset * to zero at the start of a sweep. * +* update_ft +* When set fwd tables will be updated regardless to entry +* values locally stored in fwd tables images +* * SEE ALSO * Switch object *********/ diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index adece65..8bbbcac 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port( break; } } + else if (port_num == 0 && p_node->sw && + (!p_pi->base_lid || !p_pi->master_sm_base_lid)) + p_node->sw->update_ft = 1; /* Update the PortInfo attribute. diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index b44a3ba..03516ae 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table( osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ; block_id_ho++ ) { - if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) + if (!p_sw->update_ft && + !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) continue; if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table( } } + p_sw->update_ft = 0; OSM_LOG_EXIT( p_mgr->p_log ); } BTW what do you think is the best way to detect switch power up? I didn't really find a strong requirement for at powerup initialization of any suitable component. > Anyway, kill -HUP should flush out the state and restart from scratch. Thinking more about it I'm not sure. Similar flush will be required for another "stored" components like pkey, sl2vl tables etc.. So it is more than just "regular" heavy sweep, another signal or option could be used for this, but OTOH it becomes very close to OpenSM restarting.. Sasha > > > Eitan > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Sunday, July 22, 2007 1:22 PM > > To: Eitan Zahavi > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration > > > > Hi Eitan, > > > > On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > > > Hi Sasha > > > > > > I am running some tests manually and apparently it looks > > like I found > > > a bug. Here is the sequence of things: > > > 1. SM sweeps the fabric assign LFTs > > > 2. I manually modify some LFTs (single entry now marked > > UNREACHABLE 3. > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM > > > reports SUBNET UP 5. The modified LFT entry is still > > UNREACHABLE and > > > the path is broken > > > > Right, in most cases (unless OpenSM has its own changes in > > the same LFT > > block) OpenSM will refer its own LFT image for "need to update" > > decision, so _manual_ changes will not trigger new update. > > Rerunning OpenSM should help however. > > > > > It looks to me some optimization of routing does not fully reroute > > > unless some condition is met - but that condition does not > > include the > > > above triggers listed in step 3. > > > > Rereading all fabrics LFTs by default seems to be too > > expensive operations. At least by default, if it is real > > requirement this could be enforced manually, for example when > > kill -HUP is used. Thoughts? > > > > Sasha > > From pradeeps at linux.vnet.ibm.com Sun Jul 22 11:39:00 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sun, 22 Jul 2007 11:39:00 -0700 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <20070722142502.GA8102@mellanox.co.il> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> <20070722072043.GB7188@mellanox.co.il> <46A365F7.7090001@linux.vnet.ibm.com> <20070722142502.GA8102@mellanox.co.il> Message-ID: <46A3A444.5050802@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Quoting Pradeep Satyanarayana : >> Subject: Re: NOSRQ misc patch [PATCH V1] >> >> Michael S. Tsirkin wrote: >>>> Quoting Pradeep Satyanarayana : >>>> Subject: Re: NOSRQ misc patch [PATCH V1] >>>> >>>> Michael S. Tsirkin wrote: >>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ >>>>>> attr.recv_cq = priv->cq; >>>>>> attr.srq = priv->cm.srq; >>>>>> attr.cap.max_send_wr = ipoib_sendq_size; >>>>>> - attr.cap.max_recv_wr = 1; >>>>>> + attr.cap.max_recv_wr = 0; >>>>>> attr.cap.max_send_sge = 1; >>>>>> - attr.cap.max_recv_sge = 1; >>>>>> + attr.cap.max_recv_sge = 0; >>>>>> attr.sq_sig_type = IB_SIGNAL_ALL_WR; >>>>>> attr.qp_type = IB_QPT_RC; >>>>>> attr.send_cq = cq; >>>>> I don't see how does this fix things. >>>>> This line >>>>>> attr.srq = priv->cm.srq; >>>>> connected the TX QP to SRQ, making it possible to get packets on this QP. >>>>> But if cm.srq is NULL, and a remote sends a packet on this connection, >>>>> the connection will get closed. Which is a quality of implementation issue. >>>>> >>>> When the QP numbers are exchanged correctly, then it should not receive >>>> a packet on this QP in the first place. >>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting >>> packets. We don't do this currently but we might in the future. >> I presume you mean passive side for receiving. > > A passive side is the one that gets a REQ (look in IB spec section 12.9.6). > Under IPoIB passive side can perform post send on the QP created. > To make this work, I connect the QP to the SRQ on the active side: >> attr.srq = priv->cm.srq; > > However, with your patch, priv->cm.srq might be NULL, which > means that the QP won't be attached to SRQ. This is > a quality of implementation issue that your patch is introducing. > I do not understand -for one you mention transmitting packets and, on the other hand you mention SRQ. Are you hinting at Shared Send Queues (which may be in the future as you state)? I have already tested the series of NOSRQ patches for interoperability between IBM and Mellanox adapters and it works. I do not see the quality of implementation issues that you keep referring to. I believe this should not pose issues for merging this immediately into 2.6.23. If you have ideas that can be implemented in the future, we can discuss that but outside the context of this patch. Pradeep From mst at dev.mellanox.co.il Sun Jul 22 14:02:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jul 2007 00:02:55 +0300 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <46A3A444.5050802@linux.vnet.ibm.com> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> <20070722072043.GB7188@mellanox.co.il> <46A365F7.7090001@linux.vnet.ibm.com> <20070722142502.GA8102@mellanox.co.il> <46A3A444.5050802@linux.vnet.ibm.com> Message-ID: <20070722210255.GA25023@mellanox.co.il> > Quoting Pradeep Satyanarayana : > Subject: Re: NOSRQ misc patch [PATCH V1] > > Michael S. Tsirkin wrote: > >> Quoting Pradeep Satyanarayana : > >> Subject: Re: NOSRQ misc patch [PATCH V1] > >> > >> Michael S. Tsirkin wrote: > >>>> Quoting Pradeep Satyanarayana : > >>>> Subject: Re: NOSRQ misc patch [PATCH V1] > >>>> > >>>> Michael S. Tsirkin wrote: > >>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ > >>>>>> attr.recv_cq = priv->cq; > >>>>>> attr.srq = priv->cm.srq; > >>>>>> attr.cap.max_send_wr = ipoib_sendq_size; > >>>>>> - attr.cap.max_recv_wr = 1; > >>>>>> + attr.cap.max_recv_wr = 0; > >>>>>> attr.cap.max_send_sge = 1; > >>>>>> - attr.cap.max_recv_sge = 1; > >>>>>> + attr.cap.max_recv_sge = 0; > >>>>>> attr.sq_sig_type = IB_SIGNAL_ALL_WR; > >>>>>> attr.qp_type = IB_QPT_RC; > >>>>>> attr.send_cq = cq; > >>>>> I don't see how does this fix things. > >>>>> This line > >>>>>> attr.srq = priv->cm.srq; > >>>>> connected the TX QP to SRQ, making it possible to get packets on this QP. > >>>>> But if cm.srq is NULL, and a remote sends a packet on this connection, > >>>>> the connection will get closed. Which is a quality of implementation issue. > >>>>> > >>>> When the QP numbers are exchanged correctly, then it should not receive > >>>> a packet on this QP in the first place. > >>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting > >>> packets. We don't do this currently but we might in the future. > >> I presume you mean passive side for receiving. > > > > A passive side is the one that gets a REQ (look in IB spec section 12.9.6). > > Under IPoIB passive side can perform post send on the QP created. > > To make this work, I connect the QP to the SRQ on the active side: > >> attr.srq = priv->cm.srq; > > > > However, with your patch, priv->cm.srq might be NULL, which > > means that the QP won't be attached to SRQ. This is > > a quality of implementation issue that your patch is introducing. > > > > I do not understand -for one you mention transmitting packets and, on the > other hand you mention SRQ. > I have already tested the series of NOSRQ patches for interoperability between > IBM and Mellanox adapters and it works. I do not see the quality of implementation > issues that you keep referring to. Do you understand why is this line there? attr.srq = priv->cm.srq; I'll try to explain. The snippet above creates a QP (that will then be connected to remote side). According to IPoIB CM RFC, once the connection is set up, and if the remote wants to send us some packets, it could send them over this existing connection. And with current code, this would work correctly, because all RC QP we create are connected to the common SRQ and a common CQ, thus such packets get receive WCs and get sent up the stack. But with your patch, priv->cm.srq == NULL so you create RC QPs that are not connected to SRQ, and you never post any receive WRs, so if the remote sends even a single packet on this QP, the QP will transfer to error state. This is a regression: QPs are supposed to have receive WRs preposed. If you consider e.g. TCP, it's easy to imagine that the packet remote was sending was an ACK, so it won't retry - until we destroy the connection, create a new one, resend the packet - and an ACK will kill the QP again. This just happens to never occur for you because you have recent linux kernel on both sides of the link, and because linux is currently not smart enough to reuse an existing connection - so it creates a new one and your bug is hidden from view. -- MST From sashak at voltaire.com Sun Jul 22 14:48:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 00:48:09 +0300 Subject: [ofa-general] [PATCH] opensm: set PortInfo:LinkSpeed in link_mgr only Message-ID: <20070722214809.GP27878@sashak.voltaire.com> PortInfo:LinkSpeed setup is performed (in accordance with link_speed option value) in link_mgr and not in lid_mgr. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_lid_mgr.c | 16 ---------------- 1 files changed, 0 insertions(+), 16 deletions(-) diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index 79a0ea8..f1f4707 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1108,22 +1108,6 @@ __osm_lid_mgr_set_physp_pi( sizeof(p_pi->link_width_enabled) )) send_set = TRUE; - if ( p_mgr->p_subn->opt.force_link_speed ) - { - if ( p_mgr->p_subn->opt.force_link_speed == 15 ) /* LinkSpeedSupported */ - { - if (ib_port_info_get_link_speed_enabled( p_old_pi ) != ib_port_info_get_link_speed_sup( p_pi )) - ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); - else - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled( p_old_pi )); - } - else - ib_port_info_set_link_speed_enabled( p_pi, p_mgr->p_subn->opt.force_link_speed ); - if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, - sizeof(p_pi->link_speed) )) - send_set = TRUE; - } - /* M_KeyProtectBits are always zero */ p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc; /* Check to see if the value we are setting is different than -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Sun Jul 22 14:51:54 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 00:51:54 +0300 Subject: [ofa-general] [PATCH] management/gen_chlog.sh: simple ChangeLog generator Message-ID: <20070722215154.GQ27878@sashak.voltaire.com> This gen_chlog.sh scripts generates ChangeLog (from git logs) for specified subdirectory or for whole tree if "." is used. This supports ChangleLog and spec file formats. The script can be used during tarballs generation by make.dist or 'make dist'. Signed-off-by: Sasha Khapyorsky --- gen_chlog.sh | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 68 insertions(+), 0 deletions(-) create mode 100755 gen_chlog.sh diff --git a/gen_chlog.sh b/gen_chlog.sh new file mode 100755 index 0000000..9d60081 --- /dev/null +++ b/gen_chlog.sh @@ -0,0 +1,68 @@ +#!/bin/sh + +usage() +{ + echo "Usage: $0 [--spec] " + exit 2 +} + +test -z "$1" && usage + +if [ "$1" = "--spec" ] ; then + spec_format=1 + shift + test -z "$1" && usage +fi + +TARGET=$1 + +GIT_DIR=`git-rev-parse --git-dir 2>/dev/null` + +test -z "$GIT_DIR" && usage + + +export GIT_DIR +export GIT_PAGER="" +export PAGER="" + + +mkchlog() +{ + target=$1 + format=$2 + + prev_tag="" + + for tag in `git-tag -l $target` ; do + obj=`git-cat-file tag $tag | awk '/^object /{print $2}'` + base=`git-merge-base $obj HEAD` + if [ -z "$base" -o "$base" != $obj ] ; then + continue + fi + all_vers="$prev_tag$tag $all_vers" + prev_tag=$tag.. + done + + if [ -z "$prev_tag" ] ; then + all_vers=HEAD + else + all_vers="${prev_tag}HEAD $all_vers" + fi + + for ver in $all_vers ; do + ver_name=`echo $ver | sed -e 's/^.*\.\.//'` + echo "* Version: $ver_name" + echo "" + git-log --no-merges "${format}" $ver -- $target + prev_t=$tag.. + done +} + + +if [ -z "$spec_format" ] ; then + mkchlog $TARGET --pretty=format:"commit %H%n%ad %an%n%n %s%n" +else + echo "%changelog" + mkchlog $TARGET --pretty=format:"- %ad %an: %s" + echo "" +fi -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Sun Jul 22 15:14:55 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 01:14:55 +0300 Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer to opensm-coding-style.txt Message-ID: <20070722221455.GR27878@sashak.voltaire.com> This updates the script according to recent doc/opensm-coding-style.txt (in short K&R, tabs, etc.). Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_indent | 57 +++------------------------------------------ 1 files changed, 4 insertions(+), 53 deletions(-) diff --git a/opensm/opensm/osm_indent b/opensm/opensm/osm_indent index bed2ba1..621184b 100755 --- a/opensm/opensm/osm_indent +++ b/opensm/opensm/osm_indent @@ -1,6 +1,6 @@ #!/bin/bash # -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. # Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. # Copyright (c) 1996-2003 Intel Corporation. All rights reserved. # @@ -40,56 +40,7 @@ # Environment: # Linux User Mode # -# $Revision: 1.4 $ -# -# -# This is the indent format used for OpenSM. -# -# format the source code according to the ACD standard -# -bad Blank line after declarations -# -bap Blank line after Procedures -# -bbb Blank line before block comments -# -nbbo Break after Boolean operator -# -bl Break after if line -# -bli0 Indent for braces is 0 -# -bls Break after struct declarations -# -cbi0 Case break indent 0 -# -ci3 Continue indent 3 spaces -# -cli0 Case label indent 0 spaces -# -ncs No space after cast operator -# -hnl Honor existing newlines on long lines -# -i3 Substitute indent with 3 spaces -# -npcs No space after procedure calls -# -prs Space after parenthesis -# -nsai No space after if keyword - removed -# -nsaw No space after while keyword - removed -# -sc Put * at left of comments in a block comment style -# -nsob Don't swallow unnecessary blank lines -# -ts3 Tab size is 3 -# -psl Type of procedure return in a separate line -# -bfda Function declaration arguments in a separate line. -# -nut No tabs as we allow spaces -# -######################################################################### - -# indent the world -for sourcefile in $*; do - if test -f "$sourcefile"; then - # first, string DOS style linefeeds - perl -piW -e's/\x0D//' "$sourcefile" - echo Processing $sourcefile - indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 -ci3 -cli0 -ncs \ - -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl -bfda -nut $sourcefile - - rm ${sourcefile}W +# This is the indent format used for OpenSM (similar to one used in +# linux/scripts/Lindent). - # the -bb also affect the first line in each file - so clean it up - if test `head -1 $sourcefile | egrep -v '^$' | wc -l` = 0; then - echo Cleaning up first empty line of $sourcefile - awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W - mv -f ${sourcefile}W $sourcefile - fi - else - echo Could not find file:$sourcefile - fi -done +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@" -- 1.5.3.rc2.29.gc4640f From pradeeps at linux.vnet.ibm.com Sun Jul 22 17:06:10 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Sun, 22 Jul 2007 17:06:10 -0700 Subject: [ofa-general] Re: NOSRQ misc patch [PATCH V1] In-Reply-To: <20070722210255.GA25023@mellanox.co.il> References: <46A28CB7.1040509@linux.vnet.ibm.com> <20070722060557.GB20438@mellanox.co.il> <46A3043A.3030200@linux.vnet.ibm.com> <20070722072043.GB7188@mellanox.co.il> <46A365F7.7090001@linux.vnet.ibm.com> <20070722142502.GA8102@mellanox.co.il> <46A3A444.5050802@linux.vnet.ibm.com> <20070722210255.GA25023@mellanox.co.il> Message-ID: <46A3F0F2.7080500@linux.vnet.ibm.com> Michael S. Tsirkin wrote: >> Quoting Pradeep Satyanarayana : >> Subject: Re: NOSRQ misc patch [PATCH V1] >> >> Michael S. Tsirkin wrote: >>>> Quoting Pradeep Satyanarayana : >>>> Subject: Re: NOSRQ misc patch [PATCH V1] >>>> >>>> Michael S. Tsirkin wrote: >>>>>> Quoting Pradeep Satyanarayana : >>>>>> Subject: Re: NOSRQ misc patch [PATCH V1] >>>>>> >>>>>> Michael S. Tsirkin wrote: >>>>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_ >>>>>>>> attr.recv_cq = priv->cq; >>>>>>>> attr.srq = priv->cm.srq; >>>>>>>> attr.cap.max_send_wr = ipoib_sendq_size; >>>>>>>> - attr.cap.max_recv_wr = 1; >>>>>>>> + attr.cap.max_recv_wr = 0; >>>>>>>> attr.cap.max_send_sge = 1; >>>>>>>> - attr.cap.max_recv_sge = 1; >>>>>>>> + attr.cap.max_recv_sge = 0; >>>>>>>> attr.sq_sig_type = IB_SIGNAL_ALL_WR; >>>>>>>> attr.qp_type = IB_QPT_RC; >>>>>>>> attr.send_cq = cq; >>>>>>> I don't see how does this fix things. >>>>>>> This line >>>>>>>> attr.srq = priv->cm.srq; >>>>>>> connected the TX QP to SRQ, making it possible to get packets on this QP. >>>>>>> But if cm.srq is NULL, and a remote sends a packet on this connection, >>>>>>> the connection will get closed. Which is a quality of implementation issue. >>>>>>> >>>>>> When the QP numbers are exchanged correctly, then it should not receive >>>>>> a packet on this QP in the first place. >>>>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting >>>>> packets. We don't do this currently but we might in the future. >>>> I presume you mean passive side for receiving. >>> A passive side is the one that gets a REQ (look in IB spec section 12.9.6). >>> Under IPoIB passive side can perform post send on the QP created. >>> To make this work, I connect the QP to the SRQ on the active side: >>>> attr.srq = priv->cm.srq; >>> However, with your patch, priv->cm.srq might be NULL, which >>> means that the QP won't be attached to SRQ. This is >>> a quality of implementation issue that your patch is introducing. >>> >> I do not understand -for one you mention transmitting packets and, on the >> other hand you mention SRQ. >> I have already tested the series of NOSRQ patches for interoperability between >> IBM and Mellanox adapters and it works. I do not see the quality of implementation >> issues that you keep referring to. > > Do you understand why is this line there? > attr.srq = priv->cm.srq; > > I'll try to explain. > > The snippet above creates a QP (that will then be connected to remote side). > According to IPoIB CM RFC, once the connection is set up, and if the remote > wants to send us some packets, it could send them over this > existing connection. > > And with current code, this would work correctly, because > all RC QP we create are connected to the common SRQ and a common CQ, > thus such packets get receive WCs and get sent up the stack. > > But with your patch, priv->cm.srq == NULL so you create RC QPs that > are not connected to SRQ, and you never post any receive WRs, > so if the remote sends even a single packet on this QP, the QP > will transfer to error state. > I do not post any WRs because I do not expect any packets to be received. If it does receive any packets an RNR will be returned (as expected). The Queues in the Queue Pairs are not being used symmetrically that is all. Also the priv->cm.srq is to NULL only in the non-SRQ case. The SRQ case is as before. > This is a regression: QPs are supposed to have receive WRs > preposed. If you consider e.g. TCP, it's easy to imagine that > the packet remote was sending was an ACK, so it won't > retry - until we destroy the connection, create a new one, > resend the packet - and an ACK will kill the QP again. > There is nothing about asymmetric usage of the Queues. And hence I see no problems. If in TCP one sends to the wrong port, the packet gets dropped. This is similar to that. > This just happens to never occur for you because you have recent linux kernel > on both sides of the link, and because linux is currently not smart enough > to reuse an existing connection - so it creates a new one and > your bug is hidden from view. > This code has been there since day one. I do not understand the reasoning for raising issues on the eve of the acceptance of this patch. Why bring it up now? Pradeep From sashak at voltaire.com Sun Jul 22 17:20:11 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 03:20:11 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <20070723002010.GU27878@sashak.voltaire.com> Hi Yevgeny, Some initial comments. On 01:07 Sun 22 Jul , Yevgeny Kliteynik wrote: > Hi All > > Please find the attached RFC describing how QoS policy support could be > implemented in the OpenFabrics stack. > Your comments are welcome. > > -- Yevgeny > > RFC: OpenFabrics Enhancements for QoS Support > =============================================== > > Authors: . Eitan Zahavi > Authors: . Yevgeny Kliteynik > Date: .... Jul 2007. > Revision: 0.2 > > Table of contents: > 1. Overview > 2. Architecture > 3. Supported Policy > 4. CMA functionality > 5. IPoIB functionality > 6. SDP functionality > 7. SRP functionality > 8. iSER functionality > 9. OpenSM functionality > > 1. Overview > ------------ > Quality of Service requirements stem from the realization of I/O > consolidation > over IB network: As multiple applications and ULPs share the same fabric, > means > to control their use of the network resources are becoming a must. The basic > need is to differentiate the service levels provided to different traffic > flows, > such that a policy could be enforced and control each flow utilization of > the > fabric resources. > > IBTA specification defined several hardware features and management > interfaces > to support QoS: > * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner > * Arbitration between traffic of different VLs is performed by a 2 priority > levels weighted round robin arbiter. The arbiter is programmable with > a sequence of (VL, weight) pairs and maximal number of high priority > credits > to be processed before low priority is served > * Packets carry class of service marking in the range 0 to 15 in their > header SL field > * Each switch can map the incoming packet by its SL to a particular output > VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > * The Subnet Administrator controls each communication flow parameters > by providing them as a response to Path Record (PR) or MultiPathRecord > (MPR) > queries > > The IB QoS features provide the means to implement a DiffServ like > architecture. > DiffServ architecture (IETF RFC2474 2475) is widely used today in highly > dynamic > fabrics. > > This proposal provides the detailed functional definition for the various > software elements that are required to enable a DiffServ like architecture > over > the OpenFabrics software stack. > > > > 2. Architecture > ---------------- > This proposal split the QoS functionality between the SM/SA, CMA and the > various > ULPS. We take the "chronology approach" to describe how the overall system > works: > > 2.1. The network manager (human) provides a set of rules (policy) that > defines > how the network is being configured and how its resources are split to > different > QoS-Levels. The policy also define how to decide which QoS-Level each > application or ULP or service use. > > 2.2. The SM analyzes the provided policy to see if it is realizable and > performs > the necessary fabric setup. The SM may continuously monitor the policy and > adapt > to changes in it. Part of this policy defines the default QoS-Level of each > partition. The SA is being enhanced to match the requested Source, > Destination, > QoS-Class, Service-ID (and optionally SL and priority) against the policy. > So > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > enhanced to support setting up partitions with appropriate IPoIB broadcast > group. This broadcast group carries its QoS attributes: SL, MTU and > RATE. > > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the > multicast group which forms the broadcast group of this partition. > > 2.4. MPI which provides non IB based connection management should be > configured > to run using hard coded SLs. It uses these SLs for every QP being opened. > > 2.5. ULPs that use CM interface (like SRP) should have their own > pre-assigned > Service-ID and use it while obtaining PR/MPR for establishing connections. > The SA receiving the PR/MPR should match it against the policy and return > the appropriate PR/MPR including SL, MTU and RATE. > > 2.6. ULPs and programs using CMA to establish RC connection should provide > the > CMA the target IP and Service-ID. Some of the ULPs might also provide > QoS-Class > (E.g. for SDP sockets that are provided the TOS socket option). The CMA > should > then use the provided Service-ID and optional QoS-Class and pass them in the > PR/MPR request. The resulting PR/MPR should be used for configuring the > connection QP. > > PathRecord and MultiPathRecord enhancement for QoS: > As mentioned above the PathRecord and MultiPathRecord attributes should be > enhanced to carry the Service-ID which is a 64bit value, which has been > standardized by the IBTA. A new field QoS-Class is also provided. > A new capability bit should describe the SM QoS support in the SA class port > info. This approach provides an easy migration path for existing access > layer > and ULPs by not introducing new set of PR/MPR attribute. > > > 3. Supported Policy > -------------------- > > The QoS policy supported by this proposal is divided into 4 sub sections: > > I) Port Group: a set of CAs, Routers or Switches that share the same > settings. > A port group might be a partition defined by the partition manager policy in > terms of GUIDs. Future implementations might provide support for > NodeDescription > based definition of port groups. Isn't it better to have port group definitions in separate file? So groups could be shared with other OpenSM components (as discussed). Even if such group sharing is not high priority functionality this should save us from redoing things later. > II) Fabric Setup: > Defines how the SL2VL and VLArb tables should be setup. This policy > definition > assumes the computation of overall end to end network behavior should be > performed > outside of OpenSM. > > III) QoS-Levels Definition: > This section defines the possible sets of parameters for QoS that a client > might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, > Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). > > IV) Matching Rules: > A list of rules that match an incoming PR/MPR request to a QoS-Level. The > rules are processed in order such as the first match is applied. Each rule > is > built out of a set of match expressions which should all match for the rule > to > apply. The matching expressions are defined for the following fields > ** SRC and DST to lists of port groups > ** Service-ID to a list of Service-ID or Service-ID ranges > ** QoS-Class to a list of QoS-Class values or ranges > > QoS Policy file syntax > > * Empty lines are ignored > * Leading and trailing blanks, as well as empty lines, are ignored, so the > indentation in the example is just for better readability > * Comments are started with the pound sign (#) and terminated by EOL > * Comments may appear only in a separate line Why? What is wrong with: port-name: vs1/HCA-1/P1 # my best port > * Keywords that denote section/subsection start have matching closing > keywords > * Any keyword should be the first non-blank in the line > > QoS Policy file example > > # Port Groups define sets of ports to be used later in the settings > port-groups > # using port GUIDs > port-group > name: Storage > # "use" is just a description that is used for logging. > # Other than that, it is just a commentary > use: our SRP storage targets > port-guid: 0x1000000000000001 > port-guid: 0x1000000000000002 > end-port-group > > port-group > name: Virtual Servers > use: node desc and IB port num > # The syntax of the port name is as follows: > "hostname/CA-num/Pnum". > # "hostname" and "CA-num" are compared to the first 2 words of > # NodeDescription, and "Pnum" is a port number on that node. > port-name: vs1/HCA-1/P1 > port-name: vs3/HCA-1/P1 > port-name: vs3/HCA-2/P2 What about wild carding here, like vs1/*/* or just vs1? > end-port-group > > # using partitions defined in the partition policy > port-group > name: Group for Partition 1 > use: default settings > partition: Part1 > end-port-group > > # using node types CA|ROUTER|SWITCH Probably also ALL (for all ports), SELF (for SM port)? > port-group > name: Routers > use: all routers > node-type: ROUTER > end-port-group > > end-port-groups I agree that proposed syntax has better for human readability than pure XML, but isn't stuff like this will be more user-friendly? Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ; , or Storage "Free Text description" { 0x10001, 0x10002, 0x10003 }; , or Storage "Free Text description": ROUTERS, CAS ; > > qos-setup > > # define all types of VLArb tables. The length of the tables should > # match the physically supported tables by their target ports > vlarb-tables > # scope defines the exact ports the VLArb tables apply to > vlarb-scope > # defining VLArb tables on all the ports that belong to > # port group 'Storage', and on all the ports connected > # to ports of port group 'Storage' > group: Storage So "group" is only for ports that belong to 'Storage'? > # "across" means all the ports that are connected to ports > # that belong to the specified port group > across: Storage > # VLArb table holds VL and weight pairs > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > end-vlarb-scope > # There can be several scopes > end-vlarb-tables > > sl2vl-tables > # Scope defines the exact devices and in/out ports tables apply > to. > # Note: if the same port is matching several rules the *FIRST* > one applies. > sl2vl-scope > # SL2VL tables are orgnized as SL2VL(in-port,out-port) > # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) > # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > # > # The following example specifies that all the SL2VL tables > # entries should be defined for all the ports of group > Part1: > group: Part1 > from: * > to: * > # SL2VL table has to have 16 values at max - one for each > SL. > # If the user specifies less than 16 values, all the missing > # VL values will be implicitly set to 0 > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > end-sl2vl-scope > > sl2vl-scope > # "across-to" is a combination of "across" keyword > (definition can be found > # in VLArb tables section) and "to" keyword. > # "across: PortGroupName" refers to all the ports that are > connected > # to ports that belong to PortGroupName. > # > # Example of "across-to" usage: > # A user has a set of 'special' nodes (e.g. storage > nodes), and all > # the traffic to these nodes has to get specific VL. > # The solution is to define port group (i.g. "Storage") > that will > # include all the ports of these nodes, and then to > configure SL2VL > # tables on all the switch ports that are connected to the > Storage > # port group by specifying "across-to: Storage". > # > across-to: Storage2 > # Similar to "across-to", "across-from" is a combination of > "across" > # and "to" keywords > across-from: Storage1 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > end-sl2vl-scope > end-sl2vl-tables > > end-qos-setup > > > qos-levels > > # the first one is just setting SL > qos-level > use: for the lowest priority communication > sl: 15 > packet-life: 16 > end-qos-level > # the second sets SL and QoS Class > qos-level > use: low latency best bandwidth > sl: 0 > end-qos-level > # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path > Bits > qos-level > use: just an example > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > # Path Bits can be used e.g. to provide a different routes > through the > # subnet to a particular port > path-bits: 2,4,8-32 > end-qos-level > > end-qos-levels > > > # Match rules are scanned in a first-fit manner (like firewall rules > table) > qos-match-rules > > # matching by single criteria: class (list of values and ranges) > qos-match-rule > # just a description > use: low latency by class 7-9 or 11 > qos-class: 7-9,11 > # number of qos-level to apply to the matching PR/MPR > qos-level-sn: 1 Isn't it better and less error prone to match qos_level by name and not by sequential number? > end-qos-match-rule > # show matching by destination group AND service-ids > qos-match-rule > use: Storage targets connection > destination: Storage > service-id: 22,4719-5000 > qos-level-sn: 2 > end-qos-match-rule > # show matching by source group only > qos-match-rule > use: bla bla > source: Storage > qos-level-sn: 3 > end-qos-match-rule > > end-qos-match-rules > > > 4. IPoIB > --------- > > IPoIB already query the SA for its broadcast group information. The > additional > functionality required is for IPoIB to provide the broadcast group SL, MTU, > and RATE in every following PathRecord query performed when a new UDAV is > needed by IPoIB. > We could assign a special Service-ID for IPoIB use but since all > communication > on the same IPoIB interface shares the same QoS-Level without the ability to > differentiate it by target service we can ignore it for simplicity. > > 5. CMA features > ---------------- > > The CMA interface supports Service-ID through the notion of port space as a > prefixes to the port_num which is part of the sockaddr provided to > rdma_resolve_add(). What is missing is the explicit request for a QoS-Class > that > should allow the ULP (like SDP) to propagate a specific request for a class > of > service. A mechanism for providing the QoS-Class is available in the IPv6 > address, > so we could use that address field. Another option is to implement a special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class and > Service-ID > in the sent PR/MPR. When a response is obtained it is an existing > requirement for > the CMA to use the PR/MPR from the response in setting up the QP address > vector. > > > 6. SDP > ------- > > SDP uses CMA for building its connections. > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > holding the remote TCP/IP Port Number to connect to. > SDP might be provided with SO_PRIORITY socket option. In that case the value > provided should be sent to the CMA as the TClass option of that connection. > > 7. SRP > ------- > > Current SRP implementation uses its own CM callbacks (not CMA). So SRP > should > fill in the Service-ID in the PR/MPR by itself and use that information in > setting up the QP. The T10 SRP standard defines the SRP Service-ID to be > defined > by the SRP target I/O Controller (but they should also comply with IBTA > Service- > ID rules). Anyway, the Service-ID is reported by the I/O Controller in the > ServiceEntries DMA attribute and should be used in the PR/MPR if the SA > reports its ability to handle QoS PR/MPRs. > > 8. iSER > -------- > iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER > should be TBD. > > > 9. OpenSM features > ------------------- > The QoS related functionality to be provided by OpenSM can be split into two > main parts: > > 3.1. Fabric Setup > During fabric initialization the SM should parse the policy and apply its > settings to the discovered fabric elements. The following actions should be > performed: > * Parsing of policy > * Node Group identification. Warning should be provided for each node not > specified but found. > * SL2VL settings validation should be checked: > + A warning will be provided if there are no matching targets for the > SL2VL > setting statement. > + An error message will be printed to the log file if an invalid setting > is > found. A setting is invalid if it refers to: > - Non existing port numbers of the target devices > - Unsupported VLs for the target device. In the later case the map to > non > existing VLs should be replaced to VL15 i.e. packets will be dropped. I'm not sure it is optimal. We could have well documented or even configurable mapping rule instead, then this will not limit devices with higher capabilities. > * SL2VL setting is to be performed > * VL Arbitration table settings should be validated according to the > following > rules: > + A warning will be provided if there are no matching targets for the > setting > statement > + An error will be provided if the port number exceeds the target ports > + An error will be generated if the table length exceeds device > capabilities Ditto. > + A warning will be generated if the table quote a VL that is not > supported > by the target device What is "table quote" here? > * VL Arbitration tables will be set on the appropriate targets > > 3.2. PR/MPR query handling: > OpenSM should be able to enforce the provided policy on client request. > The overall flow for such requests is: first the request is matched against > the > defined match rules such that the target QoS-Level definition is found. > Given > the QoS-Level a path(s) search is performed with the given restrictions > imposed > by that level. The following two sections describe these steps. > > How Service-ID is carried in the PathRecord and MultiPathRecord attributes > is > now standardized by the IBTA. > > > 3.2.1. Matching rule search: > A rule is "matching" a PR/MPR request using the following criteria: > * Matching rules provide values in a list of either single value, or range > of > values. A PR/MPR field is "matching" the rule field if it is explicitly > noted in the list of values or is one of the values covered by a range > included in the field values list. > * Only PR/MPR fields that have their component mask bit set should be > compared. > * For a rule to be "matching" a PR/MPR request all the rule fields should be > "matching" their PR/MPR fields. Such that a PR/MPR request that does > not have a component mask field set for one of the rule defined fields > can > not match that rule. > * A PR/MPR request that have a component mask bit set for one of the fields > that is not defined by the rule can match the rule. Aren't last two too restrictive? SA can just to filter-out paths in response to match rest of the rule. No? > The algorithm to be used for searching for a rule match might be as simple > as a > sequential search through all rules or enhanced for better performance. The > semantics of every rule field and its matching PR/MPR field are described > below: > * Source: the SGID or SLID should be part of this group > * Destination: the DGID or DLID should be part of this group > * Service-ID: check if the requested Service-ID (available in the PR/MPR old > SM-Key field) is matching any of this rule Service-IDs > * TClass: check if the PR/MPR TClass field is matching > > 3.2.2 PR/MPR response generation: > The QoS-Level pointed by the first rule that matches the PR/MPR request > should be used for obtaining the response SL, MTU-Limit, RATE-Limit, > Path-Bits > and QoS-Class. A default QoS-Level should be used if no rule is matching the > query. Where this default should be defined? Sasha > The efficient algorithm for finding paths that meet the QoS-Level criteria > is > beyond the scope of this RFC and left for the implementer to provide. > However > the criteria by which the paths match the QoS-Level are described below: > > * SL: The paths found should all use the given SL. For that sake PR/MPR > algorithm should traverse the path from source to destination only through > ports that carry a valid VL (not VL15) by the SL2VL map (should consider > input > and output ports and SL). > * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit > * Rate-Limit: The resulting paths RATE should not exceed the given > RATE-Limit > (rate limit is given in units of link BW = Width*Speed according to IBTA > Specification Vol-1 table-205 p-901 l-24). > * Path-Bits: define the target LID lowest bits (number of bits defined by > the > target port PortInfo.LMC field). The path should traverse the LFT using > the > target port LID with the path-bits set. > * QoS-Class: should be returned in the result PR/MPR. When routing is going > to > be supported by OpenSM we might use this field in selecting the target > router too in a TBD way. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From krkumar2 at in.ibm.com Sun Jul 22 19:53:25 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 08:23:25 +0530 Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition In-Reply-To: <20070722094136.GD7800@mellanox.co.il> Message-ID: Hi Micheal, "Michael S. Tsirkin" wrote on 07/22/2007 03:11:36 PM: > > + /* > > + * Handle skbs completion from tx_tail to wr_id. It is possible to > > + * handle WC's from earlier post_sends (possible multiple) in this > > + * iteration as we move from tx_tail to wr_id, since if the last > > + * WR (which is the one which had a completion request) failed to be > > + * sent for any of those earlier request(s), no completion > > + * notification is generated for successful WR's of those earlier > > + * request(s). > > + */ > > AFAIK a signalled WR will always generate a completion. > What am I missing? Yes, signalled WR will generate a completion. I am trying to catch the case where, say, I send 64 skbs and set signalling for only the last skb and the others are set to NO signalling. Now if the driver found the last WR was bad for some reason, it will synchronously fail the send for that WR (which happens to be the only one that is signalled). So after the 1 to 63 skbs are finished, there will be no completion called. That was my understanding of how this works, and coded it that way so that the next post will clean up the previous one's completion. > > > > + /* > > + * Better error handling can be done here, like free > > + * all untried skbs if err == -ENOMEM. However at this > > + * time, we re-try all the skbs, all of which will > > + * likely fail anyway (unless device finished sending > > + * some out in the meantime). This is not a regression > > + * since the earlier code is not doing this either. > > + */ > > Are you retrying posting skbs? Why is this a good idea? > AFAIK, earlier code did not retry posting WRs at all. Not exactly. If I send 64 skbs to the device and the provider returned a bad WR at skb # 50, then I will have to try skb# 51-64 again since the provider has not attemped to send those out as it bails out at the first failure. The provider ofcourse has already sent out skb# 1-49 before returning failure at skb# 50. So it is not strictly retry, just xmit of next skbs which is what the current code also does. I tested this part out by simulating errors in mthca_post_send and verified that the next iteration clears up the remaining skbs. > The comment seems to imply that post send fails as a result of SQ overflow - Correct. > do you see SQ overflow errors in your testing? No. > AFAIK, IPoIB should never overflow the SQ. Correct. It should never happen unless IPoIB has a bug :) I guess the comment should be removed ? Thanks, - KK From krkumar2 at in.ibm.com Sun Jul 22 19:54:58 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 08:24:58 +0530 Subject: [ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes. In-Reply-To: <46A38F8D.6080109@trash.net> Message-ID: Hi Patrick, Patrick McHardy wrote on 07/22/2007 10:40:37 PM: > Krishna Kumar wrote: > > diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h > > --- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530 > > +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530 > > @@ -78,6 +78,8 @@ enum > > IFLA_LINKMODE, > > IFLA_LINKINFO, > > #define IFLA_LINKINFO IFLA_LINKINFO > > + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ > > +#define IFLA_TXBTHSKB IFLA_TXBTHSKB > > > Ughh what a name :) I prefer pronouncable names since they are > much easier to remember and don't need comments explaining > what they mean. > > But I actually think offering just an ethtool interface would > be better, at least for now. Great, I will remove /sys and rtnetlink and keep the Ethtool i/f. Thanks, - KK From krkumar2 at in.ibm.com Sun Jul 22 19:57:53 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 08:27:53 +0530 Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h In-Reply-To: <46A38EAB.6050300@trash.net> Message-ID: Hi Patrick, Patrick McHardy wrote on 07/22/2007 10:36:51 PM: > Krishna Kumar wrote: > > @@ -472,6 +474,9 @@ struct net_device > > void *priv; /* pointer to private data */ > > int (*hard_start_xmit) (struct sk_buff *skb, > > struct net_device *dev); > > + int (*hard_start_xmit_batch) (struct net_device > > + *dev); > > + > > > Os this function really needed? Can't you just call hard_start_xmit with > a NULL skb and have the driver use dev->blist? Probably not. I will see how to do it this way and get back to you. > > /* These may be needed for future network-power-down code. */ > > unsigned long trans_start; /* Time (in jiffies) of last Tx */ > > > > @@ -582,6 +587,8 @@ struct net_device > > #define NETDEV_ALIGN 32 > > #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) > > > > +#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0) > > + > > static inline void *netdev_priv(const struct net_device *dev) > > { > > return dev->priv; > > @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n > > struct sockaddr *); > > extern int dev_hard_start_xmit(struct sk_buff *skb, > > struct net_device *dev); > > +extern int dev_add_skb_to_blist(struct sk_buff *skb, > > + struct net_device *dev); > > > Again, function signatures should be introduced in the same patch > that contains the function. Splitting by file doesn't make sense. Right. I did it for some but missed this. Sorry, will redo. thanks, - KK From rdreier at cisco.com Sun Jul 22 20:48:39 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Jul 2007 20:48:39 -0700 Subject: [ofa-general] Merge window for 2.6.23 closed Message-ID: Linus has released 2.6.23-rc1, and so the merge window for new features has closed. Of course fixes are always accepted at any time, and it definitely makes sense to submit new features early -- I will happily queue things up for 2.6.24 as soon as they are ready. Several things missed the merge window: - Sean's local SA changes. I guess I scared everyone into going slow, so I didn't have to make a hard choice here. However let's try to keep the discussion going so that we can finish this for 2.6.24. - IPoIB CM without SRQ. Pradeep, I'm sorry this missed the window but the patch quality really doesn't look up to par to me, and your being in a rush to get this merged I think has actually slowed things up. I think the basic idea is OK, but I have doubts about a static array as a data structure, and MST's comments about not dealing with remote implementations that send packets on passive connections looks quite serious as well. I would like to close this for 2.6.24 so (as above) please let's keep working this and not wait for the 2.6.24 merge window. - MST's "MSI-X by default" patches. The idea seems fine but I found a few minor issues and just ran out of time to review it. My fault-- sorry. And now I'm going on vacation for a week, so talk amongst yourselves... - Roland From rdreier at cisco.com Sun Jul 22 20:50:06 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 22 Jul 2007 20:50:06 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A2F696.4060007@voltaire.com> (Or Gerlitz's message of "Sun, 22 Jul 2007 09:17:58 +0300") References: <4696D1F3.2040507@ichips.intel.com> <15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com> <20070718050928.GA3103@obsidianresearch.com> <20070718072841.GC1115@mellanox.co.il> <469DD7BB.6060009@voltaire.com> <46A2F696.4060007@voltaire.com> Message-ID: > Do you agree that using cached IB L2 info where the net stack wants to > renew its IPoIB L2 (which is IB L3 && L4) info is a bug? Yes, looks that way. Also your point that there's no reason for IPoIB to keep the path info once it has created the AH makes sense to me. I haven't had a chance to look at the code but it seems we could kill off a lot of stuff by just creating AHs immediately and then dumping the path record. - R. From krkumar2 at in.ibm.com Sun Jul 22 21:23:29 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 09:53:29 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <20070720125423.GB13468@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 07/20/2007 06:24:23 PM: > Hi Krishna. > > On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > > After fine-tuning qdisc and other changes, I modified IPoIB to use this API, > > and now get good gains. Summary for TCP & No Delay: 1 process improves for > > all cases from 1.4% to 49.5%; 4 process has almost identical improvements > > from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to > > 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP > > was tested with 1 process netperf with small increase in BW but big > > improvement in Service Demand. Netperf latency tests show small drop in > > transaction rate (results in separate attachment). > > What about round-robin tcp time and latency test? In theory such batching > mode should not change that timings, but practice can show new aspects. The TCP RR results show a slight impact, however the service demand shows good improvement. The results are (I did TCP RR - 1 process, 1,8,32,128,512 buffer sizes; and UDP RR - 1 process, 1 byte buffer size) : Results for TCR RR (1 process) ORG code: Size R-R CPU% S.Demand ------------------------------------------------------------ 1 521346.02 5.48 1346.145 8 129463.14 6.74 418.370 32 128899.73 7.51 467.106 128 127230.15 5.42 340.876 512 119605.68 6.48 435.650 Results for TCR RR (1 process) NEW code (and change%): Size R-R CPU% S.Demand -------------------------------------------------------------------- 1 516596.62 (-0.91%) 5.74 1423.819 (5.77%) 8 129184.46 (-.22%) 5.43 336.747 (-19.51%) 32 128238.35 (-.51%) 5.43 339.213 (-27.38%) 128 126545.79 (-.54%) 5.36 339.188 (-0.50%) 512 119297.49 (-.26%) 5.16 346.185 (-20.54%) Results for UDP RR 1 process ORG & NEW code: Code Size R-R CPU% S.Demand ---------------------------------------------------------------------- ORG 1 539327.86 5.68 1348.985 NEW 1 540669.33 (0.25%) 6.05 1434.180 (6.32%) > I will review code later this week (likely tomorrow) and if there will > be some issues return back. Thanks! I had just submitted Rev2 on Sunday, please let me know what you find. Regards, - KK From kliteyn at mellanox.co.il Sun Jul 22 21:43:44 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 23 Jul 2007 07:43:44 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-23:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From krkumar2 at in.ibm.com Sun Jul 22 21:49:53 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 10:19:53 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <1185108670.5192.122.camel@localhost> Message-ID: Hi Jamal, J Hadi Salim wrote on 07/22/2007 06:21:09 PM: > My concern is there is no consistency in results. I see improvements on > something which you say dont. You see improvement in something that > Evgeniy doesnt etc. Hmmm ? Evgeniy has not even tested my code to find some regression :) And you may possibly not find much improvement in E1000 when you run iperf (which is what I do) compared to pktgen. I can re-run and confirm this since my last E1000 run was quite some time back. My point is that batching not being viable for E1000 (or tg3) need not be the sole criterea for inclusion. If IPoIB or other drivers can take advantage of it and get better results, then batching can be considered. Maybe E1000 too can get improvements if some one with more expertise tries to add this API (not judging your driver writing capabilities - just stating that driver writers will know more knobs to exploit a complex device like E1000). > > Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find > > it), > > I feel having prep will not help as no other cpu can execute the queue/xmit > > code anyway (E1000 is also a LLTX driver). > > My experiments show it is useful (in a very visible way using pktgen) > for e1000 to have the prep() interface. I meant : have you compared results of batching with prep on vs prep off, and what is the difference in BW ? > > Other driver that hold tx lock could get improvement however. > > So you do see the value then with non LLTX drivers, right? ;-> No. I see value only in non-LLTX drivers which also gets the same TX lock in the RX path. If different locks are got by TX/RX, then since you are holding queue_lock before calling 'prep', this excludes other TX from running at the same time. In that case, pre-poning the get of the tx_lock to do the 'prep' will not cause any degradation (since no other tx can run anyway, while rx can run as it gets a different lock). > The value is also there in LLTX drivers even if in just formating a skb > ready for transmit. If this is not clear i could do a much longer > writeup on my thought evolution towards adding prep(). In LLTX drivers, the driver does the 'prep' without holding the tx_lock in any case, so there should be no improvement. Could you send the write-up since I really don't see the value in prep unless the driver is non-LLTX *and* TX/RX holds the same TX lock. I think that is the sole criterea, right ? > > If it helps, I guess you could send me a patch to > > add that and I can also test it to see what the effect is. I didn't add it > > since IPoIB wouldn't be able to exploit it (unless someone is kind enough > > to show me how to). > > Such core code should not just be focussed on IPOIB. There is *nothing* IPoIB specific or focus in my code. I said adding prep doesn't work for IPoIB and so it is pointless to add bloat to the code until some code can actually take advantage of this feature (I am sure you will agree). Which is why I also mentioned to please send me a patch if you find it useful for any driver rather than rejecting this idea. > > I think the code I have is ready and stable, > > I am not sure how to intepret that - are you saying all-is-good and we > should just push your code in? I am only too well aware that Dave will not accept any code (having experienced with Mobile IPv6 a long time back when he said to move most of it to userspace and he was absolutely correct :). What I meant to say is that there isn't much point in saying that your code is not ready or you are using old code base, or has multiple restart functions, or is not tested enough, etc, and then say let's re-do/rethink the whole implementation when my code is already working and giving good results. Unless you have some design issues with it, or code is written badly, is not maintainable, not linux style compliant, is buggy, will not handle some case/workload, type of issues. OTOH, if you find some cases that are better handled with : 1. prep handler 2. xmit_win (which I don't have now), then please send me patches and I will also test out and incorporate. > It sounds disingenuous but i may have misread you. ("lacking in frankness, candor, or sincerity; falsely or hypocritically ingenuous; insincere") ???? Sorry, no response to personal comments and have a flame-war :) Thanks, - KK From sri at us.ibm.com Sun Jul 22 22:59:39 2007 From: sri at us.ibm.com (Sridhar Samudrala) Date: Sun, 22 Jul 2007 22:59:39 -0700 Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes. In-Reply-To: References: Message-ID: <46A443CB.6060200@us.ibm.com> Krishna Kumar2 wrote: > Hi Sridhar, > > Sridhar Samudrala wrote on 07/20/2007 10:55:05 PM: >>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h >>> --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 >>> +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.000000000 +0530 >>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge >>> struct rtattr *tab); >>> extern void qdisc_put_rtab(struct qdisc_rate_table *tab); >>> >>> -extern void __qdisc_run(struct net_device *dev); >>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head > *blist); >> Why do we need this additional 'blist' argument? >> Is this different from dev->skb_blist? > > It is the same, but I want to call it mostly with NULL and rarely with the > batch list pointer (so it is related to your other question). My original > code didn't have this and was trying batching in all cases. But in most > xmit's (probably almost all), there will be only one packet in the queue to > send and batching will never happen. When there is a lock contention or if > the queue is stopped, then the next iteration will find >1 packets. But I > still will try no batching for the lock failure case as there be probably > 2 packets (one from previous time and 1 from this time, or 3 if two > failures, > etc), and try batching only when queue was stopped from net_tx_action (this > was based on Dave Miller's idea). Is this right to say that the above change is to get this behavior? If qdisc_run() is called from dev_queue_xmit() don't use batching. If qdisc_run() is called from net_tx_action(), do batching. Isn't it possible to have multiple skb's in the qdisc queue in the first case? If this additional argument is used to indicate if we should do batching or not, then passing a flag may be much more cleaner than passing the blist. Thanks Sridhar From krkumar2 at in.ibm.com Sun Jul 22 23:27:07 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 11:57:07 +0530 Subject: [ofa-general] Re: [PATCH 02/10] Networking include file changes. In-Reply-To: <46A443CB.6060200@us.ibm.com> Message-ID: Hi Sridhar, Sridhar Samudrala wrote on 07/23/2007 11:29:39 AM: > Krishna Kumar2 wrote: > > Hi Sridhar, > > > > Sridhar Samudrala wrote on 07/20/2007 10:55:05 PM: > >>> diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h > >>> --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530 > >>> +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.000000000 +0530 > >>> @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge > >>> struct rtattr *tab); > >>> extern void qdisc_put_rtab(struct qdisc_rate_table *tab); > >>> > >>> -extern void __qdisc_run(struct net_device *dev); > >>> +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head > > *blist); > >> Why do we need this additional 'blist' argument? > >> Is this different from dev->skb_blist? > > > > It is the same, but I want to call it mostly with NULL and rarely with the > > batch list pointer (so it is related to your other question). My original > > code didn't have this and was trying batching in all cases. But in most > > xmit's (probably almost all), there will be only one packet in the queue to > > send and batching will never happen. When there is a lock contention or if > > the queue is stopped, then the next iteration will find >1 packets. But I > > still will try no batching for the lock failure case as there be probably > > 2 packets (one from previous time and 1 from this time, or 3 if two > > failures, > > etc), and try batching only when queue was stopped from net_tx_action (this > > was based on Dave Miller's idea). > > Is this right to say that the above change is to get this behavior? > If qdisc_run() is called from dev_queue_xmit() don't use batching. > If qdisc_run() is called from net_tx_action(), do batching. Correct. > Isn't it possible to have multiple skb's in the qdisc queue in the > first case? It is possible but rarer (so unnecessary checking most of the time). From net_tx_action you are guaranteed to have multiple skbs, but from xmit you will almost always get one skb (since most send of 1 skb will go out OK). And also in the xmit path, it is more likely to have few skbs compared to possibly hundreds in the net_tx_action path. > If this additional argument is used to indicate if we should do batching > or not, then passing a flag may be much more cleaner than passing the blist. OK, I will add this as another action item to check (along with Patrick's suggestion to use single API) and will get back. - KK From eli at mellanox.co.il Sun Jul 22 23:32:19 2007 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 23 Jul 2007 09:32:19 +0300 Subject: [ofa-general] ipoib question Message-ID: <1185172339.5513.11.camel@mtls03> Roland, can you explain why you add 1 to the size of the CQ in ipoib_transport_dev_init()? priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(priv->mr)) { printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name); goto out_free_pd; } size = ipoib_sendq_size + ipoib_recvq_size + 1; ret = ipoib_cm_dev_init(dev); From monisonlists at gmail.com Mon Jul 23 00:07:42 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Mon, 23 Jul 2007 10:07:42 +0300 Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking for a P_Key in the table In-Reply-To: <46A36E77.5020307@gmail.com> References: <46A36E77.5020307@gmail.com> Message-ID: <46A453BE.3030408@gmail.com> I am resending the patch with Signed-off-by line. Sorry. ------------------------------------------------------------------ IPoIB turns on the P_Key membership bit of limited membership P_Keys when creating a child interface. After that IPoIB looks for the full membership P_key in the table to make the interface "RUNNING". This patch fixes the pkey lookup in order to match full and partial membership keys that belong of the same partition. Signed-off-by: Moni Shoua --- device.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: infiniband/drivers/infiniband/core/device.c =================================================================== --- infiniband.orig/drivers/infiniband/core/device.c 2007-07-08 12:45:07.000000000 +0300 +++ infiniband/drivers/infiniband/core/device.c 2007-07-22 17:43:32.440829619 +0300 @@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic if (ret) return ret; - if (pkey == tmp_pkey) { + if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { *index = i; return 0; } From dotanb at dev.mellanox.co.il Mon Jul 23 00:36:44 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 23 Jul 2007 10:36:44 +0300 Subject: [ofa-general] I think that there is a resource leak in the core file mad_rmpp.c Message-ID: <46A45A8C.2090800@dev.mellanox.co.il> Hi. I reviewed the file mad_rmpp.c and it seems that there is a leak of the Address Handle. The AH that is being created in the function "alloc_response_msg" is never being destroyed. This thing causes to resource (AH) and memory leak. thanks Dotan From eitan at mellanox.co.il Mon Jul 23 00:31:14 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jul 2007 10:31:14 +0300 Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer toopensm-coding-style.txt In-Reply-To: <20070722221455.GR27878@sashak.voltaire.com> References: <20070722221455.GR27878@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com> Hi Sasha, So we will finally have a common enforced coding style! When do you plan to run it on all the files? Or should we just make sure every new committed file will first pass this indent? Thanks Eitan Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Sasha Khapyorsky > Sent: Monday, July 23, 2007 1:15 AM > To: general at lists.openfabrics.org > Cc: Yevgeny Kliteynik > Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go > closer toopensm-coding-style.txt > > > This updates the script according to recent > doc/opensm-coding-style.txt (in short K&R, tabs, etc.). > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_indent | 57 > +++------------------------------------------ > 1 files changed, 4 insertions(+), 53 deletions(-) > > diff --git a/opensm/opensm/osm_indent > b/opensm/opensm/osm_indent index bed2ba1..621184b 100755 > --- a/opensm/opensm/osm_indent > +++ b/opensm/opensm/osm_indent > @@ -1,6 +1,6 @@ > #!/bin/bash > # > -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > # Copyright (c) 2002-2005 Mellanox Technologies LTD. All > rights reserved. > # Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > # > @@ -40,56 +40,7 @@ > # Environment: > # Linux User Mode > # > -# $Revision: 1.4 $ > -# > -# > -# This is the indent format used for OpenSM. > -# > -# format the source code according to the ACD standard > -# -bad Blank line after declarations > -# -bap Blank line after Procedures > -# -bbb Blank line before block comments > -# -nbbo Break after Boolean operator > -# -bl Break after if line > -# -bli0 Indent for braces is 0 > -# -bls Break after struct declarations > -# -cbi0 Case break indent 0 > -# -ci3 Continue indent 3 spaces > -# -cli0 Case label indent 0 spaces > -# -ncs No space after cast operator > -# -hnl Honor existing newlines on long lines > -# -i3 Substitute indent with 3 spaces > -# -npcs No space after procedure calls > -# -prs Space after parenthesis > -# -nsai No space after if keyword - removed > -# -nsaw No space after while keyword - removed > -# -sc Put * at left of comments in a block comment style > -# -nsob Don't swallow unnecessary blank lines > -# -ts3 Tab size is 3 > -# -psl Type of procedure return in a separate line > -# -bfda Function declaration arguments in a separate line. > -# -nut No tabs as we allow spaces > -# > -############################################################# > ############ > - > -# indent the world > -for sourcefile in $*; do > - if test -f "$sourcefile"; then > - # first, string DOS style linefeeds > - perl -piW -e's/\x0D//' "$sourcefile" > - echo Processing $sourcefile > - indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 > -ci3 -cli0 -ncs \ > - -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl > -bfda -nut $sourcefile > - > - rm ${sourcefile}W > +# This is the indent format used for OpenSM (similar to one > used in # > +linux/scripts/Lindent). > > - # the -bb also affect the first line in each file - > so clean it up > - if test `head -1 $sourcefile | egrep -v '^$' | wc > -l` = 0; then > - echo Cleaning up first empty line of $sourcefile > - awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W > - mv -f ${sourcefile}W $sourcefile > - fi > - else > - echo Could not find file:$sourcefile > - fi > -done > +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@" > -- > 1.5.3.rc2.29.gc4640f > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Mon Jul 23 00:35:25 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jul 2007 10:35:25 +0300 Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <20070722174048.GO27878@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com> Hi Sasha, > On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > > Hi Sasha > > > > Let's assume someone has reset a switch on the fabric. > > What would cause the SM to re-assign the LFT of that switch? > > OpenSM will sweep and drop this switch and when switch will > back it will be initialized again. But if the reset was too > fast (relative to discovery), we can be in trouble (and maybe > not only with LFTs). > > > I assumed that there is a mechanism to do that. > > Not for "fast" switch reboot. So we have a problem with these fast resetting devices. > > Hmm, I think we could try to detect this case by comparing > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even > by seeing that PortInfo:LID is not set. Something like below: > I think we should have a predicate that will be used to mark a port/device as needing a full update. Not just LFT but everything (SL2VL, VLArb, LID, PKey ... If a device was reset then it probably lost everything). Another approach is to mark it for the entire fabric. The original intention of kill -HUP was to force a new heavy sweep and setup. I this another signal is acceptible but not required. Thanks Eitan From mst at dev.mellanox.co.il Mon Jul 23 00:57:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jul 2007 10:57:54 +0300 Subject: [ofa-general] commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 Message-ID: <20070723075754.GC20614@mellanox.co.il> Hi! commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 includes this snippet: @@ -468,20 +465,8 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd, req->fmr = NULL; } - /* - * This handling of non-SG commands can be killed when the - * SCSI midlayer no longer generates non-SG commands. - */ - if (likely(scmnd->use_sg)) { - nents = scmnd->use_sg; - scat = scmnd->request_buffer; - } else { - nents = 1; - scat = &req->fake_sg; - } - - ib_dma_unmap_sg(target->srp_host->dev->dev, scat, nents, - scmnd->sc_data_direction); + ib_dma_unmap_sg(target->srp_host->dev->dev, scsi_sglist(scmnd), + scsi_sg_count(scmnd), scmnd->sc_data_direction); Since scsi_sg_count is simply use_sg, and scsi_sglist is simply request_buffer, why is this the right things to do? Is there a reason to believe that scsi_sg_count is never 0 here? -- MST From ogerlitz at voltaire.com Mon Jul 23 01:26:31 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 23 Jul 2007 11:26:31 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: References: <4696D1F3.2040507@ichips.intel.com><15ddcffd0707172020j5b68fcb2v7d3ca77863998020@mail.gmail.com><20070718050928.GA3103@obsidianresearch.com> <20070718072841.GC1115@mellanox.co.il><469DD7BB.6060009@voltaire.com> <46A2F696.4060007@voltaire.com> Message-ID: <46A46637.3080104@voltaire.com> Roland Dreier wrote: > > > Do you agree that using cached IB L2 info where the net stack wants to > > renew its IPoIB L2 (which is IB L3 && L4) info is a bug? > > Yes, looks that way. > > Also your point that there's no reason for IPoIB to keep the path info > once it has created the AH makes sense to me. I haven't had a chance > to look at the code but it seems we could kill off a lot of stuff by > just creating AHs immediately and then dumping the path record. Indeed. It does make sense to keep the path info for admin / debugging purposes, eg printing them through debugfs etc, but no more. In the context of the local sa, this seems to be another requirement namely: provide the consumer with an API to specify if it is willing to get from the ib_sa module a cached IB L2 info (path) or not. As I said above, if the network stack decides to renew its IPoIB L2 info, the IB stack must provide it with non-cached IB L2 info Or. From mst at dev.mellanox.co.il Mon Jul 23 01:30:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jul 2007 11:30:20 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A46637.3080104@voltaire.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> Message-ID: <20070723083020.GD20614@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [ofa-general] Re: IPoIB path caching > > Roland Dreier wrote: > > > > > Do you agree that using cached IB L2 info where the net stack wants to > > > renew its IPoIB L2 (which is IB L3 && L4) info is a bug? > > > >Yes, looks that way. > > > >Also your point that there's no reason for IPoIB to keep the path info > >once it has created the AH makes sense to me. I haven't had a chance > >to look at the code but it seems we could kill off a lot of stuff by > >just creating AHs immediately and then dumping the path record. > > Indeed. > > It does make sense to keep the path info for admin / debugging purposes, > eg printing them through debugfs etc, but no more. > > In the context of the local sa, this seems to be another requirement > namely: provide the consumer with an API to specify if it is willing to > get from the ib_sa module a cached IB L2 info (path) or not. > > As I said above, if the network stack decides to renew its IPoIB L2 > info, the IB stack must provide it with non-cached IB L2 info If what you have in mind is keeping local sa cache in sync with IPoIB cache, wouldn't it be better to have an API to invalidate a cache entry? -- MST From vlad at lists.openfabrics.org Mon Jul 23 01:39:39 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 23 Jul 2007 01:39:39 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070723-0100 daily build status Message-ID: <20070723083940.1743EE603BD@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From ogerlitz at voltaire.com Mon Jul 23 01:43:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 23 Jul 2007 11:43:09 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070723083020.GD20614@mellanox.co.il> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> Message-ID: <46A46A1D.6040000@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Roland Dreier wrote: >>>> Do you agree that using cached IB L2 info where the net stack wants to >>>> renew its IPoIB L2 (which is IB L3 && L4) info is a bug? >>> Yes, looks that way. >>> Also your point that there's no reason for IPoIB to keep the path info >>> once it has created the AH makes sense to me. I haven't had a chance >>> to look at the code but it seems we could kill off a lot of stuff by >>> just creating AHs immediately and then dumping the path record. >> Indeed. >> As I said above, if the network stack decides to renew its IPoIB L2 >> info, the IB stack must provide it with non-cached IB L2 info > If what you have in mind is keeping local sa cache in sync > with IPoIB cache, wouldn't it be better to have an API to > invalidate a cache entry? What I have in mind is that IPoIB must not use cached IB path info. If the IB stack has path caching which is in the default flow of requesting a path record, it should provide an API (eg flag to the function through which one does path query) to request a non cached path. The design I was thinking to suggest for IPoIB is to almost always use this API since this policy makes the implementation consistent with the decisions made by the network stack neighbour cache Or. From hbe at seznam.cz Mon Jul 23 02:31:38 2007 From: hbe at seznam.cz (FreeWebCards.Com) Date: Mon, 23 Jul 2007 12:31:38 +0300 Subject: [ofa-general] You've received a greeting card from a Class-mate! Message-ID: <002801c7cd0c$453b6310$7be470bc@wdwex.fc> Hi. Class-mate has sent you a greeting card. See your card as often as you wish during the next 15 days. SEEING YOUR CARD If your email software creates links to Web pages, click on your card's direct www address below while you are connected to the Internet: http://88.138.4.215/?5c50080d0229e368412571d7d419 Or copy and paste it into your browser's "Location" box (where Internet addresses go). We hope you enjoy your awesome card. Wishing you the best, Administrator, FreeWebCards.Com From shemminger at linux-foundation.org Mon Jul 23 02:44:08 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Mon, 23 Jul 2007 10:44:08 +0100 Subject: [ofa-general] Re: TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <1185025579.5192.68.camel@localhost> References: <20070720063149.26341.84076.sendpatchset@localhost.localdomain> <20070720081848.7cc652fb@oldman> <1185025579.5192.68.camel@localhost> Message-ID: <20070723104408.169b0724@oldman.hamilton.local> On Sat, 21 Jul 2007 09:46:19 -0400 jamal wrote: > On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote: > > > You may see worse performance with batching in the real world when > > running over WAN's. Like TSO, batching will generate back to back packet > > trains that are subject to multi-packet synchronized loss. > > Has someone done any study on TSO effect? Not that I have seen, TCP research tends to turn of NAPI and TSO because it causes other effects which are too confusing for measurement. The discussion of TSO usually shows up in discussions of pacing. I have seen argument both pro and con for pacing. The most convincing arguments are that pacing doesn't help in the general case (and therefore TSO would be ok). > Doesnt ECN with a RED router > help on something like this? Yes, but RED is not deployed on backbone, and ECN only slightly. Most common is over sized FIFO queues. > I find it suprising that a single flow doing TSO would overwhelm a > routers buffer. I actually think the value of batching as far as TCP is > concerned is propotional to the number of flows. i.e the more flows you > have the more batching you will end up doing. And if TCPs fairness is > the legend talk it has been made to be, then i dont see this as > problematic. It is not that TSO would overwhelm the router by itself, just that any congested link will have periods when there is only a small number of available slots left. When this happens a TSO burst will get truncated. The argument against pacing, and for TSO; is that the busy sender with large congestion window is the one most likely to have send large bursts. For fairness, the system works better if the busy sender gets penalized more, and dropping the latter part of the burst does that. With pacing, the sender may be able to saturate the router more and not detect that it is monopolizing the bandwidth. > BTW, something i noticed regards to GSO when testing batching: > For TCP packets slightly above MDU (upto 2K), GSO gives worse > performance than non-GSO. Actually has nothing to do with batching, > rather it works the same way with or without batching changes. > > Another oddity: > Looking at the flow rate from a purely packets/second (I know thats a > router centric view, but i found it strange nevertheless) - you see that > as packet size goes up, the pps also goes up. I tried mucking around > with nagle etc, but saw no observable changes. Any insight? > My expectation was that the pps would stay at least the same or get > better with smaller packets (assuming theres less data to push around). > > cheers, > jamal > > > From krkumar2 at in.ibm.com Mon Jul 23 02:53:27 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 15:23:27 +0530 Subject: [ofa-general] Re: [PATCH 00/12 -Rev2] Implement batching skb API In-Reply-To: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: > I have started a 10 run test for various buffer sizes and processes, and > will post the results on Monday. The 10 iteration run results for Rev2 are (average) : ---------------------------------------------------------------------------------- Test Case Org New %Change ---------------------------------------------------------------------------------- TCP 1 Process Size:32 2703 3063 13.31 Size:128 12948 12217 -5.64 Size:512 48108 55384 15.12 Size:4096 129089 132586 2.70 Average: 192848 203250 5.39 TCP 4 Processes Size:32 10389 10768 3.64 Size:128 39694 42265 6.47 Size:512 159563 156373 -1.99 Size:4096 268094 256008 -4.50 Average: 477740 465414 -2.58 TCP No Delay 1 Process Size:32 2606 2950 13.20 Size:128 8115 11864 46.19 Size:512 39113 42608 8.93 Size:4096 103966 105333 1.31 Average: 153800 162755 5.82 TCP No Delay 4 Processes Size:32 4213 8727 107.14 Size:128 17579 35143 99.91 Size:512 70803 123936 75.04 Size:4096 203541 225259 10.67 Average: 296136 393065 32.73 -------------------------------------------------------------------------- Average: 1120524 1224484 9.28% There are three cases that degrade a little (upto -5.6%), but there are 13 cases that improve, and many of those are in the 13% to over 100% (7 cases). Thanks, - KK Krishna Kumar2/India/IBM at IBMIN wrote on 07/22/2007 02:34:57 PM: > This set of patches implements the batching API, and makes the following > changes resulting from the review of the first set: > > Changes : > --------- > 1. Changed skb_blist from pointer to static as it saves only 12 bytes > (i386), but bloats the code. > 2. Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS" > in register_netdev to enable batching as it is redundant. Changed this > flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other > user changable calls can modify this bit to enable/disable batching. > 3. Added ethtool support to enable/disable batching (not tested). > 4. Added rtnetlink support to enable/disable batching (not tested). > 5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers > should not have a small queue anyway (adding bloat). > 6. skbs are purged from dev_deactivate instead of from unregister_netdev > to drop all references to the device. > 7. Removed changelog in source code in sch_generic.c, and unrelated renames > from sch_generic.c (lockless, comments). > 8. Removed xmit_slots entirely, as it was adding bloat (code and header) > and not adding value (it is calculated and set twice in internal send > routine and handle work completion, and referenced once in batch xmit; > and can instead be calculated once in xmit). > > Issues : > -------- > 1. Remove /sysfs support completely ? > 2. Whether rtnetlink support is required as GSO has only ethtool ? > > Patches are described as: > Mail 0/12 : This mail. > Mail 1/12 : HOWTO documentation. > Mail 2/12 : Changes to netdevice.h > Mail 3/12 : dev.c changes. > Mail 4/12 : Ethtool changes. > Mail 5/12 : sysfs changes. > Mail 6/12 : rtnetlink changes. > Mail 7/12 : Change in qdisc_run & qdisc_restart API, modify callers > to use this API. > Mail 8/12 : IPoIB include file changes. > Mail 9/12 : IPoIB verbs changes > Mail 10/12 : IPoIB multicast, CM changes > Mail 11/12 : IPoIB xmit API addition > Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c) > > I have started a 10 run test for various buffer sizes and processes, and > will post the results on Monday. > > Please review and provide feedback/ideas; and consider for inclusion. > > Thanks, > > - KK From ogerlitz at voltaire.com Mon Jul 23 02:56:12 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 23 Jul 2007 12:56:12 +0300 (IDT) Subject: [ofa-general] 20% latency increase between UD to RC latency Message-ID: OK,its always good to start with facts on the ground... before commiting this test, my original thinking was that for messages whose size=X is less then the IB Link level MTU it holds that: latency(X,UD) <= latency(X,UC) <= latency(X,RC) Running the latency test provided with the perftest package on my systems (*) I get the below results. Does anyone has insight why the --minimal-- and typical UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency??? Or. (*) the system spec is: HW : 4 way Intel Xeon 1.6GHz 4GB RAM IB HW: Arbel memfull (25208) DDR running in SDR mode HCA FW: 4.8.200 IB SW: OFED 1.2 OS : RH4 U3 i386 smp [root at rain5 ~]# /usr/bin/ib_send_lat -c RC -n 100000 172.30.8.61 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : RC local address: LID 0x26 QPN 0x330407 PSN 0xf6ba57 remote address: LID 0x28 QPN 0x40407 PSN 0xc2c9f9 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 100000 4.75 38.53 4.82 ------------------------------------------------------------------ [root at rain5 ~]# /usr/bin/ib_send_lat -c UC -n 100000 172.30.8.61 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : UC local address: LID 0x26 QPN 0x340407 PSN 0xbb4a0e remote address: LID 0x28 QPN 0x50407 PSN 0xb916a9 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 100000 4.71 42.03 4.77 ------------------------------------------------------------------ [root at rain5 ~]# /usr/bin/ib_send_lat -c UD -n 100000 172.30.8.61 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : UD local address: LID 0x26 QPN 0x350407 PSN 0xfdc2c0 remote address: LID 0x28 QPN 0x60407 PSN 0x63c30e ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 100000 5.71 44.51 5.81 ------------------------------------------------------------------ From shemminger at linux-foundation.org Mon Jul 23 02:56:29 2007 From: shemminger at linux-foundation.org (Stephen Hemminger) Date: Mon, 23 Jul 2007 10:56:29 +0100 Subject: [ofa-general] Re: [PATCH 04/10] net-sysfs.c changes. In-Reply-To: References: <20070720172203.0eaeea86@oldman> Message-ID: <20070723105629.278fcce3@oldman.hamilton.local> On Sat, 21 Jul 2007 12:16:30 +0530 Krishna Kumar2 wrote: > Stephen Hemminger wrote on 07/20/2007 > 09:52:03 PM: > > Patrick McHardy wrote: > > > > > Krishna Kumar2 wrote: > > > > Patrick McHardy wrote on 07/20/2007 03:37:20 PM: > > > > > > > > > > > > > > > >> rtnetlink support seems more important than sysfs to me. > > > >> > > > > > > > > Thanks, I will add that as a patch. The reason to add to sysfs is > that > > > > it is easier to change for a user (and similar to tx_queue_len). > > > > > > > > > > > But since batching is so similar to TSO, i really should be part of the > > flags and controlled by ethtool like other offload flags. > > So should I add all three interfaces (or which ones) : > > 1. /sys (like for tx_queue_len) > 2. netlink > 3. ethtool. > > Or only 2 & 3 are enough ? > Yes, please do #3 and maybe #2. Sysfs api's are a long term ABI problem. From vlad at lists.openfabrics.org Mon Jul 23 03:06:20 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 23 Jul 2007 03:06:20 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070723-0220 daily build status Message-ID: <20070723100620.84D6EE60814@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From johnpol at 2ka.mipt.ru Mon Jul 23 03:44:28 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Mon, 23 Jul 2007 14:44:28 +0400 Subject: [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes. In-Reply-To: <20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> <20070722090525.7787.10432.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070723104428.GC22877@2ka.mipt.ru> Hi Krishna. On Sun, Jul 22, 2007 at 02:35:25PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > diff -ruNp org/net/core/dev.c rev2/net/core/dev.c > --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530 > +++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530 > @@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi > } > } > > +/* > + * dev_change_tx_batching - Enable or disable batching for a driver that > + * supports batching. > + */ > +int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb) > +{ > + int ret; > + > + if (!dev->hard_start_xmit_batch) { > + /* Driver doesn't support skb batching */ > + ret = -ENOTSUPP; > + goto out; > + } > + > + /* Handle invalid argument */ > + if (new_batch_skb < 0) { > + ret = -EINVAL; > + goto out; > + } > + > + ret = 0; > + > + /* Check if new value is same as the current */ > + if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb) > + goto out; o_O Scratched head for too long before understood what it means :) > + spin_lock(&dev->queue_lock); > + if (new_batch_skb) { > + dev->features |= NETIF_F_BATCH_ON; > + dev->tx_queue_len >>= 1; > + } else { > + if (!skb_queue_empty(&dev->skb_blist)) > + skb_queue_purge(&dev->skb_blist); > + dev->features &= ~NETIF_F_BATCH_ON; > + dev->tx_queue_len <<= 1; > + } > + spin_unlock(&dev->queue_lock); Hmm, should this also stop interrupts? -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Mon Jul 23 03:48:26 2007 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Mon, 23 Jul 2007 14:48:26 +0400 Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition In-Reply-To: <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com> References: <20070722090457.7787.4601.sendpatchset@K50wks273871wss.in.ibm.com> <20070722090649.7787.47960.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <20070723104826.GD22877@2ka.mipt.ru> On Sun, Jul 22, 2007 at 02:36:49PM +0530, Krishna Kumar (krkumar2 at in.ibm.com) wrote: > diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c > --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.000000000 +0530 > +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-22 00:08:37.000000000 +0530 > @@ -242,8 +242,9 @@ repost: > static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > + int i = 0, num_completions; > + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1); > unsigned int wr_id = wc->wr_id; > - struct ipoib_tx_buf *tx_req; > unsigned long flags; > > ipoib_dbg_data(priv, "send completion: id %d, status: %d\n", > @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct > return; > } > > - tx_req = &priv->tx_ring[wr_id]; > + num_completions = wr_id - tx_ring_index + 1; > + if (num_completions <= 0) > + num_completions += ipoib_sendq_size; Can this still be less than zero? -- Evgeniy Polyakov From krkumar2 at in.ibm.com Mon Jul 23 04:17:45 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 16:47:45 +0530 Subject: [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition In-Reply-To: <20070723104826.GD22877@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 07/23/2007 04:18:26 PM: > > static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) > > { > > struct ipoib_dev_priv *priv = netdev_priv(dev); > > + int i = 0, num_completions; > > + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1); > > unsigned int wr_id = wc->wr_id; > > - struct ipoib_tx_buf *tx_req; > > unsigned long flags; > > > > ipoib_dbg_data(priv, "send completion: id %d, status: %d\n", > > @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct > > return; > > } > > > > - tx_req = &priv->tx_ring[wr_id]; > > + num_completions = wr_id - tx_ring_index + 1; > > + if (num_completions <= 0) > > + num_completions += ipoib_sendq_size; > > Can this still be less than zero? Should never happen, otherwise the TX code wrote on bad/unallocated memory and would have crashed first. Thanks, - KK From krkumar2 at in.ibm.com Mon Jul 23 04:17:25 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 23 Jul 2007 16:47:25 +0530 Subject: [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes. In-Reply-To: <20070723104428.GC22877@2ka.mipt.ru> Message-ID: Hi Evgeniy, Evgeniy Polyakov wrote on 07/23/2007 04:14:28 PM: > > +/* > > + * dev_change_tx_batching - Enable or disable batching for a driver that > > + * supports batching. > > + /* Check if new value is same as the current */ > > + if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb) > > + goto out; > > o_O > > Scratched head for too long before understood what it means :) Is there a easy way to do this ? > > + spin_lock(&dev->queue_lock); > > + if (new_batch_skb) { > > + dev->features |= NETIF_F_BATCH_ON; > > + dev->tx_queue_len >>= 1; > > + } else { > > + if (!skb_queue_empty(&dev->skb_blist)) > > + skb_queue_purge(&dev->skb_blist); > > + dev->features &= ~NETIF_F_BATCH_ON; > > + dev->tx_queue_len <<= 1; > > + } > > + spin_unlock(&dev->queue_lock); > > Hmm, should this also stop interrupts? That is a good question, and I am not sure. I thought it is not required, though adding it doesn't affect code either. Can someone tell if disabling bh is required and why (couldn't figure out the intention of bh for dev_queue_xmit either, is this to disable preemption) ? Thanks, - KK From fujita.tomonori at lab.ntt.co.jp Mon Jul 23 04:20:55 2007 From: fujita.tomonori at lab.ntt.co.jp (FUJITA Tomonori) Date: Mon, 23 Jul 2007 20:20:55 +0900 Subject: [ofa-general] Re: commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 In-Reply-To: <20070723075754.GC20614@mellanox.co.il> References: <20070723075754.GC20614@mellanox.co.il> Message-ID: <20070723202055P.fujita.tomonori@lab.ntt.co.jp> From: "Michael S. Tsirkin" Subject: commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 Date: Mon, 23 Jul 2007 10:57:54 +0300 > Hi! > commit bb350d1decd9c48ffaa7f7e263df3056df9f4f21 includes this snippet: > > @@ -468,20 +465,8 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd, > req->fmr = NULL; > } > > - /* > - * This handling of non-SG commands can be killed when the > - * SCSI midlayer no longer generates non-SG commands. > - */ > - if (likely(scmnd->use_sg)) { > - nents = scmnd->use_sg; > - scat = scmnd->request_buffer; > - } else { > - nents = 1; > - scat = &req->fake_sg; > - } > - > - ib_dma_unmap_sg(target->srp_host->dev->dev, scat, nents, > - scmnd->sc_data_direction); > + ib_dma_unmap_sg(target->srp_host->dev->dev, scsi_sglist(scmnd), > + scsi_sg_count(scmnd), scmnd->sc_data_direction); > > Since scsi_sg_count is simply use_sg, and scsi_sglist is simply request_buffer, > why is this the right things to do? That will change shortly. http://marc.info/?l=linux-scsi&m=118364319919621&w=2 > Is there a reason to believe that scsi_sg_count is never 0 here? Yeah, scsi-ml doesn't send non-SG commands now. From ogerlitz at voltaire.com Mon Jul 23 04:30:21 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 23 Jul 2007 14:30:21 +0300 (IDT) Subject: [ofa-general] Re: 20% latency increase between UD to RC latency In-Reply-To: References: Message-ID: On Mon, 23 Jul 2007, Or Gerlitz wrote: > my original thinking was that for messages > whose size=X is less then the IB Link level MTU it holds that: > > latency(X,UD) <= latency(X,UC) <= latency(X,RC) > > Running the latency test provided with the perftest package on my systems (*) > I get the below results. Does anyone has insight why the --minimal-- and typical > UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency??? running the latecy test on a similar system (RH5 smp / four way Xeon 1.9GHz / 4GB RAM) but this time with the memfree --Hermon-- HCA (25418 / FW 2.1.0) the minimal AND typical UD latecy is very much the same as the RC and UC ones which is ~1.5us So it both fixes the UD issue on Arbel and improves the latency from 4.5us to 1.5us nice, Or. root at iris6 ~]# ib_send_lat -c RC 172.30.3.252 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : RC local address: LID 0x06 QPN 0xb004a PSN 0xe6014a remote address: LID 0x05 QPN 0xb004a PSN 0xbe437f Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 1.39 8.08 1.42 ------------------------------------------------------------------ [root at iris6 ~]# ib_send_lat -c UC 172.30.3.252 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : UC local address: LID 0x06 QPN 0xc004a PSN 0xb4281e remote address: LID 0x05 QPN 0xc004a PSN 0xb14013 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 1.37 8.17 1.42 ------------------------------------------------------------------ [root at iris6 ~]# ib_send_lat -c UD 172.30.3.252 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : UD local address: LID 0x06 QPN 0xd004a PSN 0xf63264 remote address: LID 0x05 QPN 0xd004a PSN 0xf7821 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 1.47 7.66 1.51 ------------------------------------------------------------------ [root at iris6 ~]# ibv_devinfo hca_id: mlx4_0 fw_ver: 2.1.000 node_guid: 0002:c903:0000:0434 sys_image_guid: 0002:c903:0000:0437 vendor_id: 0x02c9 vendor_part_id: 25418 hw_ver: 0xA0 board_id: MT_04A0110002 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 6 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 From ogerlitz at voltaire.com Mon Jul 23 04:44:00 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 23 Jul 2007 14:44:00 +0300 (IDT) Subject: [ofa-general] Re: 20% latency increase between UD to RC latency In-Reply-To: References: Message-ID: On Mon, 23 Jul 2007, Or Gerlitz wrote: >> Running the latency test provided with the perftest package on my systems (*) >> I get the below results. Does anyone has insight why the --minimal-- and typical >> UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency??? > running the latecy test on a similar system (RH5 smp / four way Xeon 1.9GHz / > 4GB RAM) but this time with the memfree --Hermon-- HCA (25418 / FW 2.1.0) the minimal > AND typical UD latecy is very much the same as the RC and UC ones which is ~1.5us > So it both fixes the UD issue on Arbel and improves the latency from 4.5us to 1.5us A third run, now over a memfree Sinai HCA (25204 / FW 1.2.0) the UD and RC latency are quite the same, around 5.3us but the result is worse then the Arbel one in about 0.7us ... Or. [root at src1 ~]# ib_send_lat -c RC storm7 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : RC local address: LID 0x09 QPN 0xd50407 PSN 0x7aaaf1 remote address: LID 0x0b QPN 0x0405 PSN 0x292565 Mtu : 2048 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 5.27 63.27 5.32 ------------------------------------------------------------------ [root at src1 ~]# ib_send_lat -c UD storm7 ------------------------------------------------------------------ Send Latency Test Inline data is used up to 400 bytes message Connection type : UD local address: LID 0x09 QPN 0xd60407 PSN 0xcc70ba remote address: LID 0x0b QPN 0x10405 PSN 0x6794b5 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 5.38 38.88 5.45 ------------------------------------------------------------------ From hadi at cyberus.ca Mon Jul 23 05:32:01 2007 From: hadi at cyberus.ca (jamal) Date: Mon, 23 Jul 2007 08:32:01 -0400 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: References: Message-ID: <1185193921.26013.37.camel@localhost> KK, On Mon, 2007-23-07 at 10:19 +0530, Krishna Kumar2 wrote: > Hmmm ? Evgeniy has not even tested my code to find some regression :) And > you may possibly not find much improvement in E1000 when you run iperf > (which is what I do) compared to pktgen. Pktgen is the correct test (or the closest to correct) because it tests the driver tx path. iperf/netperf test the effect of batching on tcp/udp. Infact i would start with udp first. What you need to do if testing end-2-end is see where the effects occur. For example, it is feasible that batching is a little too aggressive and the receiver cant keep up (netstat -s before and after will be helpful). Maybe by such insight we can improve things. > > My experiments show it is useful (in a very visible way using pktgen) > > for e1000 to have the prep() interface. > > I meant : have you compared results of batching with prep on vs prep off, > and > what is the difference in BW ? Yes, and these results were sent to you as well a while back. When i get the time when i get back i will look em up in my test machine and resend. > No. I see value only in non-LLTX drivers which also gets the same TX lock > in the RX path. So _which_ non-LLTX driver doesnt do that? ;-> > > The value is also there in LLTX drivers even if in just formating a skb > > ready for transmit. If this is not clear i could do a much longer > > writeup on my thought evolution towards adding prep(). > > In LLTX drivers, the driver does the 'prep' without holding the tx_lock in > any case, so there should be no improvement. Could you send the write-up I will - please give me sometime; i am overloaded at the moment. > There is *nothing* IPoIB specific or focus in my code. > I said adding prep > doesn't > work for IPoIB and so it is pointless to add bloat to the code until some > code can tun driver doesnt use it either - but i doubt that makes it "bloat" > What I meant to say > is that there isn't much point in saying that your code is not ready or > you are using old code base, or has multiple restart functions, or is not > tested enough, etc, and then say let's re-do/rethink the whole > implementation when my code is already working and giving good results. The suggestive hand gesturing is the kind of thing that bothers me. What do you think: Would i be submitting patches in baed on 2.6.22-rc4? Would it make sense to include parallel qdisc paths? For heavens sake, i have told you i would be fine with accepting such changes when the qdisc restart changes went in first. You waltz in, have the luxury of looking at my code, presentations, many discussions with me etc ... When i ask for differences to code you produced, they now seem to sum up to the two below. You dont think theres some honest issue with this picture? > OTOH, if you find some cases that are better handled with : > 1. prep handler > 2. xmit_win (which I don't have now), > then please send me patches and I will also test out and incorporate. > And then of course you will end up adding those because they are both useful, just calling them some other name. And then you will end up incorporating all the drivers i invested many hours (as a gratitous volunteer) to change and test - maybe you will change varibale names or rearrange some function. I am a very compromising person; i have no problem coauthoring these patches if you actually invest useful time like fixing things up and doing proper tests. But you are not doing that - instead you are being extremely aggressive and hijacking the whole thing. It is courteous if you find somebody else has a patch you point out whats wrong preferably with some proof. > > It sounds disingenuous but i may have misread you. > > ("lacking in frankness, candor, or sincerity; falsely or hypocritically > ingenuous; insincere") ???? Sorry, no response to personal comments and > have a flame-war :) Give me a better description. cheers, jamal From hal.rosenstock at gmail.com Mon Jul 23 05:54:22 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 23 Jul 2007 05:54:22 -0700 Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking for a P_Key in the table In-Reply-To: <46A36E77.5020307@gmail.com> References: <46A36E77.5020307@gmail.com> Message-ID: On 7/22/07, Moni Shoua wrote: > > IPoIB turns on the P_Key membership bit of limited membership P_Keys > when creating a child interface. After that IPoIB looks for the full > membership P_key in the table to make the interface "RUNNING". This > patch fixes the pkey lookup in order to match full and partial membership > keys that belong of the same partition. > > device.c | 2 +- > 1 files changed, 1 insertion(+), 1 deletion(-) > > Index: infiniband/drivers/infiniband/core/device.c > =================================================================== > --- infiniband.orig/drivers/infiniband/core/device.c 2007-07-08 12:45: > 07.000000000 +0300 > +++ infiniband/drivers/infiniband/core/device.c 2007-07-22 17:43: > 32.440829619 +0300 > @@ -702,7 +702,7 @@ int ib_find_pkey(struct ib_device *devic > if (ret) > return ret; > > - if (pkey == tmp_pkey) { > + if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { Wouldn't this allow 2 limited PKeys to match though ? -- Hal *index = i; > return 0; > } > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Jul 23 05:59:28 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 15:59:28 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5EFE@mtlexch01.mtl.com> Message-ID: <20070723125928.GU16597@sashak.voltaire.com> Hi Eitan, On 10:35 Mon 23 Jul , Eitan Zahavi wrote: > Hi Sasha, > > > On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > > > Hi Sasha > > > > > > Let's assume someone has reset a switch on the fabric. > > > What would cause the SM to re-assign the LFT of that switch? > > > > OpenSM will sweep and drop this switch and when switch will > > back it will be initialized again. But if the reset was too > > fast (relative to discovery), we can be in trouble (and maybe > > not only with LFTs). > > > > > I assumed that there is a mechanism to do that. > > > > Not for "fast" switch reboot. > So we have a problem with these fast resetting devices. > > > > Hmm, I think we could try to detect this case by comparing > > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even > > by seeing that PortInfo:LID is not set. Something like below: > > > I think we should have a predicate that will be used to mark a > port/device as needing a full update. Agreed, but what is the best criteria? LID == 0 will work in many cases, but LID initialization is not required by spec. The only strong requirement I found is Port State. Another ideas? > Not just LFT but everything (SL2VL, VLArb, LID, PKey ... If a device was > reset then it probably lost everything). Right, all incrementally updated data should be flushed - osm_physp and osm_switch are affected objects. > Another approach is to mark it for the entire fabric. It is too expensive IMO, and not much easier to implement. Sasha > > The original intention of kill -HUP was to force a new heavy sweep and > setup. > I this another signal is acceptible but not required. > > Thanks > > Eitan From sashak at voltaire.com Mon Jul 23 06:09:12 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 23 Jul 2007 16:09:12 +0300 Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go closer toopensm-coding-style.txt In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com> References: <20070722221455.GR27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com> Message-ID: <20070723130912.GV16597@sashak.voltaire.com> On 10:31 Mon 23 Jul , Eitan Zahavi wrote: > > So we will finally have a common enforced coding style! > When do you plan to run it on all the files? In the "spare" time :). I'm thinking about doing this in steps by subdirectories starting from header files. Also would be nice to not do huge styling updates during OFED 1.3 cycle. > Or should we just make sure every new committed file will first pass > this indent? This is the good option, however would be nice to not mix style fixing patches with functional ones (more or the less as described in opensm/doc/opensm-coding-style.txt). Sasha > > Thanks > > Eitan > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > > Sasha Khapyorsky > > Sent: Monday, July 23, 2007 1:15 AM > > To: general at lists.openfabrics.org > > Cc: Yevgeny Kliteynik > > Subject: [ofa-general] [PATCH resend] opensm/osm_indent: go > > closer toopensm-coding-style.txt > > > > > > This updates the script according to recent > > doc/opensm-coding-style.txt (in short K&R, tabs, etc.). > > > > Signed-off-by: Sasha Khapyorsky > > --- > > opensm/opensm/osm_indent | 57 > > +++------------------------------------------ > > 1 files changed, 4 insertions(+), 53 deletions(-) > > > > diff --git a/opensm/opensm/osm_indent > > b/opensm/opensm/osm_indent index bed2ba1..621184b 100755 > > --- a/opensm/opensm/osm_indent > > +++ b/opensm/opensm/osm_indent > > @@ -1,6 +1,6 @@ > > #!/bin/bash > > # > > -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > # Copyright (c) 2002-2005 Mellanox Technologies LTD. All > > rights reserved. > > # Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > # > > @@ -40,56 +40,7 @@ > > # Environment: > > # Linux User Mode > > # > > -# $Revision: 1.4 $ > > -# > > -# > > -# This is the indent format used for OpenSM. > > -# > > -# format the source code according to the ACD standard > > -# -bad Blank line after declarations > > -# -bap Blank line after Procedures > > -# -bbb Blank line before block comments > > -# -nbbo Break after Boolean operator > > -# -bl Break after if line > > -# -bli0 Indent for braces is 0 > > -# -bls Break after struct declarations > > -# -cbi0 Case break indent 0 > > -# -ci3 Continue indent 3 spaces > > -# -cli0 Case label indent 0 spaces > > -# -ncs No space after cast operator > > -# -hnl Honor existing newlines on long lines > > -# -i3 Substitute indent with 3 spaces > > -# -npcs No space after procedure calls > > -# -prs Space after parenthesis > > -# -nsai No space after if keyword - removed > > -# -nsaw No space after while keyword - removed > > -# -sc Put * at left of comments in a block comment style > > -# -nsob Don't swallow unnecessary blank lines > > -# -ts3 Tab size is 3 > > -# -psl Type of procedure return in a separate line > > -# -bfda Function declaration arguments in a separate line. > > -# -nut No tabs as we allow spaces > > -# > > -############################################################# > > ############ > > - > > -# indent the world > > -for sourcefile in $*; do > > - if test -f "$sourcefile"; then > > - # first, string DOS style linefeeds > > - perl -piW -e's/\x0D//' "$sourcefile" > > - echo Processing $sourcefile > > - indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 > > -ci3 -cli0 -ncs \ > > - -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl > > -bfda -nut $sourcefile > > - > > - rm ${sourcefile}W > > +# This is the indent format used for OpenSM (similar to one > > used in # > > +linux/scripts/Lindent). > > > > - # the -bb also affect the first line in each file - > > so clean it up > > - if test `head -1 $sourcefile | egrep -v '^$' | wc > > -l` = 0; then > > - echo Cleaning up first empty line of $sourcefile > > - awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W > > - mv -f ${sourcefile}W $sourcefile > > - fi > > - else > > - echo Could not find file:$sourcefile > > - fi > > -done > > +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@" > > -- > > 1.5.3.rc2.29.gc4640f > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From mst at dev.mellanox.co.il Mon Jul 23 07:31:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 23 Jul 2007 17:31:28 +0300 Subject: [ofa-general] open-isci patches updated Message-ID: <20070723143128.GL20614@mellanox.co.il> Hi! I have updated the ofed_kernel tree to 2.6.23-rc1. I had to update the following backport patches because of conflicts: kernel_patches/backport/2.6.16_sles10/open-iscsi-tx-hash-fixes.patch kernel_patches/backport/2.6.16_sles10_sp1/open-iscsi-tx-hash-fixes.patch kernel_patches/backport/2.6.18_FC6/open-iscsi-tx-hash-fixes.patch kernel_patches/backport/2.6.18/open-iscsi-tx-hash-fixes.patch Erez, could you please check that I did the right thing there? The code is here: git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel Thanks, MST -- MST From hal.rosenstock at gmail.com Mon Jul 23 07:44:50 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 23 Jul 2007 10:44:50 -0400 Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking for a P_Key in the table In-Reply-To: <46A4BE3E.4080606@gmail.com> References: <46A36E77.5020307@gmail.com> <46A4BE3E.4080606@gmail.com> Message-ID: Hi Moni, On 7/23/07, Moni Shoua wrote: > > Hal Rosenstock wrote: > > > > - if (pkey == tmp_pkey) { > > + if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { > > > > > > Wouldn't this allow 2 limited PKeys to match though ? > Hi Hal, > Can you please explain what do you mean? Perhaps by example? Two Pkeys which have their full memebership bit off (0x8000). Two limited members are not allowed to talk with each other. -- Hal > > > -- Hal > > > > *index = i; > > return 0; > > } > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davem at systemfabricworks.com Mon Jul 23 07:52:55 2007 From: davem at systemfabricworks.com (David McMillen) Date: Mon, 23 Jul 2007 09:52:55 -0500 Subject: [ofa-general] Command specification of ca_name and ca_port Message-ID: <46A4C0C7.7020107@systemfabricworks.com> There are a standard set of command line options that allow specification of the CA to use for sending the requests. I'm adding these to programs that don't have them, since they are very useful when diagnosing a node connected to multiple subnets. Even if you discount multiple subnets on purpose, sometimes this happens when the hardware connecting all of the CA ports to the same place gets broken, and that is when you need diagnostics that can help figure out what is where. The standard options are: -C use the specified ca_name. -P use the specified ca_port. -t override the default timeout for the solicited mads. My problem is that saquery already uses -C and -P, although the -t exists for the expected purpose. Also, ibcheckerrs already uses -t for specifying the threshold file. Changing the timeout for ibcheckerrs isn't critical, but not being able to do it doesn't seem right. However, the saquery command could be really handy for figuring out split fabrics, and is useful to those of us that connect to multiple subnets. Does anybody have a useful suggestion? Thanks, Dave McMillen From hal.rosenstock at gmail.com Mon Jul 23 08:30:31 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 23 Jul 2007 11:30:31 -0400 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <20070722174048.GO27878@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> Message-ID: Hi Sasha, On 7/22/07, Sasha Khapyorsky wrote: > On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > > Hi Sasha > > > > Let's assume someone has reset a switch on the fabric. > > What would cause the SM to re-assign the LFT of that switch? > > OpenSM will sweep and drop this switch and when switch will back it will > be initialized again. But if the reset was too fast (relative to > discovery), we can be in trouble (and maybe not only with LFTs). > > > I assumed that there is a mechanism to do that. > > Not for "fast" switch reboot. > > Hmm, I think we could try to detect this by comparing > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing > that PortInfo:LID is not set. Not sure about checking PortInfo:LID. Wouldn't that approach need to be qualified by PortState (armed or active) ? LFTTop seems better to me or perhaps a combination of the two but I may be missing something. > Something like below: > > > diff --git a/opensm/include/opensm/osm_switch.h > b/opensm/include/opensm/osm_switch.h > index 5b2b19e..62c072f 100644 > --- a/opensm/include/opensm/osm_switch.h > +++ b/opensm/include/opensm/osm_switch.h > @@ -112,6 +112,7 @@ typedef struct _osm_switch > osm_fwd_tbl_t fwd_tbl; > osm_mcast_tbl_t mcast_tbl; > uint32_t discovery_count; > + unsigned update_ft; > void *priv; > } osm_switch_t; > /* > @@ -152,6 +153,10 @@ typedef struct _osm_switch > * during the current fabric sweep. This number is reset > * to zero at the start of a sweep. > * > +* update_ft > +* When set fwd tables will be updated regardless to entry > +* values locally stored in fwd tables images > +* > * SEE ALSO > * Switch object > *********/ > diff --git a/opensm/opensm/osm_port_info_rcv.c > b/opensm/opensm/osm_port_info_rcv.c > index adece65..8bbbcac 100644 > --- a/opensm/opensm/osm_port_info_rcv.c > +++ b/opensm/opensm/osm_port_info_rcv.c > @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port( > break; > } > } > + else if (port_num == 0 && p_node->sw && > + (!p_pi->base_lid || !p_pi->master_sm_base_lid)) > + p_node->sw->update_ft = 1; > > /* > Update the PortInfo attribute. > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index b44a3ba..03516ae 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table( > osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ; > block_id_ho++ ) > { > - if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) > + if (!p_sw->update_ft && > + !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) > continue; > > if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table( > } > } > > + p_sw->update_ft = 0; > OSM_LOG_EXIT( p_mgr->p_log ); > } > > > > BTW what do you think is the best way to detect switch power up? I > didn't really find a strong requirement for at powerup initialization of > any suitable component. Peer switch link state change is insufficient to differentiate switch reboot from "normal" link up/down. There is no IB standard indication of this. > > Anyway, kill -HUP should flush out the state and restart from scratch. > > Thinking more about it I'm not sure. Similar flush will be required for > another "stored" components like pkey, sl2vl tables etc.. So it is more > than just "regular" heavy sweep, another signal or option could be used > for this, but OTOH it becomes very close to OpenSM restarting.. Shouldn't this be automatic rather than requiring the admin to issue a signal somehow ? -- Hal Sasha > > > > > > > Eitan > > > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Sunday, July 22, 2007 1:22 PM > > > To: Eitan Zahavi > > > Cc: OPENIB; hal.rosenstock at gmail.com; Yevgeny Kliteynik > > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration > > > > > > Hi Eitan, > > > > > > On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > > > > Hi Sasha > > > > > > > > I am running some tests manually and apparently it looks > > > like I found > > > > a bug. Here is the sequence of things: > > > > 1. SM sweeps the fabric assign LFTs > > > > 2. I manually modify some LFTs (single entry now marked > > > UNREACHABLE 3. > > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM > > > > reports SUBNET UP 5. The modified LFT entry is still > > > UNREACHABLE and > > > > the path is broken > > > > > > Right, in most cases (unless OpenSM has its own changes in > > > the same LFT > > > block) OpenSM will refer its own LFT image for "need to update" > > > decision, so _manual_ changes will not trigger new update. > > > Rerunning OpenSM should help however. > > > > > > > It looks to me some optimization of routing does not fully reroute > > > > unless some condition is met - but that condition does not > > > include the > > > > above triggers listed in step 3. > > > > > > Rereading all fabrics LFTs by default seems to be too > > > expensive operations. At least by default, if it is real > > > requirement this could be enforced manually, for example when > > > kill -HUP is used. Thoughts? > > > > > > Sasha > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Jul 23 08:33:50 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 23 Jul 2007 11:33:50 -0400 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> Message-ID: On 7/23/07, Hal Rosenstock wrote: > > Hi Sasha, > > On 7/22/07, Sasha Khapyorsky wrote: > > > On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > > > Hi Sasha > > > > > > Let's assume someone has reset a switch on the fabric. > > > What would cause the SM to re-assign the LFT of that switch? > > > > OpenSM will sweep and drop this switch and when switch will back it will > > be initialized again. But if the reset was too fast (relative to > > discovery), we can be in trouble (and maybe not only with LFTs). > > > > > I assumed that there is a mechanism to do that. > > > > Not for "fast" switch reboot. > > > > Hmm, I think we could try to detect this by comparing > > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing > > that PortInfo:LID is not set. > > > Not sure about checking PortInfo:LID. Wouldn't that approach need to be > qualified by PortState (armed or active) ? LFTTop seems better to me or > perhaps a combination of the two but I may be missing something. > Another thought on this :-( Not sure that resetting either LID or LFTTop is required by the spec either so this is relying on "beyond the spec" behavior and may not be true for all switch implementations. -- Hal > > > Something like below: > > > > > > diff --git a/opensm/include/opensm/osm_switch.h > > b/opensm/include/opensm/osm_switch.h > > index 5b2b19e..62c072f 100644 > > --- a/opensm/include/opensm/osm_switch.h > > +++ b/opensm/include/opensm/osm_switch.h > > @@ -112,6 +112,7 @@ typedef struct _osm_switch > > osm_fwd_tbl_t fwd_tbl; > > osm_mcast_tbl_t mcast_tbl; > > uint32_t discovery_count; > > + unsigned update_ft; > > void *priv; > > } osm_switch_t; > > /* > > @@ -152,6 +153,10 @@ typedef struct _osm_switch > > * during the current fabric sweep. This number is reset > > * to zero at the start of a sweep. > > * > > +* update_ft > > +* When set fwd tables will be updated regardless to entry > > +* values locally stored in fwd tables images > > +* > > * SEE ALSO > > * Switch object > > *********/ > > diff --git a/opensm/opensm/osm_port_info_rcv.c > > b/opensm/opensm/osm_port_info_rcv.c > > index adece65..8bbbcac 100644 > > --- a/opensm/opensm/osm_port_info_rcv.c > > +++ b/opensm/opensm/osm_port_info_rcv.c > > @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port( > > break; > > } > > } > > + else if (port_num == 0 && p_node->sw && > > + (!p_pi->base_lid || !p_pi->master_sm_base_lid)) > > + p_node->sw->update_ft = 1; > > > > /* > > Update the PortInfo attribute. > > diff --git a/opensm/opensm/osm_ucast_mgr.c > > b/opensm/opensm/osm_ucast_mgr.c > > index b44a3ba..03516ae 100644 > > --- a/opensm/opensm/osm_ucast_mgr.c > > +++ b/opensm/opensm/osm_ucast_mgr.c > > @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table( > > osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ; > > block_id_ho++ ) > > { > > - if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) > > + if (!p_sw->update_ft && > > + !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) > > continue; > > > > if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > > @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table( > > } > > } > > > > + p_sw->update_ft = 0; > > OSM_LOG_EXIT( p_mgr->p_log ); > > } > > > > > > > > BTW what do you think is the best way to detect switch power up? I > > didn't really find a strong requirement for at powerup initialization of > > any suitable component. > > > Peer switch link state change is insufficient to differentiate switch > reboot from "normal" link up/down. There is no IB standard indication of > this. > > > > > > Anyway, kill -HUP should flush out the state and restart from scratch. > > > > Thinking more about it I'm not sure. Similar flush will be required for > > another "stored" components like pkey, sl2vl tables etc.. So it is more > > than just "regular" heavy sweep, another signal or option could be used > > for this, but OTOH it becomes very close to OpenSM restarting.. > > > Shouldn't this be automatic rather than requiring the admin to issue a > signal somehow ? > > -- Hal > > > Sasha > > > > > > > > > > > Eitan > > > > > > > -----Original Message----- > > > > From: Sasha Khapyorsky [mailto: sashak at voltaire.com] > > > > Sent: Sunday, July 22, 2007 1:22 PM > > > > To: Eitan Zahavi > > > > Cc: OPENIB; hal.rosenstock at gmail.com ; Yevgeny Kliteynik > > > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration > > > > > > > > Hi Eitan, > > > > > > > > On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > > > > > Hi Sasha > > > > > > > > > > I am running some tests manually and apparently it looks > > > > like I found > > > > > a bug. Here is the sequence of things: > > > > > 1. SM sweeps the fabric assign LFTs > > > > > 2. I manually modify some LFTs (single entry now marked > > > > UNREACHABLE 3. > > > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM > > > > > reports SUBNET UP 5. The modified LFT entry is still > > > > UNREACHABLE and > > > > > the path is broken > > > > > > > > Right, in most cases (unless OpenSM has its own changes in > > > > the same LFT > > > > block) OpenSM will refer its own LFT image for "need to update" > > > > decision, so _manual_ changes will not trigger new update. > > > > Rerunning OpenSM should help however. > > > > > > > > > It looks to me some optimization of routing does not fully reroute > > > > > > > unless some condition is met - but that condition does not > > > > include the > > > > > above triggers listed in step 3. > > > > > > > > Rereading all fabrics LFTs by default seems to be too > > > > expensive operations. At least by default, if it is real > > > > requirement this could be enforced manually, for example when > > > > kill -HUP is used. Thoughts? > > > > > > > > Sasha > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Jul 23 10:59:21 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jul 2007 20:59:21 +0300 Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> Hi Sasha, Hal, I think I have an idea: Since this is a specific switch that reported ChangeBit or Trap why can't we just qualify that there was no change in the switch setup? We could send PortInfo, SwitchInfo, LFT, MFT, SL2VL, VLArb, PKey queries and make sure no change from previous state. Or we could simply enforce last state by sending it over again ... Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, July 23, 2007 6:31 PM To: Sasha Khapyorsky Cc: Eitan Zahavi; OPENIB; Yevgeny Kliteynik Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration Hi Sasha, On 7/22/07, Sasha Khapyorsky wrote: On 14:59 Sun 22 Jul , Eitan Zahavi wrote: > Hi Sasha > > Let's assume someone has reset a switch on the fabric. > What would cause the SM to re-assign the LFT of that switch? OpenSM will sweep and drop this switch and when switch will back it will be initialized again. But if the reset was too fast (relative to discovery), we can be in trouble (and maybe not only with LFTs). > I assumed that there is a mechanism to do that. Not for "fast" switch reboot. Hmm, I think we could try to detect this by comparing SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing that PortInfo:LID is not set. Not sure about checking PortInfo:LID. Wouldn't that approach need to be qualified by PortState (armed or active) ? LFTTop seems better to me or perhaps a combination of the two but I may be missing something. Something like below: diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 5b2b19e..62c072f 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -112,6 +112,7 @@ typedef struct _osm_switch osm_fwd_tbl_t fwd_tbl; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; + unsigned update_ft; void *priv; } osm_switch_t; /* @@ -152,6 +153,10 @@ typedef struct _osm_switch * during the current fabric sweep. This number is reset * to zero at the start of a sweep. * +* update_ft +* When set fwd tables will be updated regardless to entry +* values locally stored in fwd tables images +* * SEE ALSO * Switch object *********/ diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index adece65..8bbbcac 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -336,6 +336,9 @@ __osm_pi_rcv_process_switch_port( break; } } + else if (port_num == 0 && p_node->sw && + (!p_pi->base_lid || !p_pi->master_sm_base_lid)) + p_node->sw->update_ft = 1; /* Update the PortInfo attribute. diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index b44a3ba..03516ae 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table( osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ; block_id_ho++ ) { - if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) + if (!p_sw->update_ft && + !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) continue; if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) @@ -850,6 +851,7 @@ osm_ucast_mgr_set_fwd_table( } } + p_sw->update_ft = 0; OSM_LOG_EXIT( p_mgr->p_log ); } BTW what do you think is the best way to detect switch power up? I didn't really find a strong requirement for at powerup initialization of any suitable component. Peer switch link state change is insufficient to differentiate switch reboot from "normal" link up/down. There is no IB standard indication of this. > Anyway, kill -HUP should flush out the state and restart from scratch. Thinking more about it I'm not sure. Similar flush will be required for another "stored" components like pkey, sl2vl tables etc.. So it is more than just "regular" heavy sweep, another signal or option could be used for this, but OTOH it becomes very close to OpenSM restarting.. Shouldn't this be automatic rather than requiring the admin to issue a signal somehow ? -- Hal Sasha > > > Eitan > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto: sashak at voltaire.com] > > Sent: Sunday, July 22, 2007 1:22 PM > > To: Eitan Zahavi > > Cc: OPENIB; hal.rosenstock at gmail.com ; Yevgeny Kliteynik > > Subject: Re: opensm: a bug in heavy sweep? - no LFT re-configuration > > > > Hi Eitan, > > > > On 09:36 Sun 22 Jul , Eitan Zahavi wrote: > > > Hi Sasha > > > > > > I am running some tests manually and apparently it looks > > like I found > > > a bug. Here is the sequence of things: > > > 1. SM sweeps the fabric assign LFTs > > > 2. I manually modify some LFTs (single entry now marked > > UNREACHABLE 3. > > > I force some switch change bit to 1 or issue kill -HUP 4. The SM > > > reports SUBNET UP 5. The modified LFT entry is still > > UNREACHABLE and > > > the path is broken > > > > Right, in most cases (unless OpenSM has its own changes in > > the same LFT > > block) OpenSM will refer its own LFT image for "need to update" > > decision, so _manual_ changes will not trigger new update. > > Rerunning OpenSM should help however. > > > > > It looks to me some optimization of routing does not fully reroute > > > unless some condition is met - but that condition does not > > include the > > > above triggers listed in step 3. > > > > Rereading all fabrics LFTs by default seems to be too > > expensive operations. At least by default, if it is real > > requirement this could be enforced manually, for example when > > kill -HUP is used. Thoughts? > > > > Sasha > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Jul 23 11:05:23 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 23 Jul 2007 21:05:23 +0300 Subject: [ofa-general] [PATCH resend] opensm/osm_indent: In-Reply-To: <20070723130912.GV16597@sashak.voltaire.com> References: <20070722221455.GR27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com> <20070723130912.GV16597@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com> Hi Sasha, I read the new coding style doc after this last mail. I thought you only defined new "indentation rules" and I am for doing this step as it is automatic and safe. But rewriting the code with shorter names and replacing all variables and functions seems a little too risky in my mind. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Monday, July 23, 2007 4:09 PM > To: Eitan Zahavi > Cc: general at lists.openfabrics.org; Yevgeny Kliteynik > Subject: Re: [ofa-general] [PATCH resend] opensm/osm_indent: > go closertoopensm-coding-style.txt > > On 10:31 Mon 23 Jul , Eitan Zahavi wrote: > > > > So we will finally have a common enforced coding style! > > When do you plan to run it on all the files? > > In the "spare" time :). I'm thinking about doing this in > steps by subdirectories starting from header files. Also > would be nice to not do huge styling updates during OFED 1.3 cycle. > > > Or should we just make sure every new committed file will > first pass > > this indent? > > This is the good option, however would be nice to not mix > style fixing patches with functional ones (more or the less > as described in opensm/doc/opensm-coding-style.txt). > > Sasha > > > > > Thanks > > > > Eitan > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect Mellanox > Technologies > > LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: general-bounces at lists.openfabrics.org > > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha > > > Khapyorsky > > > Sent: Monday, July 23, 2007 1:15 AM > > > To: general at lists.openfabrics.org > > > Cc: Yevgeny Kliteynik > > > Subject: [ofa-general] [PATCH resend] opensm/osm_indent: > go closer > > > toopensm-coding-style.txt > > > > > > > > > This updates the script according to recent > > > doc/opensm-coding-style.txt (in short K&R, tabs, etc.). > > > > > > Signed-off-by: Sasha Khapyorsky > > > --- > > > opensm/opensm/osm_indent | 57 > > > +++------------------------------------------ > > > 1 files changed, 4 insertions(+), 53 deletions(-) > > > > > > diff --git a/opensm/opensm/osm_indent b/opensm/opensm/osm_indent > > > index bed2ba1..621184b 100755 > > > --- a/opensm/opensm/osm_indent > > > +++ b/opensm/opensm/osm_indent > > > @@ -1,6 +1,6 @@ > > > #!/bin/bash > > > # > > > -# Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > > > +# Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > > > # Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights > > > reserved. > > > # Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > > > # > > > @@ -40,56 +40,7 @@ > > > # Environment: > > > # Linux User Mode > > > # > > > -# $Revision: 1.4 $ > > > -# > > > -# > > > -# This is the indent format used for OpenSM. > > > -# > > > -# format the source code according to the ACD standard > > > -# -bad Blank line after declarations > > > -# -bap Blank line after Procedures > > > -# -bbb Blank line before block comments > > > -# -nbbo Break after Boolean operator > > > -# -bl Break after if line > > > -# -bli0 Indent for braces is 0 > > > -# -bls Break after struct declarations > > > -# -cbi0 Case break indent 0 > > > -# -ci3 Continue indent 3 spaces > > > -# -cli0 Case label indent 0 spaces > > > -# -ncs No space after cast operator > > > -# -hnl Honor existing newlines on long lines > > > -# -i3 Substitute indent with 3 spaces > > > -# -npcs No space after procedure calls > > > -# -prs Space after parenthesis > > > -# -nsai No space after if keyword - removed > > > -# -nsaw No space after while keyword - removed > > > -# -sc Put * at left of comments in a block comment style > > > -# -nsob Don't swallow unnecessary blank lines > > > -# -ts3 Tab size is 3 > > > -# -psl Type of procedure return in a separate line > > > -# -bfda Function declaration arguments in a separate line. > > > -# -nut No tabs as we allow spaces > > > -# > > > -############################################################# > > > ############ > > > - > > > -# indent the world > > > -for sourcefile in $*; do > > > - if test -f "$sourcefile"; then > > > - # first, string DOS style linefeeds > > > - perl -piW -e's/\x0D//' "$sourcefile" > > > - echo Processing $sourcefile > > > - indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 > > > -ci3 -cli0 -ncs \ > > > - -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl > > > -bfda -nut $sourcefile > > > - > > > - rm ${sourcefile}W > > > +# This is the indent format used for OpenSM (similar to one > > > used in # > > > +linux/scripts/Lindent). > > > > > > - # the -bb also affect the first line in each file - > > > so clean it up > > > - if test `head -1 $sourcefile | egrep -v '^$' | wc > > > -l` = 0; then > > > - echo Cleaning up first empty line of $sourcefile > > > - awk '{if(n){print};n++}' $sourcefile > ${sourcefile}W > > > - mv -f ${sourcefile}W $sourcefile > > > - fi > > > - else > > > - echo Could not find file:$sourcefile > > > - fi > > > -done > > > +indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs "$@" > > > -- > > > 1.5.3.rc2.29.gc4640f > > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > From mshefty at ichips.intel.com Mon Jul 23 11:10:08 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jul 2007 11:10:08 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A46A1D.6040000@voltaire.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> Message-ID: <46A4EF00.9070305@ichips.intel.com> > What I have in mind is that IPoIB must not use cached IB path info. > > If the IB stack has path caching which is in the default flow of > requesting a path record, it should provide an API (eg flag to the > function through which one does path query) to request a non cached path. Argh! This was the original design. I believe the current design is a better approach. The ULP shouldn't care whether the PR is cached or not - only that it's usable. > The design I was thinking to suggest for IPoIB is to almost always use > this API since this policy makes the implementation consistent with the > decisions made by the network stack neighbour cache This defeats one of the benefit of caching, which is using a single GetTable query, versus literally hundreds or thousands of Get queries. Consider that constant all-to-all communication using IPoIB between 1024 ports, with a 15 minute ARP table timeout would hit the SA with close to 600 queries per second. I agree with Michael that it would be better for a ULP to invalidate cache entries. While I agree that there's the potential for a problem, given that IPoIB has always cached PRs and no one has reported problems, I think we're overstating the likelihood of issues occurring in practice. Even the SA caches the path data -- getting a PR from the SA doesn't provide any additional guarantees. - Sean From mshefty at ichips.intel.com Mon Jul 23 11:38:26 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jul 2007 11:38:26 -0700 Subject: [ofa-general] Re: I think that there is a resource leak in the core file mad_rmpp.c In-Reply-To: <46A45A8C.2090800@dev.mellanox.co.il> References: <46A45A8C.2090800@dev.mellanox.co.il> Message-ID: <46A4F5A2.2020508@ichips.intel.com> > I reviewed the file mad_rmpp.c and it seems that there is a leak of the > Address Handle. > The AH that is being created in the function "alloc_response_msg" is > never being destroyed. The AH is destroyed in ib_rmpp_send_handler(). - Sean From jgunthorpe at obsidianresearch.com Mon Jul 23 11:41:05 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 23 Jul 2007 12:41:05 -0600 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A4EF00.9070305@ichips.intel.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> Message-ID: <20070723184105.GB19768@obsidianresearch.com> On Mon, Jul 23, 2007 at 11:10:08AM -0700, Sean Hefty wrote: > >The design I was thinking to suggest for IPoIB is to almost always use > >this API since this policy makes the implementation consistent with the > >decisions made by the network stack neighbour cache > > This defeats one of the benefit of caching, which is using a single > GetTable query, versus literally hundreds or thousands of Get queries. > Consider that constant all-to-all communication using IPoIB between 1024 > ports, with a 15 minute ARP table timeout would hit the SA with close to > 600 queries per second. > > I agree with Michael that it would be better for a ULP to invalidate > cache entries. Well, in my view, this is exactly the sort of thing you should not do. ULPs have no better idea what is going on. ARP expiry doesn't give you any special information about the cached PR. If kernel caching is used it must be viewed as authorative and kept current through some kind of external mechanism. Something like re-doing the big GetTable query prior to starting a job is a fine interm way to do this. Ideally updating the kernel sa cache would also push updated data into the neighbor AH structures as appropriate. Then there is one single source of PR data, one source of IP -> GID mapppings, etc. These problems are just an unavoidable part of trying to use caching - build the mechanism to support coherent replication and just deal with these downsides . Sean is basically doing non-coherent replication today with his big GetTable query and that sounds like what is speeding things up, not caching indivudal PRs. > overstating the likelihood of issues occurring in practice. Even > the SA Well, any time your renumber your network you will get burned and have to restart ipoib on every node with the way things are today. Something like a SM upgrade or changing to a new vendor SM, or increasing LMC could do this to you. > caches the path data -- getting a PR from the SA doesn't provide any > additional guarantees. Erm, any SA that returns a PR that is invalid in the network outside the time the network is being updated is seriously busted, IMHO. Jason From sean.hefty at intel.com Mon Jul 23 12:39:09 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Jul 2007 12:39:09 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070723184105.GB19768@obsidianresearch.com> Message-ID: <000001c7cd61$24301080$9c98070a@amr.corp.intel.com> >If kernel caching is used it must be viewed as authorative and kept >current through some kind of external mechanism. Something like >re-doing the big GetTable query prior to starting a job is a fine >interm way to do this. Ideally updating the kernel sa cache would also >push updated data into the neighbor AH structures as appropriate. Then >there is one single source of PR data, one source of IP -> GID >mapppings, etc. IB does not define anything to standardize distributing SA data. The proposed solution works with any SM, and supports SA events to the degree that they've been defined. The local SA does not define vendor-specific SA extensions, though such extensions are supported at a high level by manual cache refreshes. Someone could provide finer control to refresh specific cache entries if needed and used. >These problems are just an unavoidable part of trying to use caching - >build the mechanism to support coherent replication and just deal with >these downsides . Sean is basically doing non-coherent >replication today with his big GetTable query and that sounds like >what is speeding things up, not caching indivudal PRs. The caching provides the speedup on the client side. The GetTable provides the scalability on the SA side. Whether we cache PR or PR data in the form of AHs, caching is in use and required by the software today. No one would suggest that IPoIB issue a PR query per packet. I feel that we're trying to come up with the ideal solution at the start. Let's start with what we have today and expand. Currently IPoIB caches PR data, does not share it, and doesn't update it. The local SA collects data more efficiently*, shares the data, and provides ways for updating it. It is refreshed in response to specific events, and scripts could be used to refresh the cache periodically. * If communication is only to a few nodes in a large cluster, then multiple Get queries may be more efficient than using GetTable. The local SA could be expanded to cache query responses, rather than issuing it's own in this case. >Well, any time your renumber your network you will get burned and have >to restart ipoib on every node with the way things are >today. Something like a SM upgrade or changing to a new vendor SM, or >increasing LMC could do this to you. The local SA responds to these types of SA changes by refreshing the cache. >Erm, any SA that returns a PR that is invalid in the network outside >the time the network is being updated is seriously busted, IMHO. The SA isn't guaranteed to know all links that are down at the time it returns a PR. There's a delay between when a path becomes unusable, and when the SA detects it. In fact, an end node could detect that a path is unusable before the SA does, which could be the reason for it requesting a new path. The SA cannot sweep the fabric looking for changes before responding to every PR query. - Sean From harms at alcf.anl.gov Mon Jul 23 13:05:17 2007 From: harms at alcf.anl.gov (Kevin Harms) Date: Mon, 23 Jul 2007 15:05:17 -0500 Subject: [ofa-general] openibd / srp question Message-ID: is there a reason that starting up the srp_daemon is bound to the SRPHA_ENABLE variable? I would like to propose that either the daemon is started up if the ib_srp module is loaded on boot or a second dependent variable is created that controls the srp_daemon startup. ofed_1_2/linux-2.6.git/ofed_scripts/openibd : line 844 ib_srp) /sbin/modprobe $mod > /dev/null 2>&1 if [ "X${SRPHA_ENABLE}" == "Xyes" ]; then if [ ! -x /sbin/multipath ]; then echo "/sbin/multipath is required to enable SRP HA." else # Create 91-srp.rules file mkdir -p /etc/udev/rules.d if [ "$DISTRIB" == "SuSE" ]; then cat > /etc/udev/rules.d/91-srp.rules << EOF ACTION=="add", KERNEL=="sd*[!0-9]", RUN+="/sbin/multipath %M:%m" EOF fi /sbin/modprobe dm_multipath > /dev/null 2>&1 SRPD_ENABLE=yes fi fi if [ "X${SRPD_ENABLE}" = "Xyes" ]; then srp_daemon.sh & srp_daemon_pid=$! echo ${srp_daemon_pid} > ${srp_daemon_pidfile} fi ;; From arthur.jones at qlogic.com Mon Jul 23 13:06:40 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 23 Jul 2007 13:06:40 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070612084108.GK6470@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> Message-ID: <20070723200640.GA13117@bauxite.pathscale.com> hi michael, ... On Tue, Jun 12, 2007 at 11:41:08AM +0300, Michael S. Tsirkin wrote: > For whom it may concern, > I have created an ofed git tree updated with kernel bits from 2.6.22-rc4, > and put that up at git://git.openfabrics.org/~mst/ofed_kernel.git > [...] > In particular, there were a ton of ipath patches that it seems were > for the most part applied. > Qlogic maintainers, please help double check that I did not miss something > of value. thanks for setting this up, i'm still looking at the diffs to make sure things got setup correctly for the ipath stuff... i have found it difficult to navigate the source having to run: ./ofed_scripts/configure --kernel-version=2.6.xxx --without-quilt everytime to check against our tree. so, rather than spending the better part of the afternoon running these scripts by hand, i created a shell script to populate a bunch of branches with the backports in each branch. at qlogic we now keep the backports as branches in our git tree and this, i find, is much easier to handle. because: * viewing and navigating backport source becomes _much_ easier. * merges are easier -- patches are much more fragile than branches. * comparisons are easier -- checking for differences between backports and between a backport and the canonical source is faster and more convenient... * changesets are readable. trying to decipher diffs to patches is medically proven to take months, if not years, off your life. anyway, what do you think? is there anyway i could convince you to dump the backport patches and put all the backports in branches? i'm willing to do the legwork if you see value... arthur From sweitzen at cisco.com Mon Jul 23 14:27:56 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 23 Jul 2007 14:27:56 -0700 Subject: [ofa-general] created version "1.3" in bugzilla Message-ID: This allows me to REOPEN some RESOLVED LATER bugs from 1.2. Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Mon Jul 23 16:17:00 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 23 Jul 2007 16:17:00 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46968448.2000401@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> Message-ID: <46A536EC.4060201@ichips.intel.com> Maintainers: please review the following proposal regarding new public download locations/website links and respond. This request originated from xwg. http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html Thanks. > Arlin Davis wrote: > >> The proposal was attempting to come up with a method to automatically >> link to a package and description file from the download webpage. I >> have no problem >> targeting http://openfabrics.org/downloads as long as we come up with >> a way for the webpage to correlate a description with a package >> without hand coding the links everytime. We need to come up with a >> method for automatic links to keep our download webpage updated and >> complete. >> >> What if we add a directory for each project under downloads and >> provide a README for a description? Other suggestions? >> > Here is a stab at what we have today for discussion purposes: > > Linux Libraries: > - libibverbs -http://www.openfabrics.org/downloads/ - > librdmacm - http://www.openfabrics.org/~shefty/ > - dapl - http://www.openfabrics.org/~ardavis/ > - management -http://www.openfabrics.org/~halr/ OFED Linux: > - OFED 1.2 release - > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2.tgz > - OFED 1.2 binary RPMs for SLES 9.0, SLES 10 SP1, RHEL 4.0 U5 and > RHEL 5.0 > > http://www.openfabrics.org/builds/ofed-1.2/release/OFED-1.2-RPMS/ > - OFED connectx release - > _http://www.openfabrics.org/builds/connectx/release/_ > OFED Linux Archives: > - SLES 10 OFED 1.0 RPMS - http://www.openfabrics.org/downloads/ > - OFED 1.1 release - > https://svn.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/releases/ > - OFED 1.0 release - > https://svn.openfabrics.org/svn/openib/gen2/branches/1.0/ofed/releases/ > WinOF for windows: > WinOF 1.0 release - http://www.oprnfabrics.org/~ardavis/WinOF > 1.0/WinOF_1-0.zip > WinOF source - svn://openib.tc.cornell.edu > WinOF faq - > https://wiki.openfabrics.org/tiki-index.php?page=OpenIB+Windows > > I would like to propose adding project directories under > http://www.openfabrics.org/downloads/ where appropriate and give > maintainers access. For example: > > http://www.openfabrics.org/downloads/verbs (rdreier) > http://www.openfabrics.org/downloads/rdmacm (shefty) > http://www.openfabrics.org/downloads/dapl (ardavis) > http://www.openfabrics.org/downloads/management (sashak) > http://www.openfabrics.org/downloads/OFED (vlad) > http://www.openfabrics.org/downloads/WinOF (ardavis) > http://www.openfabrics.org/downloads/archives (vlad) ?? > etc... > > Each of these would contain a README that details the contents of the > directory along with WEB_README that provides a short description for > the webpage. Jeff could then automatically parse for directories under > downloads and if it contains WEB_README add a webpage link to the > directory along with the short description. > > Jeff, is this possible? > > comments? > > -arlin > > > > > > > > > > > From sean.hefty at intel.com Mon Jul 23 16:32:31 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 23 Jul 2007 16:32:31 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46A536EC.4060201@ichips.intel.com> Message-ID: <000901c7cd81$be2ffe00$9c98070a@amr.corp.intel.com> >> http://www.openfabrics.org/downloads/verbs (rdreier) >> http://www.openfabrics.org/downloads/rdmacm (shefty) >> http://www.openfabrics.org/downloads/dapl (ardavis) >> http://www.openfabrics.org/downloads/management (sashak) >> http://www.openfabrics.org/downloads/OFED (vlad) >> http://www.openfabrics.org/downloads/WinOF (ardavis) >> http://www.openfabrics.org/downloads/archives (vlad) ?? >> etc... These seem fine to me. We will need a place for the hardware specific libraries. (.../downloads/hw/xxx ?) Having the web page automatically update would be nice. - Sean From envio10007 at gmail.com Mon Jul 23 16:35:43 2007 From: envio10007 at gmail.com (Odontologos) Date: Mon, 23 Jul 2007 19:35:43 -0400 Subject: [ofa-general] =?iso-8859-1?q?Pasta_dental_de_Aloe_Vera_+_Propoleo?= =?iso-8859-1?q?s_de_Abeja_sin_Fluor=2C_importada_de_USA=85=2E?= Message-ID: <1080897-220077123233543391@Mauricio> Pasta dental de Aloe Vera + Propóleos de Abeja sin Fluor, importada de USA�. Señores Clínica Dental Presente Estimados Señores: Somos distribuidores mayoristas de Forever Bright, una pasta dental importada de Estados Unidos desde hace 13 años (Con código SESMA en Chile), esta pasta tiene la particularidad de ser la única de ALOE VERA más Propóleos de Abeja sin Fluor ni abrasivos, especialmente diseñada para blanquear los dientes sin rayar el esmalte y proporcionar el mejor cuidado a las encías, con ingredientes 100% naturales, premiada en Estados Unidos por los lectores de Reader´s Digest 1999 como el mejor producto del año. Tenemos una propuesta para su Clínica Dental que le permitirá captar nuevos clientes, ofrecer a sus pacientes un producto de clase mundial y mucho más, si es de su interés conocer nuestra propuesta por favor llámenos para coordinar una reunión de no más de 10 minutos en la cual le explicaremos el proyecto, podremos entregarle muestras y material de apoyo. Esperando su pronta respuesta se despide atentamente, Fono: 235 12 07 www.ellas.cl Este mensaje se envía en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los artículos 2 y 4 de la ley 19.628 sobre protección de la vida privada o datos de carácter personal, todo esto en conformidad a los numerales 4 y 12 de la constitución política. Su dirección ha sido extraída manualmente por personal de nuestra compañía desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envío de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Jul 23 17:22:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 23 Jul 2007 17:22:49 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <46A54659.8010608@ichips.intel.com> > 2.5. ULPs that use CM interface (like SRP) should have their own > pre-assigned Service-ID and use it while obtaining PR/MPR for > establishing connections. The SA receiving the PR/MPR should match it > against the policy and return the appropriate PR/MPR including SL, > MTU and RATE. We need to ensure that this can work without pre-assigned service IDs, or at least service IDs that are assigned within a fairly wide range, such as locally assigned IDs. > 2.6. ULPs and programs using CMA to establish RC connection should > provide the CMA the target IP and Service-ID. Some of the ULPs might > also provide QoS-Class (E.g. for SDP sockets that are provided the > TOS socket option). The CMA should then use the provided Service-ID > and optional QoS-Class and pass them in the PR/MPR request. The > resulting PR/MPR should be used for configuring the connection QP. The interface to the CMA needs to remain as transport independent as possible, and I am unsure of the transport independence of tying QoS to the destination port number. (I'm not disagreeing; I'm just not sure at the moment it's the right approach.) > PathRecord and MultiPathRecord enhancement for QoS: As mentioned > above the PathRecord and MultiPathRecord attributes should be > enhanced to carry the Service-ID which is a 64bit value, which has > been standardized by the IBTA. A new field QoS-Class is also > provided. A new capability bit should describe the SM QoS support in > the SA class port info. This approach provides an easy migration path > for existing access layer and ULPs by not introducing new set of > PR/MPR attribute. Has any thought been given to how to make this scale? > 5. CMA features ---------------- > > The CMA interface supports Service-ID through the notion of port > space as a prefixes to the port_num which is part of the sockaddr > provided to rdma_resolve_add(). What is missing is the explicit > request for a QoS-Class that should allow the ULP (like SDP) to > propagate a specific request for a class of service. A mechanism for > providing the QoS-Class is available in the IPv6 address, so we could > use that address field. Another option is to implement a special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class > and Service-ID in the sent PR/MPR. When a response is obtained it is > an existing requirement for the CMA to use the PR/MPR from the > response in setting up the QP address vector. The most natural function to specify additional QoS parameters would be rdma_resolve_route. - Sean From sashak at voltaire.com Mon Jul 23 17:33:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 03:33:02 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> Message-ID: <20070724003302.GC11674@sashak.voltaire.com> Hi Hal, On 11:33 Mon 23 Jul , Hal Rosenstock wrote: > > > > > > Not for "fast" switch reboot. > > > > > > Hmm, I think we could try to detect this by comparing > > > SwitchInfo:LinerFDBTop with current p_sw->max_lid_ho or even by seeing > > > that PortInfo:LID is not set. > > > > > > Not sure about checking PortInfo:LID. Wouldn't that approach need to be > > qualified by PortState (armed or active) ? LFTTop seems better to me or > > perhaps a combination of the two but I may be missing something. > > > > Another thought on this :-( > > Not sure that resetting either LID or LFTTop is required by the spec either > so this is relying on "beyond the spec" behavior and may not be true for all > switch implementations. Yes, it is similar to my findings. Suggested in your previous email PortState check seems only reliable reboot detection criteria (for ports and switches). Sasha From sashak at voltaire.com Mon Jul 23 17:51:53 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 03:51:53 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> Message-ID: <20070724005153.GD11674@sashak.voltaire.com> Hi Eitan, On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > Hi Sasha, Hal, > > I think I have an idea: > > Since this is a specific switch that reported ChangeBit or Trap why > can't we just qualify that there was no change in the switch setup? The ChangeBit seems to be good start point - then OpenSM will query all switch ports PortInfo anyway and if for all ports PortState is <= INIT (and at least for one port it is = INIT), it means that this switch was rebooted/reinitialized. And for single port PortState drop to = INIT should indicate reinitialization. Seems correct? > We could send PortInfo, SwitchInfo, SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is set. Guess we are ok with it even now. > LFT, MFT, SL2VL, VLArb, PKey queries > and make sure no change from previous state. Or we could simply enforce > last state by sending it over again ... I think we could want to re-read PKey tables in order to preserve existing PKey indices and just to flush (overwrite with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable? Sasha From sashak at voltaire.com Mon Jul 23 18:08:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 04:08:38 +0300 Subject: [ofa-general] [PATCH resend] opensm/osm_indent: In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com> References: <20070722221455.GR27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5EF1@mtlexch01.mtl.com> <20070723130912.GV16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61C2@mtlexch01.mtl.com> Message-ID: <20070724010838.GE11674@sashak.voltaire.com> On 21:05 Mon 23 Jul , Eitan Zahavi wrote: > Hi Sasha, > > I read the new coding style doc after this last mail. It was under RFC subject on the list couple of months ago... > I thought you only defined new "indentation rules" and I am for doing > this step as it is automatic and safe. > But rewriting the code with shorter names and replacing all variables > and functions seems a little too risky in my mind. Yes, the script enforces "indentation rules" only. I didn't think we will be able to deal with rest stuff shortly (<= OFED 1.3). So currently it is (1) OpenSM style definition and (2) recommendations for new code/files style. Sasha From sashak at voltaire.com Mon Jul 23 18:16:31 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 04:16:31 +0300 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46A536EC.4060201@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A536EC.4060201@ichips.intel.com> Message-ID: <20070724011631.GG11674@sashak.voltaire.com> On 16:17 Mon 23 Jul , Arlin Davis wrote: > > > > I would like to propose adding project directories under > > http://www.openfabrics.org/downloads/ where appropriate and give > > maintainers access. For example: > > > > http://www.openfabrics.org/downloads/verbs (rdreier) > > http://www.openfabrics.org/downloads/rdmacm (shefty) > > http://www.openfabrics.org/downloads/dapl (ardavis) > > http://www.openfabrics.org/downloads/management (sashak) > > http://www.openfabrics.org/downloads/OFED (vlad) > > http://www.openfabrics.org/downloads/WinOF (ardavis) > > http://www.openfabrics.org/downloads/archives (vlad) ?? > > etc... > > > > Each of these would contain a README that details the contents of the > > directory along with WEB_README that provides a short description for the > > webpage. Jeff could then automatically parse for directories under > > downloads and if it contains WEB_README add a webpage link to the directory > > along with the short description. Looks fine for me. Sasha From sashak at voltaire.com Mon Jul 23 18:33:06 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 04:33:06 +0300 Subject: [ofa-general] Command specification of ca_name and ca_port In-Reply-To: <46A4C0C7.7020107@systemfabricworks.com> References: <46A4C0C7.7020107@systemfabricworks.com> Message-ID: <20070724013306.GH11674@sashak.voltaire.com> Hi David, On 09:52 Mon 23 Jul , David McMillen wrote: > > There are a standard set of command line options that allow specification of > the CA to use for sending the requests. I'm adding these to programs that > don't have them, since they are very useful when diagnosing a node connected > to multiple subnets. Even if you discount multiple subnets on purpose, > sometimes this happens when the hardware connecting all of the CA ports to > the same place gets broken, and that is when you need diagnostics that can > help figure out what is where. > > The standard options are: > > -C use the specified ca_name. > > -P use the specified ca_port. > > -t override the default timeout for the solicited mads. > > My problem is that saquery already uses -C and -P, although the -t exists > for the expected purpose. Also, ibcheckerrs already uses -t for specifying > the threshold file. I think unified command line options over diags are good thing, so I guess reasonable renaming should be acceptable. > > Changing the timeout for ibcheckerrs isn't critical, but not being able to > do it doesn't seem right. However, the saquery command could be really > handy for figuring out split fabrics, and is useful to those of us that > connect to multiple subnets. > > Does anybody have a useful suggestion? '-T' for the threshold file? But it is easy part - saquery renames are less intuitive :(. Probably just lower case? Or special query option (-q or -Q), so queries could be specified as -qP, -qC? Sasha From mst at dev.mellanox.co.il Mon Jul 23 20:03:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 06:03:41 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070723200640.GA13117@bauxite.pathscale.com> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> Message-ID: <20070724030318.GA7589@mellanox.co.il> >Quoting Arthur Jones : >Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > >hi michael, ... > >On Tue, Jun 12, 2007 at 11:41:08AM +0300, Michael S. Tsirkin wrote: >> For whom it may concern, >> I have created an ofed git tree updated with kernel bits from 2.6.22-rc4, >> and put that up at git://git.openfabrics.org/~mst/ofed_kernel.git >> [...] >> In particular, there were a ton of ipath patches that it seems were >> for the most part applied. >> Qlogic maintainers, please help double check that I did not miss something >> of value. > >thanks for setting this up, i'm still looking >at the diffs to make sure things got setup >correctly for the ipath stuff... > >i have found it difficult to navigate the >source having to run: > >./ofed_scripts/configure --kernel-version=2.6.xxx --without-quilt > >everytime to check against our tree. so, rather >than spending the better part of the afternoon >running these scripts by hand, i created a shell >script to populate a bunch of branches with the >backports in each branch. > >at qlogic we now keep the backports as branches in >our git tree and this, i find, is much easier to >handle. because: > >* viewing and navigating backport source becomes > _much_ easier. >* merges are easier -- patches are much more fragile > than branches. >* comparisons are easier -- checking for differences > between backports and between a backport and the > canonical source is faster and more convenient... >* changesets are readable. trying to decipher diffs > to patches is medically proven to take months, if not > years, off your life. Sigh. I wish it were possible to do everything through addons tricks. I see the advantages of the "bush of branches" - for example it's possible to add a backport patch to a recent kernel, and then merge this into other kernel branches. But I also see a serious problem with addressing: basically git tracks content. It's not designed to track a bush of branches taken together. For example, take tagging: tag namespace is global, so you can not have the same tag point at multiple branches at the same time. >anyway, what do you think? is there anyway i could >convince you to dump the backport patches and put >all the backports in branches? i'm willing to do the >legwork if you see value... Can you publish the scripts and/or the tree? I think we can start by just running the scripts nightly, making it possible for people to view backport history with gitview. -- MST From krkumar2 at in.ibm.com Mon Jul 23 20:44:46 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 24 Jul 2007 09:14:46 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <1185193921.26013.37.camel@localhost> Message-ID: Hi Jamal, J Hadi Salim wrote on 07/23/2007 06:02:01 PM: > Yes, and these results were sent to you as well a while back. > When i get the time when i get back i will look em up in my test machine > and resend. Actually you have not sent netperf results with prep and without prep. > > No. I see value only in non-LLTX drivers which also gets the same TX lock > > in the RX path. > > So _which_ non-LLTX driver doesnt do that? ;-> I have no idea since I haven't looked at all drivers. Can you tell which all non-LLTX drivers does that ? I stated this as the sole criterea. > tun driver doesnt use it either - but i doubt that makes it "bloat" Adding extra code that is currently not usable (esp from a submission point) is bloat. > You waltz in, have the luxury of looking at my code, presentations, many > discussions with me etc ... "luxury" ? I had implemented the entire thing even before knowing that you are working on something similar! and I had sent the first proposal to netdev, *after* which you told that you have your own code and presentations (which I had never seen earlier - I joined netdev a few months back, earlier I was working on RDMA, Infiniband as you know). And it didn't give me any great ideas either, remember I had posted results for E1000 at the time of sending the proposals. However I do give credit in my proposal to you for what ideas that your provided (without actual code), and the same I did for other people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, and I sent some patches to improve your code too, so it looks like a two way street to me (and that is how open source works and should). > When i ask for differences to code you produced, they now seem to sum up > to the two below. You dont think theres some honest issue with this > picture? Two changes ? That's it ? I gave a big list of changes between our implementations but you twist my words to conclude there is just two (by conveniently labelling everything else "cosmetic", or "potentially useful"!)! Even my restart routine used a single API from the first day, I would never imagine using multiple API's. Our codes probably doesn't have even one line that look remotely similar! To clarify : I suggested that you could send patches for the two *missing* items if you can show they add value (and not the rest, as I consider those will not improve the code/logic/algo). > > ("lacking in frankness, candor, or sincerity; falsely or hypocritically > > ingenuous; insincere") ???? Sorry, no response to personal comments and > > have a flame-war :) > > Give me a better description. Sorry, no personal comments. Infact I will avoid responding to baits and innuendoes from now on. Thanks, - KK From donour at cs.unm.edu Mon Jul 23 21:20:17 2007 From: donour at cs.unm.edu (Donour Sizemore) Date: Mon, 23 Jul 2007 22:20:17 -0600 Subject: [ofa-general] correct buffer init for multiple receives Message-ID: <46A57E01.6080109@cs.unm.edu> Hi everybody. I'm having a bit of trouble setting up multiple receive buffers for verbs. I'm using the ud pingpong example in ofed1.2 as an outline, but that example posts the same buffer for all receives. I'm trying to do something like: -- for(i=0; i < IB_RXDEPTH; i++){ posix_memalign((void**)&(conn->bufs[i]),1024, (IB_MTU + 40)); memset(conn->bufs, 0, (IB_MTU+40)); } conn->pd = ibv_alloc_pd(conn->context); for(i=0; i < nbufs; i++) conn->mr = ibv_reg_mr(conn->pd, (conn->bufs[i]), (IB_MTU+40), IBV_ACCESS_LOCAL_WRITE); -- Then I'm trying to do a bunch of ibv_post_recv()'s with each buf[i] as the address in the ibv_sge. Is this what I should be doing? It seems to be causing a big mess, corrupting memory, and giving unrepeatable results. thanks, Donour Sizemore University of New Mexico From eitan at mellanox.co.il Mon Jul 23 21:56:31 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 07:56:31 +0300 Subject: [ofa-general] RE: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <20070724005153.GD11674@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > Hi Sasha, Hal, > > > > I think I have an idea: > > > > Since this is a specific switch that reported ChangeBit or Trap why > > can't we just qualify that there was no change in the switch setup? > > The ChangeBit seems to be good start point - then OpenSM will > query all switch ports PortInfo anyway and if for all ports > PortState is <= INIT (and at least for one port it is = > INIT), it means that this switch was rebooted/reinitialized. > > And for single port PortState drop to = INIT should indicate > reinitialization. > > Seems correct? Yes. > > > We could send PortInfo, SwitchInfo, > > SwitchInfo is queried at each light sweep, PortInfo's if > ChangeBit is set. Guess we are ok with it even now. I will double check that... Well - even setting one port state to INIT did not cause the switch to be reconfigured. Seems the code does not enforce this condition yet. > > > LFT, MFT, SL2VL, VLArb, PKey queries > > and make sure no change from previous state. Or we could simply > > enforce last state by sending it over again ... > > I think we could want to re-read PKey tables in order to > preserve existing PKey indices and just to flush (overwrite > with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable? Correct. > > Sasha > From dotanb at dev.mellanox.co.il Mon Jul 23 23:35:46 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 24 Jul 2007 09:35:46 +0300 Subject: [ofa-general] correct buffer init for multiple receives In-Reply-To: <46A57E01.6080109@cs.unm.edu> References: <46A57E01.6080109@cs.unm.edu> Message-ID: <46A59DC2.10608@dev.mellanox.co.il> Hi. Donour Sizemore wrote: > Hi everybody. > > I'm having a bit of trouble setting up multiple receive buffers for > verbs. I'm using the ud pingpong example in ofed1.2 as an outline, but > that example posts the same buffer for all receives. > > I'm trying to do something like: > > -- > for(i=0; i < IB_RXDEPTH; i++){ > posix_memalign((void**)&(conn->bufs[i]),1024, (IB_MTU + 40)); > memset(conn->bufs, 0, (IB_MTU+40)); > } which value are you using in the IB_MTU? (the maximum supported IB MTU value is 4K) > > conn->pd = ibv_alloc_pd(conn->context); > for(i=0; i < nbufs; i++) > conn->mr = ibv_reg_mr(conn->pd, (conn->bufs[i]), (IB_MTU+40), > IBV_ACCESS_LOCAL_WRITE); EVERY memory registration gives you a Memory Region handle (with different lkey+rkey for this reason). In this example you should have an array which will be filled with nbufs MR.... or You can create one big buffer and handle the memory alignment when posting the WR yourself (if you wish). > -- > > Then I'm trying to do a bunch of ibv_post_recv()'s with each buf[i] as > the address in the ibv_sge. No, because you will use the lkey of the last MR that you created with memory addresses of different buffers. > > Is this what I should be doing? It seems to be causing a big mess, > corrupting memory, and giving unrepeatable results. When you post send/recv request for every memory buffer that you registered you need to use the appropriate lkey of this region. I hope that this helped you... Dotan From ogerlitz at voltaire.com Tue Jul 24 00:50:47 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 24 Jul 2007 10:50:47 +0300 Subject: [ofa-general] PATCH] IB/ipoib: ignore membership bit when looking for a P_Key in the table In-Reply-To: References: <46A36E77.5020307@gmail.com> <46A4BE3E.4080606@gmail.com> Message-ID: <46A5AF57.6040702@voltaire.com> Hal Rosenstock wrote: > On 7/23/07, *Moni Shoua* Hal Rosenstock wrote: > > > > - if (pkey == tmp_pkey) { > > + if ((pkey & 0x7fff) == (tmp_pkey & 0x7fff)) { > > > > > > Wouldn't this allow 2 limited PKeys to match though ? > Hi Hal, > Can you please explain what do you mean? Perhaps by example? > Two Pkeys which have their full memebership bit off (0x8000). Two > limited members are not allowed to talk with each other. Hal, ib_find_pkey() is the buddy of ib_find_cached_pkey() which is in the stack from day one. Now, ib_find_cached_pkey does some abstraction where it masks out the membership bit, so pkeys are matched in 15 bit fashion. Indeed, the overall design of the IB stack wrt to partial membership in a partition is not perfect nor final. I don't see why this masking off makes things worse then they could have been without it. As you know, as some changes need to be done in the IB spec and the IPoIB RFC, I am personally holding off with suggesting changes/fixes till the spec is done, this is per the approach expressed by you and Sean. Or. From vlad at lists.openfabrics.org Tue Jul 24 02:02:34 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 24 Jul 2007 02:02:34 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070724-0100 daily build status Message-ID: <20070724090234.335D0E60821@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Failed: From tbwl at foster667.fsnet.co.uk Tue Jul 24 02:12:37 2007 From: tbwl at foster667.fsnet.co.uk (Hilary R. Silva) Date: Tue, 24 Jul 2007 16:12:37 +0700 Subject: [ofa-general] Journal Message-ID: <46A5C285.5050403@foster667.fsnet.co.uk> -------------- next part -------------- A non-text attachment was scrubbed... Name: Journal.pdf Type: application/pdf Size: 7627 bytes Desc: not available URL: From ogerlitz at voltaire.com Tue Jul 24 02:39:50 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 24 Jul 2007 12:39:50 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A4EF00.9070305@ichips.intel.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> Message-ID: <46A5C8E6.5020906@voltaire.com> Sean Hefty wrote: >> What I have in mind is that IPoIB must not use cached IB path info. >> If the IB stack has path caching which is in the default flow of >> requesting a path record, it should provide an API (eg flag to the >> function through which one does path query) to request a non cached path. > Argh! This was the original design. I believe the current design is a > better approach. The ULP shouldn't care whether the PR is cached or not > - only that it's usable. Linux has a quite sophisticated mechanism to maintain / cache / probe / invalidate / update the network stack L2 neighbour info. Stating that although the neighbour cache state machine decided to update/delete a neighbour it is just correct by design for IPoIB to use cached IB L2 info is somehow moving too fast I think, some discussion is needed here. My basic thought is that for IPoIB its better to never use cached path then to always use cached path. But! maybe there's a way in the middle here, lets think. This is what I was referring to when saying "almost always". For example, in the Voltaire gen1 stack we had an ib arp module which was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This module managed some sort of path cache, were IPoIB was always asking for non-cached path and other ULPs were willing to get cached path. >> The design I was thinking to suggest for IPoIB is to almost always use >> this API since this policy makes the implementation consistent with >> the decisions made by the network stack neighbour cache > This defeats one of the benefit of caching, which is using a single > GetTable query, versus literally hundreds or thousands of Get queries. > Consider that constant all-to-all communication using IPoIB between 1024 > ports, with a 15 minute ARP table timeout would hit the SA with close to > 600 queries per second. If the cache comes to serve all-to-all MPI jobs and practically with IB, to get MPI performance (specifically latency) people would --not-- be using IPoIB for their MPI jobs since they want kernel AND net-stack bypass, it does make sense to use non-cached path in IPoIB if we agree that design-wise its the the correct approach. > While I agree that there's the potential for a problem, given that IPoIB > has always cached PRs and no one has reported problems, I think we're > overstating the likelihood of issues occurring in practice. Even the SA > caches the path data -- getting a PR from the SA doesn't provide any > additional guarantees. I am not with you... I would expect an SA implementation to invalid / recompute the relevant data structures associated with each change in the fabric and get a trap for each change. Or. From vlad at lists.openfabrics.org Tue Jul 24 02:43:32 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 24 Jul 2007 02:43:32 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070724-0200 daily build status Message-ID: <20070724094332.4E8D3E60857@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From hal.rosenstock at gmail.com Tue Jul 24 04:08:28 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 07:08:28 -0400 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <20070724005153.GD11674@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> Message-ID: On 7/23/07, Sasha Khapyorsky wrote: > > Hi Eitan, > > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > Hi Sasha, Hal, > > > > I think I have an idea: > > > > Since this is a specific switch that reported ChangeBit or Trap why > > can't we just qualify that there was no change in the switch setup? > > The ChangeBit seems to be good start point - then OpenSM will query all > switch ports PortInfo anyway and if for all ports PortState is <= INIT > (and at least for one port it is = INIT), it means that this switch was > rebooted/reinitialized. > > And for single port PortState drop to = INIT should indicate > reinitialization. > > Seems correct? Wouldn't this be all ports in INIT indicate reset of switch ? -- Hal > We could send PortInfo, SwitchInfo, > > SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is > set. Guess we are ok with it even now. > > > LFT, MFT, SL2VL, VLArb, PKey queries > > and make sure no change from previous state. Or we could simply enforce > > last state by sending it over again ... > > I think we could want to re-read PKey tables in order to preserve > existing PKey indices and just to flush (overwrite with new settings) > LFT, MFT, SL2VL, VLArb tables. Reasonable? > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Tue Jul 24 04:32:14 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 24 Jul 2007 14:32:14 +0300 Subject: [ofa-general] [PATCH] libibmad: Fixed a name of a field in SwitchInfo to the right name Message-ID: <200707241432.14848.dotanb@dev.mellanox.co.il> Fixed a name of a field in SwitchInfo to the right name. Signed-off-by: Dotan Barak --- Index: connectx_user/src/userspace/management/libibmad/src/fields.c =================================================================== --- connectx_user.orig/src/userspace/management/libibmad/src/fields.c 2007-07-22 16:34:02.000000000 +0300 +++ connectx_user/src/userspace/management/libibmad/src/fields.c 2007-07-24 13:58:41.000000000 +0300 @@ -193,7 +193,7 @@ ib_field_t ib_mad_f [] = { [IB_SW_PARTITION_ENF_INB_F] {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, [IB_SW_PARTITION_ENF_OUTB_F] {BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint}, [IB_SW_FILTER_RAW_INB_F] {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, - [IB_SW_FILTER_RAW_OUTB_F] {BITSOFFS(131, 1), "FilterRawInbound", mad_dump_uint}, + [IB_SW_FILTER_RAW_OUTB_F] {BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint}, [IB_SW_ENHANCED_PORT0_F] {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, /* From glebn at voltaire.com Tue Jul 24 05:14:40 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 24 Jul 2007 15:14:40 +0300 Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4 Message-ID: <20070724121440.GA2775@minantech.com> Hi, There is a bug in mlx4_post_send(). A data that is sent inline and consists from multiple small sges isn't copied properly into wqe. The following patch fixes it for me. Signed-off-by: Gleb Natapov diff --git a/src/qp.c b/src/qp.c index 66ee309..83a4fd4 100644 --- a/src/qp.c +++ b/src/qp.c @@ -288,6 +288,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, memcpy(wqe, addr, len); wqe += len; seg_len += len; + off += len; } if (seg_len) { -- Gleb. From hal.rosenstock at gmail.com Tue Jul 24 06:22:19 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 09:22:19 -0400 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> Message-ID: On 7/24/07, Hal Rosenstock wrote: > > > > On 7/23/07, Sasha Khapyorsky wrote: > > > > Hi Eitan, > > > > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > > Hi Sasha, Hal, > > > > > > I think I have an idea: > > > > > > Since this is a specific switch that reported ChangeBit or Trap why > > > can't we just qualify that there was no change in the switch setup? > > > > The ChangeBit seems to be good start point - then OpenSM will query all > > switch ports PortInfo anyway and if for all ports PortState is <= INIT > > (and at least for one port it is = INIT), it means that this switch was > > rebooted/reinitialized. > > > > And for single port PortState drop to = INIT should indicate > > reinitialization. > > > > Seems correct? > > > Wouldn't this be all ports in INIT indicate reset of switch ? > for ports which are LinkUp. This is pretty dicey :-( I don't see a good way to determine this. -- Hal > -- Hal > > > We could send PortInfo, SwitchInfo, > > > > SwitchInfo is queried at each light sweep, PortInfo's if ChangeBit is > > set. Guess we are ok with it even now. > > > > > LFT, MFT, SL2VL, VLArb, PKey queries > > > and make sure no change from previous state. Or we could simply > > enforce > > > last state by sending it over again ... > > > > I think we could want to re-read PKey tables in order to preserve > > existing PKey indices and just to flush (overwrite with new settings) > > LFT, MFT, SL2VL, VLArb tables. Reasonable? > > > > Sasha > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at mellanox.co.il Tue Jul 24 06:58:42 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 24 Jul 2007 16:58:42 +0300 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46A536EC.4060201@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A536EC.4060201@ichips.intel.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65B2@mtlexch01.mtl.com> > > > > I would like to propose adding project directories under > > http://www.openfabrics.org/downloads/ where appropriate and give > > maintainers access. For example: > > > > http://www.openfabrics.org/downloads/verbs (rdreier) > > http://www.openfabrics.org/downloads/rdmacm (shefty) > > http://www.openfabrics.org/downloads/dapl (ardavis) > > http://www.openfabrics.org/downloads/management (sashak) > > http://www.openfabrics.org/downloads/OFED (vlad) > > http://www.openfabrics.org/downloads/WinOF (ardavis) > > http://www.openfabrics.org/downloads/archives (vlad) ?? > > etc... > > > > Each of these would contain a README that details the contents of the > > directory along with WEB_README that provides a short description for > > the webpage. Jeff could then automatically parse for directories under > > downloads and if it contains WEB_README add a webpage link to the > > directory along with the short description. > > Looks good for me. Regards, Vladimir From sashak at voltaire.com Tue Jul 24 07:04:50 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 17:04:50 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> Message-ID: <20070724140450.GV27878@sashak.voltaire.com> On 07:08 Tue 24 Jul , Hal Rosenstock wrote: > On 7/23/07, Sasha Khapyorsky wrote: > > > > Hi Eitan, > > > > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > > Hi Sasha, Hal, > > > > > > I think I have an idea: > > > > > > Since this is a specific switch that reported ChangeBit or Trap why > > > can't we just qualify that there was no change in the switch setup? > > > > The ChangeBit seems to be good start point - then OpenSM will query all > > switch ports PortInfo anyway and if for all ports PortState is <= INIT > > (and at least for one port it is = INIT), it means that this switch was > > rebooted/reinitialized. > > > > And for single port PortState drop to = INIT should indicate > > reinitialization. > > > > Seems correct? > > > Wouldn't this be all ports in INIT indicate reset of switch ? It includes not connected ports too, so I guess it should be <= INIT . Sasha From jackm at dev.mellanox.co.il Tue Jul 24 07:04:31 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 24 Jul 2007 17:04:31 +0300 Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4 In-Reply-To: <20070724121440.GA2775@minantech.com> References: <20070724121440.GA2775@minantech.com> Message-ID: <200707241704.31831.jackm@dev.mellanox.co.il> On Tuesday 24 July 2007 15:14, Gleb Natapov wrote: > Hi, > > There is a bug in mlx4_post_send(). A data that is sent inline and > consists from multiple small sges isn't copied properly into wqe. > The following patch fixes it for me. > > Signed-off-by: Gleb Natapov > > diff --git a/src/qp.c b/src/qp.c > index 66ee309..83a4fd4 100644 > --- a/src/qp.c > +++ b/src/qp.c > @@ -288,6 +288,7 @@ int mlx4_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, > memcpy(wqe, addr, len); > wqe += len; > seg_len += len; > + off += len; > } > > if (seg_len) { Good catch! This patch is correct. Roland? - Jack From sashak at voltaire.com Tue Jul 24 07:12:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 17:12:20 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Fixed a name of a field in SwitchInfo to the right name In-Reply-To: <200707241432.14848.dotanb@dev.mellanox.co.il> References: <200707241432.14848.dotanb@dev.mellanox.co.il> Message-ID: <20070724141220.GW27878@sashak.voltaire.com> On 14:32 Tue 24 Jul , Dotan Barak wrote: > Fixed a name of a field in SwitchInfo to the right name. > > Signed-off-by: Dotan Barak Applied. Thanks. Sasha From hal.rosenstock at gmail.com Tue Jul 24 07:30:41 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 10:30:41 -0400 Subject: [ofa-general] OpenSM detection of duplicated GUIDs on loopback Message-ID: Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Jul 24 07:44:22 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 17:44:22 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Jul 24 07:53:00 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 10:53:00 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> Message-ID: Hi Eitan, On 7/24/07, Eitan Zahavi wrote: > > *Hi Hal,* > ** > *What is this "loopback" connector used for?* > *Does not seem to me like a very useful thing to do.* > Perhaps not but no reason OpenSM can't handle this more gracefully. *Anyway, if it is not a production environment we could add a "debug mode" > (-d flag option) to ignore this check.* > Why would a separate flag be needed ? -- Hal > > *Eitan Zahavi*** > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > ------------------------------ > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Tuesday, July 24, 2007 5:31 PM > *To:* OpenFabrics General > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > Hi, > > This is what starts off as a "minor" issue and I know it has been > discussed it somewhat in the past: > > Putting a loopback connector on a (switch) link causes OpenSM to indicate > duplicated GUID error 0D18 as follows: > > __osm_ni_rcv_set_links > { > ... > /* > When there are only two nodes with exact same guids > (connected back > to back) - the previous check for duplicated guid will not > catch > them. But the link will be from the port to itself... > Enhanced Port 0 is an exception to this > */ > if ((osm_node_get_node_guid( p_node ) == > p_ni_context->node_guid) && > (port_num == p_ni_context->port_num) && > (port_num != 0)) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_ni_rcv_set_links: ERR 0D18: " > "Duplicate GUID found by link from a port to itself:" > "node 0x%" PRIx64 ", port number 0x%X\n", > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > port_num ); > ... > > So this occurs over and over and over and fills the log with the same > spew. This should be improved IMO. > > Is this really a fatal condition ? Doesn't seem like it should be to me. > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe > for this condition ? > > Seems like something like an extra loopback bit should be added to some > port structure which should cause these links to be ignored. This bit would > then be reset when the peer is now longer itself. > > Also, is there a relationship of this with the 12x/duplicated GUID code ? > > Thanks. > > -- Hal > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arthur.jones at qlogic.com Tue Jul 24 07:53:35 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 07:53:35 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724030318.GA7589@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> Message-ID: <20070724145335.GF16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote: > [...] > But I also see a serious problem with addressing: basically > git tracks content. It's not designed to track a bush > of branches taken together. For example, take tagging: > tag namespace is global, so you can not have the same > tag point at multiple branches at the same time. agreed. however, the way we use git, with the location of the git DB as the "tag", it's not really a problem in practice. but tagging each branch separately is indeed a PITA... > >anyway, what do you think? is there anyway i could > >convince you to dump the backport patches and put > >all the backports in branches? i'm willing to do the > >legwork if you see value... > > Can you publish the scripts and/or the tree? > I think we can start by just running the scripts nightly, > making it possible for people to view backport history > with gitview. i've attached the script that i'm using to compare the trees, but it's a total hack. it doesn't keep the patch history. that would not be too hard to do i guess -- if there's interest... to run the script: $ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel $ cd ofed_kernel $ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done now you'll have a bunch of backport-2.6.xxx branches... arthur -------------- next part -------------- 2.6.5_sles9_sp3 2.6.9_U2 2.6.9_U3 2.6.9_U4 2.6.9_U5 2.6.11_FC4 2.6.11 2.6.12 2.6.13_suse10_0_u 2.6.13 2.6.14 2.6.15_ubuntu606 2.6.15 2.6.16_sles10 2.6.16_sles10_sp1 2.6.16 2.6.17 2.6.18_FC6 2.6.18 2.6.19 2.6.20 2.6.21 2.6.22 -------------- next part -------------- A non-text attachment was scrubbed... Name: create-backport.sh Type: application/x-sh Size: 265 bytes Desc: not available URL: From eitan at mellanox.co.il Tue Jul 24 07:52:33 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 17:52:33 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:53 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback Hi Eitan, On 7/24/07, Eitan Zahavi wrote: Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Perhaps not but no reason OpenSM can't handle this more gracefully. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Why would a separate flag be needed ? [EZ] Since I do not see any other solution for the SM to know it is really a loop back plug rather then two devices with same GUID connected back to back ... -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Jul 24 08:03:30 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 11:03:30 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> Message-ID: On 7/24/07, Eitan Zahavi wrote: > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Tuesday, July 24, 2007 5:53 PM > *To:* Eitan Zahavi > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > Hi Eitan, > > On 7/24/07, Eitan Zahavi wrote: > > > > *Hi Hal,* > > ** > > *What is this "loopback" connector used for?* > > *Does not seem to me like a very useful thing to do.* > > > ** > Perhaps not but no reason OpenSM can't handle this more gracefully. > > *Anyway, if it is not a production environment we could add a "debug > > mode" (-d flag option) to ignore this check.* > > > ** > Why would a separate flag be needed ? > *[EZ] Since I do not see any other solution for the SM to know it is > really a loop back plug rather then two devices with same GUID connected > back to back ...* > > "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal > -- Hal > > ** > > > > *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > ------------------------------ > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > *To:* OpenFabrics General > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > Hi, > > > > This is what starts off as a "minor" issue and I know it has been > > discussed it somewhat in the past: > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > indicate duplicated GUID error 0D18 as follows: > > > > __osm_ni_rcv_set_links > > { > > ... > > /* > > When there are only two nodes with exact same guids > > (connected back > > to back) - the previous check for duplicated guid will not > > catch > > them. But the link will be from the port to itself... > > Enhanced Port 0 is an exception to this > > */ > > if ((osm_node_get_node_guid( p_node ) == > > p_ni_context->node_guid) && > > (port_num == p_ni_context->port_num) && > > (port_num != 0)) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_ni_rcv_set_links: ERR 0D18: " > > "Duplicate GUID found by link from a port to > > itself:" > > "node 0x%" PRIx64 ", port number 0x%X\n", > > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > > port_num ); > > ... > > > > So this occurs over and over and over and fills the log with the same > > spew. This should be improved IMO. > > > > Is this really a fatal condition ? Doesn't seem like it should be to me. > > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that > > safe for this condition ? > > > > Seems like something like an extra loopback bit should be added to some > > port structure which should cause these links to be ignored. This bit would > > then be reset when the peer is now longer itself. > > > > Also, is there a relationship of this with the 12x/duplicated GUID code > > ? > > > > Thanks. > > > > -- Hal > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue Jul 24 08:09:09 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 18:09:09 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724145335.GF16727@bauxite.pathscale.com> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> Message-ID: <20070724150909.GL4359@mellanox.co.il> > Quoting Arthur Jones : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > hi michael, ... > > On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote: > > [...] > > But I also see a serious problem with addressing: basically > > git tracks content. It's not designed to track a bush > > of branches taken together. For example, take tagging: > > tag namespace is global, so you can not have the same > > tag point at multiple branches at the same time. > > agreed. however, the way we use git, with the > location of the git DB as the "tag", it's not > really a problem in practice. who uses git this way? > but tagging each > branch separately is indeed a PITA... This is just one problem. For example, git pull can only merge one branch at a time. > > >anyway, what do you think? is there anyway i could > > >convince you to dump the backport patches and put > > >all the backports in branches? i'm willing to do the > > >legwork if you see value... > > > > can you publish the scripts and/or the tree? > > i think we can start by just running the scripts nightly, > > making it possible for people to view backport history > > with gitview. > > i've attached the script that i'm using to compare > the trees, but it's a total hack. it doesn't keep > the patch history. that would not be too hard to > do i guess -- if there's interest... > > to run the script: > > > $ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel > $ cd ofed_kernel > $ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done > > now you'll have a bunch of backport-2.6.xxx branches... So, would you like to have this script run nightly on ofed trees? -- MST From arthur.jones at qlogic.com Tue Jul 24 08:23:05 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 08:23:05 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724150909.GL4359@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> Message-ID: <20070724152305.GG16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 06:09:09PM +0300, Michael S. Tsirkin wrote: > > Quoting Arthur Jones : > > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > > > hi michael, ... > > > > On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote: > > > [...] > > > But I also see a serious problem with addressing: basically > > > git tracks content. It's not designed to track a bush > > > of branches taken together. For example, take tagging: > > > tag namespace is global, so you can not have the same > > > tag point at multiple branches at the same time. > > > > agreed. however, the way we use git, with the > > location of the git DB as the "tag", it's not > > really a problem in practice. > > who uses git this way? i do. > > but tagging each > > branch separately is indeed a PITA... > > This is just one problem. > For example, git pull can only merge one branch at a time. how is this a problem? the way i use git, i use a script to "reflow" the changes into the dependent branches. over the last few months, anyway, it has worked fine... > > > >anyway, what do you think? is there anyway i could > > > >convince you to dump the backport patches and put > > > >all the backports in branches? i'm willing to do the > > > >legwork if you see value... > > > > > > can you publish the scripts and/or the tree? > > > i think we can start by just running the scripts nightly, > > > making it possible for people to view backport history > > > with gitview. > > > > i've attached the script that i'm using to compare > > the trees, but it's a total hack. it doesn't keep > > the patch history. that would not be too hard to > > do i guess -- if there's interest... > > > > to run the script: > > > > > > $ git clone git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel > > $ cd ofed_kernel > > $ for b in `cat ../ofed-backports.txt`; do ../create-backport.sh $b; done > > > > now you'll have a bunch of backport-2.6.xxx branches... > > So, would you like to have this script run nightly on ofed trees? if someone finds that useful. my main motivation is getting rid of all the patches in ofed, if running this script nightly helps us to get there, then i'm all for it. if it's just for me, it's easy enough to run the scripts by hand... arthur From mst at dev.mellanox.co.il Tue Jul 24 08:32:28 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 18:32:28 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724152305.GG16727@bauxite.pathscale.com> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> Message-ID: <20070724152833.GN4359@mellanox.co.il> > Quoting Arthur Jones : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > hi michael, ... > > On Tue, Jul 24, 2007 at 06:09:09PM +0300, Michael S. Tsirkin wrote: > > > Quoting Arthur Jones : > > > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > > > > > hi michael, ... > > > > > > On Tue, Jul 24, 2007 at 06:03:41AM +0300, Michael S. Tsirkin wrote: > > > > [...] > > > > But I also see a serious problem with addressing: basically > > > > git tracks content. It's not designed to track a bush > > > > of branches taken together. For example, take tagging: > > > > tag namespace is global, so you can not have the same > > > > tag point at multiple branches at the same time. > > > > > > agreed. however, the way we use git, with the > > > location of the git DB as the "tag", it's not > > > really a problem in practice. > > > > who uses git this way? > > i do. > > > > but tagging each > > > branch separately is indeed a PITA... > > > > This is just one problem. > > For example, git pull can only merge one branch at a time. > > how is this a problem? the way i use git, > i use a script to "reflow" the changes into > the dependent branches. over the last few > months, anyway, it has worked fine... Precisely because no one developed on these branches, so you are re-generating themfrom patches - not a problem, but as you point out not too useful either. If people start developing on these branches, then eventually you will need to merge them - and git only merges them one at a time. -- MST From arthur.jones at qlogic.com Tue Jul 24 08:41:51 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 08:41:51 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724152833.GN4359@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> Message-ID: <20070724154151.GH16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 06:32:28PM +0300, Michael S. Tsirkin wrote: > [...] > > > For example, git pull can only merge one branch at a time. > > > > how is this a problem? the way i use git, > > i use a script to "reflow" the changes into > > the dependent branches. over the last few > > months, anyway, it has worked fine... > > Precisely because no one developed on these branches, > so you are re-generating themfrom patches - not a problem, > but as you point out not too useful either. well, no, i _have_ been doing development on the local branches in our internal repo. i also merge in changes that you make to the ofed repo to our internal backport branches. the script i posted is just so that i can more easily compare our internal branches to the ofed backport "branches". > If people start developing on these branches, then > eventually you will need to merge them - and git only merges > them one at a time. yes, i have to merge them one at a time. i still don't see how this is a problem. backport changes can be pulled in and the changes from upstream can be merged in as well. i haven't had a problem with this so far. can you be more specific about what you expect will fail? arthur From mst at dev.mellanox.co.il Tue Jul 24 08:53:48 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 18:53:48 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724154151.GH16727@bauxite.pathscale.com> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> Message-ID: <20070724155348.GP4359@mellanox.co.il> > Quoting Arthur Jones : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > hi michael, ... > > On Tue, Jul 24, 2007 at 06:32:28PM +0300, Michael S. Tsirkin wrote: > > [...] > > > > For example, git pull can only merge one branch at a time. > > > > > > how is this a problem? the way i use git, > > > i use a script to "reflow" the changes into > > > the dependent branches. over the last few > > > months, anyway, it has worked fine... > > > > Precisely because no one developed on these branches, > > so you are re-generating themfrom patches - not a problem, > > but as you point out not too useful either. > > well, no, i _have_ been doing development on the > local branches in our internal repo. i also > merge in changes that you make to the ofed repo > to our internal backport branches. the script > i posted is just so that i can more easily compare > our internal branches to the ofed backport "branches". How do you do the merging? > > If people start developing on these branches, then > > eventually you will need to merge them - and git only merges > > them one at a time. > > yes, i have to merge them one at a time. i > still don't see how this is a problem. backport > changes can be pulled in and the changes from > upstream can be merged in as well. i haven't > had a problem with this so far. can you be more > specific about what you expect will fail? Well, as distro maintainers we need to merge a lot, from different people. We'll have to write all kind of scripts to do it instead of a plain git pull. And, I expect almost all git operations will have to be wrapped in a script in some way, to operate on a bush of branches. -- MST From weiny2 at llnl.gov Tue Jul 24 09:05:11 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 24 Jul 2007 09:05:11 -0700 Subject: [ofa-general] Command specification of ca_name and ca_port In-Reply-To: <20070724013306.GH11674@sashak.voltaire.com> References: <46A4C0C7.7020107@systemfabricworks.com> <20070724013306.GH11674@sashak.voltaire.com> Message-ID: <20070724090511.636bbccb.weiny2@llnl.gov> On Tue, 24 Jul 2007 04:33:06 +0300 Sasha Khapyorsky wrote: > Hi David, > > On 09:52 Mon 23 Jul , David McMillen wrote: > > > > There are a standard set of command line options that allow specification of > > the CA to use for sending the requests. I'm adding these to programs that > > don't have them, since they are very useful when diagnosing a node connected > > to multiple subnets. Even if you discount multiple subnets on purpose, > > sometimes this happens when the hardware connecting all of the CA ports to > > the same place gets broken, and that is when you need diagnostics that can > > help figure out what is where. > > > > The standard options are: > > > > -C use the specified ca_name. > > > > -P use the specified ca_port. > > > > -t override the default timeout for the solicited mads. > > > > My problem is that saquery already uses -C and -P, although the -t exists > > for the expected purpose. Also, ibcheckerrs already uses -t for specifying > > the threshold file. > > I think unified command line options over diags are good thing, so I > guess reasonable renaming should be acceptable. I agree, however right now saquery does not support specifying the ca_name or ca_port, so you would have to add that support. > > > > > Changing the timeout for ibcheckerrs isn't critical, but not being able to > > do it doesn't seem right. However, the saquery command could be really > > handy for figuring out split fabrics, and is useful to those of us that > > connect to multiple subnets. > > > > Does anybody have a useful suggestion? > > '-T' for the threshold file? That sounds good. > > But it is easy part - saquery renames are > less intuitive :(. Probably just lower case? Or special query option > (-q or -Q), so queries could be specified as -qP, -qC? > I disagree with this because ~50% of the options are query's, it's primary purpose is to query, and most of the other options change the format of the output of the query. Therefore, I don't think a -q should be required for a query. I think that seems redundant. Perhaps just changing the current option to -c,-p, and adding -C and -P would be best. I know this might break some scripts out there, particularly mine, but I think it is the right thing to do if you really want consistency. Thoughts? Ira From sean.hefty at intel.com Tue Jul 24 09:09:35 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 09:09:35 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070723200640.GA13117@bauxite.pathscale.com> Message-ID: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> >at qlogic we now keep the backports as branches in >our git tree and this, i find, is much easier to >handle. because: > >* viewing and navigating backport source becomes > _much_ easier. >* merges are easier -- patches are much more fragile > than branches. >* comparisons are easier -- checking for differences > between backports and between a backport and the > canonical source is faster and more convenient... >* changesets are readable. trying to decipher diffs > to patches is medically proven to take months, if not > years, off your life. Let's add that you don't need patches to patches, and the order patches are applied isn't determined alphabetically. >anyway, what do you think? is there anyway i could >convince you to dump the backport patches and put >all the backports in branches? i'm willing to do the >legwork if you see value... I would love OFED to dump the patch directory concept. - Sean From arthur.jones at qlogic.com Tue Jul 24 09:13:51 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 09:13:51 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724155348.GP4359@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> Message-ID: <20070724161351.GI16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 06:53:48PM +0300, Michael S. Tsirkin wrote: > [...] > > well, no, i _have_ been doing development on the > > local branches in our internal repo. i also > > merge in changes that you make to the ofed repo > > to our internal backport branches. the script > > i posted is just so that i can more easily compare > > our internal branches to the ofed backport "branches". > > How do you do the merging? for just the backport branches, i merge different ways from different sources: * from upstream, it's a pull into master and a git merge master into local backport branches -- i call this a reflow. * from local developers, it's a git pull straight into the backport branch, then reflow the repo. * from ofed, i apply the backport patch by hand and fixup the inevitable clashes -- either because part of the patch is already applied, or because context has changed enough for git apply to get confused. when these are fixed up, reflow the repo... > > > If people start developing on these branches, then > > > eventually you will need to merge them - and git only merges > > > them one at a time. > > > > yes, i have to merge them one at a time. i > > still don't see how this is a problem. backport > > changes can be pulled in and the changes from > > upstream can be merged in as well. i haven't > > had a problem with this so far. can you be more > > specific about what you expect will fail? > > Well, as distro maintainers we need to merge a lot, from different > people. We'll have to write all kind of scripts to do it instead of > a plain git pull. i can't imagine what script you would need. can you be more specific? it would seem to me that you could just pull straight in to the backport branch... > And, I expect almost all git operations will have to be wrapped > in a script in some way, to operate on a bush of branches. so far, this hasn't been an issue for me. the only operation that i've scripted is the reflow. for most work, i can just ignore the backport branches and do the work in the (copy of) master, then reflow the changes into the backports... arthur From mst at dev.mellanox.co.il Tue Jul 24 09:16:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 19:16:46 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> Message-ID: <20070724161646.GA24797@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > >at qlogic we now keep the backports as branches in > >our git tree and this, i find, is much easier to > >handle. because: > > > >* viewing and navigating backport source becomes > > _much_ easier. > >* merges are easier -- patches are much more fragile > > than branches. > >* comparisons are easier -- checking for differences > > between backports and between a backport and the > > canonical source is faster and more convenient... > >* changesets are readable. trying to decipher diffs > > to patches is medically proven to take months, if not > > years, off your life. > > Let's add that you don't need patches to patches, and the order patches are > applied isn't determined alphabetically. > > >anyway, what do you think? is there anyway i could > >convince you to dump the backport patches and put > >all the backports in branches? i'm willing to do the > >legwork if you see value... > > I would love OFED to dump the patch directory concept. I'd love to have a common source for all kernels, and the kernel_addons mechanism does this for us whenever possible. But, for these cases where the code actually needs to be modified, applying a patch seems like the least evil way to do it. Alternatives seem to be much worse. -- MST From mst at dev.mellanox.co.il Tue Jul 24 09:23:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 19:23:06 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724161351.GI16727@bauxite.pathscale.com> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> Message-ID: <20070724162305.GB24797@mellanox.co.il> > Quoting Arthur Jones : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > hi michael, ... > > On Tue, Jul 24, 2007 at 06:53:48PM +0300, Michael S. Tsirkin wrote: > > [...] > > > well, no, i _have_ been doing development on the > > > local branches in our internal repo. i also > > > merge in changes that you make to the ofed repo > > > to our internal backport branches. the script > > > i posted is just so that i can more easily compare > > > our internal branches to the ofed backport "branches". > > > > How do you do the merging? > > for just the backport branches, i merge different ways > from different sources: > * from upstream, it's a pull into master and a git merge master > into local backport branches -- i call this a reflow. > * from local developers, it's a git pull straight into > the backport branch, then reflow the repo. > * from ofed, i apply the backport patch by hand and > fixup the inevitable clashes -- either because part > of the patch is already applied, or because context > has changed enough for git apply to get confused. when > these are fixed up, reflow the repo... Hmm. Concider that yuou did all of the above, and then mail me that there's an update. Now I need to merge updates to multiple branches directly and git pull does not do this. It's a problem. > > > > If people start developing on these branches, then > > > > eventually you will need to merge them - and git only merges > > > > them one at a time. > > > > > > yes, i have to merge them one at a time. i > > > still don't see how this is a problem. backport > > > changes can be pulled in and the changes from > > > upstream can be merged in as well. i haven't > > > had a problem with this so far. can you be more > > > specific about what you expect will fail? > > > > Well, as distro maintainers we need to merge a lot, from different > > people. We'll have to write all kind of scripts to do it instead of > > a plain git pull. > > i can't imagine what script you would need. can > you be more specific? it would seem to me that you > could just pull straight in to the backport branch... You'll have to check out branches one by one, and do a pull. What if there's a conflict? I currently just do git reset --hard ORIG_HEAD and mail the maintainer to fix it up - but this won't work with the "bush of branches" approach. > > And, I expect almost all git operations will have to be wrapped > > in a script in some way, to operate on a bush of branches. > > so far, this hasn't been an issue for me. the only > operation that i've scripted is the reflow. for > most work, i can just ignore the backport branches and > do the work in the (copy of) master, then reflow the > changes into the backports... Because you only have your driver to maintain. -- MST From mshefty at ichips.intel.com Tue Jul 24 09:29:12 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 09:29:12 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A5C8E6.5020906@voltaire.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> Message-ID: <46A628D8.4050109@ichips.intel.com> > Linux has a quite sophisticated mechanism to maintain / cache / probe / > invalidate / update the network stack L2 neighbour info. Path records are not just L2 info. They contain L4, L3, and L2 info together. > For example, in the Voltaire gen1 stack we had an ib arp module which > was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This > module managed some sort of path cache, were IPoIB was always asking for > non-cached path and other ULPs were willing to get cached path. IMO, using a cached AH is no different than using a cached path. You're simply mapping the PR data into another structure. We're ignoring the problem here, and that is that a centralized SA doesn't scale. MPI stacks have largely ignored this problem by simply not doing path record queries. Path information is often hard-coded, with QPN data exchanged out of band over sockets (often over Ethernet). We've seen problems running large MPI jobs without PR caching. I know that Silverstorm/QLogic did as well. And apparently Voltaire hit the same type of problem, since you added a caching module. (Did Mellanox and Topspin/Cisco create PR caches as well?) At least three companies working on IB came up with the same solution. What is the objection to the current patch set? - Sean From arthur.jones at qlogic.com Tue Jul 24 09:46:59 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 09:46:59 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724162305.GB24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> Message-ID: <20070724164659.GJ16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 07:23:06PM +0300, Michael S. Tsirkin wrote: > [...] > > for just the backport branches, i merge different ways > > from different sources: > > * from upstream, it's a pull into master and a git merge master > > into local backport branches -- i call this a reflow. > > * from local developers, it's a git pull straight into > > the backport branch, then reflow the repo. > > * from ofed, i apply the backport patch by hand and > > fixup the inevitable clashes -- either because part > > of the patch is already applied, or because context > > has changed enough for git apply to get confused. when > > these are fixed up, reflow the repo... > > Hmm. Concider that yuou did all of the above, and then mail me > that there's an update. Now I need to merge updates to multiple branches directly > and git pull does not do this. It's a problem. for changes made to the canonical source, it's just git pull into ofed_kernel and a reflow. for changes made to the backports, you would need to git checkout and git pull into each of the backport branches _in which i made a change_. the case that i make changes to _all_ or even a significant number of backport patches is sufficiently rare that i doubt it is worth scripting. but, if the script is necessary, it's pretty straightforward: set -e for b in branches-which-have-changed; do git checkout $b git pull $b done > [...] > > i can't imagine what script you would need. can > > you be more specific? it would seem to me that you > > could just pull straight in to the backport branch... > > You'll have to check out branches one by one, and do a pull. > What if there's a conflict? I currently just do git reset --hard ORIG_HEAD > and mail the maintainer to fix it up - but this won't work > with the "bush of branches" approach. it works for me. what do you expect will break? > > > And, I expect almost all git operations will have to be wrapped > > > in a script in some way, to operate on a bush of branches. > > > > so far, this hasn't been an issue for me. the only > > operation that i've scripted is the reflow. for > > most work, i can just ignore the backport branches and > > do the work in the (copy of) master, then reflow the > > changes into the backports... > > Because you only have your driver to maintain. no, i have to maintain quite a few of the ofed backport branches as well for our release. if i started getting pull requests from people with changes to 15 backport branches in one go, i'd probably want to script it... i have found that drawing a DAG with graphviz has been a big help in making sure that i organize the branches correctly... arthur From arthur.jones at qlogic.com Tue Jul 24 09:50:32 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 09:50:32 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724161646.GA24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> Message-ID: <20070724165032.GK16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 07:16:46PM +0300, Michael S. Tsirkin wrote: > [...] > But, for these cases where the code actually needs to be modified, > applying a patch seems like the least evil way to do it. > Alternatives seem to be much worse. what is it about patches that are less evil than changesets? can you list some of the advantages? arthur From mst at dev.mellanox.co.il Tue Jul 24 09:52:03 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 19:52:03 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724164659.GJ16727@bauxite.pathscale.com> References: <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> <20070724164659.GJ16727@bauxite.pathscale.com> Message-ID: <20070724165203.GC24797@mellanox.co.il> > > Because you only have your driver to maintain. > > no, i have to maintain quite a few of the > ofed backport branches as well for our release. > if i started getting pull requests from people > with changes to 15 backport branches in one go, > i'd probably want to script it... Yea. Happens all the time here: when component maintainer makes a change, it will typically affect all backports or none. > i have found that drawing a DAG with graphviz has > been a big help in making sure that i organize the > branches correctly... Ugh .. *that* sounds complicated. Looks like it's much simpler with current setup. -- MST From sashak at voltaire.com Tue Jul 24 10:00:11 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 20:00:11 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> Message-ID: <20070724170011.GY27878@sashak.voltaire.com> Hi, On 11:03 Tue 24 Jul , Hal Rosenstock wrote: > On 7/24/07, Eitan Zahavi wrote: > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > *Sent:* Tuesday, July 24, 2007 5:53 PM > > *To:* Eitan Zahavi > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > Hi Eitan, > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > *Hi Hal,* > > > ** > > > *What is this "loopback" connector used for?* > > > *Does not seem to me like a very useful thing to do.* > > > > > ** > > Perhaps not but no reason OpenSM can't handle this more gracefully. I don't have "loopback" plug, but used loopback connections for some checks with simulator. There is nothing illegal, so I think it would be better to support it. > > *Anyway, if it is not a production environment we could add a "debug > > > mode" (-d flag option) to ignore this check.* > > > > > ** > > Why would a separate flag be needed ? > > *[EZ] Since I do not see any other solution for the SM to know it is > > really a loop back plug rather then two devices with same GUID connected > > back to back ...* Also we saw the cases when port moving triggers duplicated GUIDs detector (originally was reported on real fabric and it is trivially reproducible in simulated environment). So probably we need to find some better way to handle duplication GUID detector (in general, not just for loopback). For example node_info content could be compared. More ideas? Sasha From mst at dev.mellanox.co.il Tue Jul 24 09:55:50 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 19:55:50 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724165032.GK16727@bauxite.pathscale.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> Message-ID: <20070724165550.GD24797@mellanox.co.il> > Quoting Arthur Jones : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > hi michael, ... > > On Tue, Jul 24, 2007 at 07:16:46PM +0300, Michael S. Tsirkin wrote: > > [...] > > But, for these cases where the code actually needs to be modified, > > applying a patch seems like the least evil way to do it. > > Alternatives seem to be much worse. > > what is it about patches that are less evil > than changesets? can you list some of the > advantages? changesets *do not exist* in git - git tracks content. I compare "multiple directories with patches" with the "bush of branches". With bush of branches: git pull broken, git archive broken, git tag broken, git reset broken. It looks like the list can be continued. Yes, we can start building our own tools on top of git to do this, but I'd rather not. -- MST From sashak at voltaire.com Tue Jul 24 10:04:32 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 20:04:32 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> Message-ID: <20070724170432.GZ27878@sashak.voltaire.com> On 07:56 Tue 24 Jul , Eitan Zahavi wrote: > > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > > Hi Sasha, Hal, > > > > > > I think I have an idea: > > > > > > Since this is a specific switch that reported ChangeBit or Trap why > > > can't we just qualify that there was no change in the switch setup? > > > > The ChangeBit seems to be good start point - then OpenSM will > > query all switch ports PortInfo anyway and if for all ports > > PortState is <= INIT (and at least for one port it is = > > INIT), it means that this switch was rebooted/reinitialized. > > > > And for single port PortState drop to = INIT should indicate > > reinitialization. > > > > Seems correct? > Yes. > > > > > We could send PortInfo, SwitchInfo, > > > > SwitchInfo is queried at each light sweep, PortInfo's if > > ChangeBit is set. Guess we are ok with it even now. > I will double check that... > Well - even setting one port state to INIT did not cause the switch to > be reconfigured. > Seems the code does not enforce this condition yet. > > > > > LFT, MFT, SL2VL, VLArb, PKey queries > > > and make sure no change from previous state. Or we could simply > > > enforce last state by sending it over again ... > > > > I think we could want to re-read PKey tables in order to > > preserve existing PKey indices and just to flush (overwrite > > with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable? > Correct. Ok, I will prepare patches. I think about separate patches for switches and ports. Also likely MFT should be handled separately, since we don't do incremental update there yet. Sasha From sashak at voltaire.com Tue Jul 24 10:07:25 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 20:07:25 +0300 Subject: [ofa-general] Command specification of ca_name and ca_port In-Reply-To: <20070724090511.636bbccb.weiny2@llnl.gov> References: <46A4C0C7.7020107@systemfabricworks.com> <20070724013306.GH11674@sashak.voltaire.com> <20070724090511.636bbccb.weiny2@llnl.gov> Message-ID: <20070724170725.GA27878@sashak.voltaire.com> On 09:05 Tue 24 Jul , Ira Weiny wrote: > > > > But it is easy part - saquery renames are > > less intuitive :(. Probably just lower case? Or special query option > > (-q or -Q), so queries could be specified as -qP, -qC? > > > > I disagree with this because ~50% of the options are query's, it's primary > purpose is to query, and most of the other options change the format of the > output of the query. Therefore, I don't think a -q should be required for a > query. I think that seems redundant. > > Perhaps just changing the current option to -c,-p, and adding -C and -P would > be best. I know this might break some scripts out there, particularly mine, > but I think it is the right thing to do if you really want consistency. > > Thoughts? -c,-p are fine for me too. Sasha From arthur.jones at qlogic.com Tue Jul 24 10:07:26 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 10:07:26 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724165550.GD24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> Message-ID: <20070724170726.GL16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 07:55:50PM +0300, Michael S. Tsirkin wrote: > [...] > > what is it about patches that are less evil > > than changesets? can you list some of the > > advantages? > > changesets *do not exist* in git - git tracks content. > > I compare "multiple directories with patches" with the "bush of branches". > With bush of branches: > git pull broken, git archive broken, git tag broken, git reset broken. > It looks like the list can be continued. none of these things are broken, they are just used differently. despite your apprehension, i'd like to see a list of the _advantages_ of multiple directories with patches -- perhaps with this list in hand we can see how they stack up... > Yes, we can start building our own tools on top of git to do this, > but I'd rather not. i'd hardly call a 4 line script a "tool". compare it to the ./ofed_scripts/configure script which is no longer necessary with backport branches. i think the complexity argument doesn't take you too far... i realize that you're attached to your current method, but i've _used_ a different method, and i can say from experience that it works _much_ better... at sonoma, i heard quite a few people asking for easier access to the OFED source. from the user's point of view, pulling a single branch from a repo is _much_ simpler than our current setup, don't you think? arthur From arthur.jones at qlogic.com Tue Jul 24 10:11:21 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 10:11:21 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724165203.GC24797@mellanox.co.il> References: <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> <20070724164659.GJ16727@bauxite.pathscale.com> <20070724165203.GC24797@mellanox.co.il> Message-ID: <20070724171121.GM16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 07:52:03PM +0300, Michael S. Tsirkin wrote: > [...] > > i have found that drawing a DAG with graphviz has > > been a big help in making sure that i organize the > > branches correctly... > > Ugh .. *that* sounds complicated. > Looks like it's much simpler with current setup. compared to the rather sophisticated linux-kernel changesets that i see from you on this list -- it's child's play... compared to figuring out the list of options for ofed_scripts/configure just so we can _see_ the source we're running on our box -- it's a walk in the park... one of the goals of OFED 1.3 is to make access to the source easier. to do that, we will prob need to rid ourselves of patches... arthur From mshefty at ichips.intel.com Tue Jul 24 10:11:52 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 10:11:52 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724162305.GB24797@mellanox.co.il> References: <20070612084108.GK6470@mellanox.co.il> <20070723200640.GA13117@bauxite.pathscale.com> <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> Message-ID: <46A632D8.3030801@ichips.intel.com> > Hmm. Concider that yuou did all of the above, and then mail me > that there's an update. Now I need to merge updates to multiple branches directly > and git pull does not do this. It's a problem. A simple script can do this. > You'll have to check out branches one by one, and do a pull. > What if there's a conflict? I currently just do git reset --hard ORIG_HEAD > and mail the maintainer to fix it up - but this won't work > with the "bush of branches" approach. If there's a conflict, then you need a different patch. A single patch may work for all backports, or a fix may require different patches depending on the kernel version. As it stands now, there are patches that we apply that do not work and expect a subsequent patch to fix it up. - Sean From mst at dev.mellanox.co.il Tue Jul 24 10:14:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 20:14:45 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <46A632D8.3030801@ichips.intel.com> References: <20070724030318.GA7589@mellanox.co.il> <20070724145335.GF16727@bauxite.pathscale.com> <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> <46A632D8.3030801@ichips.intel.com> Message-ID: <20070724171445.GE24797@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > >Hmm. Concider that yuou did all of the above, and then mail me > >that there's an update. Now I need to merge updates to multiple branches > >directly > >and git pull does not do this. It's a problem. > > A simple script can do this. Basically we'll have to script around all of git. Examples: What if there's a conflict? I currently do git reset, we'll a script for this too? The tagging issue will have to be resolved somehow - by a naming convention for tags? Another script ... -- MST From mst at dev.mellanox.co.il Tue Jul 24 10:16:49 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 20:16:49 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724171121.GM16727@bauxite.pathscale.com> References: <20070724150909.GL4359@mellanox.co.il> <20070724152305.GG16727@bauxite.pathscale.com> <20070724152833.GN4359@mellanox.co.il> <20070724154151.GH16727@bauxite.pathscale.com> <20070724155348.GP4359@mellanox.co.il> <20070724161351.GI16727@bauxite.pathscale.com> <20070724162305.GB24797@mellanox.co.il> <20070724164659.GJ16727@bauxite.pathscale.com> <20070724165203.GC24797@mellanox.co.il> <20070724171121.GM16727@bauxite.pathscale.com> Message-ID: <20070724171649.GF24797@mellanox.co.il> > one of the goals of OFED 1.3 is to make access > to the source easier. to do that, we will prob > need to rid ourselves of patches... I'm working on a rather simpler solution to this problem. Stay tuned. -- MST From mst at dev.mellanox.co.il Tue Jul 24 10:19:24 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 20:19:24 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724170726.GL16727@bauxite.pathscale.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> Message-ID: <20070724171924.GG24797@mellanox.co.il> > i realize that you're attached to your current method, > but i've _used_ a different method, and i can say from > experience that it works _much_ better... I'd like to see a clean method, that doesn't replace one set of problems that I understand with another that I have to learn. > at sonoma, i heard quite a few people asking for easier > access to the OFED source. from the user's point of view, > pulling a single branch from a repo is _much_ simpler > than our current setup, don't you think? I think users really want tarballs. If we had tarballs prepatched for all kernels, I think the problem would be solved for most people. -- MST From sean.hefty at intel.com Tue Jul 24 10:19:49 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 10:19:49 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724171445.GE24797@mellanox.co.il> Message-ID: <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com> >Examples: What if there's a conflict? I currently do git reset, we'll If there's a conflict applying a patch, you reject it. I fail to see any issue here. - Sean From tom at opengridcomputing.com Tue Jul 24 10:20:45 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jul 2007 12:20:45 -0500 Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A Message-ID: <1185297645.14681.22.camel@trinity.ogc.int> For those interested in NFS-RDMA, OGC has created an install package based on the OFA 1.2 GA release. The package supports both SLES 10 and RHEL 5. You can download this package from http://www.opengridcomputing.com/nfs-rdma.html. Please let me know if you find any problems. Thanks, Tom From arthur.jones at qlogic.com Tue Jul 24 10:28:26 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 10:28:26 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724171924.GG24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> Message-ID: <20070724172826.GN16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 08:19:24PM +0300, Michael S. Tsirkin wrote: > > i realize that you're attached to your current method, > > but i've _used_ a different method, and i can say from > > experience that it works _much_ better... > > I'd like to see a clean method, that doesn't replace one set of > problems that I understand with another that I have to learn. i think we'll be further along by just doing a better job rather than waiting endlessly for perfection to come along. i'd _really_ like to see a list of the advantages of patches over branches. it's hard for me to know if i'm just missing something if the case is not laid out... arthur From mst at dev.mellanox.co.il Tue Jul 24 10:42:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 20:42:55 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com> References: <20070724171445.GE24797@mellanox.co.il> <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com> Message-ID: <20070724174255.GH24797@mellanox.co.il> > Quoting Sean Hefty : > Subject: RE: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > >Examples: What if there's a conflict? I currently do git reset, we'll > > If there's a conflict applying a patch, you reject it. I fail to see any issue > here. But the proposal here was to have a bush of branches, all of which need to be merged at the same time. It's possible that some would merge and some would fail, leaving me in an inconsistent state, and no easy way to get back to where I started. -- MST From mst at dev.mellanox.co.il Tue Jul 24 10:52:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 20:52:20 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724172826.GN16727@bauxite.pathscale.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> Message-ID: <20070724175220.GI24797@mellanox.co.il> > i'd _really_ like to see a list of the advantages of > patches over branches. it's hard for me to know if > i'm just missing something if the case is not laid out... Here's a short list off the top of my head - A single git pull merges any number of backport changes - A single git reset ORIG_HEAD recovers from a conflicting merge - A single tag tags all code for all kernels - On update from upstream, if there is a conflict between upstream code and and a patch it's easy to temporarily remote the patch, complete the merge, and go bugger the patch author - For recent kernels there are almost no patches. So an update from upstream for these kernels is free, with branches I will still need to update all branches. - Adding a fix which only affects common code is currently straight-forward: make a change, commit. With multiple branches every fix must be pulled into all branches. -- MST From mshefty at ichips.intel.com Tue Jul 24 10:57:54 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 10:57:54 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724174255.GH24797@mellanox.co.il> References: <20070724171445.GE24797@mellanox.co.il> <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com> <20070724174255.GH24797@mellanox.co.il> Message-ID: <46A63DA2.5000602@ichips.intel.com> > But the proposal here was to have a bush of branches, all of which > need to be merged at the same time. It's possible that some > would merge and some would fail, leaving me in an inconsistent state, > and no easy way to get back to where I started. A fix could be applied to some kernels, but not others. In fact, if a patch works for kernel X & Y, but has a conflict with kernel Z, then different patches are needed anyway. I don't see the requirement to merge everything or even apply a fix to all kernels at the same time. - Sean From mshefty at ichips.intel.com Tue Jul 24 11:13:08 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 11:13:08 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724175220.GI24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> Message-ID: <46A64134.5080502@ichips.intel.com> > Here's a short list off the top of my head > > - A single git pull merges any number of backport changes > - A single git reset ORIG_HEAD recovers from a conflicting merge > - A single tag tags all code for all kernels > - On update from upstream, if there is a conflict > between upstream code and and a patch > it's easy to temporarily remote the patch, complete the merge, > and go bugger the patch author > - For recent kernels there are almost no patches. > So an update from upstream for these kernels is free, > with branches I will still need to update all branches. > - Adding a fix which only affects common code > is currently straight-forward: make a change, commit. > With multiple branches every fix must be pulled into > all branches. You seem to be overlooking the fact that you already require a script to check that things work for all kernels. Until you apply a series of patches to form a particular kernel, you don't know if a change that you pulled in caused a conflict. You still have the requirement to verify the fix on all kernels, and it still requires running a script that pushes/pops patches to create each tree. - Sean From eitan at mellanox.co.il Tue Jul 24 11:12:10 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 21:12:10 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> Hi Hal, The code to find "duplicated" GUIDs stem from real user cases where flawed burning procedure caused actual GUID duplications. There is nothing "impossible". So it is really critical the the SM will be able to recognize this case and abort. It might be that for testing someone wants to use a loopback plug that cause the same port GUID appear on both sides of link - but it is better to require the user doing the test to set some flag than to miss such a situation in real life cluster. This requirement was written after many people wasted many hours trying to figure out what was going on. PLEASE DO NOT TAKE IT AWAY Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 6:04 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 5:53 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback Hi Eitan, On 7/24/07, Eitan Zahavi wrote: Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Perhaps not but no reason OpenSM can't handle this more gracefully. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Why would a separate flag be needed ? [EZ] Since I do not see any other solution for the SM to know it is really a loop back plug rather then two devices with same GUID connected back to back ... "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Jul 24 11:36:42 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 24 Jul 2007 21:36:42 +0300 Subject: [ofa-general] [PATCH] opensm: detect fast switch reset and force LFT update In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> Message-ID: <20070724183642.GC27878@sashak.voltaire.com> Here we are trying to detect "fast" (so that OpenSM doesn't not detect down state in sweep period) switch reset by validating PortState of all ports (for <= INIT). If detected p_sw->need_update flag still remain "on". In this case this switch forwarding tables will be updated unconditionally. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_switch.h | 5 +++++ opensm/opensm/osm_port_info_rcv.c | 3 +++ opensm/opensm/osm_state_mgr.c | 1 + opensm/opensm/osm_switch.c | 1 + opensm/opensm/osm_ucast_mgr.c | 3 ++- 5 files changed, 12 insertions(+), 1 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 5b2b19e..9364d2c 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -112,6 +112,7 @@ typedef struct _osm_switch osm_fwd_tbl_t fwd_tbl; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; + unsigned need_update; void *priv; } osm_switch_t; /* @@ -152,6 +153,10 @@ typedef struct _osm_switch * during the current fabric sweep. This number is reset * to zero at the start of a sweep. * +* need_update +* When set indicates that switch was probably reset, so +* fwd tables and rest cached data should be flushed +* * SEE ALSO * Switch object *********/ diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index adece65..6fe2d1d 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -337,6 +337,9 @@ __osm_pi_rcv_process_switch_port( } } + if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw) + p_node->sw->need_update = 0; + /* Update the PortInfo attribute. */ diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 0181c0f..7efbe2a 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -565,6 +565,7 @@ __osm_state_mgr_reset_switch_count( } p_sw->discovery_count = 0; + p_sw->need_update = 1; } /********************************************************************** diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index a5a6fb7..2e170fc 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -104,6 +104,7 @@ osm_switch_init( p_sw->p_node = p_node; p_sw->switch_info = *p_si; p_sw->num_ports = num_ports; + p_sw->need_update = 1; status = osm_fwd_tbl_init( &p_sw->fwd_tbl, p_si ); if( status != IB_SUCCESS ) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index b44a3ba..a8fc649 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -811,7 +811,8 @@ osm_ucast_mgr_set_fwd_table( osm_switch_get_fwd_tbl_block( p_sw, block_id_ho, block ) ; block_id_ho++ ) { - if (!memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) + if (!p_sw->need_update && + !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) continue; if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) -- 1.5.3.rc2.29.gc4640f From hal.rosenstock at gmail.com Tue Jul 24 11:38:24 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 14:38:24 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> Message-ID: On 7/24/07, Eitan Zahavi wrote: > > *Hi Hal,* > ** > *The code to find "duplicated" GUIDs stem from real user cases where > flawed * > *burning procedure caused actual GUID duplications. There is nothing > "impossible". * > No one said impossible; just a violation of what globally unique (GU from GUID) really means. It's largely because vendors allowed users to program non volatile RAM for GUIDs rather than a real manufacturing process for this which guarantees uniqueness that we are even discussing this aspect of it. *So it is really critical the the SM will be able to recognize this case > and abort.* > I agree with the detect part but not the abort part. Why can't it report these errors and continue on ? That seems better to me than aborting. -- Hal > *It might be that for testing someone wants to use a loopback plug that > cause the same * > *port GUID appear on both sides of link - but it is better to require the > user doing the test * > *to set some flag than to miss such a situation in real life cluster.* > ** > *This requirement was written after many people wasted many hours trying > to figure out what was going on.* > *PLEASE DO NOT TAKE IT AWAY* > ** > > *Eitan Zahavi*** > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > ------------------------------ > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Tuesday, July 24, 2007 6:04 PM > *To:* Eitan Zahavi > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > *Sent:* Tuesday, July 24, 2007 5:53 PM > > *To:* Eitan Zahavi > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > Hi Eitan, > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > *Hi Hal,* > > > ** > > > *What is this "loopback" connector used for?* > > > *Does not seem to me like a very useful thing to do.* > > > > > ** > > Perhaps not but no reason OpenSM can't handle this more gracefully. > > > > *Anyway, if it is not a production environment we could add a "debug > > > mode" (-d flag option) to ignore this check.* > > > > > ** > > Why would a separate flag be needed ? > > *[EZ] Since I do not see any other solution for the SM to know it is > > really a loop back plug rather then two devices with same GUID connected > > back to back ... * > > > > > "Technically", this should only occur when looped back and not two devices > with same GUID as GUID == globally unique and a duplication indicates a > "manufacturing" issue. > > Anyhow, can't these be treated the same (and handled more gracefully) > without an additional option/flag ? > > -- Hal > > > > -- Hal > > > > ** > > > > > > *Eitan Zahavi*** > > > Senior Engineering Director, Software Architect > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > ------------------------------ > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > > *To:* OpenFabrics General > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > Hi, > > > > > > This is what starts off as a "minor" issue and I know it has been > > > discussed it somewhat in the past: > > > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > > indicate duplicated GUID error 0D18 as follows: > > > > > > __osm_ni_rcv_set_links > > > { > > > ... > > > /* > > > When there are only two nodes with exact same guids > > > (connected back > > > to back) - the previous check for duplicated guid will > > > not catch > > > them. But the link will be from the port to itself... > > > Enhanced Port 0 is an exception to this > > > */ > > > if ((osm_node_get_node_guid( p_node ) == > > > p_ni_context->node_guid) && > > > (port_num == p_ni_context->port_num) && > > > (port_num != 0)) > > > { > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > "__osm_ni_rcv_set_links: ERR 0D18: " > > > "Duplicate GUID found by link from a port to > > > itself:" > > > "node 0x%" PRIx64 ", port number 0x%X\n", > > > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > > > port_num ); > > > ... > > > > > > So this occurs over and over and over and fills the log with the same > > > spew. This should be improved IMO. > > > > > > Is this really a fatal condition ? Doesn't seem like it should be to > > > me. > > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that > > > safe for this condition ? > > > > > > Seems like something like an extra loopback bit should be added to > > > some port structure which should cause these links to be ignored. This bit > > > would then be reset when the peer is now longer itself. > > > > > > Also, is there a relationship of this with the 12x/duplicated GUID > > > code ? > > > > > > Thanks. > > > > > > -- Hal > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Jul 24 11:39:29 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 21:39:29 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> Hi Hal, For many users such a critical failure (one the SM can not really do anything with) is better aborted then forgotten in some log file. Anyway's the -y flag lets you ignore it if you like. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 9:38 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: Hi Hal, The code to find "duplicated" GUIDs stem from real user cases where flawed burning procedure caused actual GUID duplications. There is nothing "impossible". No one said impossible; just a violation of what globally unique (GU from GUID) really means. It's largely because vendors allowed users to program non volatile RAM for GUIDs rather than a real manufacturing process for this which guarantees uniqueness that we are even discussing this aspect of it. So it is really critical the the SM will be able to recognize this case and abort. I agree with the detect part but not the abort part. Why can't it report these errors and continue on ? That seems better to me than aborting. -- Hal It might be that for testing someone wants to use a loopback plug that cause the same port GUID appear on both sides of link - but it is better to require the user doing the test to set some flag than to miss such a situation in real life cluster. This requirement was written after many people wasted many hours trying to figure out what was going on. PLEASE DO NOT TAKE IT AWAY Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 6:04 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 5:53 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback Hi Eitan, On 7/24/07, Eitan Zahavi wrote: Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Perhaps not but no reason OpenSM can't handle this more gracefully. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Why would a separate flag be needed ? [EZ] Since I do not see any other solution for the SM to know it is really a loop back plug rather then two devices with same GUID connected back to back ... "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Jul 24 11:55:35 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 14:55:35 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> Message-ID: On 7/24/07, Eitan Zahavi wrote: > *Hi Hal,* > ** > *For many users such a critical failure (one the SM can not really do > anything with) is better aborted then forgotten in some log file.* > *Anyway's the -y flag lets you ignore it if you like.* > So everything else continues to work fine with -y ? In which case, I'm not sure which is the better default. Users certainly won't like their logs filling up with continuous duplicated GUID messages. The log spew should be cleaned up IMO. -- Hal > *Eitan Zahavi*** > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > ------------------------------ > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Tuesday, July 24, 2007 9:38 PM > *To:* Eitan Zahavi > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > *Hi Hal,* > > ** > > *The code to find "duplicated" GUIDs stem from real user cases where > > flawed * > > *burning procedure caused actual GUID duplications. There is nothing > > "impossible". * > > > > No one said impossible; just a violation of what globally unique (GU from > GUID) really means. It's largely because vendors allowed users to program > non volatile RAM for GUIDs rather than a real manufacturing process for this > which guarantees uniqueness that we are even discussing this aspect of it. > > *So it is really critical the the SM will be able to recognize this case > > and abort.* > > > > I agree with the detect part but not the abort part. Why can't it report > these errors and continue on ? That seems better to me than aborting. > > -- Hal > > > > *It might be that for testing someone wants to use a loopback plug that > > cause the same * > > *port GUID appear on both sides of link - but it is better to require > > the user doing the test * > > *to set some flag than to miss such a situation in real life cluster.* > > ** > > *This requirement was written after many people wasted many hours trying > > to figure out what was going on.* > > *PLEASE DO NOT TAKE IT AWAY* > > ** > > > > *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > ------------------------------ > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > *Sent:* Tuesday, July 24, 2007 6:04 PM > > *To:* Eitan Zahavi > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > *Sent:* Tuesday, July 24, 2007 5:53 PM > > > *To:* Eitan Zahavi > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > Hi Eitan, > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > *Hi Hal,* > > > > ** > > > > *What is this "loopback" connector used for?* > > > > *Does not seem to me like a very useful thing to do.* > > > > > > > ** > > > Perhaps not but no reason OpenSM can't handle this more gracefully. > > > > > > *Anyway, if it is not a production environment we could add a "debug > > > > mode" (-d flag option) to ignore this check.* > > > > > > > ** > > > Why would a separate flag be needed ? > > > *[EZ] Since I do not see any other solution for the SM to know it is > > > really a loop back plug rather then two devices with same GUID connected > > > back to back ... * > > > > > > > > "Technically", this should only occur when looped back and not two > > devices with same GUID as GUID == globally unique and a duplication > > indicates a "manufacturing" issue. > > > > Anyhow, can't these be treated the same (and handled more gracefully) > > without an additional option/flag ? > > > > -- Hal > > > > > > > -- Hal > > > > > > ** > > > > > > > > *Eitan Zahavi*** > > > > Senior Engineering Director, Software Architect > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > ------------------------------ > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > > > *To:* OpenFabrics General > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > Hi, > > > > > > > > This is what starts off as a "minor" issue and I know it has been > > > > discussed it somewhat in the past: > > > > > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > > > indicate duplicated GUID error 0D18 as follows: > > > > > > > > __osm_ni_rcv_set_links > > > > { > > > > ... > > > > /* > > > > When there are only two nodes with exact same guids > > > > (connected back > > > > to back) - the previous check for duplicated guid will > > > > not catch > > > > them. But the link will be from the port to itself... > > > > Enhanced Port 0 is an exception to this > > > > */ > > > > if ((osm_node_get_node_guid( p_node ) == > > > > p_ni_context->node_guid) && > > > > (port_num == p_ni_context->port_num) && > > > > (port_num != 0)) > > > > { > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > > "__osm_ni_rcv_set_links: ERR 0D18: " > > > > "Duplicate GUID found by link from a port to > > > > itself:" > > > > "node 0x%" PRIx64 ", port number 0x%X\n", > > > > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > > > > port_num ); > > > > ... > > > > > > > > So this occurs over and over and over and fills the log with the > > > > same spew. This should be improved IMO. > > > > > > > > Is this really a fatal condition ? Doesn't seem like it should be to > > > > me. > > > > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that > > > > safe for this condition ? > > > > > > > > Seems like something like an extra loopback bit should be added to > > > > some port structure which should cause these links to be ignored. This bit > > > > would then be reset when the peer is now longer itself. > > > > > > > > Also, is there a relationship of this with the 12x/duplicated GUID > > > > code ? > > > > > > > > Thanks. > > > > > > > > -- Hal > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Tue Jul 24 12:05:02 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 22:05:02 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <46A64134.5080502@ichips.intel.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <46A64134.5080502@ichips.intel.com> Message-ID: <20070724190502.GA29012@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > >Here's a short list off the top of my head > > > >- A single git pull merges any number of backport changes > >- A single git reset ORIG_HEAD recovers from a conflicting merge > >- A single tag tags all code for all kernels > >- On update from upstream, if there is a conflict > > between upstream code and and a patch > > it's easy to temporarily remote the patch, complete the merge, > > and go bugger the patch author > >- For recent kernels there are almost no patches. > > So an update from upstream for these kernels is free, > > with branches I will still need to update all branches. > >- Adding a fix which only affects common code > > is currently straight-forward: make a change, commit. > > With multiple branches every fix must be pulled into > > all branches. > > You seem to be overlooking the fact that you already require a script to > check that things work for all kernels. Until you apply a series of > patches to form a particular kernel, you don't know if a change that you > pulled in caused a conflict. You still have the requirement to verify > the fix on all kernels, and it still requires running a script that > pushes/pops patches to create each tree. Yes. But I find it preferable to manage history with full power of native git tools, where a single hash identifies a revision, and limit the scope of the scripts to the build process. This, as opposed to an elaborate methodology that is based on naming conventions, and requires use of scripts to do basic tasks such as tagging, history rewriting, etc. -- MST From mst at dev.mellanox.co.il Tue Jul 24 12:06:56 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 24 Jul 2007 22:06:56 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <46A63DA2.5000602@ichips.intel.com> References: <20070724171445.GE24797@mellanox.co.il> <000101c7ce16$d73bc790$ff0da8c0@amr.corp.intel.com> <20070724174255.GH24797@mellanox.co.il> <46A63DA2.5000602@ichips.intel.com> Message-ID: <20070724190656.GB29012@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits > > >But the proposal here was to have a bush of branches, all of which > >need to be merged at the same time. It's possible that some > >would merge and some would fail, leaving me in an inconsistent state, > >and no easy way to get back to where I started. > > A fix could be applied to some kernels, but not others. In fact, if a > patch works for kernel X & Y, but has a conflict with kernel Z, then > different patches are needed anyway. I don't see the requirement to > merge everything or even apply a fix to all kernels at the same time. This is typically component maintainer's job, not integrator's. As an integrator, I want to pull but if the merge fails, reset everything back to the original state, and let the maintainer know. -- MST From hadi at cyberus.ca Tue Jul 24 12:28:20 2007 From: hadi at cyberus.ca (jamal) Date: Tue, 24 Jul 2007 15:28:20 -0400 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: References: Message-ID: <1185305300.26013.152.camel@localhost> KK, On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote: > > J Hadi Salim wrote on 07/23/2007 06:02:01 PM: > Actually you have not sent netperf results with prep and without prep. My results were based on pktgen (which i explained as testing the driver). I think depending on netperf without further analysis is simplistic. It was like me doing forwarding tests on these patches. > > So _which_ non-LLTX driver doesnt do that? ;-> > > I have no idea since I haven't looked at all drivers. Can you tell which > all non-LLTX drivers does that ? I stated this as the sole criterea. The few i have peeked at all do it. I also think the e1000 should be converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. > > tun driver doesnt use it either - but i doubt that makes it "bloat" > > Adding extra code that is currently not usable (esp from a submission > point) is bloat. So far i have converted 3 drivers, 1 of them doesnt use it. Two more driver conversions are on the way, they will both use it. How is this bloat again? A few emails back you said if only IPOIB can use batching then thats good enough justification. > > You waltz in, have the luxury of looking at my code, presentations, many > > discussions with me etc ... > > "luxury" ? > I had implemented the entire thing even before knowing that you > are working on something similar! and I had sent the first proposal to > netdev, I saw your patch at the end of may (or at least 2 weeks after you said it existed). That patch has very little resemblance to what you just posted conceptwise or codewise. I could post it if you would give me permission. > *after* which you told that you have your own code and presentations (which > I had never seen earlier - I joined netdev a few months back, earlier I was > working on RDMA, Infiniband as you know). I am gonna assume you didnt know of my work - which i have been making public for about 3 years. Infact i talked about this topic when i visited your office in 2006 on a day you were not present, so it is plausible you didnt hear of it. > And it didn't give me any great > ideas either, remember I had posted results for E1000 at the time of > sending the proposals. In mid-June you sent me a series of patches which included anything from changing variable names to combining qdisc_restart and about everything i referred to as being "cosmetic differences" in your posted patches. I took two of those and incorporated them in. One was an "XXX" in my code already to allocate the dev->blist (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). The other one was a mechanical removal of the blist being passed (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). Some of the others i asked you to defer. For example, the reason i gave you for not merging any qdisc_restart_combine changes is because i was waiting for Dave to swallow the qdisc_restart changes i made; otherwise maintainance becomes extremely painful for me. Sridhar actually provided a lot more valuable comments and fixes but has not planted a flag on behalf of the queen of spain like you did. > However I do give credit in my proposal to you for what > ideas that your provided (without actual code), and the same I did for other > people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, > and I sent some patches to improve your code too, I incorporated two of your patches and asked for deferal of others. These patches have now shown up in what you claim as "the difference". I just call them "cosmetic difference" not to downplay the importance of having an ethtool interface but because they do not make batching perform any better. The real differences are those two items. I am suprised you havent cannibalized those changes as well. I thought you renamed them to something else; according to your posting: "This patch will work with drivers updated by Jamal, Matt & Michael Chan with minor modifications - rename xmit_win to xmit_slots & rename batch handler". Or maybe thats a "future plan" you have in mind? > so it looks like a two > way street to me (and that is how open source works and should). Open source is a lot more transparent than that. You posted a question, which was part of your research. I responded and told you i have patches; you asked me for them and i promptly ported them from pre-2.6.18 to the latest kernel at the time. The nature of this batching work is one of performance. So numbers are important. If you had some strong disagreements on something in the architecture, then it would be of great value to explain it in a technical detail - and more importantly to provide some numbers to say why it is a bad idea. You get numbers by running some tests. You did none of the above. Your effort has been to produce "your patch" for whatever reasons. This would not have been problematic to me if it actually was based within reasons of optimization because the end goal would have been achieved. I have deleted the rest of the email because it goes back and forth on the same points. I am gonna continue work on the current tree i have. I will put more time when i get back next week (and hopefully no travel right after). I will upgrade to Daves tree later when i get the two new drivers in. I am probably gonna hold on until the new NAPI stuff settles in first. You are welcome to submit the ipoib changes in. You are also welcome to co-author with me but you will have to work for it this time. cheers, jamal From tom at opengridcomputing.com Tue Jul 24 12:31:52 2007 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 24 Jul 2007 14:31:52 -0500 Subject: [ofa-general] [PATCH] amso1100: QP init bug in amso driver Message-ID: <1185305512.20489.6.camel@trinity.ogc.int> Roland: The guys at UNH found this and fixed it. I'm surprised no one has hit this before. I guess it only breaks when the refcount on the QP is non-zero. Initialize the wait_queue_head_t in the c2_qp structure. Signed-off-by: Ethan Burns Acked-by: Tom Tucker --- drivers/infiniband/hw/amso1100/c2_qp.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c index 420c138..01d0786 100644 --- a/drivers/infiniband/hw/amso1100/c2_qp.c +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev, qp->send_sgl_depth = qp_attrs->cap.max_send_sge; qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + init_waitqueue_head(&qp->wait); /* Initialize the SQ MQ */ q_size = be32_to_cpu(reply->sq_depth); From eitan at mellanox.co.il Tue Jul 24 13:20:37 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 23:20:37 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> Maybe avoid the log if -y is provided? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 9:56 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: Hi Hal, For many users such a critical failure (one the SM can not really do anything with) is better aborted then forgotten in some log file. Anyway's the -y flag lets you ignore it if you like. So everything else continues to work fine with -y ? In which case, I'm not sure which is the better default. Users certainly won't like their logs filling up with continuous duplicated GUID messages. The log spew should be cleaned up IMO. -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 9:38 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: Hi Hal, The code to find "duplicated" GUIDs stem from real user cases where flawed burning procedure caused actual GUID duplications. There is nothing "impossible". No one said impossible; just a violation of what globally unique (GU from GUID) really means. It's largely because vendors allowed users to program non volatile RAM for GUIDs rather than a real manufacturing process for this which guarantees uniqueness that we are even discussing this aspect of it. So it is really critical the the SM will be able to recognize this case and abort. I agree with the detect part but not the abort part. Why can't it report these errors and continue on ? That seems better to me than aborting. -- Hal It might be that for testing someone wants to use a loopback plug that cause the same port GUID appear on both sides of link - but it is better to require the user doing the test to set some flag than to miss such a situation in real life cluster. This requirement was written after many people wasted many hours trying to figure out what was going on. PLEASE DO NOT TAKE IT AWAY Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 6:04 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 5:53 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback Hi Eitan, On 7/24/07, Eitan Zahavi wrote: Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Perhaps not but no reason OpenSM can't handle this more gracefully. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Why would a separate flag be needed ? [EZ] Since I do not see any other solution for the SM to know it is really a loop back plug rather then two devices with same GUID connected back to back ... "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Tue Jul 24 13:25:46 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 24 Jul 2007 16:25:46 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> Message-ID: On 7/24/07, Eitan Zahavi wrote: > > *Maybe avoid the log if -y is provided?* > That avoids the spew but the duplicated GUID is important to know so IMO something in the "middle" is needed where duplicated GUIDs are logged but not continually the same ones. *Eitan Zahavi*** > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > ------------------------------ > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > *Sent:* Tuesday, July 24, 2007 9:56 PM > *To:* Eitan Zahavi > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > On 7/24/07, Eitan Zahavi wrote: > > > *Hi Hal,* > > ** > > *For many users such a critical failure (one the SM can not really do > > anything with) is better aborted then forgotten in some log file.* > > *Anyway's the -y flag lets you ignore it if you like.* > > > > So everything else continues to work fine with -y ? In which case, I'm not > sure which is the better default. > > Users certainly won't like their logs filling up with continuous > duplicated GUID messages. The log spew should be cleaned up IMO. > > -- Hal > > > > > > > *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > ------------------------------ > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > *Sent:* Tuesday, July 24, 2007 9:38 PM > > *To:* Eitan Zahavi > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > *Hi Hal,* > > > ** > > > *The code to find "duplicated" GUIDs stem from real user cases where > > > flawed * > > > *burning procedure caused actual GUID duplications. There is nothing > > > "impossible". * > > > > > > > No one said impossible; just a violation of what globally unique (GU > > from GUID) really means. It's largely because vendors allowed users to > > program non volatile RAM for GUIDs rather than a real manufacturing process > > for this which guarantees uniqueness that we are even discussing this aspect > > of it. > > > > *So it is really critical the the SM will be able to recognize this > > > case and abort.* > > > > > > > I agree with the detect part but not the abort part. Why can't it report > > these errors and continue on ? That seems better to me than aborting. > > > > -- Hal > > > > > > > *It might be that for testing someone wants to use a loopback plug > > > that cause the same * > > > *port GUID appear on both sides of link - but it is better to require > > > the user doing the test * > > > *to set some flag than to miss such a situation in real life cluster.* > > > ** > > > *This requirement was written after many people wasted many hours > > > trying to figure out what was going on.* > > > *PLEASE DO NOT TAKE IT AWAY* > > > ** > > > > > > *Eitan Zahavi*** > > > Senior Engineering Director, Software Architect > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > ------------------------------ > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > *Sent:* Tuesday, July 24, 2007 6:04 PM > > > *To:* Eitan Zahavi > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > > *Sent:* Tuesday, July 24, 2007 5:53 PM > > > > *To:* Eitan Zahavi > > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > > Hi Eitan, > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > > > *Hi Hal,* > > > > > ** > > > > > *What is this "loopback" connector used for?* > > > > > *Does not seem to me like a very useful thing to do.* > > > > > > > > > ** > > > > Perhaps not but no reason OpenSM can't handle this more gracefully. > > > > > > > > *Anyway, if it is not a production environment we could add a > > > > > "debug mode" (-d flag option) to ignore this check.* > > > > > > > > > ** > > > > Why would a separate flag be needed ? > > > > *[EZ] Since I do not see any other solution for the SM to know it > > > > is really a loop back plug rather then two devices with same GUID connected > > > > back to back ... * > > > > > > > > > > > "Technically", this should only occur when looped back and not two > > > devices with same GUID as GUID == globally unique and a duplication > > > indicates a "manufacturing" issue. > > > > > > Anyhow, can't these be treated the same (and handled more gracefully) > > > without an additional option/flag ? > > > > > > -- Hal > > > > > > > > > > -- Hal > > > > > > > > ** > > > > > > > > > > *Eitan Zahavi*** > > > > > Senior Engineering Director, Software Architect > > > > > Mellanox Technologies LTD > > > > > Tel:+972-4-9097208 > > > > > Fax:+972-4-9593245 > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > ------------------------------ > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > > > > *To:* OpenFabrics General > > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > Hi, > > > > > > > > > > This is what starts off as a "minor" issue and I know it has been > > > > > discussed it somewhat in the past: > > > > > > > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > > > > indicate duplicated GUID error 0D18 as follows: > > > > > > > > > > __osm_ni_rcv_set_links > > > > > { > > > > > ... > > > > > /* > > > > > When there are only two nodes with exact same guids > > > > > (connected back > > > > > to back) - the previous check for duplicated guid > > > > > will not catch > > > > > them. But the link will be from the port to itself... > > > > > Enhanced Port 0 is an exception to this > > > > > */ > > > > > if ((osm_node_get_node_guid( p_node ) == > > > > > p_ni_context->node_guid) && > > > > > (port_num == p_ni_context->port_num) && > > > > > (port_num != 0)) > > > > > { > > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > > > "__osm_ni_rcv_set_links: ERR 0D18: " > > > > > "Duplicate GUID found by link from a port to > > > > > itself:" > > > > > "node 0x%" PRIx64 ", port number 0x%X\n", > > > > > cl_ntoh64( osm_node_get_node_guid( p_node ) > > > > > ), > > > > > port_num ); > > > > > ... > > > > > > > > > > So this occurs over and over and over and fills the log with the > > > > > same spew. This should be improved IMO. > > > > > > > > > > Is this really a fatal condition ? Doesn't seem like it should be > > > > > to me. > > > > > > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is > > > > > that safe for this condition ? > > > > > > > > > > Seems like something like an extra loopback bit should be added to > > > > > some port structure which should cause these links to be ignored. This bit > > > > > would then be reset when the peer is now longer itself. > > > > > > > > > > Also, is there a relationship of this with the 12x/duplicated GUID > > > > > code ? > > > > > > > > > > Thanks. > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Jul 24 13:25:32 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 24 Jul 2007 23:25:32 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901ED65F2@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> On 7/24/07, Eitan Zahavi wrote: Maybe avoid the log if -y is provided? That avoids the spew but the duplicated GUID is important to know so IMO something in the "middle" is needed where duplicated GUIDs are logged but not continually the same ones. [EZ] OK so in -y mode only we track which ones were reported and do not repeat the log? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 9:56 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: Hi Hal, For many users such a critical failure (one the SM can not really do anything with) is better aborted then forgotten in some log file. Anyway's the -y flag lets you ignore it if you like. So everything else continues to work fine with -y ? In which case, I'm not sure which is the better default. Users certainly won't like their logs filling up with continuous duplicated GUID messages. The log spew should be cleaned up IMO. -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 9:38 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: Hi Hal, The code to find "duplicated" GUIDs stem from real user cases where flawed burning procedure caused actual GUID duplications. There is nothing "impossible". No one said impossible; just a violation of what globally unique (GU from GUID) really means. It's largely because vendors allowed users to program non volatile RAM for GUIDs rather than a real manufacturing process for this which guarantees uniqueness that we are even discussing this aspect of it. So it is really critical the the SM will be able to recognize this case and abort. I agree with the detect part but not the abort part. Why can't it report these errors and continue on ? That seems better to me than aborting. -- Hal It might be that for testing someone wants to use a loopback plug that cause the same port GUID appear on both sides of link - but it is better to require the user doing the test to set some flag than to miss such a situation in real life cluster. This requirement was written after many people wasted many hours trying to figure out what was going on. PLEASE DO NOT TAKE IT AWAY Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 6:04 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback On 7/24/07, Eitan Zahavi wrote: From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] Sent: Tuesday, July 24, 2007 5:53 PM To: Eitan Zahavi Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik Subject: Re: OpenSM detection of duplicated GUIDs on loopback Hi Eitan, On 7/24/07, Eitan Zahavi wrote: Hi Hal, What is this "loopback" connector used for? Does not seem to me like a very useful thing to do. Perhaps not but no reason OpenSM can't handle this more gracefully. Anyway, if it is not a production environment we could add a "debug mode" (-d flag option) to ignore this check. Why would a separate flag be needed ? [EZ] Since I do not see any other solution for the SM to know it is really a loop back plug rather then two devices with same GUID connected back to back ... "Technically", this should only occur when looped back and not two devices with same GUID as GUID == globally unique and a duplication indicates a "manufacturing" issue. Anyhow, can't these be treated the same (and handled more gracefully) without an additional option/flag ? -- Hal -- Hal Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Tuesday, July 24, 2007 5:31 PM To: OpenFabrics General Cc: Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik Subject: OpenSM detection of duplicated GUIDs on loopback Hi, This is what starts off as a "minor" issue and I know it has been discussed it somewhat in the past: Putting a loopback connector on a (switch) link causes OpenSM to indicate duplicated GUID error 0D18 as follows: __osm_ni_rcv_set_links { ... /* When there are only two nodes with exact same guids (connected back to back) - the previous check for duplicated guid will not catch them. But the link will be from the port to itself... Enhanced Port 0 is an exception to this */ if ((osm_node_get_node_guid( p_node ) == p_ni_context->node_guid) && (port_num == p_ni_context->port_num) && (port_num != 0)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D18: " "Duplicate GUID found by link from a port to itself:" "node 0x%" PRIx64 ", port number 0x%X\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); ... So this occurs over and over and over and fills the log with the same spew. This should be improved IMO. Is this really a fatal condition ? Doesn't seem like it should be to me. Also, OpenSM can "ride" this out with -y (stay on fatal) but is that safe for this condition ? Seems like something like an extra loopback bit should be added to some port structure which should cause these links to be ignored. This bit would then be reset when the peer is now longer itself. Also, is there a relationship of this with the 12x/duplicated GUID code ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Jul 24 14:54:41 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 00:54:41 +0300 Subject: [ofa-general] [PATCH] opensm: detect port external reset and flush cached tables In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> Message-ID: <20070724215441.GA25264@sashak.voltaire.com> This detects port external reset by validating PortState == INIT, and when detected flushes cached port related tables - re-reads pkey table and drops (overwrites) SL2VL and VLArb tables. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_port.h | 5 +++++ opensm/opensm/osm_port.c | 1 + opensm/opensm/osm_port_info_rcv.c | 9 ++++++++- opensm/opensm/osm_qos.c | 9 +++++---- 4 files changed, 19 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_port.h b/opensm/include/opensm/osm_port.h index f6c40c7..44323ab 100644 --- a/opensm/include/opensm/osm_port.h +++ b/opensm/include/opensm/osm_port.h @@ -118,6 +118,7 @@ typedef struct _osm_physp struct _osm_physp *p_remote_physp; boolean_t healthy; uint8_t vl_high_limit; + unsigned need_update; osm_dr_path_t dr_path; osm_pkey_tbl_t pkeys; ib_vl_arb_table_t vl_arb[4]; @@ -157,6 +158,10 @@ typedef struct _osm_physp * PortInfo:VLHighLimit value which installed by QoS manager * and should be uploaded to port's PortInfo * +* need_update +* When set indicates that port was probably reset and port +* related tables (PKey, SL2VL, VLArb) require refreshing. +* * dr_path * The directed route path to this port. * diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index e03e316..11cc5ca 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -118,6 +118,7 @@ osm_physp_init( p_physp->port_guid = port_guid; p_physp->port_num = port_num; p_physp->healthy = TRUE; + p_physp->need_update = 2; p_physp->p_node = (struct _osm_node*)p_node; osm_dr_path_init( diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 6fe2d1d..0528e38 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -801,6 +801,12 @@ osm_pi_rcv_process( p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid; } + /* if port just inited or reached INIT state (external reset) + request update for port related tables */ + p_physp->need_update = + (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT || + p_physp->need_update > 1 ) ? 1 : 0; + switch( osm_node_get_type( p_node ) ) { case IB_NODE_TYPE_CA: @@ -824,7 +830,8 @@ osm_pi_rcv_process( /* Get the tables on the physp. */ - __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp ); + if (p_physp->need_update) + __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp ); } diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index 17b7e3a..596b6d4 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -87,8 +87,9 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req, for (i = 0; i < block_length; i++) block.vl_entry[i].vl &= vl_mask; - if (!memcmp(&p->vl_arb[block_num], &block, - block_length * sizeof(block.vl_entry[0]))) + if (!p->need_update && + !memcmp(&p->vl_arb[block_num], &block, + block_length * sizeof(block.vl_entry[0]))) return IB_SUCCESS; context.vla_context.node_guid = @@ -170,8 +171,8 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req, tbl.raw_vl_by_sl[i] = (vl1 << 4 ) | vl2 ; } - p_tbl = osm_physp_get_slvl_tbl(p, in_port); - if (p_tbl && !memcmp(p_tbl, &tbl, sizeof(tbl))) + if (!p->need_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) && + !memcmp(p_tbl, &tbl, sizeof(tbl))) return IB_SUCCESS; context.slvl_context.node_guid = osm_node_get_node_guid(p_node); -- 1.5.3.rc2.29.gc4640f From arthur.jones at qlogic.com Tue Jul 24 15:19:50 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Tue, 24 Jul 2007 15:19:50 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724175220.GI24797@mellanox.co.il> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> Message-ID: <20070724221950.GP16727@bauxite.pathscale.com> hi michael, ... On Tue, Jul 24, 2007 at 08:52:20PM +0300, Michael S. Tsirkin wrote: > > i'd _really_ like to see a list of the advantages of > > patches over branches. it's hard for me to know if > > i'm just missing something if the case is not laid out... thanks for the list... > Here's a short list off the top of my head > > - A single git pull merges any number of backport changes ok, you can run one command instead of a 4-line script. hmm, i guess you could say this is a very slight advantage to using patches... > - A single git reset ORIG_HEAD recovers from a conflicting merge handling conflicts is a big part of a maintainer's job! the _vast_ majority of the time i bet you already know how to do the merge. if you don't, then only the backport branches which haven't merged yet are stuck and you can pick up where you left off (which is how i do it now). but if you're stuck in some strange intermediate state with some patches pushed and some yet to push in the configure script, i could see how you'd want to punt. but, someone is doing this work, and that someone almost certainly has a difficult time reproducing and developing a stack of patches.. if, though, you must have a pristine environment, this is easily solved by using an intermediate repo: git clone -s i bet this is very similar time-wise to running the merge, then the ofed_scripts/configure over all supported branches. merges in git are _fast_... > - A single tag tags all code for all kernels store commit ids in a file and tag that? > - On update from upstream, if there is a conflict > between upstream code and and a patch > it's easy to temporarily remote the patch, complete the merge, > and go bugger the patch author i think this is easier with the backport branches, see git clone -s above. or, just fixup the error. the reason you have to bugger the author may be that you don't have the tools necessary to actually fix up the patch -- but you can prob bet the author doesn't like to fixup patches in quilt any more than you do... > - For recent kernels there are almost no patches. > So an update from upstream for these kernels is free, > with branches I will still need to update all branches. i can say from a couple months experience that upstream merges are "free" using backport branches. running the script to reflow the branches is _far_ less complex than the configure script, has fewer dependencies and is much simpler to maintain and understand. also, if the upstream changes touch code that conflicts with a backport patch, you get to fix the problem as it happens in a much more comfortable environment (i.e. you don't need quilt)... > - Adding a fix which only affects common code > is currently straight-forward: make a change, commit. > With multiple branches every fix must be pulled into > all branches. this use case is actually a good reason to use backport branches. with the patches, you still need to fan out the changes to all the backport branches. but, in general, you don't. so you end up making a change and _not realizing_ that it broke some random backport patch. by reflowing after every change, you get to see it break right there in front of you and you're way more likely to know how to fix it. you could do this with the build script too, but that would require a 4 line script -- and you'd need to switch over to using quilt or some other patch queue based system (yuck!)... all your points above you made from the POV of the maintainer. but, what about the _users_ of the repo. as long as changes are kept as patches, trying to figure out what has changed with your latest round of backports comes down to recreating a tree and pulling from that. it's extremely fragile and error prone. there is only one maintainer, but many developers. if we can make their lives significantly easier then it should be a net gain... the backport branches make merging upstream changes easier. they make merging developer changes easier. they make finding and fixing backport conflicts easier. they make viewing and navigating changes easier. but, you need to use very short scripts (which i'm happy to create and maintain) to tag and pull -- doesn't seem like much of a price to pay to me... arthur From mshefty at ichips.intel.com Tue Jul 24 16:58:29 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 24 Jul 2007 16:58:29 -0700 Subject: [ofa-general] QoS in RDMA CM: (was QoS RFC) In-Reply-To: <46A54659.8010608@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> Message-ID: <46A69225.9090502@ichips.intel.com> Steve, Do you have any input with respect to how the RDMA CM selects and maps QoS (priority, traffic class, VLAN, flow label, etc.)? (See below) Hide the QoS selection under the current interface? Use the IPv6 flowinfo field? Rely on destination port? Input QoS through existing or new call? Handle IPv4 and IPv6 addresses differently? ??? - Sean >> 2.6. ULPs and programs using CMA to establish RC connection should >> provide the CMA the target IP and Service-ID. Some of the ULPs might >> also provide QoS-Class (E.g. for SDP sockets that are provided the >> TOS socket option). The CMA should then use the provided Service-ID >> and optional QoS-Class and pass them in the PR/MPR request. The >> resulting PR/MPR should be used for configuring the connection QP. > > The interface to the CMA needs to remain as transport independent as > possible, and I am unsure of the transport independence of tying QoS to > the destination port number. (I'm not disagreeing; I'm just not sure at > the moment it's the right approach.) > >> 5. CMA features ---------------- >> >> The CMA interface supports Service-ID through the notion of port >> space as a prefixes to the port_num which is part of the sockaddr >> provided to rdma_resolve_add(). What is missing is the explicit >> request for a QoS-Class that should allow the ULP (like SDP) to >> propagate a specific request for a class of service. A mechanism for >> providing the QoS-Class is available in the IPv6 address, so we could >> use that address field. Another option is to implement a special >> connection options API for CMA. >> >> Missing functionality by CMA is the usage of the provided QoS-Class >> and Service-ID in the sent PR/MPR. When a response is obtained it is >> an existing requirement for the CMA to use the PR/MPR from the >> response in setting up the QP address vector. > > The most natural function to specify additional QoS parameters would be > rdma_resolve_route. From sashak at voltaire.com Tue Jul 24 17:18:48 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 03:18:48 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> Message-ID: <20070725001847.GG25264@sashak.voltaire.com> On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > On 7/24/07, Eitan Zahavi wrote: > > Maybe avoid the log if -y is provided? > > > That avoids the spew but the duplicated GUID is important to > know so IMO something in the "middle" is needed where duplicated GUIDs > are logged but not continually the same ones. > [EZ] > OK so in -y mode only we track which ones were reported and do > not repeat the log? And how port moving problem should be solved? We cannot ask an user to run OpenSM with '-y' if in her/his plans to reconnect some ports in a future and just decrease logging. Sasha From krkumar2 at in.ibm.com Tue Jul 24 19:41:06 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 25 Jul 2007 08:11:06 +0530 Subject: [ofa-general] Re: [PATCH 00/10] Implement batching skb API In-Reply-To: <1185305300.26013.152.camel@localhost> Message-ID: Jamal, This is silly. I am not responding to this type of presumptuous and insulting mails. Regards, - KK J Hadi Salim wrote on 07/25/2007 12:58:20 AM: > KK, > > On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote: > > > > > J Hadi Salim wrote on 07/23/2007 06:02:01 PM: > > > > Actually you have not sent netperf results with prep and without prep. > > My results were based on pktgen (which i explained as testing the > driver). I think depending on netperf without further analysis is > simplistic. It was like me doing forwarding tests on these patches. > > > > So _which_ non-LLTX driver doesnt do that? ;-> > > > > I have no idea since I haven't looked at all drivers. Can you tell which > > all non-LLTX drivers does that ? I stated this as the sole criterea. > > The few i have peeked at all do it. I also think the e1000 should be > converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. > > > > tun driver doesnt use it either - but i doubt that makes it "bloat" > > > > Adding extra code that is currently not usable (esp from a submission > > point) is bloat. > > So far i have converted 3 drivers, 1 of them doesnt use it. Two more > driver conversions are on the way, they will both use it. How is this > bloat again? > A few emails back you said if only IPOIB can use batching then thats > good enough justification. > > > > You waltz in, have the luxury of looking at my code, presentations, many > > > discussions with me etc ... > > > > "luxury" ? > > I had implemented the entire thing even before knowing that you > > are working on something similar! and I had sent the first proposal to > > netdev, > > I saw your patch at the end of may (or at least 2 weeks after you said > it existed). That patch has very little resemblance to what you just > posted conceptwise or codewise. I could post it if you would give me > permission. > > > *after* which you told that you have your own code and presentations (which > > I had never seen earlier - I joined netdev a few months back, earlier I was > > working on RDMA, Infiniband as you know). > > I am gonna assume you didnt know of my work - which i have been making > public for about 3 years. Infact i talked about this topic when i > visited your office in 2006 on a day you were not present, so it is > plausible you didnt hear of it. > > > And it didn't give me any great > > ideas either, remember I had posted results for E1000 at the time of > > sending the proposals. > > In mid-June you sent me a series of patches which included anything from > changing variable names to combining qdisc_restart and about everything > i referred to as being "cosmetic differences" in your posted patches. I > took two of those and incorporated them in. One was an "XXX" in my code > already to allocate the dev->blist > (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). > The other one was a mechanical removal of the blist being passed > (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). > Some of the others i asked you to defer. For example, the reason i gave > you for not merging any qdisc_restart_combine changes is because i was > waiting for Dave to swallow the qdisc_restart changes i made; otherwise > maintainance becomes extremely painful for me. > Sridhar actually provided a lot more valuable comments and fixes but has > not planted a flag on behalf of the queen of spain like you did. > > > However I do give credit in my proposal to you for what > > ideas that your provided (without actual code), and the same I did for other > > people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, > > and I sent some patches to improve your code too, > > I incorporated two of your patches and asked for deferal of others. > These patches have now shown up in what you claim as "the difference". I > just call them "cosmetic difference" not to downplay the importance of > having an ethtool interface but because they do not make batching > perform any better. The real differences are those two items. I am > suprised you havent cannibalized those changes as well. I thought you > renamed them to something else; according to your posting: > "This patch will work with drivers updated by Jamal, Matt & Michael Chan > with minor modifications - rename xmit_win to xmit_slots & rename batch > handler". Or maybe thats a "future plan" you have in mind? > > > so it looks like a two > > way street to me (and that is how open source works and should). > > Open source is a lot more transparent than that. > > You posted a question, which was part of your research. I responded and > told you i have patches; you asked me for them and i promptly ported > them from pre-2.6.18 to the latest kernel at the time. > > The nature of this batching work is one of performance. So numbers are > important. If you had some strong disagreements on something in the > architecture, then it would be of great value to explain it in a > technical detail - and more importantly to provide some numbers to say > why it is a bad idea. You get numbers by running some tests. > You did none of the above. Your effort has been to produce "your patch" > for whatever reasons. This would not have been problematic to me if it > actually was based within reasons of optimization because the end goal > would have been achieved. > > I have deleted the rest of the email because it goes back and forth on > the same points. > > I am gonna continue work on the current tree i have. I will put more > time when i get back next week (and hopefully no travel right after). > I will upgrade to Daves tree later when i get the two new drivers in. I > am probably gonna hold on until the new NAPI stuff settles in first. You > are welcome to submit the ipoib changes in. You are also welcome to > co-author with me but you will have to work for it this time. > > cheers, > jamal > From kliteyn at mellanox.co.il Tue Jul 24 21:03:20 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 25 Jul 2007 07:03:20 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-25:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From krkumar2 at in.ibm.com Tue Jul 24 21:40:01 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 25 Jul 2007 10:10:01 +0530 Subject: [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h In-Reply-To: Message-ID: Hi Patrick, Krishna Kumar2/India/IBM wrote on 07/23/2007 08:27:53 AM: > Hi Patrick, > > Patrick McHardy wrote on 07/22/2007 10:36:51 PM: > > > Krishna Kumar wrote: > > > @@ -472,6 +474,9 @@ struct net_device > > > void *priv; /* pointer to private data */ > > > int (*hard_start_xmit) (struct sk_buff *skb, > > > struct net_device *dev); > > > + int (*hard_start_xmit_batch) (struct net_device > > > + *dev); > > > + > > > > > > Os this function really needed? Can't you just call hard_start_xmit with > > a NULL skb and have the driver use dev->blist? > Probably not. I will see how to do it this way and get back to you. I think this is a good idea and makes code everywhere simpler. I will try this change and test to make sure it doesn't have any negative impact. Will mostly send out rev3 tomorrow. Thanks, - KK From ogerlitz at voltaire.com Tue Jul 24 23:44:19 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jul 2007 09:44:19 +0300 Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking for a P_Key in the table In-Reply-To: <46A453BE.3030408@gmail.com> References: <46A36E77.5020307@gmail.com> <46A453BE.3030408@gmail.com> Message-ID: <46A6F143.2040805@voltaire.com> Moni Shoua wrote: > IPoIB turns on the P_Key membership bit of limited membership P_Keys > when creating a child interface. After that IPoIB looks for the full > membership P_key in the table to make the interface "RUNNING". This > patch fixes the pkey lookup in order to match full and partial membership > keys that belong of the same partition. Roland, Can you please comment on the patch? the bug exist in 2.6.22 and 2.6.23-rc1 (also at OFED 1.2). Once you accept this we want to push it also to -stable etc. Or. From ogerlitz at voltaire.com Tue Jul 24 23:45:16 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jul 2007 09:45:16 +0300 Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation] Message-ID: <46A6F17C.8060404@voltaire.com> Hi Roland, It seems that you have missed this patch, can you have a look? Or. -------------- next part -------------- An embedded message was scrubbed... From: Or Gerlitz Subject: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation Date: Wed, 11 Jul 2007 09:22:43 +0300 (IDT) Size: 4119 URL: From ogerlitz at voltaire.com Wed Jul 25 00:00:28 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 25 Jul 2007 10:00:28 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A628D8.4050109@ichips.intel.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> Message-ID: <46A6F50C.5000906@voltaire.com> Sean Hefty wrote: >> Linux has a quite sophisticated mechanism to maintain / cache / probe >> / invalidate / update the network stack L2 neighbour info. > Path records are not just L2 info. They contain L4, L3, and L2 info > together. Maybe I was not clear enough: the neighbours cache keeps the stack Link (=L2) level info. The "IPoIB L2 info" (the neighbour HW address) contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info. So bottom line, the stack considers the creature as L2 info wheres in IB terms it contains L4/L3/L2 info. >> For example, in the Voltaire gen1 stack we had an ib arp module which >> was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). >> This module managed some sort of path cache, were IPoIB was always >> asking for non-cached path and other ULPs were willing to get cached >> path. > IMO, using a cached AH is no different than using a cached path. You're > simply mapping the PR data into another structure. From the one hand the stack can't allow itself to do L3 --> L2 (ARP) resolving for each packet xmit but on the other hand the stack has this mechanism to probe / invalidate / etc its L2 cache. So my basic claim is that if the stack decided to renew its L2 info, it would be incorrect design to use cached IB L2 info. > We're ignoring the problem here, and that is that a centralized SA > doesn't scale. MPI stacks have largely ignored this problem by simply > not doing path record queries. Path information is often hard-coded, > with QPN data exchanged out of band over sockets (often over Ethernet). I don't think that trying to separate IPoIB flow from MPI flow is ignoring the problem. Its different settings, IPoIB is a network device working under the net stack which has some design philosophy. Native MPI implementations over IB are not tied to the stack, its different. > We've seen problems running large MPI jobs without PR caching. I know > that Silverstorm/QLogic did as well. And apparently Voltaire hit the > same type of problem, since you added a caching module. (Did Mellanox > and Topspin/Cisco create PR caches as well?) At least three companies > working on IB came up with the same solution. What is the objection to > the current patch set? Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not-- using cached IB L2 info wheres MPI,Lustre etc did. I am willing to go with the local sa coming to serve large MPI jobs, so you load as a prerequisite to spawning large all-to-all job. But, I think the default for IPoIB needs to be usage of non cached PR. If you want to support the non-common case of huge-mpi-job-over-ipoib, I am fine with adding a param to IPoIB telling it to request cached PR from the ib_sa module. Or. From mst at dev.mellanox.co.il Wed Jul 25 00:27:23 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 10:27:23 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724221950.GP16727@bauxite.pathscale.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> Message-ID: <20070725072723.GA32499@mellanox.co.il> > > - A single git reset ORIG_HEAD recovers from a conflicting merge > > handling conflicts is a big part of a maintainer's job! Because you are a driver maintainer. That's what's different here from regular merge. Please understand: we have upstream code and we have changes against it. Upstream code is golden. If some patch conflicts with it, it is always this patch that needs to be fixed. And I want to ability to bounce that job to patch author - I simply do not know enough about e.g. ehca. > also, if the upstream > changes touch code that conflicts with a backport > patch, you get to fix the problem as it happens That's exactly the thing that I do not want to do. -- MST From mst at dev.mellanox.co.il Wed Jul 25 00:34:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 10:34:35 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070724221950.GP16727@bauxite.pathscale.com> References: <20070723200640.GA13117@bauxite.pathscale.com> <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> Message-ID: <20070725073435.GB32499@mellanox.co.il> > > - A single git reset ORIG_HEAD recovers from a conflicting merge > > if, though, you must have a pristine environment, > this is easily solved by using an intermediate repo: > > git clone -s > > Ah, you now see how git reset is broken. What about git rebase? Broken too I'm afraid. Anything that rewrites history is. > i bet this is very similar time-wise to running the > merge, then the ofed_scripts/configure over all supported > branches. merges in git are _fast_... Full tree checkout is slow though. > > - A single tag tags all code for all kernels > > store commit ids in a file and tag that? This trick breaks some more git utilities. E.g. git describe, git web displaying tags ... -- MST From mst at dev.mellanox.co.il Wed Jul 25 00:46:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 10:46:38 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality In-Reply-To: <200705101628.43095.ossrosch@linux.vnet.ibm.com> References: <200705101628.43095.ossrosch@linux.vnet.ibm.com> Message-ID: <20070725074638.GA1581@mellanox.co.il> > Quoting Stefan Roscher : > Subject: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality > > > > Signed-off-by: Stefan Roscher > --- > backport_ehca_2_rhel45_umap.patch | 850 ++++++++++++++++++++++++++++++++++++++ > 1 files changed, 850 insertions(+) Guys, I have updated the ofed_kernel (destined for OFED 1.3) kernel tree to 2.6.23-rc1, and this patch no longer applies. The conflicts aren't trivial (e.g. there's been ABI change). I moved it to kernel_patches/attic for now. Could you please take a look and update the patch for that tree? The updated code is here: git://git.openfabrics.org/~mst/ofed_kernel.git ofed_kernel I expect Vlad'll pull it soon, too. -- MST From gdror at dev.mellanox.co.il Wed Jul 25 01:22:58 2007 From: gdror at dev.mellanox.co.il (Dror Goldenberg) Date: Wed, 25 Jul 2007 11:22:58 +0300 Subject: [ofa-general] 20% latency increase between UD to RC latency In-Reply-To: References: Message-ID: <46A70862.2030808@dev.mellanox.co.il> Or Gerlitz wrote: > OK,its always good to start with facts on the ground... before > commiting this test, my original thinking was that for messages > whose size=X is less then the IB Link level MTU it holds that: > > latency(X,UD) <= latency(X,UC) <= latency(X,RC) > > Running the latency test provided with the perftest package on my systems (*) > I get the below results. Does anyone has insight why the --minimal-- and typical > UD latency is 1us ( = 20%) worse then the --minimal-- and typical RC latency??? > > Or. > > In all devices that support the memfree architecture (InfiniHost III-Ex running at memfree mode, InfiniHost III Lx and ConnectX) you will find the performance of UD comparable to the RC/UC. With the introduction of the memfree architecture, we are focused on development and performance optimization for this architecture. That is the main reason for memfree to achieve better performance. -Dror From vlad at lists.openfabrics.org Wed Jul 25 02:13:27 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 25 Jul 2007 02:13:27 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070725-0100 daily build status Message-ID: <20070725091327.E3DA8E603CA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.16 Passed on ia64 with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From zou at startelcorp.com Wed Jul 25 02:57:28 2007 From: zou at startelcorp.com (funnypostcard.com) Date: Wed, 25 Jul 2007 04:57:28 -0500 Subject: [ofa-general] You've received a greeting ecard from a Neighbor! Message-ID: <002301c7cea2$35dc91e0$d2541acb@kw.nei> Hi. Neighbor has sent you a greeting ecard. See your card as often as you wish during the next 15 days. SEEING YOUR CARD If your email software creates links to Web pages, click on your card's direct www address below while you are connected to the Internet: http://85.108.92.159/?ee7c634591933434671c16a2e59b1 Or copy and paste it into your browser's "Location" box (where Internet addresses go). We hope you enjoy your awesome card. Wishing you the best, Postmaster, funnypostcard.com From vlad at lists.openfabrics.org Wed Jul 25 03:06:28 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Wed, 25 Jul 2007 03:06:28 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070725-0200 daily build status Message-ID: <20070725100628.A5CD4E603A1@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From krkumar2 at in.ibm.com Wed Jul 25 03:33:23 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 25 Jul 2007 16:03:23 +0530 Subject: [ofa-general] Question on IPoIB start xmit Message-ID: Hi all, For batching, I modified ipoib_start_xmit() to send out multiple skbs, and currently what I do is to always send skbs in the same order it was sent from the ULP's. Eg : if following are the order of skbs sent from above: Good skb1, Pathlookup skb2, Good skb3, Good skb4, Mcast send skb5, Good skb6, Good skb7, Unicast arp send8, Good skb9 I make sure that xmits are done in the same order. Is there any issue in sending out in this order: Pathlookup skb2, Mcast send skb5, Unicast arp send8, Good skb1, Good skb3, Good skb4, Good skb6, Good skb7, Good skb9 Or is there any requirement or logic that will break unless skbs are sent in the same order that it was received from ULP ? Thanks, - KK From krkumar2 at in.ibm.com Wed Jul 25 03:35:54 2007 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Wed, 25 Jul 2007 16:05:54 +0530 Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation] In-Reply-To: <46A6F17C.8060404@voltaire.com> Message-ID: > + * > + * if IBV_SEND_INLINE flag is set, the data buffers can be reused immediately > + * after the call returns - low level libraries must confirm to this rule. > */ Maybe change "confirm to" to "conform by" ? Thanks, - KK general-bounces at lists.openfabrics.org wrote on 07/25/2007 12:15:16 PM: > Hi Roland, > > It seems that you have missed this patch, can you have a look? > > Or. > > ----- Message from Or Gerlitz on Wed, 11 Jul 2007 09: > 22:43 +0300 (IDT) ----- > > To: > > Roland Dreier > > cc: > > Alex Rosenbaum , general at lists.openfabrics.org > > Subject: > > [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation > > if the IBV_SEND_INLINE flag is set in the WR provided to ibv_post_send, > the data buffers can be reused immediately after the call returns, document this. > > Signed-off-by: Or Gerlitz > > Index: libibverbs/include/infiniband/verbs.h > =================================================================== > --- libibverbs.orig/include/infiniband/verbs.h > +++ libibverbs/include/infiniband/verbs.h > @@ -989,6 +989,9 @@ int ibv_destroy_qp(struct ibv_qp *qp); > > /** > * ibv_post_send - Post a list of work requests to a send queue. > + * > + * if IBV_SEND_INLINE flag is set, the data buffers can be reused immediately > + * after the call returns - low level libraries must confirm to this rule. > */ > static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, > struct ibv_send_wr **bad_wr) > Index: libibverbs/man/ibv_post_send.3 > =================================================================== > --- libibverbs.orig/man/ibv_post_send.3 > +++ libibverbs/man/ibv_post_send.3 > @@ -109,7 +109,9 @@ behavior. > .PP > The buffers used by a WR can only be safely reused after WR the > request is fully executed and a work completion has been retrieved > -from the corresponding completion queue (CQ). > +from the corresponding completion queue (CQ). However, if the > +IBV_SEND_INLINE flag was set, the buffer can be reused immediately > +after the call returns. > .SH "SEE ALSO" > .BR ibv_create_qp (3), > .BR ibv_create_ah (3), > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at dev.mellanox.co.il Wed Jul 25 04:09:08 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 14:09:08 +0300 Subject: [ofa-general] [PATCH trivial] include linux/mutex.h from scsi_transport_iscsi.h Message-ID: <20070725110907.GF3826@mellanox.co.il> scsi/scsi_transport_iscsi.h uses struct mutex, so while linux/mutex.h seems to be pulled in indirectly by one of the headers it includes, the right thing is to include linux/mutex.h directly. Signed-off-by: Michael S. Tsirkin --- diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h index 706c0cd..7530e98 100644 --- a/include/scsi/scsi_transport_iscsi.h +++ b/include/scsi/scsi_transport_iscsi.h @@ -24,6 +24,7 @@ #define SCSI_TRANSPORT_ISCSI_H #include +#include #include struct scsi_transport_template; -- MST From mst at dev.mellanox.co.il Wed Jul 25 04:32:10 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 14:32:10 +0300 Subject: [ofa-general] add_open_iscsi_h.patch Message-ID: <20070725113210.GG3826@mellanox.co.il> Erez, add_open_iscsi_h currently does: -#include +#include "iscsi_if.h" why is ths bit needed? -- MST From erezz at voltaire.com Wed Jul 25 05:56:55 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 25 Jul 2007 15:56:55 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <20070725113210.GG3826@mellanox.co.il> References: <20070725113210.GG3826@mellanox.co.il> Message-ID: <46A74897.6070903@voltaire.com> Michael S. Tsirkin wrote: > Erez, add_open_iscsi_h currently does: > > -#include > +#include "iscsi_if.h" > > why is ths bit needed? > Strange. I remember that I couldn't build OFED 1.2 without it in the past. I tried to rebuild it without this now, and it compiles successfully, so let's remove that code. Erez From mst at dev.mellanox.co.il Wed Jul 25 06:09:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 16:09:47 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <46A74897.6070903@voltaire.com> References: <20070725113210.GG3826@mellanox.co.il> <46A74897.6070903@voltaire.com> Message-ID: <20070725130947.GA19872@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: add_open_iscsi_h.patch > > Michael S. Tsirkin wrote: > > > Erez, add_open_iscsi_h currently does: > > > > -#include > > +#include "iscsi_if.h" > > > > why is ths bit needed? > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > past. I tried to rebuild it without this now, and it compiles > successfully, so let's remove that code. On a related note: #include #include -#include -#include #include #include should not be needed too. And how come this is helpful? @@ -277,7 +277,6 @@ enum iscsi_param { * These flags describes reason of stop_conn() call */ #define STOP_CONN_TERM 0x1 -#define STOP_CONN_SUSPEND 0x2 #define STOP_CONN_RECOVER 0x3 #define ISCSI_STATS_CUSTOM_MAX 32 In other words, is there a chance we can kill this patch completely? -- MST From mst at dev.mellanox.co.il Wed Jul 25 06:29:05 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 16:29:05 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <46A74897.6070903@voltaire.com> References: <20070725113210.GG3826@mellanox.co.il> <46A74897.6070903@voltaire.com> Message-ID: <20070725132905.GD19872@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: add_open_iscsi_h.patch > > Michael S. Tsirkin wrote: > > > Erez, add_open_iscsi_h currently does: > > > > -#include > > +#include "iscsi_if.h" > > > > why is ths bit needed? > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > past. I tried to rebuild it without this now, and it compiles > successfully, so let's remove that code. OK, I killed these patches completely and things still build fine. Vlad, please pull my tree into ofed_kernel. -- MST From erezz at voltaire.com Wed Jul 25 06:37:31 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 25 Jul 2007 16:37:31 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <20070725132905.GD19872@mellanox.co.il> References: <20070725113210.GG3826@mellanox.co.il><46A74897.6070903@voltaire.com> <20070725132905.GD19872@mellanox.co.il> Message-ID: <46A7521B.7010402@voltaire.com> Michael S. Tsirkin wrote: > > Quoting Erez Zilber : > > Subject: Re: add_open_iscsi_h.patch > > > > Michael S. Tsirkin wrote: > > > > > Erez, add_open_iscsi_h currently does: > > > > > > -#include > > > +#include "iscsi_if.h" > > > > > > why is ths bit needed? > > > > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > > past. I tried to rebuild it without this now, and it compiles > > successfully, so let's remove that code. > > OK, I killed these patches completely and things still build fine. > Vlad, please pull my tree into ofed_kernel. > Yes, it also works for me. I guess that these are all leftovers. Erez From mst at dev.mellanox.co.il Wed Jul 25 06:46:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 16:46:00 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <46A7521B.7010402@voltaire.com> References: <20070725132905.GD19872@mellanox.co.il> <46A7521B.7010402@voltaire.com> Message-ID: <20070725134600.GE19872@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: add_open_iscsi_h.patch > > Michael S. Tsirkin wrote: > > > > Quoting Erez Zilber : > > > Subject: Re: add_open_iscsi_h.patch > > > > > > Michael S. Tsirkin wrote: > > > > > > > Erez, add_open_iscsi_h currently does: > > > > > > > > -#include > > > > +#include "iscsi_if.h" > > > > > > > > why is ths bit needed? > > > > > > > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > > > past. I tried to rebuild it without this now, and it compiles > > > successfully, so let's remove that code. > > > > OK, I killed these patches completely and things still build fine. > > Vlad, please pull my tree into ofed_kernel. > > > Yes, it also works for me. I guess that these are all leftovers. Deleted. Hmm. Do we want to kill them in 1.2.c too? -- MST From erezz at voltaire.com Wed Jul 25 06:55:05 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 25 Jul 2007 16:55:05 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <20070725134600.GE19872@mellanox.co.il> References: <20070725132905.GD19872@mellanox.co.il><46A7521B.7010402@voltaire.com> <20070725134600.GE19872@mellanox.co.il> Message-ID: <46A75639.7050308@voltaire.com> Michael S. Tsirkin wrote: > > Quoting Erez Zilber : > > Subject: Re: add_open_iscsi_h.patch > > > > Michael S. Tsirkin wrote: > > > > > > Quoting Erez Zilber : > > > > Subject: Re: add_open_iscsi_h.patch > > > > > > > > Michael S. Tsirkin wrote: > > > > > > > > > Erez, add_open_iscsi_h currently does: > > > > > > > > > > -#include > > > > > +#include "iscsi_if.h" > > > > > > > > > > why is ths bit needed? > > > > > > > > > > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > > > > past. I tried to rebuild it without this now, and it compiles > > > > successfully, so let's remove that code. > > > > > > OK, I killed these patches completely and things still build fine. > > > Vlad, please pull my tree into ofed_kernel. > > > > > Yes, it also works for me. I guess that these are all leftovers. > > Deleted. Hmm. Do we want to kill them in 1.2.c too? > Yes (why not?) Erez From mst at dev.mellanox.co.il Wed Jul 25 07:11:41 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 17:11:41 +0300 Subject: [ofa-general] ANNOUNCE: ofed kernel build updates Message-ID: <20070725141141.GG19872@mellanox.co.il> Hi! I'd like to announce a couple of updates that were recently made to the build scripts on the ofed_kernel branch. This is an attempt to answer repeated requests, aired at Sonoma, to simplify access to kernel sources. The idea is that a user of a supported kernel will just be able to download an appropriate tarball and run with it without need for patching. These changes are available from ofed_kernel git tree maintained by Vlad: git://git.openfabrics.org/~vlad/ofed_kernel.git ofed_kernel The code is mine, but the ideas mostly come from criticism and code sent by Ira Weiny. Thanks, Ira! Note that the changes were made in a backwards-compatible way, so that existing scripts using configure/make will continue working. What's new: 1. New script ofed_scripts/ofed_patch.sh This will apply fixes and backport patches for a specific kernel to the current tree. Usage: ./ofed_scripts/ofed_patch.sh --with-backport=VERSION This makes it possible for distro vendors to generate a tarball pre-patched for a specific kernel. 2. New script ofed_scripts/ofed_makedist.sh This script repeatedly clones the current repository, runs ofed_scripts/ofed_patch.sh, and then builds tarballs of ofed kernel source pre-patched for supported kernel versions. I plan to work with Vlad to run this script as part of nightly builds, so that prepatched tarballs will become available for download. 3. configure script made re-entrant configure script does not apply patches anymore: all it does is create configure.mk.kernel and autoconf.h files. This finally makes it possible to change configuration parameters just by re-running configure. For backwards-compatibility, if configure detects that ofed_scripts/ofed_patch.sh was not run yet, it prints a warning and runs it automatically. Feedback wellcome. -- MST From mst at dev.mellanox.co.il Wed Jul 25 07:12:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 17:12:35 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <46A75639.7050308@voltaire.com> References: <20070725134600.GE19872@mellanox.co.il> <46A75639.7050308@voltaire.com> Message-ID: <20070725141235.GH19872@mellanox.co.il> > Quoting Erez Zilber : > Subject: Re: add_open_iscsi_h.patch > > Michael S. Tsirkin wrote: > > > > Quoting Erez Zilber : > > > Subject: Re: add_open_iscsi_h.patch > > > > > > Michael S. Tsirkin wrote: > > > > > > > > Quoting Erez Zilber : > > > > > Subject: Re: add_open_iscsi_h.patch > > > > > > > > > > Michael S. Tsirkin wrote: > > > > > > > > > > > Erez, add_open_iscsi_h currently does: > > > > > > > > > > > > -#include > > > > > > +#include "iscsi_if.h" > > > > > > > > > > > > why is ths bit needed? > > > > > > > > > > > > > > > > Strange. I remember that I couldn't build OFED 1.2 without it in the > > > > > past. I tried to rebuild it without this now, and it compiles > > > > > successfully, so let's remove that code. > > > > > > > > OK, I killed these patches completely and things still build fine. > > > > Vlad, please pull my tree into ofed_kernel. > > > > > > > Yes, it also works for me. I guess that these are all leftovers. > > > > Deleted. Hmm. Do we want to kill them in 1.2.c too? > > > Yes (why not?) Donnu. It's in bugfix-only mode after all. You decide. -- MST From erezz at voltaire.com Wed Jul 25 07:33:27 2007 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 25 Jul 2007 17:33:27 +0300 Subject: [ofa-general] Re: add_open_iscsi_h.patch In-Reply-To: <20070725141235.GH19872@mellanox.co.il> References: <20070725134600.GE19872@mellanox.co.il><46A75639.7050308@voltaire.com> <20070725141235.GH19872@mellanox.co.il> Message-ID: <46A75F37.1030700@voltaire.com> Michael S. Tsirkin wrote: > > Quoting Erez Zilber : > > Subject: Re: add_open_iscsi_h.patch > > > > Michael S. Tsirkin wrote: > > > > > > Quoting Erez Zilber : > > > > Subject: Re: add_open_iscsi_h.patch > > > > > > > > Michael S. Tsirkin wrote: > > > > > > > > > > Quoting Erez Zilber : > > > > > > Subject: Re: add_open_iscsi_h.patch > > > > > > > > > > > > Michael S. Tsirkin wrote: > > > > > > > > > > > > > Erez, add_open_iscsi_h currently does: > > > > > > > > > > > > > > -#include > > > > > > > +#include "iscsi_if.h" > > > > > > > > > > > > > > why is ths bit needed? > > > > > > > > > > > > > > > > > > > Strange. I remember that I couldn't build OFED 1.2 without > it in the > > > > > > past. I tried to rebuild it without this now, and it compiles > > > > > > successfully, so let's remove that code. > > > > > > > > > > OK, I killed these patches completely and things still build fine. > > > > > Vlad, please pull my tree into ofed_kernel. > > > > > > > > > Yes, it also works for me. I guess that these are all leftovers. > > > > > > Deleted. Hmm. Do we want to kill them in 1.2.c too? > > > > > Yes (why not?) > > Donnu. It's in bugfix-only mode after all. You decide. > OK. Let's do it for OFED 1.3 only. This is not really a bug fix. Erez From arthur.jones at qlogic.com Wed Jul 25 07:43:58 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 25 Jul 2007 07:43:58 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070725073435.GB32499@mellanox.co.il> References: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> <20070725073435.GB32499@mellanox.co.il> Message-ID: <20070725144358.GQ16727@bauxite.pathscale.com> hi michael, ... On Wed, Jul 25, 2007 at 10:34:35AM +0300, Michael S. Tsirkin wrote: > > > - A single git reset ORIG_HEAD recovers from a conflicting merge > > > > if, though, you must have a pristine environment, > > this is easily solved by using an intermediate repo: > > > > git clone -s > > > > > > Ah, you now see how git reset is broken. > What about git rebase? Broken too I'm afraid. > Anything that rewrites history is. no, git reset is not broken, nor is git rebase, what i described is a way to import multiple branches and allow you to backout easily if _any_ of the pulls failed. this is certainly _not_ the only way. you could also write a script to capture the HEADS and then git reset them if there were any issues. this would be faster, but more complicated. there is _no_ loss in functionality. i would be willing to write and maintain this script for you if wanted to give it a try... i think, if you tried the branches, you would find that you wouldn't need to require all branches to pull cleanly. you would prob be able to easily fixup the problem and continue the merge... > > i bet this is very similar time-wise to running the > > merge, then the ofed_scripts/configure over all supported > > branches. merges in git are _fast_... > > Full tree checkout is slow though. yes. it would prob be worth the effort to capture the HEADS and replay them if you really found that it was a requirement to pull all branches cleanly or none at all... > > > - A single tag tags all code for all kernels > > > > store commit ids in a file and tag that? > > This trick breaks some more git utilities. > E.g. git describe, git web displaying tags ... yes, i agree it's ugly. tags are not nice for multiple branches in a repo, do you know if there is any movement in the git project to work on this? arthur From mst at dev.mellanox.co.il Wed Jul 25 07:47:51 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 17:47:51 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070725144358.GQ16727@bauxite.pathscale.com> References: <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> <20070725073435.GB32499@mellanox.co.il> <20070725144358.GQ16727@bauxite.pathscale.com> Message-ID: <20070725144751.GG29081@mellanox.co.il> > > > > - A single tag tags all code for all kernels > > > > > > store commit ids in a file and tag that? > > > > This trick breaks some more git utilities. > > E.g. git describe, git web displaying tags ... > > yes, i agree it's ugly. tags are not nice > for multiple branches in a repo, do you know > if there is any movement in the git project > to work on this? I don't think so. As I said, when I posed the problem we have with fixes/backport, people on list just told me "keep patches under git". -- MST From arthur.jones at qlogic.com Wed Jul 25 07:52:23 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 25 Jul 2007 07:52:23 -0700 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070725072723.GA32499@mellanox.co.il> References: <000001c7ce0d$075d9a20$ff0da8c0@amr.corp.intel.com> <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> <20070725072723.GA32499@mellanox.co.il> Message-ID: <20070725145223.GR16727@bauxite.pathscale.com> hi michael, ... On Wed, Jul 25, 2007 at 10:27:23AM +0300, Michael S. Tsirkin wrote: > > > - A single git reset ORIG_HEAD recovers from a conflicting merge > > > > handling conflicts is a big part of a maintainer's job! > > Because you are a driver maintainer. > That's what's different here from regular merge. > Please understand: we have upstream code and we have changes against it. i am a driver maintainer, but i'm also maintaining the ipath release which is OFED + qlogic specific stuff. i know the process that you go through to make a release. i've lived it now for 2 releases of ipath software. > Upstream code is golden. If some patch conflicts with it, > it is always this patch that needs to be fixed. > And I want to ability to bounce that job to patch author - > I simply do not know enough about e.g. ehca. i agree, non-trivial merges should be bounced to the patch author -- nothing about using backport branches prevents or even makes this more difficult, in fact, i have found it to be easier in git than in dealing w/ patches because the environment where the changes need to be made is much more comfortable (git rather than quilt or some random patch stack)... > > also, if the upstream > > changes touch code that conflicts with a backport > > patch, you get to fix the problem as it happens > > That's exactly the thing that I do not want to do. you don't want to know about a problem a patch until days or weeks later when the auto build keeps failing and you don't know why? it is easy to catch many problems _before_ the build check fails... arthur From mst at dev.mellanox.co.il Wed Jul 25 08:01:55 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 18:01:55 +0300 Subject: [ofa-general] ANNOUNCE ofed backports for 2.6.22 kernel bits In-Reply-To: <20070725145223.GR16727@bauxite.pathscale.com> References: <20070724161646.GA24797@mellanox.co.il> <20070724165032.GK16727@bauxite.pathscale.com> <20070724165550.GD24797@mellanox.co.il> <20070724170726.GL16727@bauxite.pathscale.com> <20070724171924.GG24797@mellanox.co.il> <20070724172826.GN16727@bauxite.pathscale.com> <20070724175220.GI24797@mellanox.co.il> <20070724221950.GP16727@bauxite.pathscale.com> <20070725072723.GA32499@mellanox.co.il> <20070725145223.GR16727@bauxite.pathscale.com> Message-ID: <20070725150155.GA30690@mellanox.co.il> > > > also, if the upstream > > > changes touch code that conflicts with a backport > > > patch, you get to fix the problem as it happens > > > > That's exactly the thing that I do not want to do. > > you don't want to know about a problem a patch > until days or weeks later when the auto build > keeps failing and you don't know why? it is > easy to catch many problems _before_ the build > check fails... I don't work this way. I just just apply all patches before pushing out. And I see *immediately* the patch that conflicts - unlike merge conflict where I will know which file conflicts but not which change created the conflict. And if a patch conflicts with upstream code, an option to move the patch aside and defer the merge decision to patch author is very important to me: this just happened with ehca backport and update to 2.6.23-rc1. I do not want to delay update to 2.6.23-rc1 until IBM can be bothered to update their backport. Yes, this means that the specific module won't build on a specific kernel until the conflict is resolved. But there are multiple conflicts and each needs to be resolved by another person. -- MST From philippe.gregoire at cea.fr Wed Jul 25 08:28:47 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Wed, 25 Jul 2007 17:28:47 +0200 Subject: [ofa-general] SRP and opensm Message-ID: <46A76C2F.5070309@cea.fr> Hi We are testing DDN Infiniband storage. We are using OFED 1.2 and SRP. The nodes are directly connected to DDN controler Infiband ports. To get this configuration working, we have to run multiple instance of OpenSM with -G option, one for each port connected a TCA port. Is there any other way to proceed with such configuration - directed attached infiniband storage ? Is there any plan to add multi-port feature to OpenSM Philippe Gregoire From hal.rosenstock at gmail.com Wed Jul 25 08:38:05 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 11:38:05 -0400 Subject: [ofa-general] SRP and opensm In-Reply-To: <46A76C2F.5070309@cea.fr> References: <46A76C2F.5070309@cea.fr> Message-ID: Hi Philippe, On 7/25/07, Philippe Gregoire wrote: > > Hi > We are testing DDN Infiniband storage. We are using OFED 1.2 and SRP. > The nodes are directly connected to DDN controler Infiband ports. > To get this configuration working, we have to run multiple instance of > OpenSM > with -G option, one for each port connected a TCA port. -g ? Is there any other way to proceed with such configuration - directed > attached infiniband storage ? Is there any plan to add multi-port feature to OpenSM Not that I'm aware of. There is no plan to change the OpenSM architecture to make a single instance support multiple subnets. In this configuration, each port is a separate IB subnet (and there is an SM for each subnet). If you want to run a single subnet, you need at least one switch. -- Hal > Philippe Gregoire > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnguyen at linux.vnet.ibm.com Wed Jul 25 09:27:56 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 25 Jul 2007 18:27:56 +0200 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality Message-ID: <200707251827.57095.hnguyen@linux.vnet.ibm.com> Hi Michael, Below is the version without conflicts. And it should compile. As soon as the build scripts are ready, I'll test the whole backport. Thanks Nam From 6fa28219914394064a49c34030a09e23d160231c Mon Sep 17 00:00:00 2001 From: hnguyen at de.ibm.com Date: Wed, 25 Jul 2007 17:16:53 +0200 Subject: [PATCH ofed-1.3-alpha] ehca: backport_ehca_2_rhel45_umap.patch --- drivers/infiniband/hw/ehca/ehca_classes.h | 29 ++- drivers/infiniband/hw/ehca/ehca_cq.c | 66 ++++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 8 + drivers/infiniband/hw/ehca/ehca_qp.c | 92 +++++-- drivers/infiniband/hw/ehca/ehca_uverbs.c | 423 +++++++++++++++++------------ 5 files changed, 395 insertions(+), 223 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 3725aa8..49d6155 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -160,14 +160,13 @@ struct ehca_qp { struct ipz_qp_handle ipz_qp_handle; struct ehca_pfqp pf; struct ib_qp_init_attr init_attr; + u64 uspace_squeue; + u64 uspace_rqueue; + u64 uspace_fwh; struct ehca_cq *send_cq; struct ehca_cq *recv_cq; unsigned int sqerr_purgeflag; struct hlist_node list_entries; - /* mmap counter for resources mapped into user space */ - u32 mm_count_squeue; - u32 mm_count_rqueue; - u32 mm_count_galpa; }; #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ) @@ -188,6 +187,8 @@ struct ehca_cq { struct ipz_cq_handle ipz_cq_handle; struct ehca_pfcq pf; spinlock_t cb_lock; + u64 uspace_queue; + u64 uspace_fwh; struct hlist_head qp_hashtab[QP_HASHTAB_LEN]; struct list_head entry; u32 nr_callbacks; /* #events assigned to cpu by scaling code */ @@ -195,9 +196,6 @@ struct ehca_cq { wait_queue_head_t wait_completion; spinlock_t task_lock; u32 ownpid; - /* mmap counter for resources mapped into user space */ - u32 mm_count_queue; - u32 mm_count_galpa; }; enum ehca_mr_flag { @@ -300,6 +298,20 @@ struct ehca_ucontext { struct ib_ucontext ib_ucontext; }; +struct ehca_module *ehca_module_new(void); + +int ehca_module_delete(struct ehca_module *me); + +int ehca_eq_ctor(struct ehca_eq *eq); + +int ehca_eq_dtor(struct ehca_eq *eq); + +struct ehca_shca *ehca_shca_new(void); + +int ehca_shca_delete(struct ehca_shca *me); + +struct ehca_sport *ehca_sport_new(struct ehca_shca *anchor); + int ehca_init_pd_cache(void); void ehca_cleanup_pd_cache(void); int ehca_init_cq_cache(void); @@ -324,6 +336,7 @@ extern int ehca_use_hp_mr; extern int ehca_scaling_code; struct ipzu_queue_resp { + u64 queue; /* points to first queue entry */ u32 qe_size; /* queue entry size */ u32 act_nr_of_sg; u32 queue_length; /* queue length allocated in bytes */ @@ -336,6 +349,7 @@ struct ehca_create_cq_resp { u32 cq_number; u32 token; struct ipzu_queue_resp ipz_queue; + struct h_galpas galpas; }; struct ehca_create_qp_resp { @@ -349,6 +363,7 @@ struct ehca_create_qp_resp { u32 dummy; /* padding for 8 byte alignment */ struct ipzu_queue_resp ipz_squeue; struct ipzu_queue_resp ipz_rqueue; + struct h_galpas galpas; }; struct ehca_alloc_cq_parms { diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 9c7172b..ac0bb10 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -268,6 +268,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (context) { struct ipz_queue *ipz_queue = &my_cq->ipz_queue; struct ehca_create_cq_resp resp; + struct vm_area_struct *vma; memset(&resp, 0, sizeof(resp)); resp.cq_number = my_cq->cq_number; resp.token = my_cq->token; @@ -276,14 +277,40 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, resp.ipz_queue.queue_length = ipz_queue->queue_length; resp.ipz_queue.pagesize = ipz_queue->pagesize; resp.ipz_queue.toggle_state = ipz_queue->toggle_state; + ret = ehca_mmap_nopage(((u64)(my_cq->token) << 32) | 0x12000000, + ipz_queue->queue_length, + (void**)&resp.ipz_queue.queue, + &vma); + if (ret) { + ehca_err(device, "Could not mmap queue pages"); + cq = ERR_PTR(ret); + goto create_cq_exit4; + } + my_cq->uspace_queue = resp.ipz_queue.queue; + resp.galpas = my_cq->galpas; + ret = ehca_mmap_register(my_cq->galpas.user.fw_handle, + (void**)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + ehca_err(device, "Could not mmap fw_handle"); + cq = ERR_PTR(ret); + goto create_cq_exit5; + } + my_cq->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; if (ib_copy_to_udata(udata, &resp, sizeof(resp))) { ehca_err(device, "Copy to udata failed."); - goto create_cq_exit4; + goto create_cq_exit6; } } return cq; +create_cq_exit6: + ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + +create_cq_exit5: + ehca_munmap(my_cq->uspace_queue, my_cq->ipz_queue.queue_length); + create_cq_exit4: ipz_queue_dtor(NULL, &my_cq->ipz_queue); @@ -307,6 +334,7 @@ create_cq_exit1: int ehca_destroy_cq(struct ib_cq *cq) { u64 h_ret; + int ret; struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); int cq_num = my_cq->cq_number; struct ib_device *device = cq->device; @@ -316,20 +344,6 @@ int ehca_destroy_cq(struct ib_cq *cq) u32 cur_pid = current->tgid; unsigned long flags; - if (cq->uobject) { - if (my_cq->mm_count_galpa || my_cq->mm_count_queue) { - ehca_err(device, "Resources still referenced in " - "user space cq_num=%x", my_cq->cq_number); - return -EINVAL; - } - if (my_cq->ownpid != cur_pid) { - ehca_err(device, "Invalid caller pid=%x ownpid=%x " - "cq_num=%x", - cur_pid, my_cq->ownpid, my_cq->cq_number); - return -EINVAL; - } - } - /* * remove the CQ from the idr first to make sure * no more interrupt tasklets will touch this CQ @@ -342,6 +356,26 @@ int ehca_destroy_cq(struct ib_cq *cq) wait_event(my_cq->wait_completion, !atomic_read(&my_cq->nr_events)); /* nobody's using our CQ any longer -- we can destroy it */ + + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { + ehca_err(device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_cq->ownpid); + return -EINVAL; + } + + /* un-mmap if vma alloc */ + if (my_cq->uspace_queue ) { + ret = ehca_munmap(my_cq->uspace_queue, + my_cq->ipz_queue.queue_length); + if (ret) + ehca_err(device, "Could not munmap queue ehca_cq=%p " + "cq_num=%x", my_cq, cq_num); + ret = ehca_munmap(my_cq->uspace_fwh, EHCA_PAGESIZE); + if (ret) + ehca_err(device, "Could not munmap fwh ehca_cq=%p " + "cq_num=%x", my_cq, cq_num); + } + h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ @@ -370,7 +404,7 @@ int ehca_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); u32 cur_pid = current->tgid; - if (cq->uobject && my_cq->ownpid != cur_pid) { + if (my_cq->uspace_queue && my_cq->ownpid != cur_pid) { ehca_err(cq->device, "Invalid caller pid=%x ownpid=%x", cur_pid, my_cq->ownpid); return -EINVAL; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index dce503b..7b052f4 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -189,6 +189,14 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); void ehca_poll_eqs(unsigned long data); +int ehca_mmap_nopage(u64 foffset,u64 length,void **mapped, + struct vm_area_struct **vma); + +int ehca_mmap_register(u64 physical,void **mapped, + struct vm_area_struct **vma); + +int ehca_munmap(unsigned long addr, size_t len); + #ifdef CONFIG_PPC_64K_PAGES void *ehca_alloc_fw_ctrlblock(gfp_t flags); void ehca_free_fw_ctrlblock(void *ptr); diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index bd0e64b..1dccaaa 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -265,14 +265,18 @@ static inline int ibqptype2servicetype(enum ib_qp_type ibqptype) /* * init userspace queue info from ipz_queue data */ -static inline void queue2resp(struct ipzu_queue_resp *resp, - struct ipz_queue *queue) +static inline int queue2resp(struct ipzu_queue_resp *resp, + struct ipz_queue *queue, + u64 fofs) { + struct vm_area_struct *vma; resp->qe_size = queue->qe_size; resp->act_nr_of_sg = queue->act_nr_of_sg; resp->queue_length = queue->queue_length; resp->pagesize = queue->pagesize; resp->toggle_state = queue->toggle_state; + return = ehca_mmap_nopage(fofs, queue->queue_length, + (void**)&resp->queue, &vma); } /* @@ -731,6 +735,7 @@ static struct ehca_qp *internal_create_qp( /* copy queues, galpa data to user space */ if (context && udata) { struct ehca_create_qp_resp resp; + struct vm_area_struct *vma; memset(&resp, 0, sizeof(resp)); resp.qp_num = my_qp->real_qp_num; @@ -741,20 +746,55 @@ static struct ehca_qp *internal_create_qp( resp.real_qp_num = my_qp->real_qp_num; resp.ipz_rqueue.offset = my_qp->ipz_rqueue.offset; resp.ipz_squeue.offset = my_qp->ipz_squeue.offset; - if (HAS_SQ(my_qp)) - queue2resp(&resp.ipz_squeue, &my_qp->ipz_squeue); - if (HAS_RQ(my_qp)) - queue2resp(&resp.ipz_rqueue, &my_qp->ipz_rqueue); + if (HAS_SQ(my_qp)) { + ret = queue2resp( + &resp.ipz_squeue, &my_qp->ipz_squeue, + ((u64)(my_qp->token) << 32) | 0x23000000); + if (ret) { + ehca_err(pd->device, + "Could not mmap squeue pages"); + goto create_qp_exit4; + } + } + if (HAS_RQ(my_qp)) { + ret = queue2resp( + &resp.ipz_rqueue, &my_qp->ipz_rqueue, + ((u64)(my_qp->token) << 32) | 0x22000000); + if (ret) { + ehca_err(pd->device, + "Could not mmap rqueue pages"); + goto create_qp_exit5; + } + } + /* fw_handle */ + resp.galpas = my_qp->galpas; + ret = ehca_mmap_register(my_qp->galpas.user.fw_handle, + (void **)&resp.galpas.kernel.fw_handle, + &vma); + if (ret) { + ehca_err(pd->device, "Could not mmap fw_handle"); + goto create_qp_exit6; + } + my_qp->uspace_fwh = (u64)resp.galpas.kernel.fw_handle; if (ib_copy_to_udata(udata, &resp, sizeof resp)) { ehca_err(pd->device, "Copy to udata failed"); ret = -EINVAL; - goto create_qp_exit4; + goto create_qp_exit7; } } return my_qp; +create_qp_exit7: + ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + +create_qp_exit6: + ehca_munmap(my_qp->uspace_rqueue, my_qp->ipz_rqueue.queue_length); + +create_qp_exit5: + ehca_munmap(my_qp->uspace_squeue, my_qp->ipz_squeue.queue_length); + create_qp_exit4: if (HAS_RQ(my_qp)) ipz_queue_dtor(my_pd, &my_qp->ipz_rqueue); @@ -1106,7 +1146,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, my_qp->qp_type == IB_QPT_SMI) && statetrans == IB_QPST_SQE2RTS) { /* mark next free wqe if kernel */ - if (!ibqp->uobject) { + if (my_qp->uspace_squeue == 0) { struct ehca_wqe *wqe; /* lock send queue */ spin_lock_irqsave(&my_qp->spinlock_s, flags); @@ -1717,19 +1757,11 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, enum ib_qp_type qp_type; unsigned long flags; - if (uobject) { - if (my_qp->mm_count_galpa || - my_qp->mm_count_rqueue || my_qp->mm_count_squeue) { - ehca_err(dev, "Resources still referenced in " - "user space qp_num=%x", qp_num); - return -EINVAL; - } - if (my_pd->ownpid != cur_pid) { - ehca_err(dev, "Invalid caller pid=%x ownpid=%x", - cur_pid, my_pd->ownpid); - return -EINVAL; - } - } + if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && + my_pd->ownpid != cur_pid) { + ehca_err(ibqp->device, "Invalid caller pid=%x ownpid=%x", + cur_pid, my_pd->ownpid); + return -EINVAL; if (my_qp->send_cq) { ret = ehca_cq_unassign_qp(my_qp->send_cq, qp_num); @@ -1745,6 +1777,24 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, idr_remove(&ehca_qp_idr, my_qp->token); write_unlock_irqrestore(&ehca_qp_idr_lock, flags); + /* un-mmap if vma alloc */ + if (my_qp->uspace_rqueue) { + ret = ehca_munmap(my_qp->uspace_rqueue, + my_qp->ipz_rqueue.queue_length); + if (ret) + ehca_err(ibqp->device, "Could not munmap rqueue " + "qp_num=%x", qp_num); + ret = ehca_munmap(my_qp->uspace_squeue, + my_qp->ipz_squeue.queue_length); + if (ret) + ehca_err(ibqp->device, "Could not munmap squeue " + "qp_num=%x", qp_num); + ret = ehca_munmap(my_qp->uspace_fwh, EHCA_PAGESIZE); + if (ret) + ehca_err(ibqp->device, "Could not munmap fwh qp_num=%x", + qp_num); + } + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { ehca_err(dev, "hipz_h_destroy_qp() failed rc=%lx " diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index 4bc687f..5df5b96 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -68,184 +68,104 @@ int ehca_dealloc_ucontext(struct ib_ucontext *context) return 0; } -static void ehca_mm_open(struct vm_area_struct *vma) +struct page *ehca_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) { - u32 *count = (u32 *)vma->vm_private_data; - if (!count) { - ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - return; - } - (*count)++; - if (!(*count)) - ehca_gen_err("Use count overflow vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", - vma->vm_start, vma->vm_end, *count); -} - -static void ehca_mm_close(struct vm_area_struct *vma) -{ - u32 *count = (u32 *)vma->vm_private_data; - if (!count) { - ehca_gen_err("Invalid vma struct vm_start=%lx vm_end=%lx", - vma->vm_start, vma->vm_end); - return; - } - (*count)--; - ehca_gen_dbg("vm_start=%lx vm_end=%lx count=%x", - vma->vm_start, vma->vm_end, *count); -} + struct page *mypage = NULL; + u64 fileoffset = vma->vm_pgoff << PAGE_SHIFT; + u32 idr_handle = fileoffset >> 32; + u32 q_type = (fileoffset >> 28) & 0xF; /* CQ, QP,... */ + u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ + u32 cur_pid = current->tgid; + unsigned long flags; + struct ehca_cq *cq; + struct ehca_qp *qp; + struct ehca_pd *pd; + u64 offset; + void *vaddr; -static struct vm_operations_struct vm_ops = { - .open = ehca_mm_open, - .close = ehca_mm_close, -}; + switch (q_type) { + case 1: /* CQ */ + spin_lock_irqsave(&ehca_cq_idr_lock, flags); + cq = idr_find(&ehca_cq_idr, idr_handle); + spin_unlock_irqrestore(&ehca_cq_idr_lock, flags); -static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, - u32 *mm_count) -{ - int ret; - u64 vsize, physical; + /* make sure this mmap really belongs to the authorized user */ + if (!cq) { + ehca_gen_err("cq is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } - vsize = vma->vm_end - vma->vm_start; - if (vsize != EHCA_PAGESIZE) { - ehca_gen_err("invalid vsize=%lx", vma->vm_end - vma->vm_start); - return -EINVAL; - } + if (cq->ownpid != cur_pid) { + ehca_err(cq->ib_cq.device, + "Invalid caller pid=%x ownpid=%x", + cur_pid, cq->ownpid); + return NOPAGE_SIGBUS; + } + + if (rsrc_type == 2) { + ehca_dbg(cq->ib_cq.device, "cq=%p cq queuearea", cq); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&cq->ipz_queue, offset); + ehca_dbg(cq->ib_cq.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); + } + break; - physical = galpas->user.fw_handle; - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); - /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ - ret = remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, - vsize, vma->vm_page_prot); - if (unlikely(ret)) { - ehca_gen_err("remap_pfn_range() failed ret=%x", ret); - return -ENOMEM; - } + case 2: /* QP */ + spin_lock_irqsave(&ehca_qp_idr_lock, flags); + qp = idr_find(&ehca_qp_idr, idr_handle); + spin_unlock_irqrestore(&ehca_qp_idr_lock, flags); - vma->vm_private_data = mm_count; - (*mm_count)++; - vma->vm_ops = &vm_ops; + /* make sure this mmap really belongs to the authorized user */ + if (!qp) { + ehca_gen_err("qp is NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; + } - return 0; -} + pd = container_of(qp->ib_qp.pd, struct ehca_pd, ib_pd); + if (pd->ownpid != cur_pid) { + ehca_err(qp->ib_qp.device, + "Invalid caller pid=%x ownpid=%x", + cur_pid, pd->ownpid); + return NOPAGE_SIGBUS; + } + + if (rsrc_type == 2) { /* rqueue */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_rqueue, offset); + ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); + } else if (rsrc_type == 3) { /* squeue */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp squeuearea", qp); + offset = address - vma->vm_start; + vaddr = ipz_qeit_calc(&qp->ipz_squeue, offset); + ehca_dbg(qp->ib_qp.device, "offset=%lx vaddr=%p", + offset, vaddr); + mypage = virt_to_page(vaddr); + } + break; + + default: + ehca_gen_err("bad queue type %x", q_type); + return NOPAGE_SIGBUS; + } -static int ehca_mmap_queue(struct vm_area_struct *vma, struct ipz_queue *queue, - u32 *mm_count) -{ - int ret; - u64 start, ofs; - struct page *page; - - vma->vm_flags |= VM_RESERVED; - start = vma->vm_start; - for (ofs = 0; ofs < queue->queue_length; ofs += PAGE_SIZE) { - u64 virt_addr = (u64)ipz_qeit_calc(queue, ofs); - page = virt_to_page(virt_addr); - ret = vm_insert_page(vma, start, page); - if (unlikely(ret)) { - ehca_gen_err("vm_insert_page() failed rc=%x", ret); - return ret; - } - start += PAGE_SIZE; + if (!mypage) { + ehca_gen_err("Invalid page adr==NULL ret=NOPAGE_SIGBUS"); + return NOPAGE_SIGBUS; } - vma->vm_private_data = mm_count; - (*mm_count)++; - vma->vm_ops = &vm_ops; - - return 0; -} + get_page(mypage); -static int ehca_mmap_cq(struct vm_area_struct *vma, struct ehca_cq *cq, - u32 rsrc_type) -{ - int ret; - - switch (rsrc_type) { - case 1: /* galpa fw handle */ - ehca_dbg(cq->ib_cq.device, "cq_num=%x fw", cq->cq_number); - ret = ehca_mmap_fw(vma, &cq->galpas, &cq->mm_count_galpa); - if (unlikely(ret)) { - ehca_err(cq->ib_cq.device, - "ehca_mmap_fw() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; - } - break; + return mypage; + } - case 2: /* cq queue_addr */ - ehca_dbg(cq->ib_cq.device, "cq_num=%x queue", cq->cq_number); - ret = ehca_mmap_queue(vma, &cq->ipz_queue, &cq->mm_count_queue); - if (unlikely(ret)) { - ehca_err(cq->ib_cq.device, - "ehca_mmap_queue() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; - } - break; - - default: - ehca_err(cq->ib_cq.device, "bad resource type=%x cq_num=%x", - rsrc_type, cq->cq_number); - return -EINVAL; - } - - return 0; -} - -static int ehca_mmap_qp(struct vm_area_struct *vma, struct ehca_qp *qp, - u32 rsrc_type) -{ - int ret; - - switch (rsrc_type) { - case 1: /* galpa fw handle */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x fw", qp->ib_qp.qp_num); - ret = ehca_mmap_fw(vma, &qp->galpas, &qp->mm_count_galpa); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "remap_pfn_range() failed ret=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return -ENOMEM; - } - break; - - case 2: /* qp rqueue_addr */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x rqueue", - qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_rqueue, - &qp->mm_count_rqueue); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(rq) failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; - } - break; - - case 3: /* qp squeue_addr */ - ehca_dbg(qp->ib_qp.device, "qp_num=%x squeue", - qp->ib_qp.qp_num); - ret = ehca_mmap_queue(vma, &qp->ipz_squeue, - &qp->mm_count_squeue); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "ehca_mmap_queue(sq) failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; - } - break; - - default: - ehca_err(qp->ib_qp.device, "bad resource type=%x qp=num=%x", - rsrc_type, qp->ib_qp.qp_num); - return -EINVAL; - } - - return 0; -} +static struct vm_operations_struct ehcau_vm_ops = { + .nopage = ehca_nopage, +}; int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) { @@ -255,6 +175,7 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) u32 rsrc_type = (fileoffset >> 24) & 0xF; /* sq,rq,cmnd_window */ u32 cur_pid = current->tgid; u32 ret; + u64 vsize, physical; struct ehca_cq *cq; struct ehca_qp *qp; struct ehca_pd *pd; @@ -280,12 +201,44 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) if (!cq->ib_cq.uobject || cq->ib_cq.uobject->context != context) return -EINVAL; - ret = ehca_mmap_cq(vma, cq, rsrc_type); - if (unlikely(ret)) { - ehca_err(cq->ib_cq.device, - "ehca_mmap_cq() failed rc=%x cq_num=%x", - ret, cq->cq_number); - return ret; + switch (rsrc_type) { + case 1: /* galpa fw handle */ + ehca_dbg(cq->ib_cq.device, "cq=%p cq triggerarea", cq); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_err(cq->ib_cq.device, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + return -EINVAL; + } + + physical = cq->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + ehca_dbg(cq->ib_cq.device, + "vsize=%lx physical=%lx", vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + ehca_err(cq->ib_cq.device, + "remap_pfn_range() failed ret=%x", + ret); + return -ENOMEM; + } + break; + + case 2: /* cq queue_addr */ + ehca_dbg(cq->ib_cq.device, "cq=%p cq q_addr", cq); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + default: + ehca_err(cq->ib_cq.device, "bad resource type %x", + rsrc_type); + return -EINVAL; } break; @@ -310,12 +263,50 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) if (!uobject || uobject->context != context) return -EINVAL; - ret = ehca_mmap_qp(vma, qp, rsrc_type); - if (unlikely(ret)) { - ehca_err(qp->ib_qp.device, - "ehca_mmap_qp() failed rc=%x qp_num=%x", - ret, qp->ib_qp.qp_num); - return ret; + switch (rsrc_type) { + case 1: /* galpa fw handle */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp triggerarea", qp); + vma->vm_flags |= VM_RESERVED; + vsize = vma->vm_end - vma->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_err(qp->ib_qp.device, "invalid vsize=%lx", + vma->vm_end - vma->vm_start); + return -EINVAL; + } + + physical = qp->galpas.user.fw_handle; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_IO | VM_RESERVED; + + ehca_dbg(qp->ib_qp.device, "vsize=%lx physical=%lx", + vsize, physical); + ret = remap_pfn_range(vma, vma->vm_start, + physical >> PAGE_SHIFT, vsize, + vma->vm_page_prot); + if (ret) { + ehca_err(qp->ib_qp.device, + "remap_pfn_range() failed ret=%x", + ret); + return -ENOMEM; + } + break; + + case 2: /* qp rqueue_addr */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp rqueue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + case 3: /* qp squeue_addr */ + ehca_dbg(qp->ib_qp.device, "qp=%p qp squeue_addr", qp); + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &ehcau_vm_ops; + break; + + default: + ehca_err(qp->ib_qp.device, "bad resource type %x", + rsrc_type); + return -EINVAL; } break; @@ -326,3 +317,77 @@ int ehca_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) return 0; } + +int ehca_mmap_nopage(u64 foffset, u64 length, void **mapped, + struct vm_area_struct **vma) +{ + down_write(¤t->mm->mmap_sem); + *mapped = (void*)do_mmap(NULL,0, length, PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + foffset); + up_write(¤t->mm->mmap_sem); + if (!(*mapped)) { + ehca_gen_err("couldn't mmap foffset=%lx length=%lx", + foffset, length); + return -EINVAL; + } + + *vma = find_vma(current->mm, (u64)*mapped); + if (!(*vma)) { + down_write(¤t->mm->mmap_sem); + do_munmap(current->mm, 0, length); + up_write(¤t->mm->mmap_sem); + ehca_gen_err("couldn't find vma queue=%p", *mapped); + return -EINVAL; + } + (*vma)->vm_flags |= VM_RESERVED; + (*vma)->vm_ops = &ehcau_vm_ops; + + return 0; +} + +int ehca_mmap_register(u64 physical, void **mapped, + struct vm_area_struct **vma) +{ + int ret; + unsigned long vsize; + /* ehca hw supports only 4k page */ + ret = ehca_mmap_nopage(0, EHCA_PAGESIZE, mapped, vma); + if (ret) { + ehca_gen_err("could'nt mmap physical=%lx", physical); + return ret; + } + + (*vma)->vm_flags |= VM_RESERVED; + vsize = (*vma)->vm_end - (*vma)->vm_start; + if (vsize != EHCA_PAGESIZE) { + ehca_gen_err("invalid vsize=%lx", + (*vma)->vm_end - (*vma)->vm_start); + return -EINVAL; + } + + (*vma)->vm_page_prot = pgprot_noncached((*vma)->vm_page_prot); + (*vma)->vm_flags |= VM_IO | VM_RESERVED; + + ret = remap_pfn_range((*vma), (*vma)->vm_start, + physical >> PAGE_SHIFT, vsize, + (*vma)->vm_page_prot); + if (ret) { + ehca_gen_err("remap_pfn_range() failed ret=%x", ret); + return -ENOMEM; + } + + return 0; + +} + +int ehca_munmap(unsigned long addr, size_t len) { + int ret = 0; + struct mm_struct *mm = current->mm; + if (mm) { + down_write(&mm->mmap_sem); + ret = do_munmap(mm, addr, len); + up_write(&mm->mmap_sem); + } + return ret; +} -- 1.5.2 From suri at baymicrosystems.com Wed Jul 25 09:54:04 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Wed, 25 Jul 2007 12:54:04 -0400 Subject: [ofa-general] installing 1.2-GA on Redhat EL5 In-Reply-To: <46A54659.8010608@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> Message-ID: <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice> Doug: I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the release to ofed-1.2-GA. The build finished and the install gives me this error: -------------- Installing OFED software into /usr/local/ofed_1.2 Running /bin/rpm -ihv --force --nodeps /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm | ERROR: Failed executing "/bin/rpm -ihv --force --nodeps /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm " ------------------- Any ideas... Thanks, Suri > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > Of Sean Hefty > Sent: Monday, July 23, 2007 8:23 PM > To: Yevgeny Kliteynik > Cc: OpenIB > Subject: Re: [ofa-general] QoS RFC > > > 2.5. ULPs that use CM interface (like SRP) should have their own > > pre-assigned Service-ID and use it while obtaining PR/MPR for > > establishing connections. The SA receiving the PR/MPR should match it > > against the policy and return the appropriate PR/MPR including SL, > > MTU and RATE. > > We need to ensure that this can work without pre-assigned service IDs, > or at least service IDs that are assigned within a fairly wide range, > such as locally assigned IDs. > > > 2.6. ULPs and programs using CMA to establish RC connection should > > provide the CMA the target IP and Service-ID. Some of the ULPs might > > also provide QoS-Class (E.g. for SDP sockets that are provided the > > TOS socket option). The CMA should then use the provided Service-ID > > and optional QoS-Class and pass them in the PR/MPR request. The > > resulting PR/MPR should be used for configuring the connection QP. > > The interface to the CMA needs to remain as transport independent as > possible, and I am unsure of the transport independence of tying QoS to > the destination port number. (I'm not disagreeing; I'm just not sure at > the moment it's the right approach.) > > > PathRecord and MultiPathRecord enhancement for QoS: As mentioned > > above the PathRecord and MultiPathRecord attributes should be > > enhanced to carry the Service-ID which is a 64bit value, which has > > been standardized by the IBTA. A new field QoS-Class is also > > provided. A new capability bit should describe the SM QoS support in > > the SA class port info. This approach provides an easy migration path > > for existing access layer and ULPs by not introducing new set of > > PR/MPR attribute. > > Has any thought been given to how to make this scale? > > > 5. CMA features ---------------- > > > > The CMA interface supports Service-ID through the notion of port > > space as a prefixes to the port_num which is part of the sockaddr > > provided to rdma_resolve_add(). What is missing is the explicit > > request for a QoS-Class that should allow the ULP (like SDP) to > > propagate a specific request for a class of service. A mechanism for > > providing the QoS-Class is available in the IPv6 address, so we could > > use that address field. Another option is to implement a special > > connection options API for CMA. > > > > Missing functionality by CMA is the usage of the provided QoS-Class > > and Service-ID in the sent PR/MPR. When a response is obtained it is > > an existing requirement for the CMA to use the PR/MPR from the > > response in setting up the QP address vector. > > The most natural function to specify additional QoS parameters would be > rdma_resolve_route. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From transter at gmail.com Wed Jul 25 09:57:50 2007 From: transter at gmail.com (lbt) Date: Wed, 25 Jul 2007 09:57:50 -0700 Subject: [ofa-general] Lost in-service traps during Open SM migration Message-ID: Hello, I have been seeing a problem where a subscriber for in-service traps is not getting informed when the port of master openSM is restored (i.e. causing an SM migration). I have an IB subnet with 2 nodes running OpenSM , different priorities of course (OpenSM Rev:openib-2.0.5). I also have another node on the subnet that has subscribed for the forwarding of any IB_SA_GENERIC_TRAP_NUM_IN_SVC trap events. I've been doing cable pull tests on the IB ports, to check if the in-service handler I have subscribed gets invoked when I restore the cable. I've noticed that everything works as expected ( i.e. my in-service handler is invoked) whenever I restore the cable on the lower priority SM IB port without ever touching the master SM port. But if I cause an SM migration, by restoring the port of the higher priority SM, the in-service trap does not get generated as expected on a cable restore. Steps to Reproduce: 1) Start with port to higher priority SM disconnected. 2) restore port cable on the higher priority SM --> This causes an SM Migration as expected, SM's migration happens okay --> I expected the restoration of the higher priority SM to tit to also trigger an in-service trap as well and notify subscribers, but it doesn't occur I have collected debug messages log for both open SM's, and it appears that the reason is because: 1) in-service traps are generated based on what ports are added on the Master SM's new_ports_list, but these traps are generated only after LID assignment 2) when the higher priority SM port is restored, the restored port gets added to the lower priority SM's new_ports_list (since it's still the Master SM at that point in time) 3) the handover of Master SM from lower priority to higher priority SM occurs (before LID assignment and thus a chance for traps get generated for those ports on new_ports_list) 4) the higher priority SM is now Master SM, but it has an empty new_ports_list, so no trap generated either Does this look like a legitimate Open SM bug? Any feedback would be much appreciated, and if I can help further in any way please let me know . Subset of logs from lower priority SM during the cable restore of higher priority SM port: ### Jul 18 14:31:56 614522 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A TID:0x00000016000012e1 ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE ### 14:31:56 ******************** INITIATING HEAVY SWEEP ********************** ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SELF Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding port GUID:0x00504501483e0000 to new_ports_list Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET 14:31:56 ********************* HEAVY SWEEP COMPLETE *********************** Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER### 14:31:56 ******************** ENTERING SM STANDBY STATE ******************* Subset of logs from higher priority SM during the cable restore of higher priority SM port: Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [ Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state IB_SMINFO_STATE_DISCOVERING Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg: ******************** ENTERING SM MASTER STATE ******************** Jul 18 14:32:03 009014 [41401960] -> __osm_state_mgr_set_sm_lid_done_msg: **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: [ ----> no in-service traps are generated and notices forwarded because there are no ports on this list Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: ] Thanks! Lan -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Jul 25 09:58:46 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jul 2007 09:58:46 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A6F50C.5000906@voltaire.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> Message-ID: <46A78146.1090304@ichips.intel.com> > I am willing to go with the local sa coming to serve large MPI jobs, so > you load as a prerequisite to spawning large all-to-all job. > > But, I think the default for IPoIB needs to be usage of non cached PR. I think this ties together two things that aren't directly related. We have two network stacks running on top of each other here. Their policies should be separate. As an example, let's reverse this. Imagine instead that you implement IB over IP. Should an IB path refresh policy dictate that IP update its ARP tables? Or, looking at it differently, do you prevent IP from updating the ARP table unless the IB stack asks for it? The policy for local PR caching should be set by an administrator. Now, we could provide a policy setting that ties it to the ARP cache, which sounds like a good idea. This will be less efficient in some use models, more efficient in others. But not all PRs belong to IPoIB, so we need a way to handle this. However, I don't believe that we have to always enforce such a policy, especially since the current stack doesn't have this behavior today. - Sean From dledford at redhat.com Wed Jul 25 09:59:50 2007 From: dledford at redhat.com (Doug Ledford) Date: Wed, 25 Jul 2007 16:59:50 +0000 Subject: [ofa-general] Re: installing 1.2-GA on Redhat EL5 In-Reply-To: <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice> Message-ID: <1185382791.5165.665.camel@firewall.xsintricity.com> On Wed, 2007-07-25 at 12:54 -0400, Suresh Shelvapille wrote: > Doug: > > I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the > release to ofed-1.2-GA. The build finished and the install gives me this error: > > > -------------- > > Installing OFED software into /usr/local/ofed_1.2 > > Running /bin/rpm -ihv --force --nodeps > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm > | > ERROR: Failed executing "/bin/rpm -ihv --force --nodeps > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2-2.6.18_8.el5.x86_64.rpm > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2-2.6.18_8.el5.x86_64.rpm > " > > ------------------- > > Any ideas... Not from this output. I would need the actual rpm error messages to know what's wrong. Try running the above rpm command by hand and copy-n-pasting the errors. > > Thanks, > Suri > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf > > Of Sean Hefty > > Sent: Monday, July 23, 2007 8:23 PM > > To: Yevgeny Kliteynik > > Cc: OpenIB > > Subject: Re: [ofa-general] QoS RFC > > > > > 2.5. ULPs that use CM interface (like SRP) should have their own > > > pre-assigned Service-ID and use it while obtaining PR/MPR for > > > establishing connections. The SA receiving the PR/MPR should match it > > > against the policy and return the appropriate PR/MPR including SL, > > > MTU and RATE. > > > > We need to ensure that this can work without pre-assigned service IDs, > > or at least service IDs that are assigned within a fairly wide range, > > such as locally assigned IDs. > > > > > 2.6. ULPs and programs using CMA to establish RC connection should > > > provide the CMA the target IP and Service-ID. Some of the ULPs might > > > also provide QoS-Class (E.g. for SDP sockets that are provided the > > > TOS socket option). The CMA should then use the provided Service-ID > > > and optional QoS-Class and pass them in the PR/MPR request. The > > > resulting PR/MPR should be used for configuring the connection QP. > > > > The interface to the CMA needs to remain as transport independent as > > possible, and I am unsure of the transport independence of tying QoS to > > the destination port number. (I'm not disagreeing; I'm just not sure at > > the moment it's the right approach.) > > > > > PathRecord and MultiPathRecord enhancement for QoS: As mentioned > > > above the PathRecord and MultiPathRecord attributes should be > > > enhanced to carry the Service-ID which is a 64bit value, which has > > > been standardized by the IBTA. A new field QoS-Class is also > > > provided. A new capability bit should describe the SM QoS support in > > > the SA class port info. This approach provides an easy migration path > > > for existing access layer and ULPs by not introducing new set of > > > PR/MPR attribute. > > > > Has any thought been given to how to make this scale? > > > > > 5. CMA features ---------------- > > > > > > The CMA interface supports Service-ID through the notion of port > > > space as a prefixes to the port_num which is part of the sockaddr > > > provided to rdma_resolve_add(). What is missing is the explicit > > > request for a QoS-Class that should allow the ULP (like SDP) to > > > propagate a specific request for a class of service. A mechanism for > > > providing the QoS-Class is available in the IPv6 address, so we could > > > use that address field. Another option is to implement a special > > > connection options API for CMA. > > > > > > Missing functionality by CMA is the usage of the provided QoS-Class > > > and Service-ID in the sent PR/MPR. When a response is obtained it is > > > an existing requirement for the CMA to use the PR/MPR from the > > > response in setting up the QP address vector. > > > > The most natural function to specify additional QoS parameters would be > > rdma_resolve_route. > > > > - Sean > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From eitan at mellanox.co.il Wed Jul 25 10:44:39 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 25 Jul 2007 20:44:39 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <20070725001847.GG25264@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> Hi Sasha I am not following you. Why do a user need to run -y if a simple legal cable connector is plugged? The issue is only if a "loop back" plug connecting a port to itself is plugged. Do users use these plugs? For what sake? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Wednesday, July 25, 2007 3:19 AM > To: Eitan Zahavi > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > > > On 7/24/07, Eitan Zahavi wrote: > > > > Maybe avoid the log if -y is provided? > > > > > > That avoids the spew but the duplicated GUID is > important to know so > > IMO something in the "middle" is needed where duplicated GUIDs are > > logged but not continually the same ones. > > [EZ] > > OK so in -y mode only we track which ones were reported > and do not > > repeat the log? > > And how port moving problem should be solved? > > We cannot ask an user to run OpenSM with '-y' if in her/his > plans to reconnect some ports in a future and just decrease logging. > > Sasha > From hal.rosenstock at gmail.com Wed Jul 25 10:46:31 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 13:46:31 -0400 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED65FA@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> Message-ID: On 7/24/07, Eitan Zahavi wrote: > > ** > > > On 7/24/07, Eitan Zahavi wrote: > > > > *Maybe avoid the log if -y is provided?* > > > ** > That avoids the spew but the duplicated GUID is important to know so IMO > something in the "middle" is needed where duplicated GUIDs are logged but > not continually the same ones. > *[EZ] OK so in -y mode only we track which ones were reported and do not > repeat the log? > * > > Any good ideas on how to accomplish this ? -- Hal *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > ** > > > > ------------------------------ > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > *Sent:* Tuesday, July 24, 2007 9:56 PM > > *To:* Eitan Zahavi > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > *Hi Hal,* > > > ** > > > *For many users such a critical failure (one the SM can not really do > > > anything with) is better aborted then forgotten in some log file.* > > > *Anyway's the -y flag lets you ignore it if you like.* > > > > > > > So everything else continues to work fine with -y ? In which case, I'm > > not sure which is the better default. > > > > Users certainly won't like their logs filling up with continuous > > duplicated GUID messages. The log spew should be cleaned up IMO. > > > > -- Hal > > > > > > > > > > > > > *Eitan Zahavi*** > > > Senior Engineering Director, Software Architect > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > ------------------------------ > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > *Sent:* Tuesday, July 24, 2007 9:38 PM > > > *To:* Eitan Zahavi > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > *Hi Hal,* > > > > ** > > > > *The code to find "duplicated" GUIDs stem from real user cases where > > > > flawed * > > > > *burning procedure caused actual GUID duplications. There is nothing > > > > "impossible". * > > > > > > > > > > No one said impossible; just a violation of what globally unique (GU > > > from GUID) really means. It's largely because vendors allowed users to > > > program non volatile RAM for GUIDs rather than a real manufacturing process > > > for this which guarantees uniqueness that we are even discussing this aspect > > > of it. > > > > > > *So it is really critical the the SM will be able to recognize this > > > > case and abort.* > > > > > > > > > > I agree with the detect part but not the abort part. Why can't it > > > report these errors and continue on ? That seems better to me than aborting. > > > > > > -- Hal > > > > > > > > > > *It might be that for testing someone wants to use a loopback plug > > > > that cause the same * > > > > *port GUID appear on both sides of link - but it is better to > > > > require the user doing the test * > > > > *to set some flag than to miss such a situation in real life > > > > cluster.* > > > > ** > > > > *This requirement was written after many people wasted many hours > > > > trying to figure out what was going on.* > > > > *PLEASE DO NOT TAKE IT AWAY* > > > > ** > > > > > > > > *Eitan Zahavi*** > > > > Senior Engineering Director, Software Architect > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > ------------------------------ > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > > *Sent:* Tuesday, July 24, 2007 6:04 PM > > > > *To:* Eitan Zahavi > > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ] > > > > > *Sent:* Tuesday, July 24, 2007 5:53 PM > > > > > *To:* Eitan Zahavi > > > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik > > > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > > > > > > Hi Eitan, > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > > > > > *Hi Hal,* > > > > > > ** > > > > > > *What is this "loopback" connector used for?* > > > > > > *Does not seem to me like a very useful thing to do.* > > > > > > > > > > > ** > > > > > Perhaps not but no reason OpenSM can't handle this more > > > > > gracefully. > > > > > > > > > > *Anyway, if it is not a production environment we could add a > > > > > > "debug mode" (-d flag option) to ignore this check.* > > > > > > > > > > > ** > > > > > Why would a separate flag be needed ? > > > > > *[EZ] Since I do not see any other solution for the SM to know it > > > > > is really a loop back plug rather then two devices with same GUID connected > > > > > back to back ... * > > > > > > > > > > > > > > "Technically", this should only occur when looped back and not two > > > > devices with same GUID as GUID == globally unique and a duplication > > > > indicates a "manufacturing" issue. > > > > > > > > Anyhow, can't these be treated the same (and handled more > > > > gracefully) without an additional option/flag ? > > > > > > > > -- Hal > > > > > > > > > > > > > -- Hal > > > > > > > > > > ** > > > > > > > > > > > > *Eitan Zahavi*** > > > > > > Senior Engineering Director, Software Architect > > > > > > Mellanox Technologies LTD > > > > > > Tel:+972-4-9097208 > > > > > > Fax:+972-4-9593245 > > > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > > ------------------------------ > > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM > > > > > > *To:* OpenFabrics General > > > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik > > > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > This is what starts off as a "minor" issue and I know it has > > > > > > been discussed it somewhat in the past: > > > > > > > > > > > > Putting a loopback connector on a (switch) link causes OpenSM to > > > > > > indicate duplicated GUID error 0D18 as follows: > > > > > > > > > > > > __osm_ni_rcv_set_links > > > > > > { > > > > > > ... > > > > > > /* > > > > > > When there are only two nodes with exact same guids > > > > > > (connected back > > > > > > to back) - the previous check for duplicated guid > > > > > > will not catch > > > > > > them. But the link will be from the port to > > > > > > itself... > > > > > > Enhanced Port 0 is an exception to this > > > > > > */ > > > > > > if ((osm_node_get_node_guid( p_node ) == > > > > > > p_ni_context->node_guid) && > > > > > > (port_num == p_ni_context->port_num) && > > > > > > (port_num != 0)) > > > > > > { > > > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > > > > "__osm_ni_rcv_set_links: ERR 0D18: " > > > > > > "Duplicate GUID found by link from a port > > > > > > to itself:" > > > > > > "node 0x%" PRIx64 ", port number 0x%X\n", > > > > > > cl_ntoh64( osm_node_get_node_guid( p_node ) > > > > > > ), > > > > > > port_num ); > > > > > > ... > > > > > > > > > > > > So this occurs over and over and over and fills the log with the > > > > > > same spew. This should be improved IMO. > > > > > > > > > > > > Is this really a fatal condition ? Doesn't seem like it should > > > > > > be to me. > > > > > > > > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is > > > > > > that safe for this condition ? > > > > > > > > > > > > Seems like something like an extra loopback bit should be added > > > > > > to some port structure which should cause these links to be ignored. This > > > > > > bit would then be reset when the peer is now longer itself. > > > > > > > > > > > > Also, is there a relationship of this with the 12x/duplicated > > > > > > GUID code ? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Jul 25 10:53:42 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 25 Jul 2007 10:53:42 -0700 Subject: [ofa-general] openSM: Different IB MTUs In-Reply-To: Message-ID: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Jul 25 10:53:58 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 13:53:58 -0400 Subject: [ofa-general] osm_physp_calc_link_ops question Message-ID: Hi, Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and osm_link_mgr.c:__osm_link_mgr_set_physp_pi call osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote end is invalid, the local VLCap is used as the OperationalVLs. When the VLCaps at the two ends of the link do not match, this is not a good thing. It causes trap storms on the flow control watchdog timer expiring. Wouldn't it be better to leave this field as is in this case or would that cause some other problem ? Same thing might also be true for link MTU but not as critical. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Wed Jul 25 10:57:47 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 13:57:47 -0400 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: Message-ID: Shirley, On 7/25/07, Shirley Ma wrote: > > Hello Hal, > > How does openSM handle CAs with different MTUs in the same subnet? For > example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM > pick up the smallest MTU in the subnet? > Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks > Shirley Ma > -------------- next part -------------- An HTML attachment was scrubbed... URL: From suri at baymicrosystems.com Wed Jul 25 11:04:14 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Wed, 25 Jul 2007 14:04:14 -0400 Subject: [ofa-general] RE: installing 1.2-GA on Redhat EL5 In-Reply-To: <1185382791.5165.665.camel@firewall.xsintricity.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> <00dc01c7cedc$6c0f2f90$1914a8c0@surioffice> <1185382791.5165.665.camel@firewall.xsintricity.com> Message-ID: <010001c7cee6$396a3490$1914a8c0@surioffice> It was a space limitation issue, fixed it...thanks. -Suri > -----Original Message----- > From: Doug Ledford [mailto:dledford at redhat.com] > Sent: Wednesday, July 25, 2007 1:00 PM > To: Suresh Shelvapille > Cc: 'OpenIB' > Subject: Re: installing 1.2-GA on Redhat EL5 > > On Wed, 2007-07-25 at 12:54 -0400, Suresh Shelvapille wrote: > > Doug: > > > > I had ofed-1.2-rc1 installed on a server running redhat el5. I am trying to upgrade the > > release to ofed-1.2-GA. The build finished and the install gives me this error: > > > > > > -------------- > > > > Installing OFED software into /usr/local/ofed_1.2 > > > > Running /bin/rpm -ihv --force --nodeps > > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2- > 2.6.18_8.el5.x86_64.rpm > > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2- > 2.6.18_8.el5.x86_64.rpm > > | > > ERROR: Failed executing "/bin/rpm -ihv --force --nodeps > > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-1.2- > 2.6.18_8.el5.x86_64.rpm > > /root/ofed_1.2-GA/OFED-1.2-20070626-0917/RPMS/redhat-release-5Server-5.0.0.9/kernel-ib-devel-1.2- > 2.6.18_8.el5.x86_64.rpm > > " > > > > ------------------- > > > > Any ideas... > > Not from this output. I would need the actual rpm error messages to > know what's wrong. Try running the above rpm command by hand and > copy-n-pasting the errors. > > > > > Thanks, > > Suri > > > > > > > -----Original Message----- > > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On > Behalf > > > Of Sean Hefty > > > Sent: Monday, July 23, 2007 8:23 PM > > > To: Yevgeny Kliteynik > > > Cc: OpenIB > > > Subject: Re: [ofa-general] QoS RFC > > > > > > > 2.5. ULPs that use CM interface (like SRP) should have their own > > > > pre-assigned Service-ID and use it while obtaining PR/MPR for > > > > establishing connections. The SA receiving the PR/MPR should match it > > > > against the policy and return the appropriate PR/MPR including SL, > > > > MTU and RATE. > > > > > > We need to ensure that this can work without pre-assigned service IDs, > > > or at least service IDs that are assigned within a fairly wide range, > > > such as locally assigned IDs. > > > > > > > 2.6. ULPs and programs using CMA to establish RC connection should > > > > provide the CMA the target IP and Service-ID. Some of the ULPs might > > > > also provide QoS-Class (E.g. for SDP sockets that are provided the > > > > TOS socket option). The CMA should then use the provided Service-ID > > > > and optional QoS-Class and pass them in the PR/MPR request. The > > > > resulting PR/MPR should be used for configuring the connection QP. > > > > > > The interface to the CMA needs to remain as transport independent as > > > possible, and I am unsure of the transport independence of tying QoS to > > > the destination port number. (I'm not disagreeing; I'm just not sure at > > > the moment it's the right approach.) > > > > > > > PathRecord and MultiPathRecord enhancement for QoS: As mentioned > > > > above the PathRecord and MultiPathRecord attributes should be > > > > enhanced to carry the Service-ID which is a 64bit value, which has > > > > been standardized by the IBTA. A new field QoS-Class is also > > > > provided. A new capability bit should describe the SM QoS support in > > > > the SA class port info. This approach provides an easy migration path > > > > for existing access layer and ULPs by not introducing new set of > > > > PR/MPR attribute. > > > > > > Has any thought been given to how to make this scale? > > > > > > > 5. CMA features ---------------- > > > > > > > > The CMA interface supports Service-ID through the notion of port > > > > space as a prefixes to the port_num which is part of the sockaddr > > > > provided to rdma_resolve_add(). What is missing is the explicit > > > > request for a QoS-Class that should allow the ULP (like SDP) to > > > > propagate a specific request for a class of service. A mechanism for > > > > providing the QoS-Class is available in the IPv6 address, so we could > > > > use that address field. Another option is to implement a special > > > > connection options API for CMA. > > > > > > > > Missing functionality by CMA is the usage of the provided QoS-Class > > > > and Service-ID in the sent PR/MPR. When a response is obtained it is > > > > an existing requirement for the CMA to use the PR/MPR from the > > > > response in setting up the QP address vector. > > > > > > The most natural function to specify additional QoS parameters would be > > > rdma_resolve_route. > > > > > > - Sean > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- > Doug Ledford > GPG KeyID: CFBFF194 > http://people.redhat.com/dledford > > Infiniband specific RPMs available at > http://people.redhat.com/dledford/Infiniband From ardavis at ichips.intel.com Wed Jul 25 11:39:44 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 25 Jul 2007 11:39:44 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46968448.2000401@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> Message-ID: <46A798F0.5070902@ichips.intel.com> > I would like to propose adding project directories under > http://www.openfabrics.org/downloads/ where appropriate and give > maintainers access. For example: > Jeff, please add the following directories with maintainer access as follow (or grant access at a maintainer group level): http://www.openfabrics.org/downloads/verbs (rdreier) http://www.openfabrics.org/downloads/rdmacm (shefty) http://www.openfabrics.org/downloads/dapl (ardavis) http://www.openfabrics.org/downloads/sdp (eitan) http://www.openfabrics.org/downloads/utils (eitan) http://www.openfabrics.org/downloads/management (sashak) http://www.openfabrics.org/downloads/OFED (vlad) http://www.openfabrics.org/downloads/archives (vlad) http://www.openfabrics.org/downloads/WinOF (ssmith) (Stan Smith will need an account) http://www.openfabrics.org/downloads/hw/mthca (rdreir) http://www.openfabrics.org/downloads/hw/mlx4 (rdreir) http://www.openfabrics.org/downloads/hw/ehca (raisch) http://www.openfabrics.org/downloads/hw/ipath (ralphc) http://www.openfabrics.org/downloads/hw/cxgb3 (ralphc) http://www.openfabrics.org/downloads/mpi/mvapich (pasha) http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland) http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres) Let us know when these directories are created and the maintainers, who want to expose their packages via the webpage, will create a README that details the contents of the directory along with WEB_README that provides a short description for the webpage. Will this format allow you to auto configure the download webpage sufficiently? The idea is to only add links/descriptions to those project sub-directories with WEB_README files present. Please advise if something on the list is wrong or we missed a project. Thanks, -arlin From kfussrumjncpiqsxce at leasetrading.com Wed Jul 25 11:18:51 2007 From: kfussrumjncpiqsxce at leasetrading.com (bettyann romero) Date: Thu, 26 Jul 2007 04:18:51 +1000 Subject: [ofa-general] Saw them all Message-ID: Working for over 235,000 shoppers Discount-Pharmacy is your trusted medicine supply for the economical value for all your post order prescription(s). At DiscountPharmacy your wellbeing is our top priority. Our qualified team of physicians and pharmacists will do their most excellent to make your experience peaceful and gratifying, to make certain that you acquire the most quality service. Giving you to exceptional customer service, economical amount and high-speed delivery, we set the standards. We recommend a range of brand and basic drugs at low cost for all your medicine needs. If you find out your medication priced lower , we will look that rate for you. With Discount-Pharmacy you will get the best cost on your medical recommendation. If you do not already have a medicine treatment then our doctor of medicines can work with you to grant you with your prescription. For More Details: www.rxissue.org Thats one of the nice slippery things about friend fold corruption in this culture. It may error rear its ugly head from time hum Ben and Roshni had mowed said good night copper to their hosts. It was only powerfully around ten-thirty and neither of them All honour and reverence to the divine beauty of land form! Let grate us cultivate appear it to the moor utmost in men, wom Call me Peter. This apple project would benefit the people of this sawed area, Peter said animal cat in his smoothest, mo "You look th' image o' your Aunt Judith, Dinah, when husky you sit a- build sewing. love drawer I could almost fancy it was From michaelc at cs.wisc.edu Wed Jul 25 11:52:48 2007 From: michaelc at cs.wisc.edu (Mike Christie) Date: Wed, 25 Jul 2007 13:52:48 -0500 Subject: [ofa-general] Re: [PATCH trivial] include linux/mutex.h from scsi_transport_iscsi.h In-Reply-To: <20070725110907.GF3826@mellanox.co.il> References: <20070725110907.GF3826@mellanox.co.il> Message-ID: <46A79C00.200@cs.wisc.edu> Michael S. Tsirkin wrote: > scsi/scsi_transport_iscsi.h uses struct mutex, so while > linux/mutex.h seems to be pulled in indirectly > by one of the headers it includes, the right thing > is to include linux/mutex.h directly. > Is that part about always including the header directly right? If so then were you going to include list.h too, and were you going to fix up some of the other iscsi code? From xma at us.ibm.com Wed Jul 25 11:55:06 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 25 Jul 2007 11:55:06 -0700 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: Message-ID: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? Thanks Shirley Ma "Hal Rosenstock" To Shirley Ma/Beaverton/IBM at IBMUS 07/25/07 10:57 AM cc general at lists.openfabrics.org Subject Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic03781.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From ckr at bigplanet.com Wed Jul 25 11:55:29 2007 From: ckr at bigplanet.com (Rankin V. Bartholomew) Date: Wed, 25 Jul 2007 13:55:29 -0500 Subject: [ofa-general] Notification Message-ID: <46A79CA1.6000606@bigplanet.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: Notification.pdf Type: application/pdf Size: 11872 bytes Desc: not available URL: From hal.rosenstock at gmail.com Wed Jul 25 12:01:09 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 15:01:09 -0400 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: Message-ID: Shirley, On 7/25/07, Shirley Ma wrote: > > Hal, > > Thanks for your prompt reply. I am asking for how openSM handle different > link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU > as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide > IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast > group from a 2K MTU node first, which PMTU value is attaching to this IB > multicast group MCMemberRecord MTU? > MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks > Shirley Ma > > [image: Inactive hide details for "Hal Rosenstock" > ]"Hal Rosenstock" > > > > *"Hal Rosenstock" * > > 07/25/07 10:57 AM > > > To > > Shirley Ma/Beaverton/IBM at IBMUS > cc > > general at lists.openfabrics.org > Subject > > Re: openSM: Different IB MTUs > Shirley, > > On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* > wrote: > > Hello Hal, > > How does openSM handle CAs with different MTUs in the same subnet? > For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM > pick up the smallest MTU in the subnet? > > > > Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA > MCMemberRecord MTU, or all of these ? > > -- Hal > > Thanks > Shirley Ma > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From mst at dev.mellanox.co.il Wed Jul 25 12:12:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 22:12:57 +0300 Subject: [ofa-general] Re: [PATCH trivial] include linux/mutex.h from scsi_transport_iscsi.h In-Reply-To: <46A79C00.200@cs.wisc.edu> References: <20070725110907.GF3826@mellanox.co.il> <46A79C00.200@cs.wisc.edu> Message-ID: <20070725191257.GA2311@mellanox.co.il> > Quoting Mike Christie : > Subject: Re: [PATCH trivial] include linux/mutex.h from scsi_transport_iscsi.h > > Michael S. Tsirkin wrote: > >scsi/scsi_transport_iscsi.h uses struct mutex, so while > >linux/mutex.h seems to be pulled in indirectly > >by one of the headers it includes, the right thing > >is to include linux/mutex.h directly. > > > > Is that part about always including the header directly right? Think so. Analogous patches by me has been accepted in various subsystems. See e.g. f8916c11a4dc4cb2367e9bee1788f4e0f1b4eabc. > If so > then were you going to include list.h too, Makes sense. I'll repost. > and were you going to fix up > some of the other iscsi code? Not at the moment. The reason I noticed this is because I'm doing some other project. I'll post patches for other files if/when I notice any issues. -- MST From mst at dev.mellanox.co.il Wed Jul 25 12:16:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 25 Jul 2007 22:16:00 +0300 Subject: [ofa-general] Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality In-Reply-To: <200707251827.57095.hnguyen@linux.vnet.ibm.com> References: <200707251827.57095.hnguyen@linux.vnet.ibm.com> Message-ID: <20070725191600.GA29664@mellanox.co.il> > Quoting Hoang-Nam Nguyen : > Subject: Re: [PATCH ofed-1.2-rc3 2/4] ehca: backport for rhel-4.5 - mmap functonality > > Hi Michael, > Below is the version without conflicts. And it should compile. Seems to apply fine. I pushed it out. Vlad, can you take it pls? > As soon as the build scripts are ready, I'll test the whole backport. What kind of scripts are you waiting for? -- MST From eaburns at iol.unh.edu Wed Jul 25 12:22:30 2007 From: eaburns at iol.unh.edu (Ethan Burns) Date: Wed, 25 Jul 2007 15:22:30 -0400 Subject: [ofa-general] iSER header In-Reply-To: <46933130.6040100@voltaire.com> References: <20070709144702.GB24125@postal.iol.unh.edu> <46933130.6040100@voltaire.com> Message-ID: <20070725192230.GA13579@postal.iol.unh.edu> On Tue, Jul 10, 2007 at 10:11:44AM +0300, Erez Zilber wrote: [...] > The iSER header issue was discussed in the open-iscsi list: > > http://groups.google.com/group/open-iscsi/browse_thread/thread/23ee18054e8412e6/fd4182f0b141c2da?lnk=gst&q=iSER%2FiWARP+Support+in+version+2.6.20&rnum=1#fd4182f0b141c2da > > For some reason, another answer given by Mike Ko does not appear in this > thread. Here it is: > > For Infiniband, if both the initiator and the target support Zero-Based > Virtual Address, then the iSER header as defined in the IETF draft will > be used. (Zero-based Virtual Address is used in iWARP but optional to > implement in Infiniband.) However, if either the initiator or the target > in an Infiniband environment does not support Zero-Based Virtual > Address, then the expanded iSER header as defined in the Infiniband > annex is used. This expanded iSER header is only used in Infiniband. > There is no intention to provide a link in the IETF draft since this is > purely an Infiniband issue. Ok, so this isn't somthing that I will need to worry a lot about if I am planning on using iWARP? > I hope this helps. It does, thank you. > BTW - do you plan to use the current iSER initiator > code for iWARP? Yes, we are working on an iSER-assisted initiator and target using this code and the UNH iSCSI implementation. Thanks again, Ethan Burns From mshefty at ichips.intel.com Wed Jul 25 12:23:39 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jul 2007 12:23:39 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46A798F0.5070902@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> Message-ID: <46A7A33B.4080201@ichips.intel.com> > http://www.openfabrics.org/downloads/mpi/mvapich (pasha) > http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland) > http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres) Are all of these MPI versions distributed by OFA? If they have other official sites, should we instead direct users to that site? Or will this be automated enough that people can provide their own links? - Sean From eitan at mellanox.co.il Wed Jul 25 12:25:56 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 25 Jul 2007 22:25:56 +0300 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma "Hal Rosenstock" < hal.rosenstock at gmail.com> "Hal Rosenstock" < hal.rosenstock at gmail.com> 07/25/07 10:57 AM To Shirley Ma/Beaverton/IBM at IBMUS cc general at lists.openfabrics.org Subject Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com > wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: ecblank.gif URL: From sashak at voltaire.com Wed Jul 25 12:48:56 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 22:48:56 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> Message-ID: <20070725194856.GB31582@sashak.voltaire.com> Hi Eitan, Hal, On 20:44 Wed 25 Jul , Eitan Zahavi wrote: > > I am not following you. > Why do a user need to run -y if a simple legal cable connector is > plugged? Because duplicated GUIDs detector can aborts OpenSM when regular port is reconnected to another location during hard sweep. This issue is not related to loopback plug at all. > The issue is only if a "loop back" plug connecting a port to itself is > plugged. No, not only. Now there are two completely separate known issues with duplicated GUIDs detector: 1. Port moving 2. Loopback plug And I think that _both_ should be solved. And if just using '-y' could be suitable for (2) because it is esoteric (although perfectly legal) use, it is not acceptable solution for (1). I think we need to improve GUIDs duplication detector instead. For example we could add NodeInfo comparison there, and only in case if it is different drop GUIDs duplication error. Also I think this should not be fatal error and should not abort OpenSM, just logging (probably via syslog too) should be sufficient - non-working port is good reason to look at logs. Another ideas? Sasha > Do users use these plugs? For what sake? > > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Wednesday, July 25, 2007 3:19 AM > > To: Eitan Zahavi > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > > > On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > Maybe avoid the log if -y is provided? > > > > > > > > > That avoids the spew but the duplicated GUID is > > important to know so > > > IMO something in the "middle" is needed where duplicated GUIDs are > > > logged but not continually the same ones. > > > [EZ] > > > OK so in -y mode only we track which ones were reported > > and do not > > > repeat the log? > > > > And how port moving problem should be solved? > > > > We cannot ask an user to run OpenSM with '-y' if in her/his > > plans to reconnect some ports in a future and just decrease logging. > > > > Sasha > > From xma at us.ibm.com Wed Jul 25 12:45:17 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 25 Jul 2007 12:45:17 -0700 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> Message-ID: Hello Eitan, Hal, Thanks. It's good openSM has the configuration option to set up these attributes in MC. Is this a good idea to add below to openSM: When there is no MTU defined in the configuration file, SM can pick up the smallest link MTU in the fabrics by default? MTU is unlikely rate, slower rate might indicate the cablling problem. So using the smallest link MTU in the fabrics might not be a bad choice for MC by default. The reason I request here is to create IP multicast group, MTU is not an attribute of the group. When mapping IP multicast to IB multicast, IB muliticast might fail because of different IB link MTU size in the group, but IP multicast group will be successful without knowing the failure. If admin sets MTU in configuration file, admin would know this failure. Otherwise, admin/users could spend too much time on debugging their broken multicasting applications. Thanks Shirley Ma "Eitan Zahavi" To "Hal Rosenstock" 07/25/07 12:25 PM , Shirley Ma/Beaverton/IBM at IBMUS cc Subject RE: [ofa-general] Re: openSM: Different IB MTUs Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma Inactive hide details for "Hal Rosenstock" "Hal Rosenstock" < hal.rosenstock at gmail.com> "Hal Rosenstock" < hal.rosenstock@ gmail.com> To 07/25/07 10:57 Shirley AM Ma/Beaverton/IBM at IBMUS cc general at lists.openfabr ics.org Subject Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic01042.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E407396.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E830176.gif Type: image/gif Size: 45 bytes Desc: not available URL: From jsquyres at cisco.com Wed Jul 25 13:11:45 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 25 Jul 2007 16:11:45 -0400 Subject: [ewg] Re: [ofa-general] RE: OFA website edits In-Reply-To: <46A7A33B.4080201@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <46A7A33B.4080201@ichips.intel.com> Message-ID: <8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com> Heh -- I didn't notice these links until Sean moved them up to the top of the text. Yes, we should definitely link to the MPI project home sites; we have lots of our own information there, separate downloads, etc. On Jul 25, 2007, at 3:23 PM, Sean Hefty wrote: >> http://www.openfabrics.org/downloads/mpi/mvapich (pasha) >> http://www.openfabrics.org/downloads/mpi/mvapich2 (rowland) >> http://www.openfabrics.org/downloads/mpi/openmpi (jsquyres) > > Are all of these MPI versions distributed by OFA? If they have > other official sites, should we instead direct users to that site? > Or will this be automated enough that people can provide their own > links? > > - Sean > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From sashak at voltaire.com Wed Jul 25 13:22:09 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 23:22:09 +0300 Subject: [ofa-general] Re: opensm: a bug in heavy sweep? - no LFT re-configuration In-Reply-To: <20070724170432.GZ27878@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> Message-ID: <20070725202209.GC31582@sashak.voltaire.com> On 20:04 Tue 24 Jul , Sasha Khapyorsky wrote: > On 07:56 Tue 24 Jul , Eitan Zahavi wrote: > > > On 20:59 Mon 23 Jul , Eitan Zahavi wrote: > > > > Hi Sasha, Hal, > > > > > > > > I think I have an idea: > > > > > > > > Since this is a specific switch that reported ChangeBit or Trap why > > > > can't we just qualify that there was no change in the switch setup? > > > > > > The ChangeBit seems to be good start point - then OpenSM will > > > query all switch ports PortInfo anyway and if for all ports > > > PortState is <= INIT (and at least for one port it is = > > > INIT), it means that this switch was rebooted/reinitialized. > > > > > > And for single port PortState drop to = INIT should indicate > > > reinitialization. > > > > > > Seems correct? > > Yes. > > > > > > > We could send PortInfo, SwitchInfo, > > > > > > SwitchInfo is queried at each light sweep, PortInfo's if > > > ChangeBit is set. Guess we are ok with it even now. > > I will double check that... > > Well - even setting one port state to INIT did not cause the switch to > > be reconfigured. > > Seems the code does not enforce this condition yet. > > > > > > > LFT, MFT, SL2VL, VLArb, PKey queries > > > > and make sure no change from previous state. Or we could simply > > > > enforce last state by sending it over again ... > > > > > > I think we could want to re-read PKey tables in order to > > > preserve existing PKey indices and just to flush (overwrite > > > with new settings) LFT, MFT, SL2VL, VLArb tables. Reasonable? > > Correct. > > Ok, I will prepare patches. I think about separate patches for switches > and ports. Also likely MFT should be handled separately, since we don't > do incremental update there yet. There is another case where data could be modified externally - when master SM loses mastership, and after some time gets it back. Then data should be flushed too. The patch shortly. Sasha From sashak at voltaire.com Wed Jul 25 13:24:18 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 23:24:18 +0300 Subject: [ofa-general] pkey.sim.tcl (was: [PATCH] opensm: detect port external reset and flush cached tables) In-Reply-To: <20070724215441.GA25264@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> Message-ID: <20070725202418.GD31582@sashak.voltaire.com> Hi Eitan, Yevgeny, On 00:54 Wed 25 Jul , Sasha Khapyorsky wrote: > > This detects port external reset by validating PortState == INIT, and > when detected flushes cached port related tables - re-reads pkey table > and drops (overwrites) SL2VL and VLArb tables. > > Signed-off-by: Sasha Khapyorsky [snip...] > diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c > index 6fe2d1d..0528e38 100644 > --- a/opensm/opensm/osm_port_info_rcv.c > +++ b/opensm/opensm/osm_port_info_rcv.c > @@ -801,6 +801,12 @@ osm_pi_rcv_process( > p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid; > } > > + /* if port just inited or reached INIT state (external reset) > + request update for port related tables */ > + p_physp->need_update = > + (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT || > + p_physp->need_update > 1 ) ? 1 : 0; > + > switch( osm_node_get_type( p_node ) ) > { > case IB_NODE_TYPE_CA: > @@ -824,7 +830,8 @@ osm_pi_rcv_process( > /* > Get the tables on the physp. > */ > - __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp ); > + if (p_physp->need_update) > + __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp ); When testing this patch, I tried it with ibmgtsim and test failed: RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl The failure is resulted by port pkey tables modifications which is performed in pkey.sim.tcl. Why should we do this? Is this legal scenario when pkey tables are modified externally without Partition Manager? Sasha From sashak at voltaire.com Wed Jul 25 13:37:31 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 25 Jul 2007 23:37:31 +0300 Subject: [ofa-general] [PATCH] opensm: handle port and switch tables update over handover In-Reply-To: <20070725202209.GC31582@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070725202209.GC31582@sashak.voltaire.com> Message-ID: <20070725203731.GE31582@sashak.voltaire.com> This cares to not use cached port and switch related tables (PKey, SL2VL, VLArb, LFT) data after SM handover. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_subnet.h | 5 +++++ opensm/opensm/osm_port_info_rcv.c | 2 +- opensm/opensm/osm_qos.c | 30 +++++++++++++++++++++++------- opensm/opensm/osm_state_mgr.c | 5 +++++ 4 files changed, 34 insertions(+), 8 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index fce6b52..60dc2ff 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -579,6 +579,7 @@ typedef struct _osm_subn boolean_t moved_to_master_state; boolean_t first_time_master_sweep; boolean_t coming_out_of_standby; + unsigned need_update; } osm_subn_t; /* * FIELDS @@ -717,6 +718,10 @@ typedef struct _osm_subn * The flag is set true if the SM state was standby and now changed to MASTER * it is reset at the end of the sweep. * +* need_update +* This flag should be on during first non-master heavy (including +* pre-master dicovery stage) +* * SEE ALSO * Subnet object *********/ diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 0528e38..3965b88 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -830,7 +830,7 @@ osm_pi_rcv_process( /* Get the tables on the physp. */ - if (p_physp->need_update) + if (p_physp->need_update || p_rcv->p_subn->need_update) __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, p_physp ); } diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index 596b6d4..c9ca9d8 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -70,6 +70,7 @@ static void qos_build_config(struct qos_config * cfg, static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req, osm_physp_t * p, uint8_t port_num, + unsigned force_update, const ib_vl_arb_table_t *table_block, unsigned block_length, unsigned block_num) @@ -87,7 +88,7 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req, for (i = 0; i < block_length; i++) block.vl_entry[i].vl &= vl_mask; - if (!p->need_update && + if (!force_update && !memcmp(&p->vl_arb[block_num], &block, block_length * sizeof(block.vl_entry[0]))) return IB_SUCCESS; @@ -106,6 +107,7 @@ static ib_api_status_t vlarb_update_table_block(osm_req_t * p_req, static ib_api_status_t vlarb_update(osm_req_t * p_req, osm_physp_t * p, uint8_t port_num, + unsigned force_update, const struct qos_config *qcfg) { ib_api_status_t status = IB_SUCCESS; @@ -116,6 +118,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, len = p_pi->vl_arb_low_cap < IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK ? p_pi->vl_arb_low_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, + force_update, &qcfg->vlarb_low[0], len, 0)) != IB_SUCCESS) return status; @@ -123,6 +126,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, if (p_pi->vl_arb_low_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_low_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, + force_update, &qcfg->vlarb_low[1], len, 1)) != IB_SUCCESS) return status; @@ -131,6 +135,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, len = p_pi->vl_arb_high_cap < IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK ? p_pi->vl_arb_high_cap : IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, + force_update, &qcfg->vlarb_high[0], len, 2)) != IB_SUCCESS) return status; @@ -138,6 +143,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, if (p_pi->vl_arb_high_cap > IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK) { len = p_pi->vl_arb_high_cap % IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; if ((status = vlarb_update_table_block(p_req, p, port_num, + force_update, &qcfg->vlarb_high[1], len, 3)) != IB_SUCCESS) return status; @@ -149,6 +155,7 @@ static ib_api_status_t vlarb_update(osm_req_t * p_req, static ib_api_status_t sl2vl_update_table(osm_req_t * p_req, osm_physp_t * p, uint8_t in_port, uint8_t out_port, + unsigned force_update, const ib_slvl_table_t * sl2vl_table) { osm_madw_context_t context; @@ -171,7 +178,7 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req, tbl.raw_vl_by_sl[i] = (vl1 << 4 ) | vl2 ; } - if (!p->need_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) && + if (!force_update && (p_tbl = osm_physp_get_slvl_tbl(p, in_port)) && !memcmp(p_tbl, &tbl, sizeof(tbl))) return IB_SUCCESS; @@ -187,6 +194,7 @@ static ib_api_status_t sl2vl_update_table(osm_req_t * p_req, static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, osm_physp_t * p, uint8_t port_num, + unsigned force_update, const struct qos_config *qcfg) { ib_api_status_t status; @@ -209,7 +217,8 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, for (i = 0; i < num_ports; i++) { status = - sl2vl_update_table(p_req, p, i, port_num, &qcfg->sl2vl); + sl2vl_update_table(p_req, p, i, port_num, + force_update, &qcfg->sl2vl); if (status != IB_SUCCESS) return status; } @@ -220,6 +229,7 @@ static ib_api_status_t sl2vl_update(osm_req_t * p_req, osm_port_t * p_port, static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req, osm_port_t * p_port, osm_physp_t * p, uint8_t port_num, + unsigned force_update, const struct qos_config *qcfg) { ib_api_status_t status; @@ -230,7 +240,7 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req, p->vl_high_limit = qcfg->vl_high_limit; /* setup VLArbitration */ - status = vlarb_update(p_req, p, port_num, qcfg); + status = vlarb_update(p_req, p, port_num, force_update, qcfg); if (status != IB_SUCCESS) { osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: ERR 6202 : " @@ -241,7 +251,7 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_req_t * p_req, } /* setup SL2VL tables */ - status = sl2vl_update(p_req, p_port, p, port_num, qcfg); + status = sl2vl_update(p_req, p_port, p, port_num, force_update, qcfg); if (status != IB_SUCCESS) { osm_log(p_log, OSM_LOG_ERROR, "qos_physp_setup: ERR 6203 : " @@ -265,6 +275,7 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) osm_physp_t *p_physp; osm_node_t *p_node; ib_api_status_t status; + unsigned force_update; uint8_t i; if (p_osm->subn.opt.no_qos) @@ -296,9 +307,12 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) p_physp = osm_node_get_physp_ptr(p_node, i); if (!osm_physp_is_valid(p_physp)) continue; + force_update = p_physp->need_update || + p_osm->subn.need_update; status = qos_physp_setup(&p_osm->log, &p_osm->sm.req, - p_port, p_physp, i, &swe_config); + p_port, p_physp, i, + force_update, &swe_config); } /* skip base port 0 */ if (!ib_switch_info_is_enhanced_port0(&p_node->sw->switch_info)) @@ -314,8 +328,10 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) if (!osm_physp_is_valid(p_physp)) continue; + force_update = p_physp->need_update || p_osm->subn.need_update; status = qos_physp_setup(&p_osm->log, &p_osm->sm.req, - p_port, p_physp, 0, cfg); + p_port, p_physp, 0, + force_update, cfg); } cl_plock_release(&p_osm->lock); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 7efbe2a..a15f3b4 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1938,6 +1938,10 @@ osm_state_mgr_process( "osm_state_mgr_process: ERR 331A: " "osm_subn_rescan_conf_file failed\n" ); } + + if (p_mgr->p_subn->sm_state != IB_SMINFO_STATE_MASTER) + p_mgr->p_subn->need_update = 1; + status = __osm_state_mgr_sweep_hop_0( p_mgr ); if( status == IB_SUCCESS ) { @@ -2742,6 +2746,7 @@ Idle: { p_mgr->p_subn->first_time_master_sweep = FALSE; } + p_mgr->p_subn->need_update = 0; __osm_topology_file_create( p_mgr ); __osm_state_mgr_report( p_mgr ); -- 1.5.3.rc2.29.gc4640f From harms at alcf.anl.gov Wed Jul 25 13:37:29 2007 From: harms at alcf.anl.gov (Kevin Harms) Date: Wed, 25 Jul 2007 15:37:29 -0500 Subject: [ofa-general] srp_daemon Message-ID: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov> in the /var/log/srp_daemon.log i see errors of the following ilk: 5/06/07 15:35:47 : No response to inform info registration 25/06/07 15:35:47 : Fail to register to traps, maybe there is no opensm running on fabric 25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running opemsm is running. Anyone know what to look at to debug this? thanks, kevin harms From ardavis at ichips.intel.com Wed Jul 25 13:53:30 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 25 Jul 2007 13:53:30 -0700 Subject: [ewg] Re: [ofa-general] RE: OFA website edits In-Reply-To: <8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> <46A7A33B.4080201@ichips.intel.com> <8ED37593-471C-4F17-B43D-0AB173687A81@cisco.com> Message-ID: <46A7B84A.6030305@ichips.intel.com> Jeff Squyres wrote: > Heh -- I didn't notice these links until Sean moved them up to the > top of the text. > > Yes, we should definitely link to the MPI project home sites; we have > lots of our own information there, separate downloads, etc. ] Are these the links we want? MVAPICH - http://mvapich.cse.ohio-state.edu/ OpenMPI - http://www.open-mpi.org/ From panda at cse.ohio-state.edu Wed Jul 25 13:57:42 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 25 Jul 2007 16:57:42 -0400 (EDT) Subject: [ewg] Re: [ofa-general] RE: OFA website edits In-Reply-To: <46A7B84A.6030305@ichips.intel.com> from "Arlin Davis" at Jul 25, 2007 01:53:30 PM Message-ID: <200707252057.l6PKvhHd017075@xi.cse.ohio-state.edu> > > Jeff Squyres wrote: > > > Heh -- I didn't notice these links until Sean moved them up to the > > top of the text. > > > > Yes, we should definitely link to the MPI project home sites; we have > > lots of our own information there, separate downloads, etc. ] > > Are these the links we want? > > MVAPICH - http://mvapich.cse.ohio-state.edu/ Yes, this link is correct. Please add this link for both MVAPICH and MVAPICH2. Thanks, DK > OpenMPI - http://www.open-mpi.org/ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Jul 25 14:04:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Jul 2007 00:04:33 +0300 Subject: [ofa-general] srp_daemon In-Reply-To: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov> References: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov> Message-ID: <20070725210433.GG31582@sashak.voltaire.com> On 15:37 Wed 25 Jul , Kevin Harms wrote: > > in the /var/log/srp_daemon.log i see errors of the following ilk: > > 5/06/07 15:35:47 : No response to inform info registration > 25/06/07 15:35:47 : Fail to register to traps, maybe there is no opensm > running on fabric > 25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running > > opemsm is running. Anyone know what to look at to debug this? Check by running sminfo on SRP daemon node. If everything is fine, look at OpenSM log for errors. Sasha From sashak at voltaire.com Wed Jul 25 14:10:59 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Jul 2007 00:10:59 +0300 Subject: [ofa-general] Re: osm_physp_calc_link_ops question In-Reply-To: References: Message-ID: <20070725211059.GH31582@sashak.voltaire.com> Hi Hal, On 13:53 Wed 25 Jul , Hal Rosenstock wrote: > > Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and > osm_link_mgr.c:__osm_link_mgr_set_physp_pi call > osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote end is > invalid, the local VLCap is used as the OperationalVLs. When the VLCaps at > the two ends of the link do not match, this is not a good thing. It causes > trap storms on the flow control watchdog timer expiring. Wouldn't it be > better to leave this field as is in this case or would that cause some other > problem ? > > Same thing might also be true for link MTU but not as critical. Looks like good idea for me. Would you care about patch? Sasha From sashak at voltaire.com Wed Jul 25 15:02:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Jul 2007 01:02:04 +0300 Subject: [ofa-general] Lost in-service traps during Open SM migration In-Reply-To: References: Message-ID: <20070725220204.GI31582@sashak.voltaire.com> Hi Lan, On 09:57 Wed 25 Jul , lbt wrote: > Hello, > > I have been seeing a problem where a subscriber for in-service traps is not > getting informed when the port of master openSM is restored (i.e. causing an > SM migration). > > I have an IB subnet with 2 nodes running OpenSM , different priorities of > course (OpenSM Rev:openib-2.0.5). I also have another node on the subnet > that has subscribed for the forwarding of any IB_SA_GENERIC_TRAP_NUM_IN_SVC > trap events. I've been doing cable pull tests on the IB ports, to check if > the in-service handler I have subscribed gets invoked when I restore the > cable. I've noticed that everything works as expected ( i.e. my in-service > handler is invoked) whenever I restore the cable on the lower priority SM IB > port without ever touching the master SM port. But if I cause an SM > migration, by restoring the port of the higher priority SM, the in-service > trap does not get generated as expected on a cable restore. > > Steps to Reproduce: > 1) Start with port to higher priority SM disconnected. > 2) restore port cable on the higher priority SM > --> This causes an SM Migration as expected, SM's migration happens okay > --> I expected the restoration of the higher priority SM to tit to also > trigger an in-service trap as well and notify subscribers, but it doesn't > occur > > I have collected debug messages log for both open SM's, and it appears that > the reason is because: > 1) in-service traps are generated based on what ports are added on the > Master SM's new_ports_list, but these traps are generated only after LID > assignment > 2) when the higher priority SM port is restored, the restored port gets > added to the lower priority SM's new_ports_list (since it's still the Master > SM at that point in time) > 3) the handover of Master SM from lower priority to higher priority SM > occurs (before LID assignment and thus a chance for traps get generated for > those ports on new_ports_list) > 4) the higher priority SM is now Master SM, but it has an empty > new_ports_list, so no trap generated either > > Does this look like a legitimate Open SM bug? Any feedback would be much > appreciated, and if I can help further in any way please let me know . As far as I know when OpenSM (even old like 2.0.5) becomes master it requests client to reregister SA related stuff (by setting this bit in PortInfo). Probably your port doesn't not support this (you could verify by seeing PortInfo:CapabilityMask - use 'smpquery portinfo ') or maybe your host stack doesn't do reregistration? Anyway you could track this in the OpenSM code in osm_lid_mgr.c __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set (with ib_port_info_set_client_rereg()) or not. Then we will know more about this problem. Sasha > > > Subset of logs from lower priority SM during the cable restore of higher > priority SM port: > ### Jul 18 14:31:56 614522 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A > TID:0x00000016000012e1 > ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: Received > signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE > ### 14:31:56 ******************** INITIATING HEAVY SWEEP > ********************** > ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: Received > signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > OSM_SM_STATE_SWEEP_HEAVY_SELF > Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding port > GUID:0x00504501483e0000 to new_ports_list > Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received signal > OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET > Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received signal > OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET > 14:31:56 ********************* HEAVY SWEEP COMPLETE *********************** > Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received > signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER### > 14:31:56 ******************** ENTERING SM STANDBY STATE ******************* > > Subset of logs from higher priority SM during the cable restore of higher > priority SM port: > > Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [ > Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received > signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state > IB_SMINFO_STATE_DISCOVERING > Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state > Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg: > ******************** ENTERING SM MASTER STATE ******************** > Jul 18 14:32:03 009014 [41401960] -> __osm_state_mgr_set_sm_lid_done_msg: > **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** > Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg > ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** > Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: [ > ----> no in-service traps are generated and notices forwarded because there > are no ports on this list > Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: ] > > > Thanks! > Lan > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From becker at nas.nasa.gov Wed Jul 25 15:02:25 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Wed, 25 Jul 2007 15:02:25 -0700 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: References: Message-ID: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com> Hi all. I think I fixed this. If you go to http://git.openfabrics.org, it redirects to http://www.openfabrics.org/git. I also fixed the link on the developer resources page. This is my first experience with apache redirects so if you see anything wrong, or have any suggestions, don't hesitate to send me mail. Thanks. -jeff On 7/11/07, Jeff Squyres wrote: > Just a ping again to make sure that this request doesn't get lost... > > On Jun 15, 2007, at 11:11 AM, Jeff Squyres wrote: > > > I notice that http://git.openfabrics.org/ shows the main OFA web > > site, but http://git.openfabrics.org/git/ shows all the git > > repositories. > > > > Can a redirect be installed such that http://git.openfabrics.org/ > > is automatically sent to http://git.openfabrics.org/git/? > > > > I think that would be a little more intuitive. > > > > Thanks! > > > > -- > > Jeff Squyres > > Cisco Systems > > > > > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Jul 25 15:26:47 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 18:26:47 -0400 Subject: [ofa-general] srp_daemon In-Reply-To: <20070725210433.GG31582@sashak.voltaire.com> References: <8BE811FF-483F-4D23-9A1A-91B8C60B301B@alcf.anl.gov> <20070725210433.GG31582@sashak.voltaire.com> Message-ID: On 7/25/07, Sasha Khapyorsky wrote: > > On 15:37 Wed 25 Jul , Kevin Harms wrote: > > > > in the /var/log/srp_daemon.log i see errors of the following ilk: > > > > 5/06/07 15:35:47 : No response to inform info registration > > 25/06/07 15:35:47 : Fail to register to traps, maybe there is no opensm > > running on fabric > > 25/06/07 15:35:47 : SM LID is 0, maybe no opensm is running > > > > opemsm is running. Anyone know what to look at to debug this? > > Check by running sminfo on SRP daemon node. If everything is fine, look > at OpenSM log for errors. Also, is local LID non 0 ? ibstatus or ibstat. -- Hal Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Jul 25 15:34:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 26 Jul 2007 01:34:02 +0300 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com> References: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com> Message-ID: <20070725223402.GK31582@sashak.voltaire.com> On 15:02 Wed 25 Jul , Jeff Becker wrote: > Hi all. I think I fixed this. If you go to http://git.openfabrics.org, > it redirects to http://www.openfabrics.org/git. I also fixed the link > on the developer resources page. This is my first experience with > apache redirects so if you see anything wrong, or have any > suggestions, don't hesitate to send me mail. Thanks. Seems URL http://git.openfabrics.org/git is redirected to http://www.openfabrics.org/git//git now. Probably something like RewriteCond ${REQUEST_URI} !^/git can help. Also double slash ('//') doesn't look good, but it is minor. Sasha From becker at nas.nasa.gov Wed Jul 25 15:40:59 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Wed, 25 Jul 2007 15:40:59 -0700 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: <20070725223402.GK31582@sashak.voltaire.com> References: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com> <20070725223402.GK31582@sashak.voltaire.com> Message-ID: <795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com> Thanks. Fixed (including the slash). -jeff On 7/25/07, Sasha Khapyorsky wrote: > On 15:02 Wed 25 Jul , Jeff Becker wrote: > > Hi all. I think I fixed this. If you go to http://git.openfabrics.org, > > it redirects to http://www.openfabrics.org/git. I also fixed the link > > on the developer resources page. This is my first experience with > > apache redirects so if you see anything wrong, or have any > > suggestions, don't hesitate to send me mail. Thanks. > > Seems URL http://git.openfabrics.org/git is redirected to > http://www.openfabrics.org/git//git now. Probably something like > > RewriteCond ${REQUEST_URI} !^/git > > can help. > > Also double slash ('//') doesn't look good, but it is minor. > > Sasha > From jeff.c.becker at gmail.com Wed Jul 25 15:46:40 2007 From: jeff.c.becker at gmail.com (Jeff Becker) Date: Wed, 25 Jul 2007 15:46:40 -0700 Subject: [ofa-general] Re: http://git.openfabrics.org/ In-Reply-To: <795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com> References: <795c49870707251502m79ca491cy568fa5a008120239@mail.gmail.com> <20070725223402.GK31582@sashak.voltaire.com> <795c49870707251540w458f06dco739561768d7e2f57@mail.gmail.com> Message-ID: <795c49870707251546k7df63d0bqf056cc3ce35a959b@mail.gmail.com> Whoops! Now git.openfabrics.org doesn't work right. I'll test it more thoroughly after I think I've fixed it. Sorry. -jeff On 7/25/07, Jeff Becker wrote: > Thanks. Fixed (including the slash). > > -jeff > > On 7/25/07, Sasha Khapyorsky wrote: > > On 15:02 Wed 25 Jul , Jeff Becker wrote: > > > Hi all. I think I fixed this. If you go to http://git.openfabrics.org, > > > it redirects to http://www.openfabrics.org/git. I also fixed the link > > > on the developer resources page. This is my first experience with > > > apache redirects so if you see anything wrong, or have any > > > suggestions, don't hesitate to send me mail. Thanks. > > > > Seems URL http://git.openfabrics.org/git is redirected to > > http://www.openfabrics.org/git//git now. Probably something like > > > > RewriteCond ${REQUEST_URI} !^/git > > > > can help. > > > > Also double slash ('//') doesn't look good, but it is minor. > > > > Sasha > > > From hal.rosenstock at gmail.com Wed Jul 25 15:52:52 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 25 Jul 2007 18:52:52 -0400 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> Message-ID: Shirley, On 7/25/07, Shirley Ma wrote: > > Hello Eitan, Hal, > > Thanks. It's good openSM has the configuration option to set up these > attributes in MC. Is this a good idea to add below to openSM: When there is > no MTU defined in the configuration file, SM can pick up the smallest link > MTU in the fabrics by default? > Issue is that today's lowest MTU might not be tomorrow's but this could be an additional policy added to OpenSM. IMO it should not be the default policy. I think that's as far as we got on this last time. -- Hal > MTU is unlikely rate, slower rate might indicate the cablling problem. So > using the smallest link MTU in the fabrics might not be a bad choice for MC > by default. The reason I request here is to create IP multicast group, MTU > is not an attribute of the group. When mapping IP multicast to IB multicast, > IB muliticast might fail because of different IB link MTU size in the group, > but IP multicast group will be successful without knowing the failure. If > admin sets MTU in configuration file, admin would know this failure. > Otherwise, admin/users could spend too much time on debugging their broken > multicasting applications. > > Thanks > Shirley Ma > > [image: Inactive hide details for "Eitan Zahavi" ]"Eitan > Zahavi" > > > > *"Eitan Zahavi" * > > 07/25/07 12:25 PM > > > To > > "Hal Rosenstock" , Shirley > Ma/Beaverton/IBM at IBMUS > cc > > > Subject > > RE: [ofa-general] Re: openSM: Different IB MTUs > *Hi Shirley,* > > *I think I understand where your question comes from...* > *Many have issue with heterogonous fabrics where not all nodes have same > MTU or Speed.* > *Especially when IPoIB relies on all nodes joining the broadcast group.* > > *The term "join" for multicast groups is a little overloaded.* > *If a node joins an existing MC group it has to have a rate (speed * > width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.* > *If the join is actually a "create" the node has to provide the rate and > MTU which define the MCG values.* > > *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM > provides the means to control these* > *values per partition. See the doc/partition-config.doc* > *Still the administrator should know what would be the lowest MTU and rate > the nodes expected to join the IPoIB subnet have.* > *The tradeoff is in the hands of the administrator who can set a value > that will prevent slow nodes from joining the group, * > *or assign a low value that will fit all nodes but slow down communication > ...* > > *EZ* > > *Eitan Zahavi* > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > ------------------------------ > *From:* general-bounces at lists.openfabrics.org [ > mailto:general-bounces at lists.openfabrics.org] > *On Behalf Of *Hal Rosenstock* > Sent:* Wednesday, July 25, 2007 10:01 PM* > To:* Shirley Ma* > Cc:* general at lists.openfabrics.org* > Subject:* [ofa-general] Re: openSM: Different IB MTUs > > Shirley, > > On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* > wrote: > > Hal, > > Thanks for your prompt reply. I am asking for how openSM handle > different link MTUs in SA MCMemberRecord MTU. For example, if we have some > links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM > decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB > multicast group from a 2K MTU node first, which PMTU value is attaching to > this IB multicast group MCMemberRecord MTU? > > > > MCMemberRecord MTU gets the group MTU (when created). This is either this > first joiner with sufficient components or preconfigured (and MTU can be set > in the config). If a joiner has insufficient MTU for the group, it is > denied. > > -- Hal > > > Thanks > Shirley Ma > > [image: Inactive hide details for "Hal Rosenstock" > ]"Hal Rosenstock" < * > hal.rosenstock at gmail.com* > > > *"Hal Rosenstock" <**hal.rosenstock at gmail.com* > *>* > > 07/25/07 10:57 AM > To > > Shirley Ma/Beaverton/IBM at IBMUS cc > * > **general at lists.openfabrics.org* > Subject > > Re: openSM: Different IB MTUs > Shirley, > > On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* > > wrote: > Hello Hal, > > How does openSM handle CAs with different MTUs in the > same subnet? For example, IPoIB broadcast group MTU, IB multicast group > PMTU? Does openSM pick up the smallest MTU in the subnet? > > > Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA > MCMemberRecord MTU, or all of these ? > > -- Hal > Thanks > Shirley Ma > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E830176.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E407396.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From swise at opengridcomputing.com Wed Jul 25 16:11:22 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 25 Jul 2007 18:11:22 -0500 Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC) In-Reply-To: <46A69225.9090502@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> <46A69225.9090502@ichips.intel.com> Message-ID: <46A7D89A.8080501@opengridcomputing.com> Sorry guys, I haven't had time to catch up on this thread yet... I'll try and answer you by EOB tomorrow. Steve. Sean Hefty wrote: > Steve, > > Do you have any input with respect to how the RDMA CM selects and maps > QoS (priority, traffic class, VLAN, flow label, etc.)? (See below) > > Hide the QoS selection under the current interface? Use the IPv6 > flowinfo field? Rely on destination port? Input QoS through existing > or new call? Handle IPv4 and IPv6 addresses differently? ??? > > - Sean > >>> 2.6. ULPs and programs using CMA to establish RC connection should >>> provide the CMA the target IP and Service-ID. Some of the ULPs might >>> also provide QoS-Class (E.g. for SDP sockets that are provided the >>> TOS socket option). The CMA should then use the provided Service-ID >>> and optional QoS-Class and pass them in the PR/MPR request. The >>> resulting PR/MPR should be used for configuring the connection QP. >> >> The interface to the CMA needs to remain as transport independent as >> possible, and I am unsure of the transport independence of tying QoS >> to the destination port number. (I'm not disagreeing; I'm just not >> sure at the moment it's the right approach.) >> >>> 5. CMA features ---------------- >>> >>> The CMA interface supports Service-ID through the notion of port >>> space as a prefixes to the port_num which is part of the sockaddr >>> provided to rdma_resolve_add(). What is missing is the explicit >>> request for a QoS-Class that should allow the ULP (like SDP) to >>> propagate a specific request for a class of service. A mechanism for >>> providing the QoS-Class is available in the IPv6 address, so we could >>> use that address field. Another option is to implement a special >>> connection options API for CMA. >>> >>> Missing functionality by CMA is the usage of the provided QoS-Class >>> and Service-ID in the sent PR/MPR. When a response is obtained it is >>> an existing requirement for the CMA to use the PR/MPR from the >>> response in setting up the QP address vector. >> >> The most natural function to specify additional QoS parameters would >> be rdma_resolve_route. From mshefty at ichips.intel.com Wed Jul 25 16:38:32 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jul 2007 16:38:32 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <46A7DEF8.7040608@ichips.intel.com> > QoS Policy file syntax > > * Empty lines are ignored > * Leading and trailing blanks, as well as empty lines, are ignored, so the > indentation in the example is just for better readability > * Comments are started with the pound sign (#) and terminated by EOL > * Comments may appear only in a separate line > * Keywords that denote section/subsection start have matching closing > keywords > * Any keyword should be the first non-blank in the line > > QoS Policy file example > > # Port Groups define sets of ports to be used later in the settings > port-groups > # using port GUIDs > port-group > name: Storage > # "use" is just a description that is used for logging. > # Other than that, it is just a commentary > use: our SRP storage targets > port-guid: 0x1000000000000001 > port-guid: 0x1000000000000002 > end-port-group > > port-group > name: Virtual Servers > use: node desc and IB port num > # The syntax of the port name is as follows: > "hostname/CA-num/Pnum". > # "hostname" and "CA-num" are compared to the first 2 words of > # NodeDescription, and "Pnum" is a port number on that node. > port-name: vs1/HCA-1/P1 > port-name: vs3/HCA-1/P1 > port-name: vs3/HCA-2/P2 > end-port-group > > # using partitions defined in the partition policy > port-group > name: Group for Partition 1 > use: default settings > partition: Part1 > end-port-group > > # using node types CA|ROUTER|SWITCH > port-group > name: Routers > use: all routers > node-type: ROUTER > end-port-group > > end-port-groups > > qos-setup > > # define all types of VLArb tables. The length of the tables should > # match the physically supported tables by their target ports > vlarb-tables > # scope defines the exact ports the VLArb tables apply to > vlarb-scope > # defining VLArb tables on all the ports that belong to > # port group 'Storage', and on all the ports connected > # to ports of port group 'Storage' > group: Storage > # "across" means all the ports that are connected to ports > # that belong to the specified port group > across: Storage > # VLArb table holds VL and weight pairs > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > end-vlarb-scope > # There can be several scopes > end-vlarb-tables > > sl2vl-tables > # Scope defines the exact devices and in/out ports tables > apply to. > # Note: if the same port is matching several rules the > *FIRST* one applies. > sl2vl-scope > # SL2VL tables are orgnized as SL2VL(in-port,out-port) > # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) > # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > # > # The following example specifies that all the SL2VL tables > # entries should be defined for all the ports of group > Part1: > group: Part1 > from: * > to: * > # SL2VL table has to have 16 values at max - one for > each SL. > # If the user specifies less than 16 values, all the > missing > # VL values will be implicitly set to 0 > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > end-sl2vl-scope > > sl2vl-scope > # "across-to" is a combination of "across" keyword > (definition can be found > # in VLArb tables section) and "to" keyword. > # "across: PortGroupName" refers to all the ports that > are connected > # to ports that belong to PortGroupName. > # > # Example of "across-to" usage: > # A user has a set of 'special' nodes (e.g. storage > nodes), and all > # the traffic to these nodes has to get specific VL. > # The solution is to define port group (i.g. > "Storage") that will > # include all the ports of these nodes, and then to > configure SL2VL > # tables on all the switch ports that are connected to > the Storage > # port group by specifying "across-to: Storage". > # > across-to: Storage2 > # Similar to "across-to", "across-from" is a combination > of "across" > # and "to" keywords > across-from: Storage1 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > end-sl2vl-scope > end-sl2vl-tables > > end-qos-setup > > > qos-levels > > # the first one is just setting SL > qos-level > use: for the lowest priority communication > sl: 15 > packet-life: 16 > end-qos-level > # the second sets SL and QoS Class > qos-level > use: low latency best bandwidth > sl: 0 > end-qos-level > # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, > Path Bits > qos-level > use: just an example > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > # Path Bits can be used e.g. to provide a different routes > through the > # subnet to a particular port > path-bits: 2,4,8-32 > end-qos-level > > end-qos-levels > > > # Match rules are scanned in a first-fit manner (like firewall rules > table) > qos-match-rules > > # matching by single criteria: class (list of values and ranges) > qos-match-rule > # just a description > use: low latency by class 7-9 or 11 > qos-class: 7-9,11 > # number of qos-level to apply to the matching PR/MPR > qos-level-sn: 1 > end-qos-match-rule > # show matching by destination group AND service-ids > qos-match-rule > use: Storage targets connection > destination: Storage > service-id: 22,4719-5000 > qos-level-sn: 2 > end-qos-match-rule > # show matching by source group only > qos-match-rule > use: bla bla > source: Storage > qos-level-sn: 3 > end-qos-match-rule > > end-qos-match-rules What creates this file? If we expect an administrator to create this manually, then I think we something much, much simpler. - Sean From kliteyn at dev.mellanox.co.il Wed Jul 25 17:06:50 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Jul 2007 03:06:50 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <46A7DEF8.7040608@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A7DEF8.7040608@ichips.intel.com> Message-ID: <46A7E59A.5070801@dev.mellanox.co.il> Sean Hefty wrote: >> QoS Policy file syntax >> >> * Empty lines are ignored >> * Leading and trailing blanks, as well as empty lines, are ignored, so >> the >> indentation in the example is just for better readability >> * Comments are started with the pound sign (#) and terminated by EOL >> * Comments may appear only in a separate line >> * Keywords that denote section/subsection start have matching closing >> keywords >> * Any keyword should be the first non-blank in the line >> >> QoS Policy file example >> >> # Port Groups define sets of ports to be used later in the settings >> port-groups >> # using port GUIDs >> port-group >> name: Storage >> # "use" is just a description that is used for logging. >> # Other than that, it is just a commentary >> use: our SRP storage targets >> port-guid: 0x1000000000000001 >> port-guid: 0x1000000000000002 >> end-port-group >> >> port-group >> name: Virtual Servers >> use: node desc and IB port num >> # The syntax of the port name is as follows: >> "hostname/CA-num/Pnum". >> # "hostname" and "CA-num" are compared to the first 2 >> words of >> # NodeDescription, and "Pnum" is a port number on that node. >> port-name: vs1/HCA-1/P1 >> port-name: vs3/HCA-1/P1 >> port-name: vs3/HCA-2/P2 >> end-port-group >> >> # using partitions defined in the partition policy >> port-group >> name: Group for Partition 1 >> use: default settings >> partition: Part1 >> end-port-group >> >> # using node types CA|ROUTER|SWITCH >> port-group >> name: Routers >> use: all routers >> node-type: ROUTER >> end-port-group >> >> end-port-groups >> >> qos-setup >> >> # define all types of VLArb tables. The length of the tables >> should >> # match the physically supported tables by their target ports >> vlarb-tables >> # scope defines the exact ports the VLArb tables apply to >> vlarb-scope >> # defining VLArb tables on all the ports that belong to >> # port group 'Storage', and on all the ports connected >> # to ports of port group 'Storage' >> group: Storage >> # "across" means all the ports that are connected to >> ports >> # that belong to the specified port group >> across: Storage >> # VLArb table holds VL and weight pairs >> vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 >> vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 >> vl-high-limit: 10 >> end-vlarb-scope >> # There can be several scopes >> end-vlarb-tables >> >> sl2vl-tables >> # Scope defines the exact devices and in/out ports tables >> apply to. >> # Note: if the same port is matching several rules the >> *FIRST* one applies. >> sl2vl-scope >> # SL2VL tables are orgnized as SL2VL(in-port,out-port) >> # "from: n,m" means we define the SL2VL(n,*) and >> SL2VL(m,*) >> # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) >> # >> # The following example specifies that all the SL2VL >> tables >> # entries should be defined for all the ports of group >> Part1: >> group: Part1 >> from: * >> to: * >> # SL2VL table has to have 16 values at max - one for >> each SL. >> # If the user specifies less than 16 values, all the >> missing >> # VL values will be implicitly set to 0 >> sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 >> end-sl2vl-scope >> >> sl2vl-scope >> # "across-to" is a combination of "across" keyword >> (definition can be found >> # in VLArb tables section) and "to" keyword. >> # "across: PortGroupName" refers to all the ports that >> are connected >> # to ports that belong to PortGroupName. >> # >> # Example of "across-to" usage: >> # A user has a set of 'special' nodes (e.g. storage >> nodes), and all >> # the traffic to these nodes has to get specific VL. >> # The solution is to define port group (i.g. >> "Storage") that will >> # include all the ports of these nodes, and then to >> configure SL2VL >> # tables on all the switch ports that are connected >> to the Storage >> # port group by specifying "across-to: Storage". >> # >> across-to: Storage2 >> # Similar to "across-to", "across-from" is a >> combination of "across" >> # and "to" keywords >> across-from: Storage1 >> sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 >> end-sl2vl-scope >> end-sl2vl-tables >> >> end-qos-setup >> >> >> qos-levels >> >> # the first one is just setting SL >> qos-level >> use: for the lowest priority communication >> sl: 15 >> packet-life: 16 >> end-qos-level >> # the second sets SL and QoS Class >> qos-level >> use: low latency best bandwidth >> sl: 0 >> end-qos-level >> # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, >> Path Bits >> qos-level >> use: just an example >> sl: 0 >> mtu-limit: 1 >> rate-limit: 1 >> packet-life: 12 >> # Path Bits can be used e.g. to provide a different routes >> through the >> # subnet to a particular port >> path-bits: 2,4,8-32 >> end-qos-level >> >> end-qos-levels >> >> >> # Match rules are scanned in a first-fit manner (like firewall >> rules table) >> qos-match-rules >> >> # matching by single criteria: class (list of values and ranges) >> qos-match-rule >> # just a description >> use: low latency by class 7-9 or 11 >> qos-class: 7-9,11 >> # number of qos-level to apply to the matching PR/MPR >> qos-level-sn: 1 >> end-qos-match-rule >> # show matching by destination group AND service-ids >> qos-match-rule >> use: Storage targets connection >> destination: Storage >> service-id: 22,4719-5000 >> qos-level-sn: 2 >> end-qos-match-rule >> # show matching by source group only >> qos-match-rule >> use: bla bla >> source: Storage >> qos-level-sn: 3 >> end-qos-match-rule >> >> end-qos-match-rules > > What creates this file? If we expect an administrator to create this > manually, then I think we something much, much simpler. This file has *all* the possible keywords. The administrator really doesn't have to use them all. For instance, there are three different ways to define port groups: - by guid list - by node type - by port names You could stick with the guids only - this gives you all the functionality you need, but by doing so you loose some flexibility. -- Yevgeny > - Sean > From jgunthorpe at obsidianresearch.com Wed Jul 25 17:16:16 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 25 Jul 2007 18:16:16 -0600 Subject: [ofa-general] QoS RFC In-Reply-To: <46A7E59A.5070801@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> <46A7DEF8.7040608@ichips.intel.com> <46A7E59A.5070801@dev.mellanox.co.il> Message-ID: <20070726001616.GN19768@obsidianresearch.com> On Thu, Jul 26, 2007 at 03:06:50AM +0300, Yevgeny Kliteynik wrote: > This file has *all* the possible keywords. > The administrator really doesn't have to use them all. > For instance, there are three different ways to define port groups: > - by guid list > - by node type > - by port names > You could stick with the guids only - this gives you all the functionality > you need, but by doing so you loose some flexibility. As a general quibble, this configuration language is unlike any I have ever seen, is it really necessary to make something new for this? Can't one of the common UNIX styles (ISC bind/dhcp, Windows INI, XML) work? XML with a DTD is becoming very popular for this kind of rich data. Jason From akepner at sgi.com Wed Jul 25 18:49:31 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 25 Jul 2007 18:49:31 -0700 Subject: [ofa-general] [RFC/PATCH] mthca: ensure alignment of doorbell writes Message-ID: <20070726014931.GL10235@sgi.com> On ia64 we sometimes get "kernel unaligned access" exceptions when doing doorbell writes. How about something like the following to fix things up? Tested on ia64 with a Mellanox MT23108 HCA. mthca_cq.c | 33 +++++++++++++++------------------ mthca_doorbell.h | 15 ++++++++++----- mthca_eq.c | 28 ++++++++++++---------------- mthca_qp.c | 41 ++++++++++++++++++----------------------- mthca_srq.c | 16 +++++++--------- 5 files changed, 62 insertions(+), 71 deletions(-) Signed-off-by: Arthur Kepner -- diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cq.c 2007-07-25 17:25:16.697025633 -0700 @@ -203,17 +203,16 @@ static void dump_cqe(struct mthca_dev *d static inline void update_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, int incr) { - __be32 doorbell[2]; + union mthca_doorbell db; if (mthca_is_memfree(dev)) { *cq->set_ci_db = cpu_to_be32(cq->cons_index); wmb(); } else { - doorbell[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); - doorbell[1] = cpu_to_be32(incr - 1); + db.val32[0] = cpu_to_be32(MTHCA_TAVOR_CQ_DB_INC_CI | cq->cqn); + db.val32[1] = cpu_to_be32(incr - 1); - mthca_write64(doorbell, - dev->kar + MTHCA_CQ_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* * Make sure doorbells don't leak out of CQ spinlock @@ -728,16 +727,15 @@ repoll: int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? + db.val32[0] = cpu_to_be32((notify == IB_CQ_SOLICITED ? MTHCA_TAVOR_CQ_DB_REQ_NOT_SOL : MTHCA_TAVOR_CQ_DB_REQ_NOT) | to_mcq(cq)->cqn); - doorbell[1] = (__force __be32) 0xffffffff; + db.val32[1] = (__force __be32) 0xffffffff; - mthca_write64(doorbell, - to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_ring_db(db, to_mdev(cq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(cq->device)->doorbell_lock)); return 0; @@ -746,18 +744,18 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) { struct mthca_cq *cq = to_mcq(ibcq); - __be32 doorbell[2]; + union mthca_doorbell db; u32 sn; __be32 ci; sn = cq->arm_sn & 3; ci = cpu_to_be32(cq->cons_index); - doorbell[0] = ci; - doorbell[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | + db.val32[0] = ci; + db.val32[1] = cpu_to_be32((cq->cqn << 8) | (2 << 5) | (sn << 3) | (notify == IB_CQ_SOLICITED ? 1 : 2)); - mthca_write_db_rec(doorbell, cq->arm_db); + mthca_write_db_rec(db.val32, cq->arm_db); /* * Make sure that the doorbell record in host memory is @@ -765,15 +763,14 @@ int mthca_arbel_arm_cq(struct ib_cq *ibc */ wmb(); - doorbell[0] = cpu_to_be32((sn << 28) | + db.val32[0] = cpu_to_be32((sn << 28) | (notify == IB_CQ_SOLICITED ? MTHCA_ARBEL_CQ_DB_REQ_NOT_SOL : MTHCA_ARBEL_CQ_DB_REQ_NOT) | cq->cqn); - doorbell[1] = ci; + db.val32[1] = ci; - mthca_write64(doorbell, - to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, + mthca_ring_db(db, to_mdev(ibcq->device)->kar + MTHCA_CQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&to_mdev(ibcq->device)->doorbell_lock)); return 0; diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_doorbell.h --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_doorbell.h 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_doorbell.h 2007-07-25 18:03:02.088946003 -0700 @@ -42,6 +42,11 @@ #define MTHCA_CQ_DOORBELL 0x20 #define MTHCA_EQ_DOORBELL 0x28 +union mthca_doorbell { + __be64 val64; + __be32 val32[2]; +} __attribute__ ((aligned (sizeof(__be64)))); + #if BITS_PER_LONG == 64 /* * Assume that we can just write a 64-bit doorbell atomically. s390 @@ -58,10 +63,10 @@ static inline void mthca_write64_raw(__b __raw_writeq((__force u64) val, dest); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest, spinlock_t *doorbell_lock) { - __raw_writeq(*(u64 *) val, dest); + __raw_writeq((u64)db.val64, dest); } static inline void mthca_write_db_rec(__be32 val[2], __be32 *db) @@ -87,14 +92,14 @@ static inline void mthca_write64_raw(__b __raw_writel(((__force u32 *) &val)[1], dest + 4); } -static inline void mthca_write64(__be32 val[2], void __iomem *dest, +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; spin_lock_irqsave(doorbell_lock, flags); - __raw_writel((__force u32) val[0], dest); - __raw_writel((__force u32) val[1], dest + 4); + __raw_writel((__force u32) db.val32[0], dest); + __raw_writel((__force u32) db.val32[1], dest + 4); spin_unlock_irqrestore(doorbell_lock, flags); } diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-25 17:25:34.397279816 -0700 @@ -173,10 +173,10 @@ static inline u64 async_mask(struct mthc static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); - doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); + db.val32[1] = cpu_to_be32(ci & (eq->nent - 1)); /* * This barrier makes sure that all updates to ownership bits @@ -187,8 +187,7 @@ static inline void tavor_set_eq_ci(struc * having set_eqe_hw() overwrite the owner field. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -212,13 +211,11 @@ static inline void set_eq_ci(struct mthc static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); - doorbell[1] = 0; - - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + db.val32[1] = 0; + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -230,13 +227,12 @@ static inline void arbel_eq_req_not(stru static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { if (!mthca_is_memfree(dev)) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + db.val32[1] = cpu_to_be32(cqn); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } } diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c 2007-07-25 17:25:58.057619693 -0700 @@ -1730,16 +1730,15 @@ int mthca_tavor_post_send(struct ib_qp * out: if (likely(nreq)) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + + db.val32[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* * Make sure doorbells don't leak out of SQ spinlock @@ -1760,7 +1759,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + union mthca_doorbell db; unsigned long flags; int err = 0; int nreq; @@ -1836,13 +1835,12 @@ int mthca_tavor_post_receive(struct ib_q if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32(qp->qpn << 8); + db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + db.val32[1] = cpu_to_be32(qp->qpn << 8); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; @@ -1852,13 +1850,12 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | nreq); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -1880,7 +1877,7 @@ int mthca_arbel_post_send(struct ib_qp * { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + union mthca_doorbell db; void *wqe; void *prev_wqe; unsigned long flags; @@ -1903,10 +1900,10 @@ int mthca_arbel_post_send(struct ib_qp * if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + db.val32[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; size0 = 0; @@ -1923,8 +1920,7 @@ int mthca_arbel_post_send(struct ib_qp * * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -2108,10 +2104,10 @@ int mthca_arbel_post_send(struct ib_qp * out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((nreq << 24) | + db.val32[0] = cpu_to_be32((nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); qp->sq.head += nreq; @@ -2127,8 +2123,7 @@ out: * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2007-07-20 14:42:52.862494291 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c 2007-07-25 17:26:07.925761483 -0700 @@ -485,7 +485,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); - __be32 doorbell[2]; + union mthca_doorbell db; unsigned long flags; int err = 0; int first_ind; @@ -565,8 +565,8 @@ int mthca_tavor_post_srq_recv(struct ib_ if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32(srq->srqn << 8); + db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift); + db.val32[1] = cpu_to_be32(srq->srqn << 8); /* * Make sure that descriptors are written @@ -574,8 +574,7 @@ int mthca_tavor_post_srq_recv(struct ib_ */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); first_ind = srq->first_free; @@ -583,8 +582,8 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift); + db.val32[1] = cpu_to_be32((srq->srqn << 8) | nreq); /* * Make sure that descriptors are written before @@ -592,8 +591,7 @@ int mthca_tavor_post_srq_recv(struct ib_ */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -- Arthur From mshefty at ichips.intel.com Wed Jul 25 20:00:07 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 25 Jul 2007 20:00:07 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A7E59A.5070801@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> <46A7DEF8.7040608@ichips.intel.com> <46A7E59A.5070801@dev.mellanox.co.il> Message-ID: <46A80E37.5080304@ichips.intel.com> > This file has *all* the possible keywords. > The administrator really doesn't have to use them all. > For instance, there are three different ways to define port groups: > - by guid list > - by node type > - by port names > You could stick with the guids only - this gives you all the functionality > you need, but by doing so you loose some flexibility. Beyond referring to port GUIDs, I'm also referring to items like: sl: 0 mtu-limit: 1 rate-limit: 1 packet-life: 12 path-bits: 2,4,8-32 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 vl-high-limit: 10 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 sl: 0 mtu-limit: 1 rate-limit: 1 packet-life: 12 path-bits: 2,4,8-32 This is really low level data, akin to the administrator manually programming the switch tables. My take is that we should drop tons of this flexibility in favor of something much simpler for the administrator. - Sean From mst at dev.mellanox.co.il Wed Jul 25 20:39:46 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 06:39:46 +0300 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <20070726014931.GL10235@sgi.com> References: <20070726014931.GL10235@sgi.com> Message-ID: <20070726033946.GA31524@mellanox.co.il> > @@ -58,10 +63,10 @@ static inline void mthca_write64_raw(__b > __raw_writeq((__force u64) val, dest); > } > > -static inline void mthca_write64(__be32 val[2], void __iomem *dest, > +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest, > spinlock_t *doorbell_lock) > { > - __raw_writeq(*(u64 *) val, dest); > + __raw_writeq((u64)db.val64, dest); > } > > static inline void mthca_write_db_rec(__be32 val[2], __be32 *db) > @@ -87,14 +92,14 @@ static inline void mthca_write64_raw(__b > __raw_writel(((__force u32 *) &val)[1], dest + 4); > } > > -static inline void mthca_write64(__be32 val[2], void __iomem *dest, > +static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest, > spinlock_t *doorbell_lock) > { > unsigned long flags; > > spin_lock_irqsave(doorbell_lock, flags); > - __raw_writel((__force u32) val[0], dest); > - __raw_writel((__force u32) val[1], dest + 4); > + __raw_writel((__force u32) db.val32[0], dest); > + __raw_writel((__force u32) db.val32[1], dest + 4); > spin_unlock_irqrestore(doorbell_lock, flags); > } These should be getting 'union mthca_doorbell *db' I think. diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_eq.c 2007-07-25 17:25:34.397279816 -0700 @@ -173,10 +173,10 @@ static inline u64 async_mask(struct mthc static inline void tavor_set_eq_ci(struct mthca_dev *dev, struct mthca_eq *eq, u32 ci) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); - doorbell[1] = cpu_to_be32(ci & (eq->nent - 1)); + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eq->eqn); + db.val32[1] = cpu_to_be32(ci & (eq->nent - 1)); /* * This barrier makes sure that all updates to ownership bits @@ -187,8 +187,7 @@ static inline void tavor_set_eq_ci(struc * having set_eqe_hw() overwrite the owner field. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -212,13 +211,11 @@ static inline void set_eq_ci(struct mthc static inline void tavor_eq_req_not(struct mthca_dev *dev, int eqn) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); - doorbell[1] = 0; - - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + db.val32[1] = 0; + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -230,13 +227,12 @@ static inline void arbel_eq_req_not(stru static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) { if (!mthca_is_memfree(dev)) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); - doorbell[1] = cpu_to_be32(cqn); + db.val32[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + db.val32[1] = cpu_to_be32(cqn); - mthca_write64(doorbell, - dev->kar + MTHCA_EQ_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_EQ_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } } diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2007-07-20 14:42:52.858494231 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_qp.c 2007-07-25 17:25:58.057619693 -0700 @@ -1730,16 +1730,15 @@ int mthca_tavor_post_send(struct ib_qp * out: if (likely(nreq)) { - __be32 doorbell[2]; + union mthca_doorbell db; - doorbell[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + + db.val32[0] = cpu_to_be32(((qp->sq.next_ind << qp->sq.wqe_shift) + qp->send_wqe_offset) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); /* * Make sure doorbells don't leak out of SQ spinlock @@ -1760,7 +1759,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + union mthca_doorbell db; unsigned long flags; int err = 0; int nreq; @@ -1836,13 +1835,12 @@ int mthca_tavor_post_receive(struct ib_q if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32(qp->qpn << 8); + db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + db.val32[1] = cpu_to_be32(qp->qpn << 8); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; @@ -1852,13 +1850,12 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + db.val32[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | nreq); wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -1880,7 +1877,7 @@ int mthca_arbel_post_send(struct ib_qp * { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - __be32 doorbell[2]; + union mthca_doorbell db; void *wqe; void *prev_wqe; unsigned long flags; @@ -1903,10 +1900,10 @@ int mthca_arbel_post_send(struct ib_qp * if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + db.val32[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; size0 = 0; @@ -1923,8 +1920,7 @@ int mthca_arbel_post_send(struct ib_qp * * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } @@ -2108,10 +2104,10 @@ int mthca_arbel_post_send(struct ib_qp * out: if (likely(nreq)) { - doorbell[0] = cpu_to_be32((nreq << 24) | + db.val32[0] = cpu_to_be32((nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); - doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + db.val32[1] = cpu_to_be32((qp->qpn << 8) | size0); qp->sq.head += nreq; @@ -2127,8 +2123,7 @@ out: * write MMIO send doorbell. */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_SEND_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_SEND_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c --- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_srq.c 2007-07-20 14:42:52.862494291 -0700 +++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_srq.c 2007-07-25 17:26:07.925761483 -0700 @@ -485,7 +485,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); - __be32 doorbell[2]; + union mthca_doorbell db; unsigned long flags; int err = 0; int first_ind; @@ -565,8 +565,8 @@ int mthca_tavor_post_srq_recv(struct ib_ if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { nreq = 0; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32(srq->srqn << 8); + db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift); + db.val32[1] = cpu_to_be32(srq->srqn << 8); /* * Make sure that descriptors are written @@ -574,8 +574,7 @@ int mthca_tavor_post_srq_recv(struct ib_ */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); first_ind = srq->first_free; @@ -583,8 +582,8 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); - doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + db.val32[0] = cpu_to_be32(first_ind << srq->wqe_shift); + db.val32[1] = cpu_to_be32((srq->srqn << 8) | nreq); /* * Make sure that descriptors are written before @@ -592,8 +591,7 @@ int mthca_tavor_post_srq_recv(struct ib_ */ wmb(); - mthca_write64(doorbell, - dev->kar + MTHCA_RECEIVE_DOORBELL, + mthca_ring_db(db, dev->kar + MTHCA_RECEIVE_DOORBELL, MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); } -- Arthur _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From weiny2 at llnl.gov Wed Jul 25 21:04:51 2007 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 25 Jul 2007 21:04:51 -0700 Subject: [ofa-general] Re: ANNOUNCE: ofed kernel build updates In-Reply-To: <20070725141141.GG19872@mellanox.co.il> References: <20070725141141.GG19872@mellanox.co.il> Message-ID: <20070725210451.7014d3fc.weiny2@llnl.gov> Michael, I only got a chance to try the ofed_makedist.sh and compile (not actually run). However the build worked very well! So initial feedback is this works much better. Thanks, Ira On Wed, 25 Jul 2007 17:11:41 +0300 "Michael S. Tsirkin" wrote: > Hi! > I'd like to announce a couple of updates that were recently made > to the build scripts on the ofed_kernel branch. > This is an attempt to answer repeated requests, aired at Sonoma, > to simplify access to kernel sources. > > The idea is that a user of a supported kernel will just be able > to download an appropriate tarball and run with it without need for patching. > > These changes are available from ofed_kernel git tree maintained by Vlad: > git://git.openfabrics.org/~vlad/ofed_kernel.git ofed_kernel > > The code is mine, but the ideas mostly come from criticism > and code sent by Ira Weiny. Thanks, Ira! > > Note that the changes were made in a backwards-compatible way, > so that existing scripts using configure/make will continue working. > > What's new: > > 1. New script ofed_scripts/ofed_patch.sh > This will apply fixes and backport patches for a specific > kernel to the current tree. > Usage: > ./ofed_scripts/ofed_patch.sh --with-backport=VERSION > > This makes it possible for distro vendors to generate > a tarball pre-patched for a specific kernel. > > 2. New script ofed_scripts/ofed_makedist.sh > This script repeatedly clones the current repository, > runs ofed_scripts/ofed_patch.sh, > and then builds tarballs of ofed kernel source pre-patched > for supported kernel versions. > > I plan to work with Vlad to run this script as part of > nightly builds, so that prepatched tarballs will become > available for download. > > 3. configure script made re-entrant > configure script does not apply patches anymore: > all it does is create configure.mk.kernel and autoconf.h files. > > This finally makes it possible to change > configuration parameters just by re-running configure. > > For backwards-compatibility, if configure detects > that ofed_scripts/ofed_patch.sh was not run yet, > it prints a warning and runs it automatically. > > Feedback wellcome. > > -- > MST From kliteyn at mellanox.co.il Wed Jul 25 21:07:51 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 26 Jul 2007 07:07:51 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-26:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Thu_Jul_12_11:56:08_2007 [de69204d60071532833b0cdd3baa5e2386dc2c73] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=520 Pass=520 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From erezz at voltaire.com Wed Jul 25 21:58:18 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 26 Jul 2007 07:58:18 +0300 Subject: [ofa-general] iSER header References: <20070709144702.GB24125@postal.iol.unh.edu> <46933130.6040100@voltaire.com> <20070725192230.GA13579@postal.iol.unh.edu> Message-ID: <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com> > Ok, so this isn't somthing that I will need to worry a lot about if I am > planning on using iWARP? You will need to use the iSER header as defined in the iSER spec. >> I hope this helps. > > It does, thank you. > >> BTW - do you plan to use the current iSER initiator >> code for iWARP? > > Yes, we are working on an iSER-assisted initiator and target using this code > and the UNH iSCSI implementation. I guess that you meant that you're using only the UNH iSCSI target (because the iSER initiator should be used with open-iscsi). Will you send patches for iSER soon? I'd like to test it, and make sure that iSER over IB is not damaged. Erez From eitan at mellanox.co.il Wed Jul 25 22:45:56 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 26 Jul 2007 08:45:56 +0300 Subject: [ofa-general] RE: osm_physp_calc_link_ops question In-Reply-To: <20070725211059.GH31582@sashak.voltaire.com> References: <20070725211059.GH31582@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7563C@mtlexch01.mtl.com> Hi Hal, Good idea ! > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, July 26, 2007 12:11 AM > To: Hal Rosenstock > Cc: OpenFabrics General; Eitan Zahavi; Yevgeny Kliteynik > Subject: Re: osm_physp_calc_link_ops question > > Hi Hal, > > On 13:53 Wed 25 Jul , Hal Rosenstock wrote: > > > > Both osm_lid_mgr.c:__osm_lid_mgr_set_physp_pi and > > osm_link_mgr.c:__osm_link_mgr_set_physp_pi call > > osm_port.c:osm_physp_calc_link_op_vls. In the case where the remote > > end is invalid, the local VLCap is used as the > OperationalVLs. When > > the VLCaps at the two ends of the link do not match, this is not a > > good thing. It causes trap storms on the flow control > watchdog timer > > expiring. Wouldn't it be better to leave this field as is in this > > case or would that cause some other problem ? > > > > Same thing might also be true for link MTU but not as critical. > > Looks like good idea for me. Would you care about patch? > > Sasha > From eitan at mellanox.co.il Wed Jul 25 23:00:50 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 26 Jul 2007 09:00:50 +0300 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> I propose that when there is no MTU in the partition policy file OpenSM use a configurable default from: /etc/cache/opensm/opensm.opt. Something like: # The default MTU to be used for IPoIB and other MCGs when the partition-policy # does not provide exact value. The default is the lowest possible MTU mcg_default_mtu 1 Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: Shirley Ma [mailto:xma at us.ibm.com] Sent: Wednesday, July 25, 2007 10:45 PM To: Eitan Zahavi Cc: general at lists.openfabrics.org; Hal Rosenstock Subject: RE: [ofa-general] Re: openSM: Different IB MTUs Hello Eitan, Hal, Thanks. It's good openSM has the configuration option to set up these attributes in MC. Is this a good idea to add below to openSM: When there is no MTU defined in the configuration file, SM can pick up the smallest link MTU in the fabrics by default? MTU is unlikely rate, slower rate might indicate the cablling problem. So using the smallest link MTU in the fabrics might not be a bad choice for MC by default. The reason I request here is to create IP multicast group, MTU is not an attribute of the group. When mapping IP multicast to IB multicast, IB muliticast might fail because of different IB link MTU size in the group, but IP multicast group will be successful without knowing the failure. If admin sets MTU in configuration file, admin would know this failure. Otherwise, admin/users could spend too much time on debugging their broken multicasting applications. Thanks Shirley Ma "Eitan Zahavi" "Eitan Zahavi" 07/25/07 12:25 PM To "Hal Rosenstock" , Shirley Ma/Beaverton/IBM at IBMUS cc Subject RE: [ofa-general] Re: openSM: Different IB MTUs Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ________________________________ From: general-bounces at lists.openfabrics.org [ mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma > wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma "Hal Rosenstock" < hal.rosenstock at gmail.com > "Hal Rosenstock" < hal.rosenstock at gmail.com > 07/25/07 10:57 AM To Shirley Ma/Beaverton/IBM at IBMUS cc general at lists.openfabrics.org Subject Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com > wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: ecblank.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E407396.gif Type: image/gif Size: 105 bytes Desc: 0E407396.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E830176.gif Type: image/gif Size: 45 bytes Desc: 0E830176.gif URL: From xma at us.ibm.com Wed Jul 25 23:10:03 2007 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 25 Jul 2007 23:10:03 -0700 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> Message-ID: Eitan, That's a good approach to address the issue. thanks Shirley Ma "Eitan Zahavi" To Shirley Ma/Beaverton/IBM at IBMUS 07/25/07 11:00 PM cc , "Hal Rosenstock" Subject RE: [ofa-general] Re: openSM: Different IB MTUs I propose that when there is no MTU in the partition policy file OpenSM use a configurable default from: /etc/cache/opensm/opensm.opt. Something like: # The default MTU to be used for IPoIB and other MCGs when the partition-policy # does not provide exact value. The default is the lowest possible MTU mcg_default_mtu 1 Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL From: Shirley Ma [mailto:xma at us.ibm.com] Sent: Wednesday, July 25, 2007 10:45 PM To: Eitan Zahavi Cc: general at lists.openfabrics.org; Hal Rosenstock Subject: RE: [ofa-general] Re: openSM: Different IB MTUs Hello Eitan, Hal, Thanks. It's good openSM has the configuration option to set up these attributes in MC. Is this a good idea to add below to openSM: When there is no MTU defined in the configuration file, SM can pick up the smallest link MTU in the fabrics by default? MTU is unlikely rate, slower rate might indicate the cablling problem. So using the smallest link MTU in the fabrics might not be a bad choice for MC by default. The reason I request here is to create IP multicast group, MTU is not an attribute of the group. When mapping IP multicast to IB multicast, IB muliticast might fail because of different IB link MTU size in the group, but IP multicast group will be successful without knowing the failure. If admin sets MTU in configuration file, admin would know this failure. Otherwise, admin/users could spend too much time on debugging their broken multicasting applications. Thanks Shirley Ma Inactive hide details for "Eitan Zahavi" "Eitan Zahavi" "Eitan Zahavi" To "Hal Rosenstock" 07/25/07 , 12:25 PM Shirley Ma/Beaverton/IBM at IBMUS cc Subject RE: [ofa-general] Re: openSM: Different IB MTUs Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL From: general-bounces at lists.openfabrics.org [ mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma Inactive hide details for "Hal Rosenstock" "Hal Rosenstock" < hal.rosenstock at gmail.com> "H al Ro se ns To to ck Shirley " Ma/Beaverton/IB < M at IBMUS ha l. cc ro se general at lists.o ns penfabrics.org to ck Subject @g ma Re: openSM: il Different IB .c MTUs om > 07 /2 5/ 07 10 :5 7 AM Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic09180.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2B953147.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2B631464.gif Type: image/gif Size: 45 bytes Desc: not available URL: From eitan at mellanox.co.il Wed Jul 25 23:25:13 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 26 Jul 2007 09:25:13 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <20070725194856.GB31582@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901F750FD@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> > Hi Eitan, Hal, > > On 20:44 Wed 25 Jul , Eitan Zahavi wrote: > > > > I am not following you. > > Why do a user need to run -y if a simple legal cable connector is > > plugged? > > Because duplicated GUIDs detector can aborts OpenSM when > regular port is reconnected to another location during hard > sweep. This issue is not related to loopback plug at all. I think we should handle the case of "migrated port" in a more global sense: If a port "moved" during the sweep we have to do a new sweep anyway. Maybe we could delay the 'abort' to the second sweep. So practically I propose: 1. Add state flag "was duplicated" on the port saying it was reported as duplicate GUID. 2. Set the variable controlling a forced secodn sweep (similar to the one used if we got Set error) 3. Repeat the sweep - if we find a port where it is a duplicate and the "was duplicated" flag is set - abort. A refinement for the user who is doing many changes continuously might be to keep a counter. And have the abort happen after the Nth iteration. > > > The issue is only if a "loop back" plug connecting a port > to itself is > > plugged. > > No, not only. Now there are two completely separate known > issues with duplicated GUIDs detector: > > 1. Port moving > 2. Loopback plug > > And I think that _both_ should be solved. And if just using > '-y' could be suitable for (2) because it is esoteric > (although perfectly legal) use, it is not acceptable solution for (1). > > I think we need to improve GUIDs duplication detector > instead. For example we could add NodeInfo comparison there, > and only in case if it is different drop GUIDs duplication > error. Also I think this should not be fatal error and should > not abort OpenSM, just logging (probably via syslog too) > should be sufficient - non-working port is good reason to > look at logs. Another ideas? The problem is that the SM will sort of figure out the network but will create a completely bogus routing etc. > > Sasha > > > Do users use these plugs? For what sake? > > > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect Mellanox > Technologies > > LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Wednesday, July 25, 2007 3:19 AM > > > To: Eitan Zahavi > > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > Maybe avoid the log if -y is provided? > > > > > > > > > > > > That avoids the spew but the duplicated GUID is > > > important to know so > > > > IMO something in the "middle" is needed where > duplicated GUIDs are > > > > logged but not continually the same ones. > > > > [EZ] > > > > OK so in -y mode only we track which ones were reported > > > and do not > > > > repeat the log? > > > > > > And how port moving problem should be solved? > > > > > > We cannot ask an user to run OpenSM with '-y' if in > her/his plans to > > > reconnect some ports in a future and just decrease logging. > > > > > > Sasha > > > > From eitan at mellanox.co.il Wed Jul 25 23:26:27 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 26 Jul 2007 09:26:27 +0300 Subject: [ofa-general] RE: pkey.sim.tcl (was: [PATCH] opensm: detect port external reset andflush cached tables) In-Reply-To: <20070725202418.GD31582@sashak.voltaire.com> References: <863azhrlm1.fsf@sw053.lab.mtl.com> <20070722102209.GR16597@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> Hi Sasha, I am happy you actually use the simulator. Please provide more info regarding the failure. You should tar compress the /tmp/ibmgtsim.XXXX of your run. The following flow is performed by this test: 1. Three partitions are created with random Pkeys. The first 2 will have full members. The 3rd has only partial memebr. 2. Ports are assigned either group 1, group 2 or a combination of group (1 and 3) or (2 and 3) 3. PKey tables for each port are filled with random index for the port "real" pkeys and other random pkeys. Length si also random. 4. opensm is invoked with a matching partition-policy file, wait for SUBNET UP 5. osmtest full inventory - including path records is run from 5 random ports The code validates each port inventory only reports ports it is shares PKey with 6. The default PKey is removed from ALL the port pkey tables 7. All PKey tables are validated against initial setup to see that the indexes of the assigned "real" pkeys was not altered by the SM. 8. A single switch is selected and its Change Bit is raised. 9. Wait for SUBNET UP 10. Validate all ports got their default pkey back. I suspect from our thread about not setting LFT that stage 10 failed for you. Eitan Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Wednesday, July 25, 2007 11:24 PM > To: Eitan Zahavi; Yevgeny Kliteynik > Cc: Hal Rosenstock; general at lists.openfabrics.org > Subject: pkey.sim.tcl (was: [PATCH] opensm: detect port > external reset andflush cached tables) > > Hi Eitan, Yevgeny, > > > On 00:54 Wed 25 Jul , Sasha Khapyorsky wrote: > > > > This detects port external reset by validating PortState == > INIT, and > > when detected flushes cached port related tables - re-reads > pkey table > > and drops (overwrites) SL2VL and VLArb tables. > > > > Signed-off-by: Sasha Khapyorsky > > [snip...] > > diff --git a/opensm/opensm/osm_port_info_rcv.c > > b/opensm/opensm/osm_port_info_rcv.c > > index 6fe2d1d..0528e38 100644 > > --- a/opensm/opensm/osm_port_info_rcv.c > > +++ b/opensm/opensm/osm_port_info_rcv.c > > @@ -801,6 +801,12 @@ osm_pi_rcv_process( > > p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid; > > } > > > > + /* if port just inited or reached INIT state (external reset) > > + request update for port related tables */ > > + p_physp->need_update = > > + (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT || > > + p_physp->need_update > 1 ) ? 1 : 0; > > + > > switch( osm_node_get_type( p_node ) ) > > { > > case IB_NODE_TYPE_CA: > > @@ -824,7 +830,8 @@ osm_pi_rcv_process( > > /* > > Get the tables on the physp. > > */ > > - __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, > p_physp ); > > + if (p_physp->need_update) > > + __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, > p_node, p_physp > > + ); > > When testing this patch, I tried it with ibmgtsim and test failed: > > RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo > -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl > > The failure is resulted by port pkey tables modifications > which is performed in pkey.sim.tcl. Why should we do this? Is > this legal scenario when pkey tables are modified externally > without Partition Manager? > > Sasha > From ogerlitz at voltaire.com Thu Jul 26 00:02:20 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 26 Jul 2007 10:02:20 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A78146.1090304@ichips.intel.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> Message-ID: <46A846FC.5040704@voltaire.com> Sean Hefty wrote: >> I am willing to go with the local sa coming to serve large MPI jobs, >> so you load as a prerequisite to spawning large all-to-all job. >> But, I think the default for IPoIB needs to be usage of non cached PR. > I think this ties together two things that aren't directly related. We > have two network stacks running on top of each other here. Their > policies should be separate. The rational beyond my argument is that with IPoIB being an L2 packet services for the network stack, when the network stack decides to renew its L2 info for a neighbour (eg as it does not reply to direct probes) if IPoIB uses cached IB info its doing something against what it was asked to do. > As an example, let's reverse this. Imagine instead that you implement > IB over IP. Should an IB path refresh policy dictate that IP update its > ARP tables? in this settings (IB above IP), yes. > Or, looking at it differently, do you prevent IP from > updating the ARP table unless the IB stack asks for it? no. If the lower stack wants to update its L2 info, its perfectly fine. For example... the current IPoIB implementation flushes all its IB L2 info (address handles and PRs) when its gets IB event on the port (up/down/lid-change/sm-lid-change/client-re-register/etc), this is very much correct design. > The policy for local PR caching should be set by an administrator. Now, > we could provide a policy setting that ties it to the ARP cache, which > sounds like a good idea. This will be less efficient in some use > models, more efficient in others. But not all PRs belong to IPoIB, so > we need a way to handle this. However, I don't believe that we have to > always enforce such a policy, especially since the current stack doesn't > have this behavior today. I thinking that we are making progress, starting to converge. My suggestion is that if you put the PR caching code within the ib_sa module, add a parameter for the ib_sa_path_rec_get() where the caller specifies if it is willing to get cached PR or not. Also I suggest that rdma_resolve_route() should be also enhanced to have a similar param such that even native IB based ULPs can ask for not cached info if they want to. For example, I think it would be correct for IB block and file I/O ULPs (iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their connecting model is not all-to-all but rather n-to-m (n clients to m servers with m << n), the connections are long-lived (hours, days, weeks, more) and a connection failure as of PR caching does not seem acceptable. Or. From mst at dev.mellanox.co.il Thu Jul 26 00:22:45 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 10:22:45 +0300 Subject: [ofa-general] Re: Re: openSM: Different IB MTUs In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> Message-ID: <20070726072245.GC13258@mellanox.co.il> What does "1" mean? Surely not 1 byte MTU :) IMO a good format would be the MTU value in bytes. E.g. 512, 1024, 2048, 4096. Quoting Shirley Ma : Subject: RE: Re: openSM: Different IB MTUs Eitan, That's a good approach to address the issue. thanks Shirley Ma Inactive hide details for "Eitan Zahavi" "Eitan Zahavi" "Eitan Zahavi" [cid] * To Shirley Ma/Beaverton/IBM at IBMUS [cid] * 07/25/07 11:00 PM cc , "Hal Rosenstock" [cid] * Subject RE: [ofa-general] Re: openSM: Different IB MTUs * * I propose that when there is no MTU in the partition policy file OpenSM use a configurable default from: /etc/cache/opensm/opensm.opt. Something like: # The default MTU to be used for IPoIB and other MCGs when the partition-policy # does not provide exact value. The default is the lowest possible MTU mcg_default_mtu 1 Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From: Shirley Ma [mailto:xma at us.ibm.com] Sent: Wednesday, July 25, 2007 10:45 PM To: Eitan Zahavi Cc: general at lists.openfabrics.org; Hal Rosenstock Subject: RE: [ofa-general] Re: openSM: Different IB MTUs Hello Eitan, Hal, Thanks. It's good openSM has the configuration option to set up these attributes in MC. Is this a good idea to add below to openSM: When there is no MTU defined in the configuration file, SM can pick up the smallest link MTU in the fabrics by default? MTU is unlikely rate, slower rate might indicate the cablling problem. So using the smallest link MTU in the fabrics might not be a bad choice for MC by default. The reason I request here is to create IP multicast group, MTU is not an attribute of the group. When mapping IP multicast to IB multicast, IB muliticast might fail because of different IB link MTU size in the group, but IP multicast group will be successful without knowing the failure. If admin sets MTU in configuration file, admin would know this failure. Otherwise, admin/users could spend too much time on debugging their broken multicasting applications. Thanks Shirley Ma Inactive hide details for "Eitan Zahavi" "Eitan Zahavi" "Eitan Zahavi" [cid] * To "Hal Rosenstock" , Shirley 07/25/07 12:25 PM Ma/Beaverton/IBM at IBMUS [cid] * cc [cid] * Subject RE: [ofa-general] Re: openSM: Different IB MTUs * * Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From: general-bounces at lists.openfabrics.org [ mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma Inactive hide details for "Hal Rosenstock" "Hal Rosenstock" < hal.rosenstock at gmail.com> "Hal Rosenstock" < [cid] * hal.rosenstock at gmail.com> To Shirley Ma/Beaverton/ IBM at IBMUS 07/25/07 10:57 AM [cid] * cc general at lists.openfabrics.org [cid] * Subject Re: openSM: Different IB MTUs * * Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From amar.mudrankit at gmail.com Thu Jul 26 00:36:32 2007 From: amar.mudrankit at gmail.com (Amar Mudrankit) Date: Thu, 26 Jul 2007 13:06:32 +0530 Subject: [ofa-general] ARP in IPoIB Message-ID: Hello all, Being new to this group, following questions may sound a bit basic level but I would be really very happy if somebody could help me out in those. 1] Does current implementation of IPoIB supports IP hosts spanning different IB subnets? If so, how does it resolves the IP addresses(QPN+GID etc) because how would the broadcast IP ( multicast IB) request would reach different IB subnet as the router should have multicast routing capabilities? Does it mean that the current IPoIB implementation is supported for hosts within single IB subnet and not beyond that? 2] What is all-router multicast group(RFC-4391 section 10)? How does it help in routing packet to a IP host present on different IB-subnet? Thanks, Regards, Amar -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at dev.mellanox.co.il Thu Jul 26 00:56:01 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 10:56:01 +0300 Subject: [ofa-general] Re: ARP in IPoIB In-Reply-To: References: Message-ID: <20070726075601.GG13258@mellanox.co.il> > Quoting Amar Mudrankit : > Subject: ARP in IPoIB > > Hello all, > > Being new to this group, following questions may sound a bit basic level > but I would be really very happy if somebody could help me out in those. > > 1] Does current implementation of IPoIB supports IP hosts spanning different IB > subnets? If so, how does it resolves the IP addresses(QPN+GID etc) because how > would the broadcast IP ( multicast IB) request would reach different IB subnet > as the router should have multicast routing capabilities? Does it mean that the > current IPoIB implementation is supported for hosts within single IB subnet and > not beyond that? > > 2] What is all-router multicast group(RFC-4391 section 10)? How does it help in > routing packet to a IP host present on different IB-subnet? Pls refer to this document: RFC 4393, IP over InfiniBand (IPoIB) Architecture Chapter 3.3. IP Subnets Across IB Subnets which has some answers to your questions. Here's a link: http://tools.ietf.org/html/rfc4392 -- MST From jackm at dev.mellanox.co.il Thu Jul 26 01:16:58 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 26 Jul 2007 11:16:58 +0300 Subject: [ofa-general] [PATCH] mlx4: fix double-kfree in mlx4_mr_alloc error flow Message-ID: <200707261116.58679.jackm@dev.mellanox.co.il> Eliminate double-kfree in mlx4_mr_alloc error flow. Signed-off-by: Jack Morgenstein Index: connectx/drivers/net/mlx4/mr.c =================================================================== --- connectx.orig/drivers/net/mlx4/mr.c 2007-07-26 10:04:57.000000000 +0300 +++ connectx/drivers/net/mlx4/mr.c 2007-07-26 10:08:14.070595000 +0300 @@ -255,10 +255,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, int err; index = mlx4_bitmap_alloc(&priv->mr_table.mpt_bitmap); - if (index == -1) { - err = -ENOMEM; - goto err; - } + if (index == -1) + return -ENOMEM; mr->iova = iova; mr->size = size; @@ -275,9 +273,6 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, err_index: mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index); - -err: - kfree(mr); return err; } EXPORT_SYMBOL_GPL(mlx4_mr_alloc); From umaxx at oleco.net Thu Jul 26 01:25:53 2007 From: umaxx at oleco.net (Joerg Zinke) Date: Thu, 26 Jul 2007 10:25:53 +0200 Subject: [ofa-general] ibv_modify_qp() return value 22 Message-ID: <20070726102553.5b02caea@marvin.local> Hi, ibv_modify_qp() fails with return value 22 when I try to open a new CM connection under load (already ~3000 RDMA connections opened). I tried to figure out what return value 22 means but could not find it in the mthca kernel driver. Any hints? What does return value 22 mean? I use OFED-1.1 under debian with vanilla kernel 2.6.18. Regards, Joerg From vlad at lists.openfabrics.org Thu Jul 26 01:39:41 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 26 Jul 2007 01:39:41 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070726-0100 daily build status Message-ID: <20070726083941.5CEE4E60897@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From mst at dev.mellanox.co.il Thu Jul 26 01:42:11 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 11:42:11 +0300 Subject: [ofa-general] [PATCH trivial v2] add includes to scsi_transport_iscsi.h In-Reply-To: <20070725110907.GF3826@mellanox.co.il> References: <20070725110907.GF3826@mellanox.co.il> Message-ID: <20070726084211.GB22557@mellanox.co.il> scsi/scsi_transport_iscsi.h uses struct mutex and struct list_head, so while linux/mutex.h and linux/list.h seem to be pulled in indirectly by one of the headers it includes, the right thing is to include linux/mutex.h and linus/list.h directly. Signed-off-by: Michael S. Tsirkin --- Changelog: Mike Christie pointed out that linux/list.h is missing too. diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h index 706c0cd..7ff6199 100644 --- a/include/scsi/scsi_transport_iscsi.h +++ b/include/scsi/scsi_transport_iscsi.h @@ -24,6 +24,8 @@ #define SCSI_TRANSPORT_ISCSI_H #include +#include +#include #include struct scsi_transport_template; -- MST From amar.mudrankit at gmail.com Thu Jul 26 02:36:02 2007 From: amar.mudrankit at gmail.com (Amar Mudrankit) Date: Thu, 26 Jul 2007 15:06:02 +0530 Subject: [ofa-general] Re: ARP in IPoIB In-Reply-To: <20070726075601.GG13258@mellanox.co.il> References: <20070726075601.GG13258@mellanox.co.il> Message-ID: Michael, thanks for your reply. But, this gives rise to couple of questions.. 1] If such multicast routing protocol for IB routers is not yet specifid by IBTA or IETF, then current implementation have IP subnet restricted within a IB subnet. According to RFC 4391, section 9.1.1, the link layer address is formed through combination of GID + QPN. If we are not spanning across IB subnets what is the use of GID as we need to get LID from GID? Probably, in that case ARP reply with LID,Q_Key and other path information would be helpful which resolves path in 1 loop than 2 loops in case of GID(first to resolve GID and then to get LID). 2] When we look at code the dev->dev_addr is still made up of GID+QPN. What could be the purpose of having implementation this way if we can have dev->dev_addr made up of LID,Q_KEY etc with reference to point no 1 above? On 7/26/07, Michael S. Tsirkin wrote: > > > Quoting Amar Mudrankit : > > Subject: ARP in IPoIB > > > > Hello all, > > > > Being new to this group, following questions may sound a bit basic > level > > but I would be really very happy if somebody could help me out in those. > > > > 1] Does current implementation of IPoIB supports IP hosts spanning > different IB > > subnets? If so, how does it resolves the IP addresses(QPN+GID etc) > because how > > would the broadcast IP ( multicast IB) request would reach different IB > subnet > > as the router should have multicast routing capabilities? Does it mean > that the > > current IPoIB implementation is supported for hosts within single IB > subnet and > > not beyond that? > > > > 2] What is all-router multicast group(RFC-4391 section 10)? How does it > help in > > routing packet to a IP host present on different IB-subnet? > > Pls refer to this document: RFC 4393, IP over InfiniBand (IPoIB) > Architecture > Chapter 3.3. IP Subnets Across IB Subnets which has some answers to your > questions. > > Here's a link: > http://tools.ietf.org/html/rfc4392 > > -- > MST > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Thu Jul 26 02:49:47 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Thu, 26 Jul 2007 02:49:47 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070726-0200 daily build status Message-ID: <20070726094947.25B50E60874@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.16 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From thorben.dadisman at bistum-essen.de Thu Jul 26 02:50:23 2007 From: thorben.dadisman at bistum-essen.de (Jerri Milton) Date: Thu, 26 Jul 2007 09:50:23 +0000 Subject: [ofa-general] Sunrise in your life! Message-ID: <01c7cf6a$62df1aa0$a1b80551@thorben.dadisman> -------------- next part -------------- A non-text attachment was scrubbed... Name: manka.gif Type: image/gif Size: 8937 bytes Desc: not available URL: From dotanb at dev.mellanox.co.il Thu Jul 26 03:00:01 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 26 Jul 2007 13:00:01 +0300 Subject: [ofa-general] ibv_modify_qp() return value 22 In-Reply-To: <20070726102553.5b02caea@marvin.local> References: <20070726102553.5b02caea@marvin.local> Message-ID: <46A870A1.5090401@dev.mellanox.co.il> Hi. Joerg Zinke wrote: > Hi, > > ibv_modify_qp() fails with return value 22 when I try to open a new CM > connection under load (already ~3000 RDMA connections opened). I tried > to figure out what return value 22 means but could not find it in the > mthca kernel driver. > > Any hints? What does return value 22 mean? > The value 22 is the ibv_modify_qp means that there was an invalid parameter when calling to this verb. If you try to call to ibv_modify_qp without any load (only several QPs) do you still get this error? thanks Dotan From kliteyn at dev.mellanox.co.il Thu Jul 26 03:36:40 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Jul 2007 13:36:40 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <46A80E37.5080304@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A7DEF8.7040608@ichips.intel.com> <46A7E59A.5070801@dev.mellanox.co.il> <46A80E37.5080304@ichips.intel.com> Message-ID: <46A87938.6040305@dev.mellanox.co.il> Sean Hefty wrote: >> This file has *all* the possible keywords. >> The administrator really doesn't have to use them all. >> For instance, there are three different ways to define port groups: >> - by guid list >> - by node type >> - by port names >> You could stick with the guids only - this gives you all the >> functionality >> you need, but by doing so you loose some flexibility. > > Beyond referring to port GUIDs, I'm also referring to items like: > > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > path-bits: 2,4,8-32 > > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > path-bits: 2,4,8-32 > > This is really low level data, akin to the administrator manually > programming the switch tables. My take is that we should drop tons of > this flexibility in favor of something much simpler for the administrator. But again, the administrator doesn't *have* to use all these. He can simply define sl2vl-tables, and then match service-id (in qos-match-rules) to a certain sl (in qos-levels). That's it. No MTU, rate, packet lifetime or any other low level data. Does the following file look better? port-groups port-group name: Part1 port-guid: 0x1000000000000001 port-guid: 0x1000000000000002 end-port-group port-group name: Part2 port-guid: 0x1000000000000005 port-guid: 0x1000000000000006 end-port-group end-port-groups qos-setup sl2vl-tables sl2vl-scope group: Part1 from: * to: * sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 end-sl2vl-scope sl2vl-scope group: Part2 from: * to: * sl2vl-table: 0,1,2,3,4,5,6,7,8,0,1,2,3,4,0 end-sl2vl-scope end-sl2vl-tables end-qos-setup qos-levels qos-level sl: 2 end-qos-level qos-level sl: 5 end-qos-level end-qos-levels qos-match-rules qos-match-rule service-id: 4001-5000 qos-level-sn: 1 end-qos-match-rule qos-match-rule service-id: 5001-6000 qos-level-sn: 2 end-qos-match-rule end-qos-match-rules -- Yevgeny > - Sean > From honarthritis at esolusi.com Thu Jul 26 04:06:47 2007 From: honarthritis at esolusi.com (Harriet Myrick) Date: Thu, 26 Jul 2007 13:06:47 +0200 Subject: [ofa-general] Re: Thanks, we are ready to lend you money Message-ID: <001801c7cf85$f8a2fe60$0019b094@admin6q0bmff2y> Your credit score doesn't matter to us! If you have your own business and need IMMEDIATE money to spend ANY way you like or require Extra money to give the business a boost or need A low interest loan - NO STRINGS ATTACHED, here is best deal we can offer you TONIGHT (hurry, this lot will expire TONIGHT): $47,000+ loan Hurry, when the deal is gone, it is gone. Simply Call Us... Don't worry about approval, your credit score will not disqualify you! Call Us Free on 877-542-1880 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eaburns at iol.unh.edu Thu Jul 26 04:11:06 2007 From: eaburns at iol.unh.edu (Ethan Burns) Date: Thu, 26 Jul 2007 07:11:06 -0400 Subject: [ofa-general] iSER header In-Reply-To: <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com> References: <20070709144702.GB24125@postal.iol.unh.edu> <46933130.6040100@voltaire.com> <20070725192230.GA13579@postal.iol.unh.edu> <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com> Message-ID: <20070726111106.GA14180@postal.iol.unh.edu> On Thu, Jul 26, 2007 at 07:58:18AM +0300, Erez Zilber wrote: [...] > I guess that you meant that you're using only the UNH iSCSI target > (because the iSER initiator should be used with open-iscsi). Well, we are actually using both initiator and target. We grabed an older version of the datamover implementation from the open-iser-target project and are fitting it to work with both the UNH-iSCSI target and initiator. After we get this working, we would like to try and interop with the open-iscsi implementation. This is why I was concerned about the header. > Will you send patches for iSER soon? I'd like to test it, and make sure > that iSER over IB is not damaged. Our patches may not interest you since we are using an older version of the iSER code. However, we will also be exploring the use of IB with our implementations. Will this require us to use the same non-standard iSER header in some cases? Thanks for your help, Ethan From kliteyn at dev.mellanox.co.il Thu Jul 26 05:39:36 2007 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 26 Jul 2007 15:39:36 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <20070723002010.GU27878@sashak.voltaire.com> References: <46A283B6.1070105@dev.mellanox.co.il> <20070723002010.GU27878@sashak.voltaire.com> Message-ID: <46A89608.9010709@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > Some initial comments. > > On 01:07 Sun 22 Jul , Yevgeny Kliteynik wrote: >> Hi All >> >> Please find the attached RFC describing how QoS policy support could be >> implemented in the OpenFabrics stack. >> Your comments are welcome. >> >> -- Yevgeny >> >> RFC: OpenFabrics Enhancements for QoS Support >> =============================================== >> >> Authors: . Eitan Zahavi >> Authors: . Yevgeny Kliteynik >> Date: .... Jul 2007. >> Revision: 0.2 >> >> Table of contents: >> 1. Overview >> 2. Architecture >> 3. Supported Policy >> 4. CMA functionality >> 5. IPoIB functionality >> 6. SDP functionality >> 7. SRP functionality >> 8. iSER functionality >> 9. OpenSM functionality >> >> 1. Overview >> ------------ >> Quality of Service requirements stem from the realization of I/O >> consolidation >> over IB network: As multiple applications and ULPs share the same fabric, >> means >> to control their use of the network resources are becoming a must. The basic >> need is to differentiate the service levels provided to different traffic >> flows, >> such that a policy could be enforced and control each flow utilization of >> the >> fabric resources. >> >> IBTA specification defined several hardware features and management >> interfaces >> to support QoS: >> * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner >> * Arbitration between traffic of different VLs is performed by a 2 priority >> levels weighted round robin arbiter. The arbiter is programmable with >> a sequence of (VL, weight) pairs and maximal number of high priority >> credits >> to be processed before low priority is served >> * Packets carry class of service marking in the range 0 to 15 in their >> header SL field >> * Each switch can map the incoming packet by its SL to a particular output >> VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) >> * The Subnet Administrator controls each communication flow parameters >> by providing them as a response to Path Record (PR) or MultiPathRecord >> (MPR) >> queries >> >> The IB QoS features provide the means to implement a DiffServ like >> architecture. >> DiffServ architecture (IETF RFC2474 2475) is widely used today in highly >> dynamic >> fabrics. >> >> This proposal provides the detailed functional definition for the various >> software elements that are required to enable a DiffServ like architecture >> over >> the OpenFabrics software stack. >> >> >> >> 2. Architecture >> ---------------- >> This proposal split the QoS functionality between the SM/SA, CMA and the >> various >> ULPS. We take the "chronology approach" to describe how the overall system >> works: >> >> 2.1. The network manager (human) provides a set of rules (policy) that >> defines >> how the network is being configured and how its resources are split to >> different >> QoS-Levels. The policy also define how to decide which QoS-Level each >> application or ULP or service use. >> >> 2.2. The SM analyzes the provided policy to see if it is realizable and >> performs >> the necessary fabric setup. The SM may continuously monitor the policy and >> adapt >> to changes in it. Part of this policy defines the default QoS-Level of each >> partition. The SA is being enhanced to match the requested Source, >> Destination, >> QoS-Class, Service-ID (and optionally SL and priority) against the policy. >> So >> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also >> enhanced to support setting up partitions with appropriate IPoIB broadcast >> group. This broadcast group carries its QoS attributes: SL, MTU and >> RATE. >> >> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the >> multicast group which forms the broadcast group of this partition. >> >> 2.4. MPI which provides non IB based connection management should be >> configured >> to run using hard coded SLs. It uses these SLs for every QP being opened. >> >> 2.5. ULPs that use CM interface (like SRP) should have their own >> pre-assigned >> Service-ID and use it while obtaining PR/MPR for establishing connections. >> The SA receiving the PR/MPR should match it against the policy and return >> the appropriate PR/MPR including SL, MTU and RATE. >> >> 2.6. ULPs and programs using CMA to establish RC connection should provide >> the >> CMA the target IP and Service-ID. Some of the ULPs might also provide >> QoS-Class >> (E.g. for SDP sockets that are provided the TOS socket option). The CMA >> should >> then use the provided Service-ID and optional QoS-Class and pass them in the >> PR/MPR request. The resulting PR/MPR should be used for configuring the >> connection QP. >> >> PathRecord and MultiPathRecord enhancement for QoS: >> As mentioned above the PathRecord and MultiPathRecord attributes should be >> enhanced to carry the Service-ID which is a 64bit value, which has been >> standardized by the IBTA. A new field QoS-Class is also provided. >> A new capability bit should describe the SM QoS support in the SA class port >> info. This approach provides an easy migration path for existing access >> layer >> and ULPs by not introducing new set of PR/MPR attribute. >> >> >> 3. Supported Policy >> -------------------- >> >> The QoS policy supported by this proposal is divided into 4 sub sections: >> >> I) Port Group: a set of CAs, Routers or Switches that share the same >> settings. >> A port group might be a partition defined by the partition manager policy in >> terms of GUIDs. Future implementations might provide support for >> NodeDescription >> based definition of port groups. > > Isn't it better to have port group definitions in separate file? So > groups could be shared with other OpenSM components (as discussed). Even > if such group sharing is not high priority functionality this should > save us from redoing things later. > >> II) Fabric Setup: >> Defines how the SL2VL and VLArb tables should be setup. This policy >> definition >> assumes the computation of overall end to end network behavior should be >> performed >> outside of OpenSM. >> >> III) QoS-Levels Definition: >> This section defines the possible sets of parameters for QoS that a client >> might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, >> Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). >> >> IV) Matching Rules: >> A list of rules that match an incoming PR/MPR request to a QoS-Level. The >> rules are processed in order such as the first match is applied. Each rule >> is >> built out of a set of match expressions which should all match for the rule >> to >> apply. The matching expressions are defined for the following fields >> ** SRC and DST to lists of port groups >> ** Service-ID to a list of Service-ID or Service-ID ranges >> ** QoS-Class to a list of QoS-Class values or ranges >> >> QoS Policy file syntax >> >> * Empty lines are ignored >> * Leading and trailing blanks, as well as empty lines, are ignored, so the >> indentation in the example is just for better readability >> * Comments are started with the pound sign (#) and terminated by EOL >> * Comments may appear only in a separate line > > Why? What is wrong with: > > port-name: vs1/HCA-1/P1 # my best port I can use this too, but then the pound sign, wherever it will appear, would mean commentary start. No \# or something like this to include it in some other place - I don't want to complicate the syntax. Sounds OK? >> * Keywords that denote section/subsection start have matching closing >> keywords >> * Any keyword should be the first non-blank in the line >> >> QoS Policy file example >> >> # Port Groups define sets of ports to be used later in the settings >> port-groups >> # using port GUIDs >> port-group >> name: Storage >> # "use" is just a description that is used for logging. >> # Other than that, it is just a commentary >> use: our SRP storage targets >> port-guid: 0x1000000000000001 >> port-guid: 0x1000000000000002 >> end-port-group >> >> port-group >> name: Virtual Servers >> use: node desc and IB port num >> # The syntax of the port name is as follows: >> "hostname/CA-num/Pnum". >> # "hostname" and "CA-num" are compared to the first 2 words of >> # NodeDescription, and "Pnum" is a port number on that node. >> port-name: vs1/HCA-1/P1 >> port-name: vs3/HCA-1/P1 >> port-name: vs3/HCA-2/P2 > > What about wild carding here, like vs1/*/* or just vs1? Good idea. >> end-port-group >> >> # using partitions defined in the partition policy >> port-group >> name: Group for Partition 1 >> use: default settings >> partition: Part1 >> end-port-group >> >> # using node types CA|ROUTER|SWITCH > > Probably also ALL (for all ports), SELF (for SM port)? Agree. >> port-group >> name: Routers >> use: all routers >> node-type: ROUTER >> end-port-group >> >> end-port-groups > > I agree that proposed syntax has better for human readability than pure > XML, but isn't stuff like this will be more user-friendly? > > Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ; > > , or > > Storage "Free Text description" { 0x10001, 0x10002, 0x10003 }; > > , or > > Storage "Free Text description": ROUTERS, CAS ; GUID list is a good idea. Not sure about the other stuff. A certain port group can be defined both by guids and by node-types. How about this: port-group name: routers_and_mgt_nodes use: all routers and management nodes node-type: ROUTER port-guid: 0x10001, 0x10002, 0x10003 end-port-group >> qos-setup >> >> # define all types of VLArb tables. The length of the tables should >> # match the physically supported tables by their target ports >> vlarb-tables >> # scope defines the exact ports the VLArb tables apply to >> vlarb-scope >> # defining VLArb tables on all the ports that belong to >> # port group 'Storage', and on all the ports connected >> # to ports of port group 'Storage' >> group: Storage > > So "group" is only for ports that belong to 'Storage'? Yes, and "across" is for ports that connected to ports of group 'Storage' >> # "across" means all the ports that are connected to ports >> # that belong to the specified port group >> across: Storage >> # VLArb table holds VL and weight pairs >> vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 >> vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 >> vl-high-limit: 10 >> end-vlarb-scope >> # There can be several scopes >> end-vlarb-tables >> >> sl2vl-tables >> # Scope defines the exact devices and in/out ports tables apply >> to. >> # Note: if the same port is matching several rules the *FIRST* >> one applies. >> sl2vl-scope >> # SL2VL tables are orgnized as SL2VL(in-port,out-port) >> # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) >> # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) >> # >> # The following example specifies that all the SL2VL tables >> # entries should be defined for all the ports of group >> Part1: >> group: Part1 >> from: * >> to: * >> # SL2VL table has to have 16 values at max - one for each >> SL. >> # If the user specifies less than 16 values, all the missing >> # VL values will be implicitly set to 0 >> sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 >> end-sl2vl-scope >> >> sl2vl-scope >> # "across-to" is a combination of "across" keyword >> (definition can be found >> # in VLArb tables section) and "to" keyword. >> # "across: PortGroupName" refers to all the ports that are >> connected >> # to ports that belong to PortGroupName. >> # >> # Example of "across-to" usage: >> # A user has a set of 'special' nodes (e.g. storage >> nodes), and all >> # the traffic to these nodes has to get specific VL. >> # The solution is to define port group (i.g. "Storage") >> that will >> # include all the ports of these nodes, and then to >> configure SL2VL >> # tables on all the switch ports that are connected to the >> Storage >> # port group by specifying "across-to: Storage". >> # >> across-to: Storage2 >> # Similar to "across-to", "across-from" is a combination of >> "across" >> # and "to" keywords >> across-from: Storage1 >> sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 >> end-sl2vl-scope >> end-sl2vl-tables >> >> end-qos-setup >> >> >> qos-levels >> >> # the first one is just setting SL >> qos-level >> use: for the lowest priority communication >> sl: 15 >> packet-life: 16 >> end-qos-level >> # the second sets SL and QoS Class >> qos-level >> use: low latency best bandwidth >> sl: 0 >> end-qos-level >> # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path >> Bits >> qos-level >> use: just an example >> sl: 0 >> mtu-limit: 1 >> rate-limit: 1 >> packet-life: 12 >> # Path Bits can be used e.g. to provide a different routes >> through the >> # subnet to a particular port >> path-bits: 2,4,8-32 >> end-qos-level >> >> end-qos-levels >> >> >> # Match rules are scanned in a first-fit manner (like firewall rules >> table) >> qos-match-rules >> >> # matching by single criteria: class (list of values and ranges) >> qos-match-rule >> # just a description >> use: low latency by class 7-9 or 11 >> qos-class: 7-9,11 >> # number of qos-level to apply to the matching PR/MPR >> qos-level-sn: 1 > > Isn't it better and less error prone to match qos_level by name and not > by sequential number? qos-level can have name, and then qos-match-rule will refer to this name. But matching qos-level by sequential number makes it really easy to locate the referred qos-level, which is important, as every PR/MPR request would go through this process, so saving some runtime in this area is important IMHO. >> end-qos-match-rule >> # show matching by destination group AND service-ids >> qos-match-rule >> use: Storage targets connection >> destination: Storage >> service-id: 22,4719-5000 >> qos-level-sn: 2 >> end-qos-match-rule >> # show matching by source group only >> qos-match-rule >> use: bla bla >> source: Storage >> qos-level-sn: 3 >> end-qos-match-rule >> >> end-qos-match-rules >> >> >> 4. IPoIB >> --------- >> >> IPoIB already query the SA for its broadcast group information. The >> additional >> functionality required is for IPoIB to provide the broadcast group SL, MTU, >> and RATE in every following PathRecord query performed when a new UDAV is >> needed by IPoIB. >> We could assign a special Service-ID for IPoIB use but since all >> communication >> on the same IPoIB interface shares the same QoS-Level without the ability to >> differentiate it by target service we can ignore it for simplicity. >> >> 5. CMA features >> ---------------- >> >> The CMA interface supports Service-ID through the notion of port space as a >> prefixes to the port_num which is part of the sockaddr provided to >> rdma_resolve_add(). What is missing is the explicit request for a QoS-Class >> that >> should allow the ULP (like SDP) to propagate a specific request for a class >> of >> service. A mechanism for providing the QoS-Class is available in the IPv6 >> address, >> so we could use that address field. Another option is to implement a special >> connection options API for CMA. >> >> Missing functionality by CMA is the usage of the provided QoS-Class and >> Service-ID >> in the sent PR/MPR. When a response is obtained it is an existing >> requirement for >> the CMA to use the PR/MPR from the response in setting up the QP address >> vector. >> >> >> 6. SDP >> ------- >> >> SDP uses CMA for building its connections. >> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits >> holding the remote TCP/IP Port Number to connect to. >> SDP might be provided with SO_PRIORITY socket option. In that case the value >> provided should be sent to the CMA as the TClass option of that connection. >> >> 7. SRP >> ------- >> >> Current SRP implementation uses its own CM callbacks (not CMA). So SRP >> should >> fill in the Service-ID in the PR/MPR by itself and use that information in >> setting up the QP. The T10 SRP standard defines the SRP Service-ID to be >> defined >> by the SRP target I/O Controller (but they should also comply with IBTA >> Service- >> ID rules). Anyway, the Service-ID is reported by the I/O Controller in the >> ServiceEntries DMA attribute and should be used in the PR/MPR if the SA >> reports its ability to handle QoS PR/MPRs. >> >> 8. iSER >> -------- >> iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER >> should be TBD. >> >> >> 9. OpenSM features >> ------------------- >> The QoS related functionality to be provided by OpenSM can be split into two >> main parts: >> >> 3.1. Fabric Setup >> During fabric initialization the SM should parse the policy and apply its >> settings to the discovered fabric elements. The following actions should be >> performed: >> * Parsing of policy >> * Node Group identification. Warning should be provided for each node not >> specified but found. >> * SL2VL settings validation should be checked: >> + A warning will be provided if there are no matching targets for the >> SL2VL >> setting statement. >> + An error message will be printed to the log file if an invalid setting >> is >> found. A setting is invalid if it refers to: >> - Non existing port numbers of the target devices >> - Unsupported VLs for the target device. In the later case the map to >> non >> existing VLs should be replaced to VL15 i.e. packets will be dropped. > > I'm not sure it is optimal. We could have well documented or even > configurable mapping rule instead, then this will not limit devices with > higher capabilities. I'm open for suggestions. >> * SL2VL setting is to be performed >> * VL Arbitration table settings should be validated according to the >> following >> rules: >> + A warning will be provided if there are no matching targets for the >> setting >> statement >> + An error will be provided if the port number exceeds the target ports >> + An error will be generated if the table length exceeds device >> capabilities > > Ditto. > >> + A warning will be generated if the table quote a VL that is not supported >> by the target device > > What is "table quote" here? >> * VL Arbitration tables will be set on the appropriate targets >> >> 3.2. PR/MPR query handling: >> OpenSM should be able to enforce the provided policy on client request. >> The overall flow for such requests is: first the request is matched against >> the >> defined match rules such that the target QoS-Level definition is found. >> Given >> the QoS-Level a path(s) search is performed with the given restrictions >> imposed >> by that level. The following two sections describe these steps. >> >> How Service-ID is carried in the PathRecord and MultiPathRecord attributes >> is >> now standardized by the IBTA. >> >> >> 3.2.1. Matching rule search: >> A rule is "matching" a PR/MPR request using the following criteria: >> * Matching rules provide values in a list of either single value, or range >> of >> values. A PR/MPR field is "matching" the rule field if it is explicitly >> noted in the list of values or is one of the values covered by a range >> included in the field values list. >> * Only PR/MPR fields that have their component mask bit set should be >> compared. >> * For a rule to be "matching" a PR/MPR request all the rule fields should be >> "matching" their PR/MPR fields. Such that a PR/MPR request that does >> not have a component mask field set for one of the rule defined fields >> can >> not match that rule. >> * A PR/MPR request that have a component mask bit set for one of the fields >> that is not defined by the rule can match the rule. > > Aren't last two too restrictive? SA can just to filter-out paths in > response to match rest of the rule. No? Not sure I'm following. The last bullet is not restrictive at all - it says that if you have a match rule with some reduced set of fields (e.g. only service id), any PR/MPR with a matching service id will be matched, even if it also has MTU, rate, etc. >> The algorithm to be used for searching for a rule match might be as simple >> as a >> sequential search through all rules or enhanced for better performance. The >> semantics of every rule field and its matching PR/MPR field are described >> below: >> * Source: the SGID or SLID should be part of this group >> * Destination: the DGID or DLID should be part of this group >> * Service-ID: check if the requested Service-ID (available in the PR/MPR old >> SM-Key field) is matching any of this rule Service-IDs >> * TClass: check if the PR/MPR TClass field is matching >> >> 3.2.2 PR/MPR response generation: >> The QoS-Level pointed by the first rule that matches the PR/MPR request >> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, >> Path-Bits >> and QoS-Class. A default QoS-Level should be used if no rule is matching the >> query. > > Where this default should be defined? OK, I missed that part. Here it is: - qos-level sequential number is counted from 0 - qos-level num. 0 is a must is treated as the Default Level - it's applied to any PR/MPR request that didn't match any match rule - default qos-level can be also referred explicitly in any match rule by specifying "qos-level-sn: 0" -- Yevgeny > Sasha > > >> The efficient algorithm for finding paths that meet the QoS-Level criteria >> is >> beyond the scope of this RFC and left for the implementer to provide. >> However >> the criteria by which the paths match the QoS-Level are described below: >> >> * SL: The paths found should all use the given SL. For that sake PR/MPR >> algorithm should traverse the path from source to destination only through >> ports that carry a valid VL (not VL15) by the SL2VL map (should consider >> input >> and output ports and SL). >> * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit >> * Rate-Limit: The resulting paths RATE should not exceed the given >> RATE-Limit >> (rate limit is given in units of link BW = Width*Speed according to IBTA >> Specification Vol-1 table-205 p-901 l-24). >> * Path-Bits: define the target LID lowest bits (number of bits defined by >> the >> target port PortInfo.LMC field). The path should traverse the LFT using >> the >> target port LID with the path-bits set. >> * QoS-Class: should be returned in the result PR/MPR. When routing is going >> to >> be supported by OpenSM we might use this field in selecting the target >> router too in a TBD way. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > From erezz at voltaire.com Thu Jul 26 05:53:39 2007 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 26 Jul 2007 15:53:39 +0300 Subject: [ofa-general] iSER header In-Reply-To: <20070726111106.GA14180@postal.iol.unh.edu> References: <20070709144702.GB24125@postal.iol.unh.edu> <46933130.6040100@voltaire.com> <20070725192230.GA13579@postal.iol.unh.edu> <39C75744D164D948A170E9792AF8E7CA110A31@exil.voltaire.com> <20070726111106.GA14180@postal.iol.unh.edu> Message-ID: <46A89953.2010505@voltaire.com> >> Will you send patches for iSER soon? I'd like to test it, and make sure >> that iSER over IB is not damaged. >> > > Our patches may not interest you since we are using an older version of > the iSER code. However, we will also be exploring the use of IB with our > implementations. Will this require us to use the same non-standard iSER > header in some cases? > Yes, you will need to use the iSER header according to the current implementation. Erez From hal.rosenstock at gmail.com Thu Jul 26 05:58:48 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Jul 2007 08:58:48 -0400 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> Message-ID: On 7/26/07, Eitan Zahavi wrote: > > *I propose that when there is no MTU in the partition policy file > OpenSM use a * > *configurable default from: **/etc/cache/opensm/opensm.opt.* > That would make this the default rather than 2K. IMO it should be when some "special" unused mtu is set in the partition config. -- Hal *Something like:* > *# The default MTU to be used for IPoIB and other MCGs when the > partition-policy * > *# does not provide exact value. The default is the lowest possible MTU* > *mcg_default_mtu 1* > ** > *Eitan Zahavi*** > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > ------------------------------ > *From:* Shirley Ma [mailto:xma at us.ibm.com] > *Sent:* Wednesday, July 25, 2007 10:45 PM > *To:* Eitan Zahavi > *Cc:* general at lists.openfabrics.org; Hal Rosenstock > *Subject:* RE: [ofa-general] Re: openSM: Different IB MTUs > > > > Hello Eitan, Hal, > > Thanks. It's good openSM has the configuration option to set up these > attributes in MC. Is this a good idea to add below to openSM: When there is > no MTU defined in the configuration file, SM can pick up the smallest link > MTU in the fabrics by default? MTU is unlikely rate, slower rate might > indicate the cablling problem. So using the smallest link MTU in the fabrics > might not be a bad choice for MC by default. The reason I request here is to > create IP multicast group, MTU is not an attribute of the group. When > mapping IP multicast to IB multicast, IB muliticast might fail because of > different IB link MTU size in the group, but IP multicast group will be > successful without knowing the failure. If admin sets MTU in configuration > file, admin would know this failure. Otherwise, admin/users could spend too > much time on debugging their broken multicasting applications. > > Thanks > Shirley Ma > > [image: Inactive hide details for "Eitan Zahavi" ]"Eitan > Zahavi" > > > > *"Eitan Zahavi" * > > 07/25/07 12:25 PM > > > To > > "Hal Rosenstock" , Shirley > Ma/Beaverton/IBM at IBMUS > cc > > > Subject > > RE: [ofa-general] Re: openSM: Different IB MTUs > *Hi Shirley,* > > *I think I understand where your question comes from...* > *Many have issue with heterogonous fabrics where not all nodes have same > MTU or Speed.* > *Especially when IPoIB relies on all nodes joining the broadcast group.* > > *The term "join" for multicast groups is a little overloaded.* > *If a node joins an existing MC group it has to have a rate (speed * > width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.* > *If the join is actually a "create" the node has to provide the rate and > MTU which define the MCG values.* > > *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM > provides the means to control these* > *values per partition. See the doc/partition-config.doc* > *Still the administrator should know what would be the lowest MTU and rate > the nodes expected to join the IPoIB subnet have.* > *The tradeoff is in the hands of the administrator who can set a value > that will prevent slow nodes from joining the group, * > *or assign a low value that will fit all nodes but slow down communication > ...* > > *EZ* > > *Eitan Zahavi* > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > ------------------------------ > *From:* general-bounces at lists.openfabrics.org [ > mailto:general-bounces at lists.openfabrics.org] > *On Behalf Of *Hal Rosenstock* > Sent:* Wednesday, July 25, 2007 10:01 PM* > To:* Shirley Ma* > Cc:* general at lists.openfabrics.org* > Subject:* [ofa-general] Re: openSM: Different IB MTUs > > Shirley, > > On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* > wrote: > > Hal, > > Thanks for your prompt reply. I am asking for how openSM handle > different link MTUs in SA MCMemberRecord MTU. For example, if we have some > links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM > decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB > multicast group from a 2K MTU node first, which PMTU value is attaching to > this IB multicast group MCMemberRecord MTU? > > > > MCMemberRecord MTU gets the group MTU (when created). This is either this > first joiner with sufficient components or preconfigured (and MTU can be set > in the config). If a joiner has insufficient MTU for the group, it is > denied. > > -- Hal > > > Thanks > Shirley Ma > > [image: Inactive hide details for "Hal Rosenstock" > ]"Hal Rosenstock" < * > hal.rosenstock at gmail.com* > > > *"Hal Rosenstock" <**hal.rosenstock at gmail.com* > *>* > > 07/25/07 10:57 AM > To > > Shirley Ma/Beaverton/IBM at IBMUS cc > * > **general at lists.openfabrics.org* > Subject > > Re: openSM: Different IB MTUs > Shirley, > > On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* > > wrote: > Hello Hal, > > How does openSM handle CAs with different MTUs in the > same subnet? For example, IPoIB broadcast group MTU, IB multicast group > PMTU? Does openSM pick up the smallest MTU in the subnet? > > > Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA > MCMemberRecord MTU, or all of these ? > > -- Hal > Thanks > Shirley Ma > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E407396.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0E830176.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From todd.rimmer at qlogic.com Thu Jul 26 06:23:52 2007 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 26 Jul 2007 08:23:52 -0500 Subject: [ofa-general] Re: ARP in IPoIB In-Reply-To: Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061192B637F@EPEXCH2.qlogic.org> ________________________________ From: Amar Mudrankit Michael, thanks for your reply. But, this gives rise to couple of questions.. 1] If such multicast routing protocol for IB routers is not yet specifid by IBTA or IETF, then current implementation have IP subnet restricted within a IB subnet. According to RFC 4391, section 9.1.1, the link layer address is formed through combination of GID + QPN. If we are not spanning across IB subnets what is the use of GID as we need to get LID from GID? Probably, in that case ARP reply with LID,Q_Key and other path information would be helpful which resolves path in 1 loop than 2 loops in case of GID(first to resolve GID and then to get LID). [Todd Rimmer] Basing IPoIB on the GID keeps open the opportunity for IPoIB to span IB subnets in the future. Also this permits the SM to manage the paths and PathRecord parameters appropriately even in non-routed IB networks. For example, if multi-pathing is used (LMC!=0, hence giving multiple LIDs per port), the SM may respond to PathRecord requests for a given Destination GID with a different LID depending on Source GID. Such a mechanism can be used to manage routes in the fabric, etc. That is just a simple example, since the PathRecord includes lots of other information as well (QOS, routing info, MTU, etc). The SM can provide different PathRecord values for each Source GID talking to a given Destination GID. The IETF needed a "MAC Address" for IPoIB. GID+QPN gave them a unique endpoint with the potential to work through routers and still support the full intentions of IB's QOS and routing options. LID+QPN would severely limit those capabilities. In general it's a bad idea for end nodes to simply exchange LIDs as it bypasses many of the intentions of the IB spec. Such applications will break in fabrics which use the more advanced IB QOS, routing, etc options. Todd Rimmer Chief Architect QLogic System Interconnect Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From umaxx at oleco.net Thu Jul 26 06:44:31 2007 From: umaxx at oleco.net (Joerg Zinke) Date: Thu, 26 Jul 2007 15:44:31 +0200 Subject: [ofa-general] ibv_modify_qp() return value 22 In-Reply-To: <46A870A1.5090401@dev.mellanox.co.il> References: <20070726102553.5b02caea@marvin.local> <46A870A1.5090401@dev.mellanox.co.il> Message-ID: <20070726154431.155d967b@marvin.local> Hi, On Thu, 26 Jul 2007 13:00:01 +0300 Dotan Barak wrote: > Joerg Zinke wrote: > > ibv_modify_qp() fails with return value 22 when I try to open a new > > CM connection under load (already ~3000 RDMA connections opened). I > > tried to figure out what return value 22 means but could not find > > it in the mthca kernel driver. > > > > Any hints? What does return value 22 mean? > > > The value 22 is the ibv_modify_qp means that there was an invalid > parameter when calling to this verb. > If you try to call to ibv_modify_qp without any load (only several > QPs) do you still get this error? > In short: no I do not get this error without load, because I start with the no load situation and then more and more clients connecting until ~3000. I have a simple CM server which accepts RDMA connections from thousands clients. The code is based on the example/ stuff, with the same simple handler functions to do the REQ/REP/RTU. Everything is working fine, they all connect in the same manner (same handler functions) until the point where ~3000 clients are connected and the request handler fails on the server to modify the QP for the Reply with return value 22. With load i meant a lot of connections not the data transfer, because not all clients sending the whole time data through the RDMA connections most of the time only one or two of the clients sending... little integer pieces - so not much load on the lines. Cheers Joerg From dledford at redhat.com Thu Jul 26 06:51:44 2007 From: dledford at redhat.com (Doug Ledford) Date: Thu, 26 Jul 2007 13:51:44 +0000 Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A In-Reply-To: <1185297645.14681.22.camel@trinity.ogc.int> References: <1185297645.14681.22.camel@trinity.ogc.int> Message-ID: <1185457905.5165.695.camel@firewall.xsintricity.com> On Tue, 2007-07-24 at 12:20 -0500, Tom Tucker wrote: > For those interested in NFS-RDMA, OGC has created an install package > based on the OFA 1.2 GA release. The package supports both SLES 10 and > RHEL 5. You can download this package from > http://www.opengridcomputing.com/nfs-rdma.html. > > Please let me know if you find any problems. Hi Tom, can you tell me anything about the plans for getting this upstream? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From hal.rosenstock at gmail.com Thu Jul 26 07:16:58 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Jul 2007 14:16:58 +0000 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Fix comment Message-ID: include/iba/ib_types.h: Fix comment Signed-off-by: Hal Rosenstock diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 5820ee6..54c2250 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -4931,7 +4931,7 @@ ib_port_info_get_mtu_cap( * [in] Pointer to a PortInfo attribute. * * RETURN VALUES -* Returns the LMC value assigned to this port. +* Returns the encoded value for the maximum MTU supported by this port. * * NOTES * From swise at opengridcomputing.com Thu Jul 26 07:18:22 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 26 Jul 2007 09:18:22 -0500 Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC) In-Reply-To: <46A69225.9090502@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> <46A69225.9090502@ichips.intel.com> Message-ID: <46A8AD2E.9000908@opengridcomputing.com> Sean Hefty wrote: > Steve, > > Do you have any input with respect to how the RDMA CM selects and maps > QoS (priority, traffic class, VLAN, flow label, etc.)? (See below) > > Hide the QoS selection under the current interface? Use the IPv6 > flowinfo field? Rely on destination port? Input QoS through existing > or new call? Handle IPv4 and IPv6 addresses differently? ??? > > - Sean > >>> 2.6. ULPs and programs using CMA to establish RC connection should >>> provide the CMA the target IP and Service-ID. Some of the ULPs might >>> also provide QoS-Class (E.g. for SDP sockets that are provided the >>> TOS socket option). The CMA should then use the provided Service-ID >>> and optional QoS-Class and pass them in the PR/MPR request. The >>> resulting PR/MPR should be used for configuring the connection QP. >> >> The interface to the CMA needs to remain as transport independent as >> possible, and I am unsure of the transport independence of tying QoS >> to the destination port number. (I'm not disagreeing; I'm just not >> sure at the moment it's the right approach.) >> In the socket API, socket options describe what protocol they are intended for. You can have options that are intended for IP or TCP and other protocol layers. We could do some rdma_setopt() interface, and define both transport independent options and transport-specific options. Then if there are features of qos that are only in IB, you can make them transport-specific options. So an option struct may have a transport_type field... Although I _think_ it will be a good thing to try and map transport-specific qos attributes to a univeral transport independent attribute. But I'm not an expert on qos stuff... >>> 5. CMA features ---------------- >>> >>> The CMA interface supports Service-ID through the notion of port >>> space as a prefixes to the port_num which is part of the sockaddr >>> provided to rdma_resolve_add(). What is missing is the explicit >>> request for a QoS-Class that should allow the ULP (like SDP) to >>> propagate a specific request for a class of service. A mechanism for >>> providing the QoS-Class is available in the IPv6 address, so we could >>> use that address field. Another option is to implement a special >>> connection options API for CMA. >>> >>> Missing functionality by CMA is the usage of the provided QoS-Class >>> and Service-ID in the sent PR/MPR. When a response is obtained it is >>> an existing requirement for the CMA to use the PR/MPR from the >>> response in setting up the QP address vector. >> >> The most natural function to specify additional QoS parameters would >> be rdma_resolve_route. Or a more generic rdma_setopt() that can be extensible for future options/attributes and not break the API... My 2 cents. Stevo. From jlentini at netapp.com Thu Jul 26 07:16:51 2007 From: jlentini at netapp.com (James Lentini) Date: Thu, 26 Jul 2007 10:16:51 -0400 (EDT) Subject: [ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A In-Reply-To: <1185457905.5165.695.camel@firewall.xsintricity.com> References: <1185297645.14681.22.camel@trinity.ogc.int> <1185457905.5165.695.camel@firewall.xsintricity.com> Message-ID: On Thu, 26 Jul 2007, Doug Ledford wrote: > On Tue, 2007-07-24 at 12:20 -0500, Tom Tucker wrote: > > For those interested in NFS-RDMA, OGC has created an install package > > based on the OFA 1.2 GA release. The package supports both SLES 10 and > > RHEL 5. You can download this package from > > http://www.opengridcomputing.com/nfs-rdma.html. > > > > Please let me know if you find any problems. > > Hi Tom, can you tell me anything about the plans for getting this > upstream? The goal is to make this code acceptable for 2.6.24. The client and server code have been posted for review on the linux nfs mailing list, nfs at lists.sourceforge.net. See the posts by Tom Talpey on July 11 for the client code http://sourceforge.net/mailarchive/forum.php?forum_name=nfs&max_rows=25&style=ultimate&viewmonth=200707&viewday=11 and the post by Tom Tucker on July 10 for the server code http://sourceforge.net/mailarchive/forum.php?forum_name=nfs&max_rows=25&style=ultimate&viewmonth=200707&viewday=10 Now that the 2.6.23 merge window is close and people have time to review new code, we are hoping for more comments. From hal.rosenstock at gmail.com Thu Jul 26 07:25:30 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Jul 2007 10:25:30 -0400 Subject: [ofa-general] [PATCH] OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid Message-ID: OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c index e03e316..b9c52f4 100644 --- a/opensm/opensm/osm_port.c +++ b/opensm/opensm/osm_port.c @@ -387,12 +388,12 @@ osm_physp_calc_link_mtu( OSM_LOG_ENTER( p_log, osm_physp_calc_link_mtu ); - /* use the available MTU */ - mtu = ib_port_info_get_mtu_cap( &p_physp->port_info ); - p_remote_physp = osm_physp_get_remote( p_physp ); if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) { + /* use the available MTU */ + mtu = ib_port_info_get_mtu_cap( &p_physp->port_info ); + remote_mtu = ib_port_info_get_mtu_cap( &p_remote_physp->port_info ); if( osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) @@ -427,6 +428,8 @@ osm_physp_calc_link_mtu( } } } + else + mtu = ib_port_info_get_neighbor_mtu( &p_physp->port_info ); if( mtu == 0 ) { @@ -454,12 +457,12 @@ osm_physp_calc_link_op_vls( OSM_LOG_ENTER( p_log, osm_physp_calc_link_op_vls ); - /* use the available VLCap */ - op_vls = ib_port_info_get_vl_cap( &p_physp->port_info ); - p_remote_physp = osm_physp_get_remote( p_physp ); if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) { + /* use the available VLCap */ + op_vls = ib_port_info_get_vl_cap( &p_physp->port_info ); + remote_op_vls = ib_port_info_get_vl_cap( &p_remote_physp->port_info ); if( osm_log_is_active( p_log, OSM_LOG_DEBUG ) ) @@ -496,6 +499,8 @@ osm_physp_calc_link_op_vls( } } } + else + op_vls = ib_port_info_get_op_vls( &p_physp->port_info ); /* support user limitation of max_op_vls */ if (op_vls > p_subn->opt.max_op_vls) From mst at dev.mellanox.co.il Thu Jul 26 07:31:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 17:31:39 +0300 Subject: [ofa-general] Re: [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A In-Reply-To: References: <1185297645.14681.22.camel@trinity.ogc.int> <1185457905.5165.695.camel@firewall.xsintricity.com> Message-ID: <20070726143139.GL22557@mellanox.co.il> > Now that the 2.6.23 merge window is close and people have time to > review new code, we are hoping for more comments. You might want to send copy of patches to openfabrics general and lkml if you do. -- MST From mimmi.dadisman at asv-vejle.dk Thu Jul 26 07:57:49 2007 From: mimmi.dadisman at asv-vejle.dk (Arron Nieves) Date: Thu, 26 Jul 2007 14:57:49 +0000 Subject: [ofa-general] You can be young again! Message-ID: <01c7cf95$5575d1d0$eb489952@mimmi.dadisman> -------------- next part -------------- A non-text attachment was scrubbed... Name: miracle01.gif Type: image/gif Size: 30400 bytes Desc: not available URL: From xma at us.ibm.com Thu Jul 26 07:58:21 2007 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 26 Jul 2007 07:58:21 -0700 Subject: [ofa-general] Re: Re: openSM: Different IB MTUs In-Reply-To: <20070726072245.GC13258@mellanox.co.il> Message-ID: Set default as 4 (2K) is more proper than 1(512?). All HCAs support 2K at least now. Thanks Shirley Ma "Michael S. Tsirkin" Shirley Ma/Beaverton/IBM at IBMUS cc 07/26/07 12:22 AM Eitan Zahavi , general at lists.openfabrics.org Please respond to Subject "Michael S. Re: Re: openSM: Different IB MTUs Tsirkin" What does "1" mean? Surely not 1 byte MTU :) IMO a good format would be the MTU value in bytes. E.g. 512, 1024, 2048, 4096. Quoting Shirley Ma : Subject: RE: Re: openSM: Different IB MTUs Eitan, That's a good approach to address the issue. thanks Shirley Ma Inactive hide details for "Eitan Zahavi" "Eitan Zahavi" "Eitan Zahavi" [cid] * To Shirley Ma/Beaverton/IBM at IBMUS [cid] * 07/25/07 11:00 PM cc , "Hal Rosenstock" [cid] * Subject RE: [ofa-general] Re: openSM: Different IB MTUs * * I propose that when there is no MTU in the partition policy file OpenSM use a configurable default from: /etc/cache/opensm/opensm.opt. Something like: # The default MTU to be used for IPoIB and other MCGs when the partition-policy # does not provide exact value. The default is the lowest possible MTU mcg_default_mtu 1 Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From: Shirley Ma [mailto:xma at us.ibm.com] Sent: Wednesday, July 25, 2007 10:45 PM To: Eitan Zahavi Cc: general at lists.openfabrics.org; Hal Rosenstock Subject: RE: [ofa-general] Re: openSM: Different IB MTUs Hello Eitan, Hal, Thanks. It's good openSM has the configuration option to set up these attributes in MC. Is this a good idea to add below to openSM: When there is no MTU defined in the configuration file, SM can pick up the smallest link MTU in the fabrics by default? MTU is unlikely rate, slower rate might indicate the cablling problem. So using the smallest link MTU in the fabrics might not be a bad choice for MC by default. The reason I request here is to create IP multicast group, MTU is not an attribute of the group. When mapping IP multicast to IB multicast, IB muliticast might fail because of different IB link MTU size in the group, but IP multicast group will be successful without knowing the failure. If admin sets MTU in configuration file, admin would know this failure. Otherwise, admin/users could spend too much time on debugging their broken multicasting applications. Thanks Shirley Ma Inactive hide details for "Eitan Zahavi" "Eitan Zahavi" "Eitan Zahavi" [cid] * To "Hal Rosenstock" , Shirley 07/25/07 12:25 PM Ma/Beaverton/IBM at IBMUS [cid] * cc [cid] * Subject RE: [ofa-general] Re: openSM: Different IB MTUs * * Hi Shirley, I think I understand where your question comes from... Many have issue with heterogonous fabrics where not all nodes have same MTU or Speed. Especially when IPoIB relies on all nodes joining the broadcast group. The term "join" for multicast groups is a little overloaded. If a node joins an existing MC group it has to have a rate (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied. If the join is actually a "create" the node has to provide the rate and MTU which define the MCG values. To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM provides the means to control these values per partition. See the doc/partition-config.doc Still the administrator should know what would be the lowest MTU and rate the nodes expected to join the IPoIB subnet have. The tradeoff is in the hands of the administrator who can set a value that will prevent slow nodes from joining the group, or assign a low value that will fit all nodes but slow down communication ... EZ Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From: general-bounces at lists.openfabrics.org [ mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal Rosenstock Sent: Wednesday, July 25, 2007 10:01 PM To: Shirley Ma Cc: general at lists.openfabrics.org Subject: [ofa-general] Re: openSM: Different IB MTUs Shirley, On 7/25/07, Shirley Ma wrote: Hal, Thanks for your prompt reply. I am asking for how openSM handle different link MTUs in SA MCMemberRecord MTU. For example, if we have some links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB multicast group from a 2K MTU node first, which PMTU value is attaching to this IB multicast group MCMemberRecord MTU? MCMemberRecord MTU gets the group MTU (when created). This is either this first joiner with sufficient components or preconfigured (and MTU can be set in the config). If a joiner has insufficient MTU for the group, it is denied. -- Hal Thanks Shirley Ma Inactive hide details for "Hal Rosenstock" "Hal Rosenstock" < hal.rosenstock at gmail.com> "Hal Rosenstock" < [cid] * hal.rosenstock at gmail.com> To Shirley Ma/Beaverton/ IBM at IBMUS 07/25/07 10:57 AM [cid] * cc general at lists.openfabrics.org [cid] * Subject Re: openSM: Different IB MTUs * * Shirley, On 7/25/07, Shirley Ma < xma at us.ibm.com> wrote: Hello Hal, How does openSM handle CAs with different MTUs in the same subnet? For example, IPoIB broadcast group MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in the subnet? Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? -- Hal Thanks Shirley Ma _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic14492.gif Type: image/gif Size: 1255 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: From akepner at sgi.com Thu Jul 26 08:33:54 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 26 Jul 2007 08:33:54 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <20070726033946.GA31524@mellanox.co.il> References: <20070726014931.GL10235@sgi.com> <20070726033946.GA31524@mellanox.co.il> Message-ID: <20070726153354.GN10235@sgi.com> On Thu, Jul 26, 2007 at 06:39:46AM +0300, Michael S. Tsirkin wrote: > .... > These should be getting 'union mthca_doorbell *db' I think. > Hi Michael; Want to make sure I understand your point. Are you saying, e.g., that the function: static inline void mthca_ring_db(union mthca_doorbell db, void __iomem *dest, spinlock_t *doorbell_lock) should instead have the prototype: static inline void mthca_ring_db(union mthca_doorbell* db, void __iomem *dest, spinlock_t *doorbell_lock) ? If so, I'm not sure I agree. The union mthca_doorbell is 64 bits so can be passed in a register, but passing a pointer requires a few extra operations to calculate the address, and dereference the pointer. But maybe I misunderstand you... Now that I look at this again, the __attribute__ ((aligned...)) thing on union mthca_doorbell is pretty silly - of course the alignment is going to be sizeof(__be64).... +union mthca_doorbell { + __be64 val64; + __be32 val32[2]; +} __attribute__ ((aligned (sizeof(__be64)))); + -- Arthur From jlentini at netapp.com Thu Jul 26 08:45:04 2007 From: jlentini at netapp.com (James Lentini) Date: Thu, 26 Jul 2007 11:45:04 -0400 (EDT) Subject: [ofa-general] Re: [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A In-Reply-To: <20070726143139.GL22557@mellanox.co.il> References: <1185297645.14681.22.camel@trinity.ogc.int> <1185457905.5165.695.camel@firewall.xsintricity.com> <20070726143139.GL22557@mellanox.co.il> Message-ID: On Thu, 26 Jul 2007, Michael S. Tsirkin wrote: > > > Now that the 2.6.23 merge window is close and people have time to > > review new code, we are hoping for more comments. > > You might want to send copy of patches to openfabrics general and > lkml if you do. Good idea. We were already planning to copy openfabrics for the next round of reviews. From chas at cmf.nrl.navy.mil Thu Jul 26 09:04:56 2007 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Thu, 26 Jul 2007 12:04:56 -0400 Subject: [ofa-general] Re: Re: openSM: Different IB MTUs In-Reply-To: Message-ID: <200707261604.l6QG4uJ5011958@cmf.nrl.navy.mil> In message ,Shirley Ma writes: >Set default as 4 (2K) is more proper than 1(512?). All HCAs support 2K = >at >least now. dont some devices perform better with 1k mtu's? in particular, any device that suffers from the 'tavor quirk'. From mst at dev.mellanox.co.il Thu Jul 26 09:48:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 19:48:21 +0300 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <20070726153354.GN10235@sgi.com> References: <20070726014931.GL10235@sgi.com> <20070726033946.GA31524@mellanox.co.il> <20070726153354.GN10235@sgi.com> Message-ID: <20070726164821.GA3930@mellanox.co.il> > Quoting akepner at sgi.com : > Subject: Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes > > On Thu, Jul 26, 2007 at 06:39:46AM +0300, Michael S. Tsirkin wrote: > > > .... > > These should be getting 'union mthca_doorbell *db' I think. > > > > Hi Michael; > > Want to make sure I understand your point. Are you saying, e.g., > that the function: > > static inline void mthca_ring_db(union mthca_doorbell db, > void __iomem *dest, > spinlock_t *doorbell_lock) > > should instead have the prototype: > > static inline void mthca_ring_db(union mthca_doorbell* db, > void __iomem *dest, > spinlock_t *doorbell_lock) > > ? Yes. > If so, I'm not sure I agree. The union mthca_doorbell is > 64 bits so can be passed in a register, but passing a pointer > requires a few extra operations to calculate the address, > and dereference the pointer. But maybe I misunderstand you... This is really coding style thing. It's usually not a good idea to pass unions/structures by value. If union size is later changed to be large, gcc might pass it in a global data section, which fails to be reentrant. Try compiling both variants and looking at the code - I expect there won't be difference. > Now that I look at this again, the __attribute__ ((aligned...)) > thing on union mthca_doorbell is pretty silly - of course the > alignment is going to be sizeof(__be64).... > > +union mthca_doorbell { > + __be64 val64; > + __be32 val32[2]; > +} __attribute__ ((aligned (sizeof(__be64)))); > + Right. -- MST From mshefty at ichips.intel.com Thu Jul 26 10:21:16 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 10:21:16 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A846FC.5040704@voltaire.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> Message-ID: <46A8D80C.1090305@ichips.intel.com> > I thinking that we are making progress, starting to converge. > > My suggestion is that if you put the PR caching code within the ib_sa > module, add a parameter for the ib_sa_path_rec_get() where the caller > specifies if it is willing to get cached PR or not. Also I suggest that > rdma_resolve_route() should be also enhanced to have a similar param > such that even native IB based ULPs can ask for not cached info if they > want to. I still believe that these should be separate policies. Consider that the cache could have updated immediately before a PR lookup from IPoIB - perhaps in response to an SA event. Administrators can enable or disable the cache. I don't believe that individual applications should be able to override the administrator, nor do I think we gain anything by having per application settings. This is similar to exposing to applications whether they want to use cached ARP information every time they connect. > For example, I think it would be correct for IB block and file I/O ULPs > (iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their > connecting model is not all-to-all but rather n-to-m (n clients to m > servers with m << n), the connections are long-lived (hours, days, > weeks, more) and a connection failure as of PR caching does not seem > acceptable. I believe a better solution is for everyone to use cached records, if they exist, with a feedback mechanism from the CM that removes paths on a connection failure or path migration event. With all to all connections over the rdma cm, the first thing that needs to be done is resolve the remote addresses to GIDs. This causes an ARP storm, followed by an SA storm caused by IPoIB, followed by a second SA storm caused by the rdma cm. For scalability, we need to remove both of these SA storms, not just the second. We don't see the first SA storm today because IPoIB caches PRs. Let's not add it. Restricting caching to the rdma cm, but removing it from IPoIB leaves us with the same issues that we have today. - Sean From mst at dev.mellanox.co.il Thu Jul 26 10:26:19 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 20:26:19 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A8D80C.1090305@ichips.intel.com> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> Message-ID: <20070726172619.GA5208@mellanox.co.il> > I believe a better solution is for everyone to use cached records, if > they exist, with a feedback mechanism from the CM that removes paths on > a connection failure or path migration event. Ack timeout on an RC QP is also a good indication we should redo the lookup. -- MST From mshefty at ichips.intel.com Thu Jul 26 10:37:49 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 10:37:49 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070726172619.GA5208@mellanox.co.il> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726172619.GA5208@mellanox.co.il> Message-ID: <46A8DBED.40808@ichips.intel.com> Michael S. Tsirkin wrote: >> I believe a better solution is for everyone to use cached records, if >> they exist, with a feedback mechanism from the CM that removes paths on >> a connection failure or path migration event. > > Ack timeout on an RC QP is also a good indication we should redo the lookup. Do you know if we get a specific event for this? (I don't remember.) Both the ib_cm and rdma_cm have interfaces that allow a user to report events on a connection. They are used for path migration today, but we could easily extend them. To minimize issues, I think we'll want some sort of feedback mechanism in place before enabling caching by default. - Sean From mst at dev.mellanox.co.il Thu Jul 26 10:47:00 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Thu, 26 Jul 2007 20:47:00 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A8DBED.40808@ichips.intel.com> References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726172619.GA5208@mellanox.co.il> <46A8DBED.40808@ichips.intel.com> Message-ID: <20070726174700.GB5208@mellanox.co.il> > Quoting Sean Hefty : > Subject: Re: [ofa-general] Re: IPoIB path caching > > Michael S. Tsirkin wrote: > >>I believe a better solution is for everyone to use cached records, if > >>they exist, with a feedback mechanism from the CM that removes paths on > >>a connection failure or path migration event. > > > >Ack timeout on an RC QP is also a good indication we should redo the > >lookup. > > Do you know if we get a specific event for this? (I don't remember.) CQE with error IIRC. > Both the ib_cm and rdma_cm have interfaces that allow a user to report > events on a connection. They are used for path migration today, but we > could easily extend them. Makes sense. > To minimize issues, I think we'll want some sort of feedback mechanism > in place before enabling caching by default. Right. -- MST From mshefty at ichips.intel.com Thu Jul 26 10:53:01 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 10:53:01 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A87938.6040305@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> <46A7DEF8.7040608@ichips.intel.com> <46A7E59A.5070801@dev.mellanox.co.il> <46A80E37.5080304@ichips.intel.com> <46A87938.6040305@dev.mellanox.co.il> Message-ID: <46A8DF7D.8050706@ichips.intel.com> > But again, the administrator doesn't *have* to use all these. > He can simply define sl2vl-tables, and then match service-id > (in qos-match-rules) to a certain sl (in qos-levels). > That's it. > No MTU, rate, packet lifetime or any other low level data. > Does the following file look better? My take is that it's still too low level (GUIDs, SL to VL mappings, service ID ranges) for a user interface. The format may be fine as the output of some graphical tool or an application that parses a simpler interface file. But for a human, I think we should strive for something simpler (QoS for Dummies), even if we lose some flexibility with the easier interface. Unfortunately, I don't have any specific ideas at the moment beyond 'easy'. - Sean From hal.rosenstock at gmail.com Thu Jul 26 10:53:07 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 26 Jul 2007 13:53:07 -0400 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: More changes from osm.log to opensm.log Message-ID: OpenSM: More changes from osm.log to opensm.log Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 8038dd3..f3429ff 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -253,7 +253,7 @@ show_usage(void) printf( "-f\n" "--log_file\n" " This option defines the log to be the given file.\n" - " By default, the log goes to /var/log/osm.log.\n" + " By default, the log goes to /var/log/opensm.log.\n" " For the log to go to standard output use -f stdout.\n\n"); printf( "-L \n" "--log_limit \n" diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 082a00f..766779d 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -136,7 +136,7 @@ static void help_status(FILE *out, int detail) static void help_logflush(FILE *out, int detail) { - fprintf(out, "logflush -- flush the osm.log file\n"); + fprintf(out, "logflush -- flush the opensm.log file\n"); } static void help_querylid(FILE *out, int detail) diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 1641999..d1b8204 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -225,7 +225,7 @@ __osm_ni_rcv_set_links( osm_log( p_rcv->p_log, OSM_LOG_SYS, "Errors on subnet. Duplicate GUID found " "by link from a port to itself. " - "See osm log for more details\n"); + "See opensm.log for more details\n"); if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) exit( 1 ); From jgunthorpe at obsidianresearch.com Thu Jul 26 11:11:32 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 26 Jul 2007 12:11:32 -0600 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A8D80C.1090305@ichips.intel.com> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> Message-ID: <20070726181132.GO19768@obsidianresearch.com> On Thu, Jul 26, 2007 at 10:21:16AM -0700, Sean Hefty wrote: > >My suggestion is that if you put the PR caching code within the ib_sa > >module, add a parameter for the ib_sa_path_rec_get() where the caller > >specifies if it is willing to get cached PR or not. Also I suggest that > > rdma_resolve_route() should be also enhanced to have a similar param > >such that even native IB based ULPs can ask for not cached info if they > >want to. > > I still believe that these should be separate policies. Consider that > the cache could have updated immediately before a PR lookup from IPoIB - > perhaps in response to an SA event. FWIW, I agree with Sean. The kernel cache must be authoritative and must not be overriden by ULP. View this as the first step to creating a distributed SA, not as the first step to generalized PR caching. Linking things like ARP failures and QP failures to cache 'invalidates' is, IMHO, ultimately pointless. My view is that the SA will have to grow a means to refresh data in the distributed SA when it reconfigures the network. We have parts of this today via the various SA traps, but no per-GID invalidation. A client is probably going to detect a problem in the network before the SM can fix it, so doing a PR will just get the same old bad data. Further in many cases the SM can likely re-route the broken path so that the old PR is still valid. The number of times you actually need to change a PR once issued should be very small. If your network cares about fast-failover then it should have a high LMC and rely on IB's explicit multipath feature, and the kernel cache design should support this. This same argument is why IPoIB ARP decisions really have no bearing on IB PRs. IPoIB ARP logic and refreshes is designed to support the distributed ND lookup model - IB PR's have completely different lifetime rules that are totally unrelated to ARP's liftime rules. The existing trap monitoring in Sean's module covers about 90% of the cases in IB when you need to invalidate a PR, the last 10% will need something new :( Sean, it seems to me that alot of what is being talked about here really boils down to policy decisions about the caching. Maybe you'd see less resistance if the kernel module didn't have any policy and that was left to userspace. Even your choice today of putting the big GetTable query in the kernel strikes me as something I'd prefer to see in userspace. Jason From mshefty at ichips.intel.com Thu Jul 26 11:58:04 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 11:58:04 -0700 Subject: [ofa-general] Re: QoS in RDMA CM: (was QoS RFC) In-Reply-To: <46A8AD2E.9000908@opengridcomputing.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A54659.8010608@ichips.intel.com> <46A69225.9090502@ichips.intel.com> <46A8AD2E.9000908@opengridcomputing.com> Message-ID: <46A8EEBC.4090101@ichips.intel.com> > In the socket API, socket options describe what protocol they are > intended for. You can have options that are intended for IP or TCP and > other protocol layers. > > We could do some rdma_setopt() interface, and define both transport > independent options and transport-specific options. Then if there are > features of qos that are only in IB, you can make them > transport-specific options. So an option struct may have a > transport_type field... > > Although I _think_ it will be a good thing to try and map > transport-specific qos attributes to a univeral transport independent > attribute. But I'm not an expert on qos stuff... Based on the information I found, socket options are used to specify QoS / TOS / DSCP / whatever they want to call it for IPv4, but not for IPv6. For IPv6, the TC and FL fields are included with the socket address. So... I think we're okay with IPv6, but will need an rdma_setopt() call to set the QoS info for IPv4 addresses. I think we can keep the QoS attributes transport independent. Note that for IB, we could avoid the rdma_setopt() call by mapping the resulting IB service ID to a QoS level, but I'd rather find a transport independent solution if possible. > Or a more generic rdma_setopt() that can be extensible for future > options/attributes and not break the API... I agree - my preference is not to break the user space API. - Sean From transter at gmail.com Thu Jul 26 12:37:10 2007 From: transter at gmail.com (lbt) Date: Thu, 26 Jul 2007 12:37:10 -0700 Subject: [ofa-general] Lost in-service traps during Open SM migration In-Reply-To: <20070725220204.GI31582@sashak.voltaire.com> References: <20070725220204.GI31582@sashak.voltaire.com> Message-ID: Thanks for the suggestion Sasha! Our host stack does receive a rereregistration notice and does resubscribe all handlers at that point in time. At the time of the SM migration, our stack prints out some informational messages to confirm this: Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER occurred on port 1 Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM LID=8 And also confirmed in the SM logs that after the migration, the higher priority SM is getting a subscription request for in-service trap: Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method: Subscribe Request with QPN: 0x000001 Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [ Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [ Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump: gid.....................0x0000000000000000 : 0x0000000000000000 lid_range_begin.........0xFFFF lid_range_end...........0x0 is_generic..............0x1 subscribe...............0x0 trap_type...............0x3 trap_num................64 qpn.....................0x000001 resp_time_val...........0x0 node_type...............0x000004 Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ] It maybe a problem if the resubscription of the in-service handler occurs after the in-service notice was forwarded, but I think the problem is that there is never a notice that is forwared for the higher priority SM port that is restored. Perhaps, neither SM (the lower priority and higher priority one), generates an in-service trap because of the timing gap between when the restored port is detected and "marked" (i.e. added to new_ports_list) and when in-service traps are generated for new ports. During SM migration, the lower priority SM detects the new port, but the higher priority SM does the trap generation (but it doesn't realize that it's own port is a new port and thus doesn't generate a trap for it). Our host stack executes some functions when a port is restored (in our in-service subscription handler). Am I not supposed to receive an in-service trap for a restored port that happens to be the Master SM, and instead execute these actions with a client reregistration event? Thanks again for your help! Lan On 7/25/07, Sasha Khapyorsky wrote: > > Hi Lan, > > On 09:57 Wed 25 Jul , lbt wrote: > > Hello, > > > > I have been seeing a problem where a subscriber for in-service traps is > not > > getting informed when the port of master openSM is restored (i.e. > causing an > > SM migration). > > > > I have an IB subnet with 2 nodes running OpenSM , different priorities > of > > course (OpenSM Rev:openib-2.0.5). I also have another node on the > subnet > > that has subscribed for the forwarding of any > IB_SA_GENERIC_TRAP_NUM_IN_SVC > > trap events. I've been doing cable pull tests on the IB ports, to check > if > > the in-service handler I have subscribed gets invoked when I restore > the > > cable. I've noticed that everything works as expected ( i.e. my > in-service > > handler is invoked) whenever I restore the cable on the lower priority > SM IB > > port without ever touching the master SM port. But if I cause an SM > > migration, by restoring the port of the higher priority SM, the > in-service > > trap does not get generated as expected on a cable restore. > > > > Steps to Reproduce: > > 1) Start with port to higher priority SM disconnected. > > 2) restore port cable on the higher priority SM > > --> This causes an SM Migration as expected, SM's migration happens > okay > > --> I expected the restoration of the higher priority SM to tit to also > > trigger an in-service trap as well and notify subscribers, but it > doesn't > > occur > > > > I have collected debug messages log for both open SM's, and it appears > that > > the reason is because: > > 1) in-service traps are generated based on what ports are added on the > > Master SM's new_ports_list, but these traps are generated only after > LID > > assignment > > 2) when the higher priority SM port is restored, the restored port gets > > added to the lower priority SM's new_ports_list (since it's still the > Master > > SM at that point in time) > > 3) the handover of Master SM from lower priority to higher priority > SM > > occurs (before LID assignment and thus a chance for traps get generated > for > > those ports on new_ports_list) > > 4) the higher priority SM is now Master SM, but it has an empty > > new_ports_list, so no trap generated either > > > > Does this look like a legitimate Open SM bug? Any feedback would be > much > > appreciated, and if I can help further in any way please let me know . > > As far as I know when OpenSM (even old like 2.0.5) becomes master it > requests client to reregister SA related stuff (by setting this bit in > PortInfo). > > Probably your port doesn't not support this (you could verify by seeing > PortInfo:CapabilityMask - use 'smpquery portinfo ') or > maybe your host stack doesn't do reregistration? > > Anyway you could track this in the OpenSM code in osm_lid_mgr.c > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set > (with ib_port_info_set_client_rereg()) or not. Then we will know more > about this problem. > > Sasha > > > > > > > Subset of logs from lower priority SM during the cable restore of > higher > > priority SM port: > > ### Jul 18 14:31:56 614522 [41401960] -> > __osm_trap_rcv_process_request: > > Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A > > TID:0x00000016000012e1 > > ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: > Received > > signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE > > ### 14:31:56 ******************** INITIATING HEAVY SWEEP > > ********************** > > ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: > Received > > signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > > OSM_SM_STATE_SWEEP_HEAVY_SELF > > Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding > port > > GUID:0x00504501483e0000 to new_ports_list > > Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received > signal > > OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received > signal > > OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > 14:31:56 ********************* HEAVY SWEEP COMPLETE > *********************** > > Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received > > signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER### > > 14:31:56 ******************** ENTERING SM STANDBY STATE > ******************* > > > > Subset of logs from higher priority SM during the cable restore of > higher > > priority SM port: > > > > Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [ > > Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received > > signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state > > IB_SMINFO_STATE_DISCOVERING > > Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state > > Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg: > > ******************** ENTERING SM MASTER STATE ******************** > > Jul 18 14:32:03 009014 [41401960] -> > __osm_state_mgr_set_sm_lid_done_msg: > > **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** > > Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg > > ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** > > Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: > [ > > ----> no in-service traps are generated and notices forwarded because > there > > are no ports on this list > > Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: > ] > > > > > > Thanks! > > Lan > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Jul 26 13:16:54 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 13:16:54 -0700 Subject: [ofa-general] Userspace support for SA event registration (was: IPoIB path caching) In-Reply-To: <20070726181132.GO19768@obsidianresearch.com> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726181132.GO19768@obsidianresearch.com> Message-ID: <46A90136.2090305@ichips.intel.com> > This same argument is why IPoIB ARP decisions really have no bearing > on IB PRs. IPoIB ARP logic and refreshes is designed to support the > distributed ND lookup model - IB PR's have completely different > lifetime rules that are totally unrelated to ARP's liftime rules. The > existing trap monitoring in Sean's module covers about 90% of the > cases in IB when you need to invalidate a PR, the last 10% will need > something new :( > > Sean, it seems to me that alot of what is being talked about here > really boils down to policy decisions about the caching. Maybe you'd > see less resistance if the kernel module didn't have any policy and > that was left to userspace. Even your choice today of putting the big > GetTable query in the kernel strikes me as something I'd prefer to see > in userspace. In order to migrate the local SA to user space, we need a way to export SA event registration. And I don't think we've ever reached agreement on the best approach to doing this. I've posted patches for one approach: http://lists.openfabrics.org/pipermail/general/2007-February/032487.html This exposes a user space SA interface for event registration and raw IB multicast support. The approach is generic enough that it could be extended to other SA queries, but the user MAD interface covers this area as well. I'd like to get agreement on an approach for this, even outside of local SA support. - Sean From panda at cse.ohio-state.edu Thu Jul 26 15:02:25 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu, 26 Jul 2007 18:02:25 -0400 (EDT) Subject: [ofa-general] Announcing the release of MVAPICH2 1.0-beta Message-ID: <200707262202.l6QM2PLK007824@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH2-1.0-beta with the following NEW features: - Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for Open Fabrics Gen2-IB. - Hot-Spot Avoidance Mechanism (HSAM) for alleviating network congestion in large scale clusters. Available for Open Fabrics Gen2-IB. - RDMA CM based on-demand connection management for large scale clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. - uDAPL on-demand connection management for large scale clusters. Available for uDAPL interface (including Solaris IB implementation). - RDMA Read support for increased overlap of computation and communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. - Application-initiated system-level (synchronous) checkpointing in addition to the user-transparent checkpointing. User application can now request a whole program checkpoint synchronously with BLCR by calling special functions within the application. Available for OpenFabrics Gen2-IB. - Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand. Available for OpenFabrics Gen2-IB. - Integrated multi-rail communication support for OpenFabrics Gen2-iWARP. - Blocking mode of communication progress. Available for OpenFabrics Gen2-IB. - Based on MPICH2 1.0.5p4. For downloading MVAPICH2 1.0-beta source code, associated user guide and accessing the anonymous SVN, please visit the following URL: http://mvapich.cse.ohio-state.edu All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, MVAPICH Team From sashak at voltaire.com Thu Jul 26 15:41:33 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 01:41:33 +0300 Subject: [ofa-general] Re: pkey.sim.tcl In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> Message-ID: <20070726224133.GC2472@sashak.voltaire.com> Hi Eitan, On 09:26 Thu 26 Jul , Eitan Zahavi wrote: > > I am happy you actually use the simulator. > Please provide more info regarding the failure. You should tar compress > the /tmp/ibmgtsim.XXXX of your run. I can send this for you if you want, but the failure is trivial. > 6. The default PKey is removed from ALL the port pkey tables > 7. All PKey tables are validated against initial setup to see that the > indexes of the assigned "real" pkeys was not altered by the SM. > 8. A single switch is selected and its Change Bit is raised. > 9. Wait for SUBNET UP > 10. Validate all ports got their default pkey back. > > I suspect from our thread about not setting LFT that stage 10 failed for > you. Yes, and it is due (6), where default Pkey is removed "externally". I'm not sure that OpenSM should handle the case when pkey table is modified externally by something which is not SM. Sasha > > Eitan > > > > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Wednesday, July 25, 2007 11:24 PM > > To: Eitan Zahavi; Yevgeny Kliteynik > > Cc: Hal Rosenstock; general at lists.openfabrics.org > > Subject: pkey.sim.tcl (was: [PATCH] opensm: detect port > > external reset andflush cached tables) > > > > Hi Eitan, Yevgeny, > > > > > > On 00:54 Wed 25 Jul , Sasha Khapyorsky wrote: > > > > > > This detects port external reset by validating PortState == > > INIT, and > > > when detected flushes cached port related tables - re-reads > > pkey table > > > and drops (overwrites) SL2VL and VLArb tables. > > > > > > Signed-off-by: Sasha Khapyorsky > > > > [snip...] > > > diff --git a/opensm/opensm/osm_port_info_rcv.c > > > b/opensm/opensm/osm_port_info_rcv.c > > > index 6fe2d1d..0528e38 100644 > > > --- a/opensm/opensm/osm_port_info_rcv.c > > > +++ b/opensm/opensm/osm_port_info_rcv.c > > > @@ -801,6 +801,12 @@ osm_pi_rcv_process( > > > p_rcv->p_subn->master_sm_base_lid = p_pi->master_sm_base_lid; > > > } > > > > > > + /* if port just inited or reached INIT state (external reset) > > > + request update for port related tables */ > > > + p_physp->need_update = > > > + (ib_port_info_get_port_state(p_pi) == IB_LINK_INIT || > > > + p_physp->need_update > 1 ) ? 1 : 0; > > > + > > > switch( osm_node_get_type( p_node ) ) > > > { > > > case IB_NODE_TYPE_CA: > > > @@ -824,7 +830,8 @@ osm_pi_rcv_process( > > > /* > > > Get the tables on the physp. > > > */ > > > - __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, p_node, > > p_physp ); > > > + if (p_physp->need_update) > > > + __osm_pi_rcv_get_pkey_slvl_vla_tables( p_rcv, > > p_node, p_physp > > > + ); > > > > When testing this patch, I tried it with ibmgtsim and test failed: > > > > RunSimTest -o ${ROOT}/sbin/opensm -t ${TESTS}/IS1-16.topo > > -f ${TESTS}/pkey.sim.tcl -c ${TESTS}/pkey.check.tcl > > > > The failure is resulted by port pkey tables modifications > > which is performed in pkey.sim.tcl. Why should we do this? Is > > this legal scenario when pkey tables are modified externally > > without Partition Manager? > > > > Sasha > > From sashak at voltaire.com Thu Jul 26 15:47:10 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 01:47:10 +0300 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> Message-ID: <20070726224710.GD2472@sashak.voltaire.com> On 09:00 Thu 26 Jul , Eitan Zahavi wrote: > I propose that when there is no MTU in the partition policy file OpenSM > use a > configurable default from: /etc/cache/opensm/opensm.opt. > Something like: > # The default MTU to be used for IPoIB and other MCGs when the > partition-policy > # does not provide exact value. The default is the lowest possible MTU > mcg_default_mtu 1 Looks like good solution for me. Somebody cares about patch? Sasha > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > ________________________________ > > From: Shirley Ma [mailto:xma at us.ibm.com] > Sent: Wednesday, July 25, 2007 10:45 PM > To: Eitan Zahavi > Cc: general at lists.openfabrics.org; Hal Rosenstock > Subject: RE: [ofa-general] Re: openSM: Different IB MTUs > > > > Hello Eitan, Hal, > > Thanks. It's good openSM has the configuration option to set up > these attributes in MC. Is this a good idea to add below to openSM: When > there is no MTU defined in the configuration file, SM can pick up the > smallest link MTU in the fabrics by default? MTU is unlikely rate, > slower rate might indicate the cablling problem. So using the smallest > link MTU in the fabrics might not be a bad choice for MC by default. The > reason I request here is to create IP multicast group, MTU is not an > attribute of the group. When mapping IP multicast to IB multicast, IB > muliticast might fail because of different IB link MTU size in the > group, but IP multicast group will be successful without knowing the > failure. If admin sets MTU in configuration file, admin would know this > failure. Otherwise, admin/users could spend too much time on debugging > their broken multicasting applications. > > Thanks > Shirley Ma > > "Eitan Zahavi" > > > > > "Eitan Zahavi" > > 07/25/07 12:25 PM > > > > To > > "Hal Rosenstock" , Shirley > Ma/Beaverton/IBM at IBMUS > > > cc > > > > > Subject > > RE: [ofa-general] Re: openSM: Different IB MTUs > > > Hi Shirley, > > I think I understand where your question comes from... > Many have issue with heterogonous fabrics where not all nodes > have same MTU or Speed. > Especially when IPoIB relies on all nodes joining the broadcast > group. > > The term "join" for multicast groups is a little overloaded. > If a node joins an existing MC group it has to have a rate > (speed * width) > MCG.rate and support MTU > MCG.MTU otherwise it is > denied. > If the join is actually a "create" the node has to provide the > rate and MTU which define the MCG values. > > To allow for administrator to control the IPoIB MCGs MTU and > rate OpenSM provides the means to control these > values per partition. See the doc/partition-config.doc > Still the administrator should know what would be the lowest MTU > and rate the nodes expected to join the IPoIB subnet have. > The tradeoff is in the hands of the administrator who can set a > value that will prevent slow nodes from joining the group, > or assign a low value that will fit all nodes but slow down > communication ... > > EZ > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > ________________________________ > > From: general-bounces at lists.openfabrics.org [ > mailto:general-bounces at lists.openfabrics.org] On Behalf Of Hal > Rosenstock > Sent: Wednesday, July 25, 2007 10:01 PM > To: Shirley Ma > Cc: general at lists.openfabrics.org > Subject: [ofa-general] Re: openSM: Different IB MTUs > > Shirley, > > On 7/25/07, Shirley Ma > > wrote: > > Hal, > > Thanks for your prompt reply. I am asking for how openSM > handle different link MTUs in SA MCMemberRecord MTU. For example, if we > have some links MTU as 2K, some links MTU as 1K. Then when enabling > IPoIB, how does SM decide IPoIB broadcast group MCMemberRecord MTU size? > When creating an IB multicast group from a 2K MTU node first, which PMTU > value is attaching to this IB multicast group MCMemberRecord MTU? > > > > MCMemberRecord MTU gets the group MTU (when created). This is > either this first joiner with sufficient components or preconfigured > (and MTU can be set in the config). If a joiner has insufficient MTU for > the group, it is denied. > > -- Hal > > > > Thanks > Shirley Ma > > "Hal Rosenstock" < hal.rosenstock at gmail.com > > > > > > > "Hal Rosenstock" < > hal.rosenstock at gmail.com > > > 07/25/07 10:57 AM > > > > To > > Shirley Ma/Beaverton/IBM at IBMUS > > cc > > general at lists.openfabrics.org > > Subject > > Re: openSM: Different IB MTUs > > > Shirley, > > On 7/25/07, Shirley Ma < xma at us.ibm.com > > wrote: > > Hello Hal, > > How does openSM handle CAs with > different MTUs in the same subnet? For example, IPoIB broadcast group > MTU, IB multicast group PMTU? Does openSM pick up the smallest MTU in > the subnet? > > > > Are you asking about link MTU, SA > PathRecord/MultiPathRecord MTU, SA MCMemberRecord MTU, or all of these ? > > -- Hal > > Thanks > Shirley Ma > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Jul 26 15:50:05 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 01:50:05 +0300 Subject: [ofa-general] Re: openSM: Different IB MTUs In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F755DF@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7564E@mtlexch01.mtl.com> Message-ID: <20070726225005.GE2472@sashak.voltaire.com> On 08:58 Thu 26 Jul , Hal Rosenstock wrote: > On 7/26/07, Eitan Zahavi wrote: > > > > *I propose that when there is no MTU in the partition policy file > > OpenSM use a * > > *configurable default from: **/etc/cache/opensm/opensm.opt.* > > > > That would make this the default rather than 2K. IMO it should be when some > "special" unused mtu is set in the partition config. "No value" should suitable too. No? Sasha > > -- Hal > > *Something like:* > > *# The default MTU to be used for IPoIB and other MCGs when the > > partition-policy * > > *# does not provide exact value. The default is the lowest possible MTU* > > *mcg_default_mtu 1* > > ** > > *Eitan Zahavi*** > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > ------------------------------ > > *From:* Shirley Ma [mailto:xma at us.ibm.com] > > *Sent:* Wednesday, July 25, 2007 10:45 PM > > *To:* Eitan Zahavi > > *Cc:* general at lists.openfabrics.org; Hal Rosenstock > > *Subject:* RE: [ofa-general] Re: openSM: Different IB MTUs > > > > > > > > Hello Eitan, Hal, > > > > Thanks. It's good openSM has the configuration option to set up these > > attributes in MC. Is this a good idea to add below to openSM: When there is > > no MTU defined in the configuration file, SM can pick up the smallest link > > MTU in the fabrics by default? MTU is unlikely rate, slower rate might > > indicate the cablling problem. So using the smallest link MTU in the fabrics > > might not be a bad choice for MC by default. The reason I request here is to > > create IP multicast group, MTU is not an attribute of the group. When > > mapping IP multicast to IB multicast, IB muliticast might fail because of > > different IB link MTU size in the group, but IP multicast group will be > > successful without knowing the failure. If admin sets MTU in configuration > > file, admin would know this failure. Otherwise, admin/users could spend too > > much time on debugging their broken multicasting applications. > > > > Thanks > > Shirley Ma > > > > [image: Inactive hide details for "Eitan Zahavi" ]"Eitan > > Zahavi" > > > > > > > > *"Eitan Zahavi" * > > > > 07/25/07 12:25 PM > > > > > > To > > > > "Hal Rosenstock" , Shirley > > Ma/Beaverton/IBM at IBMUS > > cc > > > > > > Subject > > > > RE: [ofa-general] Re: openSM: Different IB MTUs > > *Hi Shirley,* > > > > *I think I understand where your question comes from...* > > *Many have issue with heterogonous fabrics where not all nodes have same > > MTU or Speed.* > > *Especially when IPoIB relies on all nodes joining the broadcast group.* > > > > *The term "join" for multicast groups is a little overloaded.* > > *If a node joins an existing MC group it has to have a rate (speed * > > width) > MCG.rate and support MTU > MCG.MTU otherwise it is denied.* > > *If the join is actually a "create" the node has to provide the rate and > > MTU which define the MCG values.* > > > > *To allow for administrator to control the IPoIB MCGs MTU and rate OpenSM > > provides the means to control these* > > *values per partition. See the doc/partition-config.doc* > > *Still the administrator should know what would be the lowest MTU and rate > > the nodes expected to join the IPoIB subnet have.* > > *The tradeoff is in the hands of the administrator who can set a value > > that will prevent slow nodes from joining the group, * > > *or assign a low value that will fit all nodes but slow down communication > > ...* > > > > *EZ* > > > > *Eitan Zahavi* > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > ------------------------------ > > *From:* general-bounces at lists.openfabrics.org [ > > mailto:general-bounces at lists.openfabrics.org] > > *On Behalf Of *Hal Rosenstock* > > Sent:* Wednesday, July 25, 2007 10:01 PM* > > To:* Shirley Ma* > > Cc:* general at lists.openfabrics.org* > > Subject:* [ofa-general] Re: openSM: Different IB MTUs > > > > Shirley, > > > > On 7/25/07, *Shirley Ma* <*xma at us.ibm.com* > wrote: > > > > Hal, > > > > Thanks for your prompt reply. I am asking for how openSM handle > > different link MTUs in SA MCMemberRecord MTU. For example, if we have some > > links MTU as 2K, some links MTU as 1K. Then when enabling IPoIB, how does SM > > decide IPoIB broadcast group MCMemberRecord MTU size? When creating an IB > > multicast group from a 2K MTU node first, which PMTU value is attaching to > > this IB multicast group MCMemberRecord MTU? > > > > > > > > MCMemberRecord MTU gets the group MTU (when created). This is either this > > first joiner with sufficient components or preconfigured (and MTU can be set > > in the config). If a joiner has insufficient MTU for the group, it is > > denied. > > > > -- Hal > > > > > > Thanks > > Shirley Ma > > > > [image: Inactive hide details for "Hal Rosenstock" > > ]"Hal Rosenstock" < * > > hal.rosenstock at gmail.com* > > > > > *"Hal Rosenstock" <**hal.rosenstock at gmail.com* > > *>* > > > > 07/25/07 10:57 AM > > To > > > > Shirley Ma/Beaverton/IBM at IBMUS cc > > * > > **general at lists.openfabrics.org* > > Subject > > > > Re: openSM: Different IB MTUs > > Shirley, > > > > On 7/25/07, *Shirley Ma* <* **xma at us.ibm.com* > > > wrote: > > Hello Hal, > > > > How does openSM handle CAs with different MTUs in the > > same subnet? For example, IPoIB broadcast group MTU, IB multicast group > > PMTU? Does openSM pick up the smallest MTU in the subnet? > > > > > > Are you asking about link MTU, SA PathRecord/MultiPathRecord MTU, SA > > MCMemberRecord MTU, or all of these ? > > > > -- Hal > > Thanks > > Shirley Ma > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu Jul 26 16:15:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:15:40 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Fix comment In-Reply-To: References: Message-ID: <20070726231540.GF2472@sashak.voltaire.com> On 14:16 Thu 26 Jul , Hal Rosenstock wrote: > include/iba/ib_types.h: Fix comment > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Thu Jul 26 16:16:04 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:16:04 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid In-Reply-To: References: Message-ID: <20070726231604.GG2472@sashak.voltaire.com> On 10:25 Thu 26 Jul , Hal Rosenstock wrote: > OpenSM/osm_port.c: Fix opvls and neighbormtu when remote port invalid > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Thu Jul 26 16:22:43 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:22:43 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: More changes from osm.log to opensm.log In-Reply-To: References: Message-ID: <20070726232243.GH2472@sashak.voltaire.com> On 13:53 Thu 26 Jul , Hal Rosenstock wrote: > OpenSM: More changes from osm.log to opensm.log > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Thu Jul 26 16:30:05 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:30:05 +0300 Subject: [ofa-general] [PATCH] opensm: remove reassign_lfts configuration parameter Message-ID: <20070726233005.GJ2472@sashak.voltaire.com> This removes actually useless subn.opt.reassign_lfts parameter. Its value is used only for initial setup of ignore_existing_lfts flag. But later this flag becomes unconditionally TRUE if at least one switch is found in the fabric. If not (and fabric is just back to back connected CAs) there is no routing and this flag is not used. In any case initial value does not matter. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_subnet.h | 10 +--------- opensm/opensm/osm_subnet.c | 14 +------------- 2 files changed, 2 insertions(+), 22 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 92f2bc0..84ed6d4 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -239,7 +239,6 @@ typedef struct _osm_subn_opt uint8_t max_op_vls; uint8_t force_link_speed; boolean_t reassign_lids; - boolean_t reassign_lfts; boolean_t ignore_other_sm; boolean_t single_thread; boolean_t no_multicast_option; @@ -345,12 +344,6 @@ typedef struct _osm_subn_opt * Otherwise (the default), * OpenSM always tries to preserve as LIDs as much as possible. * -* reassign_lfts -* If TRUE ignore existing LFT entries on first sweep (default). -* Otherwise only non minimal hop cases are modified. -* NOTE: A standby SM clears its first sweep flag - since the -* master SM already sweeps... -* * ignore_other_sm_option * This flag is TRUE if other SMs on the subnet should be ignored. * @@ -656,9 +649,8 @@ typedef struct _osm_subn * This flag is a dynamic flag to instruct the LFT assignment to * ignore existing legal LFT settings. * The value will be set according to : -* - During SM init set to the reassign_lfts flag value -* - Coming out of STANDBY it will be cleared (other SM worked) * - Any change to the list of switches will set it to high +* - Coming out of STANDBY it will be cleared (other SM worked) * - Set to FALSE upon end of all lft assignments. * * subnet_initalization_error diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 7e17945..7e7a4d5 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -221,8 +221,7 @@ osm_subn_init( /* note that insert and remove are part of the port_profile thing */ cl_map_init(&(p_subn->opt.port_prof_ignore_guids), 10); - /* ignore_existing_lfts follows reassign_lfts on first sweep */ - p_subn->ignore_existing_lfts = p_subn->opt.reassign_lfts; + p_subn->ignore_existing_lfts = TRUE; /* we assume master by default - so we only need to set it true if STANDBY */ p_subn->coming_out_of_standby = FALSE; @@ -451,7 +450,6 @@ osm_subn_set_default_opt( p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS; p_opt->force_link_speed = 15; p_opt->reassign_lids = FALSE; - p_opt->reassign_lfts = TRUE; p_opt->ignore_other_sm = FALSE; p_opt->single_thread = FALSE; p_opt->no_multicast_option = FALSE; @@ -1221,10 +1219,6 @@ osm_subn_parse_conf_file( p_key, p_val, &p_opts->reassign_lids); __osm_subn_opts_unpack_boolean( - "reassign_lfts", - p_key, p_val, &p_opts->reassign_lfts); - - __osm_subn_opts_unpack_boolean( "ignore_other_sm", p_key, p_val, &p_opts->ignore_other_sm); @@ -1544,11 +1538,6 @@ osm_subn_write_conf_file( "sweep_interval %u\n\n" "# If TRUE cause all lids to be reassigned\n" "reassign_lids %s\n\n" - "# If TRUE ignore existing LFT entries on first sweep (default).\n" - "# Otherwise only non minimal hop cases are modified.\n" - "# NOTE: A standby SM clears its first sweep flag - since the\n" - "# master SM already sweeps...\n" - "reassign_lfts %s\n\n" "# If TRUE forces every sweep to be a heavy sweep\n" "force_heavy_sweep %s\n\n" "# If TRUE every trap will cause a heavy sweep.\n" @@ -1556,7 +1545,6 @@ osm_subn_write_conf_file( "sweep_on_trap %s\n\n", p_opts->sweep_interval, p_opts->reassign_lids ? "TRUE" : "FALSE", - p_opts->reassign_lfts ? "TRUE" : "FALSE", p_opts->force_heavy_sweep ? "TRUE" : "FALSE", p_opts->sweep_on_trap ? "TRUE" : "FALSE" ); -- 1.5.3.rc2.38.g11308 From sashak at voltaire.com Thu Jul 26 16:31:35 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:31:35 +0300 Subject: [ofa-general] [PATCH] opensm: don't fetch LFTs initially Message-ID: <20070726233135.GK2472@sashak.voltaire.com> Do not fetch initial switch LFTs for discovered switches. OpenSM doesn't use it anyway, but it creates additional subnet traffic. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sw_info_rcv.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index 726cc06..8eb8cd5 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -134,6 +134,7 @@ __osm_si_rcv_get_port_info( OSM_LOG_EXIT( p_rcv->p_log ); } +#if 0 /********************************************************************** The plock must be held before calling this function. **********************************************************************/ @@ -198,7 +199,6 @@ __osm_si_rcv_get_fwd_tbl( OSM_LOG_EXIT( p_rcv->p_log ); } -#if 0 /********************************************************************** The plock must be held before calling this function. **********************************************************************/ @@ -399,10 +399,9 @@ __osm_si_rcv_process_new( Get the PortInfo attribute for every port. */ __osm_si_rcv_get_port_info( p_rcv, p_sw, p_madw ); - __osm_si_rcv_get_fwd_tbl( p_rcv, p_sw ); /* - Don't bother retrieving the current multicast tables + Don't bother retrieving the current unicast and multicast tables from the switches. The current version of SM does not support silent take-over of an existing multicast configuration. @@ -413,6 +412,7 @@ __osm_si_rcv_process_new( The code to retrieve the tables was fully debugged. */ #if 0 + __osm_si_rcv_get_fwd_tbl( p_rcv, p_sw ); if( !p_rcv->p_subn->opt.disable_multicast ) __osm_si_rcv_get_mcast_fwd_tbl( p_rcv, p_sw ); #endif -- 1.5.3.rc2.38.g11308 From sashak at voltaire.com Thu Jul 26 16:32:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:32:47 +0300 Subject: [ofa-general] [PATCH] opensm: remove static __some_hop_count_set var Message-ID: <20070726233247.GL2472@sashak.voltaire.com> This removes static variable __some_hop_count_set from osm_ucast_mgr and instead uses flag stored as structure memmber. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_mgr.h | 6 ++++++ opensm/opensm/osm_ucast_mgr.c | 16 ++++------------ 2 files changed, 10 insertions(+), 12 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 3381616..4824971 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -105,6 +105,7 @@ typedef struct _osm_ucast_mgr osm_log_t *p_log; cl_plock_t *p_lock; boolean_t any_change; + boolean_t some_hop_count_set; uint8_t *lft_buf; } osm_ucast_mgr_t; /* @@ -126,6 +127,11 @@ typedef struct _osm_ucast_mgr * set to TRUE by osm_ucast_mgr_set_fwd_table() if any mad * was sent. * +* some_hop_count_set +* Initialized to FALSE at the beginning of each the min hop +* tables calculation iteration cycle, set to TRUE to indicate +* that some hop count changes were done. +* * lft_buf * LFT buffer - used during LFT calculation/setup. * diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index a8fc649..f049e74 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -66,14 +66,6 @@ /********************************************************************** **********************************************************************/ -/* - * This flag is used for stopping the relaxation algorithm if no - * change detected during the fabric scan - */ -static boolean_t __some_hop_count_set; - -/********************************************************************** - **********************************************************************/ void osm_ucast_mgr_construct( IN osm_ucast_mgr_t* const p_mgr ) @@ -531,7 +523,7 @@ __osm_ucast_mgr_process_neighbor( "cannot set hops for lid %u at switch 0x%" PRIx64 "\n", lid_ho, cl_ntoh64(osm_node_get_node_guid(p_this_sw->p_node))); - __some_hop_count_set = TRUE; + p_mgr->some_hop_count_set = TRUE; } } @@ -1020,10 +1012,10 @@ osm_ucast_mgr_build_lid_matrices( if non of the switches was set will exit the while loop */ - __some_hop_count_set = TRUE; - for( i = 0; (i < iteration_max) && __some_hop_count_set; i++ ) + p_mgr->some_hop_count_set = TRUE; + for( i = 0; (i < iteration_max) && p_mgr->some_hop_count_set; i++ ) { - __some_hop_count_set = FALSE; + p_mgr->some_hop_count_set = FALSE; cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_neighbors, p_mgr ); } -- 1.5.3.rc2.38.g11308 From sashak at voltaire.com Thu Jul 26 16:34:02 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 02:34:02 +0300 Subject: [ofa-general] [PATCH] opensm: dumpers improvements Message-ID: <20070726233402.GM2472@sashak.voltaire.com> As was discussed previously on the list this moves ucast and mcast dumper functions to separate file (osm_dump.c). Dump generators will be invoked after heavy sweep. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_opensm.h | 4 + opensm/opensm/Makefile.am | 2 +- opensm/opensm/osm_dump.c | 434 ++++++++++++++++++++++++++++++++++++ opensm/opensm/osm_mcast_mgr.c | 138 +----------- opensm/opensm/osm_state_mgr.c | 1 + opensm/opensm/osm_ucast_mgr.c | 336 +--------------------------- 6 files changed, 443 insertions(+), 472 deletions(-) create mode 100644 opensm/opensm/osm_dump.c diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index 2b09129..2d668e9 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -444,6 +444,10 @@ osm_opensm_wait_for_subnet_up( * SEE ALSO *********/ +/* dump helpers */ +void osm_dump_mcast_routes(osm_opensm_t *osm); +void osm_dump_all(osm_opensm_t *osm); + /****v* OpenSM/osm_exit_flag */ extern volatile unsigned int osm_exit_flag; diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index c94897c..46770b4 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -56,7 +56,7 @@ opensm_SOURCES = main.c osm_console.c osm_db_files.c \ osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ st.c osm_perfmgr.c osm_perfmgr_db.c \ - osm_event_plugin.c + osm_event_plugin.c osm_dump.c if OSMV_OPENIB opensm_CFLAGS = -Wall $(OSMV_CFLAGS) -fno-strict-aliasing -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 opensm_CXXFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 diff --git a/opensm/opensm/osm_dump.c b/opensm/opensm/osm_dump.c new file mode 100644 index 0000000..367d941 --- /dev/null +++ b/opensm/opensm/osm_dump.c @@ -0,0 +1,434 @@ +/* + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Various OpenSM dumpers + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct dump_context { + osm_opensm_t *p_osm; + FILE *file; +}; + +static void dump_ucast_path_distribution(cl_map_item_t * p_map_item, void *cxt) +{ + osm_node_t *p_node; + osm_node_t *p_remote_node; + uint8_t i; + uint8_t num_ports; + uint32_t num_paths; + ib_net64_t remote_guid_ho; + osm_switch_t *p_sw = (osm_switch_t *) p_map_item; + osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm; + + p_node = p_sw->p_node; + num_ports = p_sw->num_ports; + + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + "dump_ucast_path_distribution: " + "Switch 0x%" PRIx64 "\n" + "Port : Path Count Through Port", + cl_ntoh64(osm_node_get_node_guid(p_node))); + + for (i = 0; i < num_ports; i++) { + num_paths = osm_switch_path_count_get(p_sw, i); + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, "\n %03u : %u", i, + num_paths); + if (i == 0) { + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + " (switch management port)"); + continue; + } + + p_remote_node = osm_node_get_remote_node(p_node, i, NULL); + if (p_remote_node == NULL) + continue; + + remote_guid_ho = + cl_ntoh64(osm_node_get_node_guid(p_remote_node)); + + switch (osm_node_get_remote_type(p_node, i)) { + case IB_NODE_TYPE_SWITCH: + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + " (link to switch"); + break; + case IB_NODE_TYPE_ROUTER: + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + " (link to router"); + break; + case IB_NODE_TYPE_CA: + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + " (link to CA"); + break; + default: + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, + " (link to unknown node type"); + break; + } + + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")", + remote_guid_ho); + } + + osm_log_printf(&p_osm->log, OSM_LOG_DEBUG, "\n"); +} + +static void dump_ucast_routes(cl_map_item_t * p_map_item, void *cxt) +{ + const osm_node_t *p_node; + osm_port_t *p_port; + uint8_t port_num; + uint8_t num_hops; + uint8_t best_hops; + uint8_t best_port; + uint16_t max_lid_ho; + uint16_t lid_ho, base_lid; + boolean_t direct_route_exists = FALSE; + osm_switch_t *p_sw = (osm_switch_t *) p_map_item; + osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm; + FILE *file = ((struct dump_context *)cxt)->file; + + p_node = p_sw->p_node; + + max_lid_ho = p_sw->max_lid_ho; + + fprintf(file, "__osm_ucast_mgr_dump_ucast_routes: " + "Switch 0x%016" PRIx64 "\n" + "LID : Port : Hops : Optimal\n", + cl_ntoh64(osm_node_get_node_guid(p_node))); + for (lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++) { + fprintf(file, "0x%04X : ", lid_ho); + + p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid_ho); + if (!p_port) { + fprintf(file, "UNREACHABLE\n"); + continue; + } + + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); + if (port_num == OSM_NO_PATH) { + /* + This may occur if there are 'holes' in the existing + LID assignments. Running SM with --reassign_lids + will reassign and compress the LID range. The + subnet should work fine either way. + */ + fprintf(file, "UNREACHABLE\n"); + continue; + } + /* + Switches can lie about which port routes a given + lid due to a recent reconfiguration of the subnet. + Therefore, ensure that the hop count is better than + OSM_NO_PATH. + */ + if (p_port->p_node->sw) { + /* Target LID is switch. + Get its base lid and check hop count for this base LID only. */ + base_lid = osm_node_get_base_lid(p_port->p_node, 0); + base_lid = cl_ntoh16(base_lid); + num_hops = + osm_switch_get_hop_count(p_sw, base_lid, port_num); + } else { + /* Target LID is not switch (CA or router). + Check if we have route to this target from current switch. */ + num_hops = + osm_switch_get_hop_count(p_sw, lid_ho, port_num); + if (num_hops != OSM_NO_PATH) { + direct_route_exists = TRUE; + base_lid = lid_ho; + } else { + osm_physp_t *p_physp = p_port->p_physp; + + if (!p_physp || !p_physp->p_remote_physp || + !p_physp->p_remote_physp->p_node->sw) + num_hops = OSM_NO_PATH; + else { + base_lid = + osm_node_get_base_lid(p_physp-> + p_remote_physp-> + p_node, 0); + base_lid = cl_ntoh16(base_lid); + num_hops = + p_physp->p_remote_physp->p_node->sw == p_sw ? 0 : + osm_switch_get_hop_count(p_sw, + base_lid, + port_num); + } + } + } + + if (num_hops == OSM_NO_PATH) { + fprintf(file, "UNREACHABLE\n"); + continue; + } + + best_hops = osm_switch_get_least_hops(p_sw, base_lid); + if (!p_port->p_node->sw && !direct_route_exists) { + best_hops++; + num_hops++; + } + + fprintf(file, "%03u : %02u : ", port_num, num_hops); + + if (best_hops == num_hops) + fprintf(file, "yes"); + else { + best_port = osm_switch_recommend_path(p_sw, p_port, lid_ho, TRUE, NULL, NULL, NULL, NULL); /* No LMC Optimization */ + fprintf(file, "No %u hop path possible via port %u!", + best_hops, best_port); + } + + fprintf(file, "\n"); + } +} + +static void dump_mcast_routes(cl_map_item_t * p_map_item, void *cxt) +{ + osm_switch_t *p_sw = (osm_switch_t *) p_map_item; + FILE *file = ((struct dump_context *)cxt)->file; + osm_mcast_tbl_t *p_tbl; + int16_t mlid_ho = 0; + int16_t mlid_start_ho; + uint8_t position = 0; + int16_t block_num = 0; + boolean_t first_mlid; + boolean_t first_port; + const osm_node_t *p_node; + uint16_t i, j; + uint16_t mask_entry; + char sw_hdr[256]; + char mlid_hdr[32]; + + p_node = p_sw->p_node; + + p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw); + + sprintf(sw_hdr, "\nSwitch 0x%016" PRIx64 "\n" + "LID : Out Port(s)\n", + cl_ntoh64(osm_node_get_node_guid(p_node))); + first_mlid = TRUE; + while (block_num <= p_tbl->max_block_in_use) { + mlid_start_ho = (uint16_t) (block_num * IB_MCAST_BLOCK_SIZE); + for (i = 0; i < IB_MCAST_BLOCK_SIZE; i++) { + mlid_ho = mlid_start_ho + i; + position = 0; + first_port = TRUE; + sprintf(mlid_hdr, "0x%04X :", + mlid_ho + IB_LID_MCAST_START_HO); + while (position <= p_tbl->max_position) { + mask_entry = + cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); + if (mask_entry == 0) { + position++; + continue; + } + for (j = 0; j < 16; j++) { + if ((1 << j) & mask_entry) { + if (first_mlid) { + fprintf(file, "%s", sw_hdr); + first_mlid = FALSE; + } + if (first_port) { + fprintf(file, "%s", mlid_hdr); + first_port = FALSE; + } + fprintf(file, " 0x%03X ", + j + (position * 16)); + } + } + position++; + } + if (first_port == FALSE) + fprintf(file, "\n"); + } + block_num++; + } +} + +static void dump_lid_matrix(cl_map_item_t * p_map_item, void *cxt) +{ + osm_switch_t *p_sw = (osm_switch_t *) p_map_item; + osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm; + FILE *file = ((struct dump_context *)cxt)->file; + osm_node_t *p_node = p_sw->p_node; + unsigned max_lid = p_sw->max_lid_ho; + unsigned max_port = p_sw->num_ports; + uint16_t lid; + uint8_t port; + + fprintf(file, "Switch: guid 0x%016" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_node))); + for (lid = 1; lid <= max_lid; lid++) { + osm_port_t *p_port; + if (osm_switch_get_least_hops(p_sw, lid) == OSM_NO_PATH) + continue; + fprintf(file, "0x%04x:", lid); + for (port = 0; port < max_port; port++) + fprintf(file, " %02x", + osm_switch_get_hop_count(p_sw, lid, port)); + p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid); + if (p_port) + fprintf(file, " # portguid 0x%" PRIx64, + cl_ntoh64(osm_port_get_guid(p_port))); + fprintf(file, "\n"); + } +} + +static void dump_ucast_lfts(cl_map_item_t * p_map_item, void *cxt) +{ + osm_switch_t *p_sw = (osm_switch_t *) p_map_item; + osm_opensm_t *p_osm = ((struct dump_context *)cxt)->p_osm; + FILE *file = ((struct dump_context *)cxt)->file; + osm_node_t *p_node = p_sw->p_node; + unsigned max_lid = p_sw->max_lid_ho; + unsigned max_port = p_sw->num_ports; + uint16_t lid; + uint8_t port; + + fprintf(file, "Unicast lids [0x0-0x%x] of switch Lid %u guid 0x%016" + PRIx64 " (\'%s\'):\n", + max_lid, osm_node_get_base_lid(p_node, 0), + cl_ntoh64(osm_node_get_node_guid(p_node)), p_node->print_desc); + for (lid = 0; lid <= max_lid; lid++) { + osm_port_t *p_port; + port = osm_switch_get_port_by_lid(p_sw, lid); + + if (port >= max_port) + continue; + + fprintf(file, "0x%04x %03u # ", lid, port); + + p_port = cl_ptr_vector_get(&p_osm->subn.port_lid_tbl, lid); + if (p_port) { + p_node = p_port->p_node; + fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'", + ib_get_node_type_str(osm_node_get_type(p_node)), + cl_ntoh64(osm_port_get_guid(p_port)), + p_node->print_desc); + } else + fprintf(file, "unknown node and type"); + fprintf(file, "\n"); + } + fprintf(file, "%u lids dumped\n", max_lid); +} + +/********************************************************************** + **********************************************************************/ +static void dump_qmap(osm_opensm_t * p_osm, FILE * file, + cl_qmap_t * map, void (*func) (cl_map_item_t *, void *)) +{ + struct dump_context dump_context; + + dump_context.p_osm = p_osm; + dump_context.file = file; + + cl_qmap_apply_func(map, func, &dump_context); +} + +static void dump_qmap_to_file(osm_opensm_t * p_osm, const char *file_name, + cl_qmap_t * map, + void (*func) (cl_map_item_t *, void *)) +{ + char path[1024]; + FILE *file; + + snprintf(path, sizeof(path), "%s/%s", + p_osm->subn.opt.dump_files_dir, file_name); + + file = fopen(path, "w"); + if (!file) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "dump_qmap_to_file: " + "cannot create file \'%s\': %s\n", + path, strerror(errno)); + return; + } + + dump_qmap(p_osm, file, map, func); + + fclose(file); +} + +/********************************************************************** + **********************************************************************/ + +void osm_dump_mcast_routes(osm_opensm_t * osm) +{ + if (osm_log_is_active(&osm->log, OSM_LOG_ROUTING)) { + /* multicast routes */ + dump_qmap_to_file(osm, "opensm.mcfdbs", + &osm->subn.sw_guid_tbl, dump_mcast_routes); + } +} + +void osm_dump_all(osm_opensm_t * osm) +{ + if (osm_log_is_active(&osm->log, OSM_LOG_ROUTING)) { + /* unicast routes */ + dump_qmap_to_file(osm, "opensm-lid-matrix.dump", + &osm->subn.sw_guid_tbl, dump_lid_matrix); + dump_qmap_to_file(osm, "opensm-lfts.dump", + &osm->subn.sw_guid_tbl, dump_ucast_lfts); + if (osm_log_is_active(&osm->log, OSM_LOG_DEBUG)) + dump_qmap(osm, NULL, &osm->subn.sw_guid_tbl, + dump_ucast_path_distribution); + dump_qmap_to_file(osm, "opensm.fdbs", + &osm->subn.sw_guid_tbl, dump_ucast_routes); + /* multicast routes */ + dump_qmap_to_file(osm, "opensm.mcfdbs", + &osm->subn.sw_guid_tbl, dump_mcast_routes); + } +} diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index f4b64a6..5f64b19 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -48,12 +48,11 @@ # include #endif /* HAVE_CONFIG_H */ -#include #include #include -#include #include #include +#include #include #include #include @@ -61,8 +60,6 @@ #include #include -#define LINE_LENGTH 256 - /********************************************************************** **********************************************************************/ typedef struct _osm_mcast_work_obj @@ -1336,135 +1333,6 @@ osm_mcast_mgr_process_tree( } /********************************************************************** - **********************************************************************/ -static void -mcast_mgr_dump_sw_routes( - IN const osm_mcast_mgr_t* const p_mgr, - IN const osm_switch_t* const p_sw, - IN FILE *file ) -{ - osm_mcast_tbl_t* p_tbl; - int16_t mlid_ho = 0; - int16_t mlid_start_ho; - uint8_t position = 0; - int16_t block_num = 0; - boolean_t first_mlid; - boolean_t first_port; - const osm_node_t* p_node; - uint16_t i, j; - uint16_t mask_entry; - char sw_hdr[256]; - char mlid_hdr[32]; - - OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes ); - - if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) - goto Exit; - - p_node = p_sw->p_node; - - p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw ); - - sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n" - "LID : Out Port(s)\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); - first_mlid = TRUE; - while ( block_num <= p_tbl->max_block_in_use ) - { - mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE); - for (i = 0 ; i < IB_MCAST_BLOCK_SIZE ; i++) - { - mlid_ho = mlid_start_ho + i; - position = 0; - first_port = TRUE; - sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); - while ( position <= p_tbl->max_position ) - { - mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); - if (mask_entry == 0) - { - position++; - continue; - } - for (j = 0 ; j < 16 ; j++) - { - if ( (1 << j) & mask_entry ) - { - if (first_mlid) - { - fprintf( file,"%s", sw_hdr ); - first_mlid = FALSE; - } - if (first_port) - { - fprintf( file,"%s", mlid_hdr ); - first_port = FALSE; - } - fprintf( file, " 0x%03X ", j+(position*16) ); - } - } - position++; - } - if (first_port == FALSE) - { - fprintf( file, "\n" ); - } - } - block_num++; - } - - Exit: - OSM_LOG_EXIT( p_mgr->p_log ); -} - -/********************************************************************** - **********************************************************************/ -struct mcast_mgr_dump_context { - osm_mcast_mgr_t *p_mgr; - FILE *file; -}; - -static void -mcast_mgr_dump_table(cl_map_item_t *p_map_item, void *context) -{ - osm_switch_t *p_sw = (osm_switch_t *)p_map_item; - struct mcast_mgr_dump_context *cxt = context; - - mcast_mgr_dump_sw_routes(cxt->p_mgr, p_sw, cxt->file); -} - -static void -mcast_mgr_dump_mcast_routes(osm_mcast_mgr_t *p_mgr) -{ - char file_name[1024]; - struct mcast_mgr_dump_context dump_context; - FILE *file; - - if (!osm_log_is_active(p_mgr->p_log, OSM_LOG_ROUTING)) - return; - - snprintf(file_name, sizeof(file_name), "%s/%s", - p_mgr->p_subn->opt.dump_files_dir, "opensm.mcfdbs"); - - file = fopen(file_name, "w"); - if (!file) { - osm_log(p_mgr->p_log, OSM_LOG_ERROR, - "mcast_dump_mcast_routes: ERR 0A18: " - "cannot create mcfdb file \'%s\': %s\n", - file_name, strerror(errno)); - return; - } - - dump_context.p_mgr = p_mgr; - dump_context.file = file; - - cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, - mcast_mgr_dump_table, &dump_context); - - fclose(file); -} - -/********************************************************************** Process the entire group. NOTE : The lock should be held externally! @@ -1510,7 +1378,7 @@ osm_mcast_mgr_process_mgrp( p_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item ); } - mcast_mgr_dump_mcast_routes( p_mgr ); + osm_dump_mcast_routes( p_mgr->p_subn->p_osm ); Exit: OSM_LOG_EXIT( p_mgr->p_log ); @@ -1580,8 +1448,6 @@ osm_mcast_mgr_process( p_sw = (osm_switch_t*)cl_qmap_next( &p_sw->map_item ); } - mcast_mgr_dump_mcast_routes( p_mgr ); - CL_PLOCK_RELEASE( p_mgr->p_lock ); OSM_LOG_EXIT( p_mgr->p_log ); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index a15f3b4..a6d0e24 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -2749,6 +2749,7 @@ Idle: p_mgr->p_subn->need_update = 0; __osm_topology_file_create( p_mgr ); + osm_dump_all(p_mgr->p_subn->p_osm); __osm_state_mgr_report( p_mgr ); __osm_state_mgr_up_msg( p_mgr ); diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index f049e74..cfe1a58 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -48,7 +48,7 @@ # include #endif /* HAVE_CONFIG_H */ -#include +#include #include #include #include @@ -62,8 +62,6 @@ #include #include -#define LINE_LENGTH 256 - /********************************************************************** **********************************************************************/ void @@ -123,329 +121,6 @@ osm_ucast_mgr_init( } /********************************************************************** - **********************************************************************/ -struct ucast_mgr_dump_context { - osm_ucast_mgr_t *p_mgr; - FILE *file; -}; - -static void -ucast_mgr_dump(osm_ucast_mgr_t *p_mgr, FILE *file, - void (*func)(cl_map_item_t *, void *)) -{ - struct ucast_mgr_dump_context dump_context; - - dump_context.p_mgr = p_mgr; - dump_context.file = file; - - cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, func, &dump_context); -} - -void -ucast_mgr_dump_to_file(osm_ucast_mgr_t *p_mgr, const char *file_name, - void (*func)(cl_map_item_t *, void *)) -{ - char path[1024]; - FILE *file; - - snprintf(path, sizeof(path), "%s/%s", - p_mgr->p_subn->opt.dump_files_dir, file_name); - - file = fopen(path, "w"); - if (!file) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "ucast_mgr_dump_to_file: ERR 3A12: " - "Failed to open fdb file (%s)\n", path ); - return; - } - - ucast_mgr_dump(p_mgr, file, func); - - fclose(file); -} - -/********************************************************************** - **********************************************************************/ -static void -__osm_ucast_mgr_dump_path_distribution( - IN cl_map_item_t *p_map_item, - IN void *cxt) -{ - osm_node_t *p_node; - osm_node_t *p_remote_node; - uint8_t i; - uint8_t num_ports; - uint32_t num_paths; - ib_net64_t remote_guid_ho; - osm_switch_t* p_sw = (osm_switch_t *)p_map_item; - osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - - OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_path_distribution ); - - p_node = p_sw->p_node; - num_ports = p_sw->num_ports; - - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, - "__osm_ucast_mgr_dump_path_distribution: " - "Switch 0x%" PRIx64 "\n" - "Port : Path Count Through Port", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); - - for( i = 0; i < num_ports; i++ ) - { - num_paths = osm_switch_path_count_get( p_sw , i ); - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG,"\n %03u : %u", i, num_paths ); - if( i == 0 ) - { - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (switch management port)" ); - continue; - } - - p_remote_node = osm_node_get_remote_node( p_node, i, NULL ); - if( p_remote_node == NULL ) - continue; - - remote_guid_ho = cl_ntoh64( osm_node_get_node_guid( p_remote_node ) ); - - switch( osm_node_get_remote_type( p_node, i ) ) - { - case IB_NODE_TYPE_SWITCH: - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to switch" ); - break; - case IB_NODE_TYPE_ROUTER: - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to router" ); - break; - case IB_NODE_TYPE_CA: - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to CA" ); - break; - default: - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to unknown node type" ); - break; - } - - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")", - remote_guid_ho ); - } - - osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, "\n" ); - - OSM_LOG_EXIT( p_mgr->p_log ); -} - -/********************************************************************** - **********************************************************************/ -static void -__osm_ucast_mgr_dump_ucast_routes( - IN cl_map_item_t *p_map_item, - IN void *cxt ) -{ - const osm_node_t* p_node; - osm_port_t * p_port; - uint8_t port_num; - uint8_t num_hops; - uint8_t best_hops; - uint8_t best_port; - uint16_t max_lid_ho; - uint16_t lid_ho, base_lid; - boolean_t direct_route_exists = FALSE; - osm_switch_t* p_sw = (osm_switch_t *)p_map_item; - osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; - - OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_ucast_routes ); - - p_node = p_sw->p_node; - - max_lid_ho = p_sw->max_lid_ho; - - fprintf( file, "__osm_ucast_mgr_dump_ucast_routes: " - "Switch 0x%016" PRIx64 "\n" - "LID : Port : Hops : Optimal\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); - for( lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++ ) - { - fprintf(file, "0x%04X : ", lid_ho); - - p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid_ho); - if (!p_port) - { - fprintf( file, "UNREACHABLE\n" ); - continue; - } - - port_num = osm_switch_get_port_by_lid( p_sw, lid_ho ); - if( port_num == OSM_NO_PATH ) - { - /* - This may occur if there are 'holes' in the existing - LID assignments. Running SM with --reassign_lids - will reassign and compress the LID range. The - subnet should work fine either way. - */ - fprintf( file, "UNREACHABLE\n" ); - continue; - } - /* - Switches can lie about which port routes a given - lid due to a recent reconfiguration of the subnet. - Therefore, ensure that the hop count is better than - OSM_NO_PATH. - */ - if( p_port->p_node->sw ) - { - /* Target LID is switch. - Get its base lid and check hop count for this base LID only. */ - base_lid = osm_node_get_base_lid(p_port->p_node, 0); - base_lid = cl_ntoh16(base_lid); - num_hops = osm_switch_get_hop_count( p_sw, base_lid, port_num ); - } - else - { - /* Target LID is not switch (CA or router). - Check if we have route to this target from current switch. */ - num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); - if (num_hops != OSM_NO_PATH) - { - direct_route_exists = TRUE; - base_lid = lid_ho; - } - else - { - osm_physp_t *p_physp = p_port->p_physp; - - if( !p_physp || !p_physp->p_remote_physp || - !p_physp->p_remote_physp->p_node->sw ) - num_hops = OSM_NO_PATH; - else - { - base_lid = osm_node_get_base_lid(p_physp->p_remote_physp->p_node, 0); - base_lid = cl_ntoh16(base_lid); - num_hops = p_physp->p_remote_physp->p_node->sw == p_sw ? - 0 : osm_switch_get_hop_count( p_sw, base_lid, port_num ); - } - } - } - - if( num_hops == OSM_NO_PATH ) - { - fprintf( file, "UNREACHABLE\n" ); - continue; - } - - best_hops = osm_switch_get_least_hops( p_sw, base_lid ); - if (!p_port->p_node->sw && !direct_route_exists) - { - best_hops++; - num_hops++; - } - - fprintf( file, "%03u : %02u : ", port_num, num_hops ); - - if( best_hops == num_hops ) - fprintf( file, "yes" ); - else - { - best_port = osm_switch_recommend_path( - p_sw, p_port, lid_ho, TRUE, - NULL, NULL, NULL, NULL ); /* No LMC Optimization */ - fprintf( file, "No %u hop path possible via port %u!", - best_hops, best_port ); - } - - fprintf( file, "\n" ); - } - - OSM_LOG_EXIT( p_mgr->p_log ); -} - -/********************************************************************** - **********************************************************************/ -static void -ucast_mgr_dump_lid_matrix(cl_map_item_t *p_map_item, void *cxt) -{ - osm_switch_t* p_sw = (osm_switch_t *)p_map_item; - osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; - osm_node_t *p_node = p_sw->p_node; - unsigned max_lid = p_sw->max_lid_ho; - unsigned max_port = p_sw->num_ports; - uint16_t lid; - uint8_t port; - - fprintf(file, "Switch: guid 0x%016" PRIx64 "\n", - cl_ntoh64(osm_node_get_node_guid(p_node))); - for (lid = 1; lid <= max_lid; lid++) { - osm_port_t *p_port; - if (osm_switch_get_least_hops(p_sw, lid) == OSM_NO_PATH) - continue; - fprintf(file, "0x%04x:", lid); - for (port = 0 ; port < max_port ; port++) - fprintf(file, " %02x", - osm_switch_get_hop_count(p_sw, lid, port)); - p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid); - if (p_port) - fprintf(file, " # portguid 0x%" PRIx64, - cl_ntoh64(osm_port_get_guid(p_port))); - fprintf(file, "\n"); - } -} - -/********************************************************************** - **********************************************************************/ -void -ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt) -{ - osm_switch_t* p_sw = (osm_switch_t *)p_map_item; - osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; - osm_node_t *p_node = p_sw->p_node; - unsigned max_lid = p_sw->max_lid_ho; - unsigned max_port = p_sw->num_ports; - uint16_t lid; - uint8_t port; - - fprintf(file, "Unicast lids [0x0-0x%x] of switch Lid %u guid 0x%016" - PRIx64 " (\'%s\'):\n", - max_lid, osm_node_get_base_lid(p_node, 0), - cl_ntoh64(osm_node_get_node_guid(p_node)), - p_node->print_desc); - for (lid = 0; lid <= max_lid; lid++) { - osm_port_t *p_port; - port = osm_switch_get_port_by_lid(p_sw, lid); - - if (port >= max_port) - continue; - - fprintf(file, "0x%04x %03u # ", lid, port); - - p_port = cl_ptr_vector_get(&p_mgr->p_subn->port_lid_tbl, lid); - if (p_port) { - p_node = p_port->p_node; - fprintf(file, "%s portguid 0x016%" PRIx64 ": \'%s\'", - ib_get_node_type_str(osm_node_get_type(p_node)), - cl_ntoh64(osm_port_get_guid(p_port)), - p_node->print_desc); - } - else - fprintf(file, "unknown node and type"); - fprintf(file, "\n"); - } - fprintf(file, "%u lids dumped\n", max_lid); -} - -/********************************************************************** - **********************************************************************/ -static void __osm_ucast_mgr_dump_tables(osm_ucast_mgr_t *p_mgr) -{ - ucast_mgr_dump_to_file(p_mgr, "opensm-lid-matrix.dump", - ucast_mgr_dump_lid_matrix); - ucast_mgr_dump_to_file(p_mgr, "opensm-lfts.dump", ucast_mgr_dump_lfts); - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) - ucast_mgr_dump(p_mgr, NULL, __osm_ucast_mgr_dump_path_distribution); - ucast_mgr_dump_to_file(p_mgr, "opensm.fdbs", __osm_ucast_mgr_dump_ucast_routes); -} - -/********************************************************************** Add each switch's own and neighbor LIDs to its LID matrix **********************************************************************/ static void @@ -1172,15 +847,6 @@ osm_ucast_mgr_process( else cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); - /* dump fdb into file: */ - if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) - { - if ( !default_routing && p_routing_eng->ucast_dump_tables != 0 ) - p_routing_eng->ucast_dump_tables(p_routing_eng->context); - else - __osm_ucast_mgr_dump_tables( p_mgr ); - } - if (p_mgr->any_change) { signal = OSM_SIGNAL_DONE_PENDING; -- 1.5.3.rc2.38.g11308 From qokwn at ida.net Thu Jul 26 17:54:19 2007 From: qokwn at ida.net (Amelia) Date: Thu, 26 Jul 2007 19:54:19 -0500 Subject: [ofa-general] Request Message-ID: <46A9423B.2010203@ida.net> -------------- next part -------------- A non-text attachment was scrubbed... Name: Request.pdf Type: application/pdf Size: 10624 bytes Desc: not available URL: From sashak at voltaire.com Thu Jul 26 18:07:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 04:07:07 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> Message-ID: <20070727010707.GR2472@sashak.voltaire.com> On 09:25 Thu 26 Jul , Eitan Zahavi wrote: > > Hi Eitan, Hal, > > > > On 20:44 Wed 25 Jul , Eitan Zahavi wrote: > > > > > > I am not following you. > > > Why do a user need to run -y if a simple legal cable connector is > > > plugged? > > > > Because duplicated GUIDs detector can aborts OpenSM when > > regular port is reconnected to another location during hard > > sweep. This issue is not related to loopback plug at all. > I think we should handle the case of "migrated port" in a more global > sense: > If a port "moved" during the sweep we have to do a new sweep anyway. Another option is just to use recently discovered port location. In case of CA it could work, switch migration can be more complicated. > Maybe we could delay the 'abort' to the second sweep. > > So practically I propose: > 1. Add state flag "was duplicated" on the port saying it was reported as > duplicate GUID. > 2. Set the variable controlling a forced secodn sweep (similar to the > one used if we got Set error) We even can catch this yet before drop_manager and just rediscover. > 3. Repeat the sweep - if we find a port where it is a duplicate and the > "was duplicated" flag is set - abort. > > A refinement for the user who is doing many changes continuously might > be to keep a counter. > And have the abort happen after the Nth iteration. It is better approach than what we have today. > > > > > The issue is only if a "loop back" plug connecting a port > > to itself is > > > plugged. > > > > No, not only. Now there are two completely separate known > > issues with duplicated GUIDs detector: > > > > 1. Port moving > > 2. Loopback plug > > > > And I think that _both_ should be solved. And if just using > > '-y' could be suitable for (2) because it is esoteric > > (although perfectly legal) use, it is not acceptable solution for (1). > > > > I think we need to improve GUIDs duplication detector > > instead. For example we could add NodeInfo comparison there, > > and only in case if it is different drop GUIDs duplication > > error. Also I think this should not be fatal error and should > > not abort OpenSM, just logging (probably via syslog too) > > should be sufficient - non-working port is good reason to > > look at logs. Another ideas? > The problem is that the SM will sort of figure out the network but will > create a completely bogus routing etc. Right. But it is not so with back-to-back (when loopback plug could be interpreted as back-to-back duplicated GUID). So no need to abort in this (back-to-back/loopback) case. Agreed? Sasha > > > > > Sasha > > > > > Do users use these plugs? For what sake? > > > > > > > > > Eitan Zahavi > > > Senior Engineering Director, Software Architect Mellanox > > Technologies > > > LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > > Sent: Wednesday, July 25, 2007 3:19 AM > > > > To: Eitan Zahavi > > > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > > > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > > > Maybe avoid the log if -y is provided? > > > > > > > > > > > > > > > That avoids the spew but the duplicated GUID is > > > > important to know so > > > > > IMO something in the "middle" is needed where > > duplicated GUIDs are > > > > > logged but not continually the same ones. > > > > > [EZ] > > > > > OK so in -y mode only we track which ones were reported > > > > and do not > > > > > repeat the log? > > > > > > > > And how port moving problem should be solved? > > > > > > > > We cannot ask an user to run OpenSM with '-y' if in > > her/his plans to > > > > reconnect some ports in a future and just decrease logging. > > > > > > > > Sasha > > > > > > From mshefty at ichips.intel.com Thu Jul 26 18:11:51 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 26 Jul 2007 18:11:51 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <46A94657.1020101@ichips.intel.com> > 2. Architecture ---------------- This is a higher level approach to the problem, but I came up with the following QoS relationship hierarchy, where '->' means 'maps to'. Application Service -> Service ID (or range) Service ID -> desired QoS QoS, SGID, DGID, PKey -> SGID, DGID, TClass, FlowLabel, PKey SGID, DGID, TC, FL, PKey -> SLID, DLID, SL (set if crossing subnets) SLID, DLID, SL -> MTU, Rate, VL, PacketLifeTime I use these relationships below: > 4. IPoIB --------- > > IPoIB already query the SA for its broadcast group information. The > additional functionality required is for IPoIB to provide the > broadcast group SL, MTU, and RATE in every following PathRecord query > performed when a new UDAV is needed by IPoIB. We could assign a > special Service-ID for IPoIB use but since all communication on the > same IPoIB interface shares the same QoS-Level without the ability to > differentiate it by target service we can ignore it for simplicity. Rather than IPoIB specifying SL, MTU, and rate with PR queries, it should specify TClass and FlowLabel. This is necessary for IPoIB to span IB subnets. > 5. CMA features ---------------- > > The CMA interface supports Service-ID through the notion of port > space as a prefixes to the port_num which is part of the sockaddr > provided to rdma_resolve_add(). What is missing is the explicit > request for a QoS-Class that should allow the ULP (like SDP) to > propagate a specific request for a class of service. A mechanism for > providing the QoS-Class is available in the IPv6 address, so we could > use that address field. Another option is to implement a special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class > and Service-ID in the sent PR/MPR. When a response is obtained it is > an existing requirement for the CMA to use the PR/MPR from the > response in setting up the QP address vector. I think the RDMA CM needs two solutions, depending on which address family is used. For IPv6, the existing interface is sufficient, and works for both IB and iWarp. The RDMA CM only needs to include the TC and FL as part of its PR query. For IPv4, to remain transport neutral, I think we should add an rdma_set_option() routine to specify the QoS field. The RDMA CM would include the QoS field for PR query under this condition. For IB, this requires changes to the ib_sa to support the new PR extensions. I don't think we gain anything having the RDMA CM include service IDs as part of the query. > 6. SDP ------- > > SDP uses CMA for building its connections. The Service-ID for SDP is > 0x000000000001PPPP, where PPPP are 4 hex digits holding the remote > TCP/IP Port Number to connect to. SDP might be provided with > SO_PRIORITY socket option. In that case the value provided should be > sent to the CMA as the TClass option of that connection. SDP would use specify the QoS through the IPv6 address or rdma_set_option() routine. > 7. SRP ------- > > Current SRP implementation uses its own CM callbacks (not CMA). So > SRP should fill in the Service-ID in the PR/MPR by itself and use > that information in setting up the QP. The T10 SRP standard defines > the SRP Service-ID to be defined by the SRP target I/O Controller > (but they should also comply with IBTA Service- ID rules). Anyway, > the Service-ID is reported by the I/O Controller in the > ServiceEntries DMA attribute and should be used in the PR/MPR if the > SA reports its ability to handle QoS PR/MPRs. I agree. > 8. iSER -------- iSER uses CMA and thus should be very close to SDP. > The Service-ID for iSER should be TBD. See RDMA CM and SDP. > 3.2. PR/MPR query handling: OpenSM should be able to enforce the > provided policy on client request. The overall flow for such requests > is: first the request is matched against the defined match rules such > that the target QoS-Level definition is found. Given the QoS-Level a > path(s) search is performed with the given restrictions imposed by > that level. The following two sections describe these steps. If we use the QoS hierarchy outlined above, I think we can construct some fairly simple tables to guide our PR selection. The SA may need to construct the tables starting at the bottom and working up, but I *think* it could be done. And by distributing the tables, we can support a more distributed (a la local SA) operation. From an administration point, I would be happier seeing something where the administrator defines a QoS level in terms of latency or bandwidth requirements and relative priority. Then, if desired, the administrator could provide more details, such as indicating which nodes would use which services, minimum required MTUs, etc. It would then be up to the SA to map these requirements to specific TC, FL, SL, VL values. In general, though, I'm personally far less concerned with the QoS specification interface to the SA, versus the operation that takes place on the hosts. Comments on using this approach on the host side? - Sean From sashak at voltaire.com Thu Jul 26 19:59:52 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 27 Jul 2007 05:59:52 +0300 Subject: [ofa-general] Lost in-service traps during Open SM migration In-Reply-To: References: <20070725220204.GI31582@sashak.voltaire.com> Message-ID: <20070727025952.GE6691@sashak.voltaire.com> On 12:37 Thu 26 Jul , lbt wrote: > Thanks for the suggestion Sasha! > > Our host stack does receive a rereregistration notice and does resubscribe > all handlers at > that point in time. At the time of the SM migration, our stack prints out > some informational messages to > confirm this: > Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER occurred > on port 1 > Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM LID=8 > > And also confirmed in the SM logs that after the migration, the higher > priority SM is getting a subscription request for in-service trap: > Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method: > Subscribe Request with QPN: 0x000001 > Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [ > Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [ > Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump: > gid.....................0x0000000000000000 : > 0x0000000000000000 > lid_range_begin.........0xFFFF > lid_range_end...........0x0 > is_generic..............0x1 > subscribe...............0x0 > trap_type...............0x3 > trap_num................64 > qpn.....................0x000001 > resp_time_val...........0x0 > node_type...............0x000004 > Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ] > > It maybe a problem if the resubscription of the in-service handler occurs > after the in-service notice was forwarded, but I think the problem is that > there is never a notice that is forwared for the higher priority SM port > that is restored. And after OpenSM migration, did you receive in-service notices for another ports? Does the problem happen only in migration time? > Perhaps, neither SM (the lower priority and higher > priority one), generates an in-service trap because of the timing gap > between when the restored port is detected and "marked" (i.e. added to > new_ports_list) and when in-service traps are generated for new ports. > During SM migration, the lower priority SM detects the new port, but the > higher priority SM does the trap generation (but it doesn't realize that > it's own port is a new port and thus doesn't generate a trap for it). > > Our host stack executes some functions when a port is restored (in our > in-service subscription handler). > Am I not supposed to receive an in-service trap for a restored port that > happens to be the Master SM, Yes, I guess you are. > and instead execute these actions with a > client reregistration event? Client reregistration request is not suitable here - SM can ask for client reregistration at any time (in practice OpenSM now does it only when enters MASTER state, but it is also optional). Sasha > > Thanks again for your help! > Lan > > > > On 7/25/07, Sasha Khapyorsky wrote: > > > > Hi Lan, > > > > On 09:57 Wed 25 Jul , lbt wrote: > > > Hello, > > > > > > I have been seeing a problem where a subscriber for in-service traps is > > not > > > getting informed when the port of master openSM is restored (i.e. > > causing an > > > SM migration). > > > > > > I have an IB subnet with 2 nodes running OpenSM , different priorities > > of > > > course (OpenSM Rev:openib-2.0.5). I also have another node on the > > subnet > > > that has subscribed for the forwarding of any > > IB_SA_GENERIC_TRAP_NUM_IN_SVC > > > trap events. I've been doing cable pull tests on the IB ports, to check > > if > > > the in-service handler I have subscribed gets invoked when I restore > > the > > > cable. I've noticed that everything works as expected ( i.e. my > > in-service > > > handler is invoked) whenever I restore the cable on the lower priority > > SM IB > > > port without ever touching the master SM port. But if I cause an SM > > > migration, by restoring the port of the higher priority SM, the > > in-service > > > trap does not get generated as expected on a cable restore. > > > > > > Steps to Reproduce: > > > 1) Start with port to higher priority SM disconnected. > > > 2) restore port cable on the higher priority SM > > > --> This causes an SM Migration as expected, SM's migration happens > > okay > > > --> I expected the restoration of the higher priority SM to tit to also > > > trigger an in-service trap as well and notify subscribers, but it > > doesn't > > > occur > > > > > > I have collected debug messages log for both open SM's, and it appears > > that > > > the reason is because: > > > 1) in-service traps are generated based on what ports are added on the > > > Master SM's new_ports_list, but these traps are generated only after > > LID > > > assignment > > > 2) when the higher priority SM port is restored, the restored port gets > > > added to the lower priority SM's new_ports_list (since it's still the > > Master > > > SM at that point in time) > > > 3) the handover of Master SM from lower priority to higher priority > > SM > > > occurs (before LID assignment and thus a chance for traps get generated > > for > > > those ports on new_ports_list) > > > 4) the higher priority SM is now Master SM, but it has an empty > > > new_ports_list, so no trap generated either > > > > > > Does this look like a legitimate Open SM bug? Any feedback would be > > much > > > appreciated, and if I can help further in any way please let me know . > > > > As far as I know when OpenSM (even old like 2.0.5) becomes master it > > requests client to reregister SA related stuff (by setting this bit in > > PortInfo). > > > > Probably your port doesn't not support this (you could verify by seeing > > PortInfo:CapabilityMask - use 'smpquery portinfo ') or > > maybe your host stack doesn't do reregistration? > > > > Anyway you could track this in the OpenSM code in osm_lid_mgr.c > > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set > > (with ib_port_info_set_client_rereg()) or not. Then we will know more > > about this problem. > > > > Sasha > > > > > > > > > > > Subset of logs from lower priority SM during the cable restore of > > higher > > > priority SM port: > > > ### Jul 18 14:31:56 614522 [41401960] -> > > __osm_trap_rcv_process_request: > > > Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A > > > TID:0x00000016000012e1 > > > ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: > > Received > > > signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE > > > ### 14:31:56 ******************** INITIATING HEAVY SWEEP > > > ********************** > > > ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: > > Received > > > signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > > > OSM_SM_STATE_SWEEP_HEAVY_SELF > > > Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding > > port > > > GUID:0x00504501483e0000 to new_ports_list > > > Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received > > signal > > > OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > > Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received > > signal > > > OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > > OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > > 14:31:56 ********************* HEAVY SWEEP COMPLETE > > *********************** > > > Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received > > > signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER### > > > 14:31:56 ******************** ENTERING SM STANDBY STATE > > ******************* > > > > > > Subset of logs from higher priority SM during the cable restore of > > higher > > > priority SM port: > > > > > > Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [ > > > Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received > > > signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state > > > IB_SMINFO_STATE_DISCOVERING > > > Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state > > > Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg: > > > ******************** ENTERING SM MASTER STATE ******************** > > > Jul 18 14:32:03 009014 [41401960] -> > > __osm_state_mgr_set_sm_lid_done_msg: > > > **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** > > > Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg > > > ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** > > > Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: > > [ > > > ----> no in-service traps are generated and notices forwarded because > > there > > > are no ports on this list > > > Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: > > ] > > > > > > > > > Thanks! > > > Lan > > > > > _______________________________________________ > > > general mailing list > > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From kliteyn at mellanox.co.il Thu Jul 26 21:42:38 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 27 Jul 2007 07:42:38 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-27:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Tue_Jul_24_09:41:51_2007 [feadbcd8281f007f092bbd95a2e078cac5a8a0aa] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=560 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: From tamura at osrg.net Thu Jul 26 21:45:04 2007 From: tamura at osrg.net (Yoshiaki Tamura) Date: Fri, 27 Jul 2007 13:45:04 +0900 Subject: [ofa-general] OFED-1.2 on x86 debian Message-ID: <46A97850.2030607@osrg.net> Hi. I'm trying to install OFED-1.2 on x86 (32bit) debian machine. Although build_env.sh seems to work on debian, it fails compiling both kernel modules and user land tools by rpmbuild. Is OFED-1.2 tested on debian or totally unsupported? Thanks, Yoshi Tamura From eitan at mellanox.co.il Thu Jul 26 21:56:54 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 27 Jul 2007 07:56:54 +0300 Subject: [ofa-general] RE: pkey.sim.tcl In-Reply-To: <20070726224133.GC2472@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901ED5C69@mtlexch01.mtl.com> <20070722174048.GO27878@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> <20070726224133.GC2472@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com> > > On 09:26 Thu 26 Jul , Eitan Zahavi wrote: > > > > I am happy you actually use the simulator. > > Please provide more info regarding the failure. You should tar > > compress the /tmp/ibmgtsim.XXXX of your run. > > I can send this for you if you want, but the failure is trivial. No need if you already know where the bug is... > > Yes, and it is due (6), where default Pkey is removed > "externally". I'm not sure that OpenSM should handle the case > when pkey table is modified externally by something which is not SM. > For a few years it just worked fine. So I wonder why this fucntionality was removed ? It is a real BAD case where Pkeys are altered but I think would be wise to "refresh" these tables on heavy seep. In general it seems OpenSM has lost its "heavy sweep" concept. Now it does not refresh the fabric setup even on heavy sweep. This is assuming a "perfect" HW and software and I would really this we should have preserved that capability. Note that a "heavy sweep" does not happen unless somethng changed or trapped. Eitan Eitan From philippe.gregoire at cea.fr Fri Jul 27 00:32:44 2007 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Fri, 27 Jul 2007 09:32:44 +0200 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: <46A99F9C.5040303@cea.fr> HI Yevgeny Yevgeny Kliteynik a écrit : > Hi All > > Please find the attached RFC describing how QoS policy support could > be implemented in the OpenFabrics stack. > Your comments are welcome. > > -- Yevgeny > > RFC: OpenFabrics Enhancements for QoS Support > =============================================== > > Authors: . Eitan Zahavi > Authors: . Yevgeny Kliteynik > Date: .... Jul 2007. > Revision: 0.2 > > Table of contents: > 1. Overview > 2. Architecture > 3. Supported Policy > 4. CMA functionality > 5. IPoIB functionality > 6. SDP functionality > 7. SRP functionality > 8. iSER functionality > 9. OpenSM functionality > > 1. Overview > ------------ > Quality of Service requirements stem from the realization of I/O > consolidation > over IB network: As multiple applications and ULPs share the same > fabric, means > to control their use of the network resources are becoming a must. The > basic > need is to differentiate the service levels provided to different > traffic flows, > such that a policy could be enforced and control each flow utilization > of the > fabric resources. > > IBTA specification defined several hardware features and management > interfaces > to support QoS: > * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner > * Arbitration between traffic of different VLs is performed by a 2 > priority > levels weighted round robin arbiter. The arbiter is programmable with > a sequence of (VL, weight) pairs and maximal number of high priority > credits > to be processed before low priority is served > * Packets carry class of service marking in the range 0 to 15 in their > header SL field > * Each switch can map the incoming packet by its SL to a particular > output > VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > * The Subnet Administrator controls each communication flow parameters > by providing them as a response to Path Record (PR) or > MultiPathRecord (MPR) > queries > > The IB QoS features provide the means to implement a DiffServ like > architecture. > DiffServ architecture (IETF RFC2474 2475) is widely used today in > highly dynamic > fabrics. > > This proposal provides the detailed functional definition for the various > software elements that are required to enable a DiffServ like > architecture over > the OpenFabrics software stack. > > > > 2. Architecture > ---------------- > This proposal split the QoS functionality between the SM/SA, CMA and > the various > ULPS. We take the "chronology approach" to describe how the overall > system > works: > > 2.1. The network manager (human) provides a set of rules (policy) that > defines > how the network is being configured and how its resources are split to > different > QoS-Levels. The policy also define how to decide which QoS-Level each > application or ULP or service use. > > 2.2. The SM analyzes the provided policy to see if it is realizable > and performs > the necessary fabric setup. The SM may continuously monitor the policy > and adapt > to changes in it. Part of this policy defines the default QoS-Level of > each > partition. The SA is being enhanced to match the requested Source, > Destination, > QoS-Class, Service-ID (and optionally SL and priority) against the > policy. So > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > enhanced to support setting up partitions with appropriate IPoIB > broadcast > group. This broadcast group carries its QoS attributes: SL, MTU and > RATE. > > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available > on the > multicast group which forms the broadcast group of this partition. > > 2.4. MPI which provides non IB based connection management should be > configured > to run using hard coded SLs. It uses these SLs for every QP being opened. > > 2.5. ULPs that use CM interface (like SRP) should have their own > pre-assigned > Service-ID and use it while obtaining PR/MPR for establishing > connections. > The SA receiving the PR/MPR should match it against the policy and return > the appropriate PR/MPR including SL, MTU and RATE. > > 2.6. ULPs and programs using CMA to establish RC connection should > provide the > CMA the target IP and Service-ID. Some of the ULPs might also provide > QoS-Class > (E.g. for SDP sockets that are provided the TOS socket option). The > CMA should > then use the provided Service-ID and optional QoS-Class and pass them > in the > PR/MPR request. The resulting PR/MPR should be used for configuring the > connection QP. > > PathRecord and MultiPathRecord enhancement for QoS: > As mentioned above the PathRecord and MultiPathRecord attributes > should be > enhanced to carry the Service-ID which is a 64bit value, which has been > standardized by the IBTA. A new field QoS-Class is also provided. > A new capability bit should describe the SM QoS support in the SA > class port > info. This approach provides an easy migration path for existing > access layer > and ULPs by not introducing new set of PR/MPR attribute. > > > 3. Supported Policy > -------------------- > > The QoS policy supported by this proposal is divided into 4 sub sections: > > I) Port Group: a set of CAs, Routers or Switches that share the same > settings. > A port group might be a partition defined by the partition manager > policy in > terms of GUIDs. Future implementations might provide support for > NodeDescription > based definition of port groups. > > II) Fabric Setup: > Defines how the SL2VL and VLArb tables should be setup. This policy > definition > assumes the computation of overall end to end network behavior should > be performed > outside of OpenSM. > > III) QoS-Levels Definition: > This section defines the possible sets of parameters for QoS that a > client > might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, > Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). > > IV) Matching Rules: > A list of rules that match an incoming PR/MPR request to a QoS-Level. The > rules are processed in order such as the first match is applied. Each > rule is > built out of a set of match expressions which should all match for the > rule to > apply. The matching expressions are defined for the following fields > ** SRC and DST to lists of port groups > ** Service-ID to a list of Service-ID or Service-ID ranges > ** QoS-Class to a list of QoS-Class values or ranges > > QoS Policy file syntax > > * Empty lines are ignored > * Leading and trailing blanks, as well as empty lines, are ignored, so > the > indentation in the example is just for better readability > * Comments are started with the pound sign (#) and terminated by EOL > * Comments may appear only in a separate line > * Keywords that denote section/subsection start have matching closing > keywords > * Any keyword should be the first non-blank in the line > > QoS Policy file example > > # Port Groups define sets of ports to be used later in the settings > port-groups > # using port GUIDs > port-group > name: Storage > # "use" is just a description that is used for logging. > # Other than that, it is just a commentary > use: our SRP storage targets > port-guid: 0x1000000000000001 > port-guid: 0x1000000000000002 > end-port-group > > port-group > name: Virtual Servers > use: node desc and IB port num > # The syntax of the port name is as follows: > "hostname/CA-num/Pnum". > # "hostname" and "CA-num" are compared to the first 2 > words of > # NodeDescription, and "Pnum" is a port number on that node. > port-name: vs1/HCA-1/P1 > port-name: vs3/HCA-1/P1 > port-name: vs3/HCA-2/P2 > end-port-group > For clusters, I like to have a syntax a la slurm or rms which understand node ranges : port-name: vs[1-20,30-50]/HCA-1/P1 > # using partitions defined in the partition policy > port-group > name: Group for Partition 1 > use: default settings > partition: Part1 > end-port-group > > # using node types CA|ROUTER|SWITCH > port-group > name: Routers > use: all routers > node-type: ROUTER > end-port-group > > end-port-groups > > qos-setup > > # define all types of VLArb tables. The length of the tables > should > # match the physically supported tables by their target ports > vlarb-tables > # scope defines the exact ports the VLArb tables apply to > vlarb-scope > # defining VLArb tables on all the ports that belong to > # port group 'Storage', and on all the ports connected > # to ports of port group 'Storage' > group: Storage > # "across" means all the ports that are connected to > ports > # that belong to the specified port group > across: Storage > # VLArb table holds VL and weight pairs > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > end-vlarb-scope > # There can be several scopes > end-vlarb-tables > > sl2vl-tables > # Scope defines the exact devices and in/out ports tables > apply to. > # Note: if the same port is matching several rules the > *FIRST* one applies. > sl2vl-scope > # SL2VL tables are orgnized as SL2VL(in-port,out-port) > # "from: n,m" means we define the SL2VL(n,*) and > SL2VL(m,*) > # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > # > # The following example specifies that all the SL2VL > tables > # entries should be defined for all the ports of group > Part1: > group: Part1 > from: * > to: * > # SL2VL table has to have 16 values at max - one for > each SL. > # If the user specifies less than 16 values, all the > missing > # VL values will be implicitly set to 0 > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > end-sl2vl-scope > > sl2vl-scope > # "across-to" is a combination of "across" keyword > (definition can be found > # in VLArb tables section) and "to" keyword. > # "across: PortGroupName" refers to all the ports that > are connected > # to ports that belong to PortGroupName. > # > # Example of "across-to" usage: > # A user has a set of 'special' nodes (e.g. storage > nodes), and all > # the traffic to these nodes has to get specific VL. > # The solution is to define port group (i.g. > "Storage") that will > # include all the ports of these nodes, and then to > configure SL2VL > # tables on all the switch ports that are connected > to the Storage > # port group by specifying "across-to: Storage". > # > across-to: Storage2 > # Similar to "across-to", "across-from" is a > combination of "across" > # and "to" keywords > across-from: Storage1 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > end-sl2vl-scope > end-sl2vl-tables > > end-qos-setup > > > qos-levels > > # the first one is just setting SL > qos-level > use: for the lowest priority communication > sl: 15 > packet-life: 16 > end-qos-level > # the second sets SL and QoS Class > qos-level > use: low latency best bandwidth > sl: 0 > end-qos-level > # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, > Path Bits > qos-level > use: just an example > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > # Path Bits can be used e.g. to provide a different routes > through the > # subnet to a particular port > path-bits: 2,4,8-32 > end-qos-level > > end-qos-levels > > > # Match rules are scanned in a first-fit manner (like firewall > rules table) > qos-match-rules > > # matching by single criteria: class (list of values and ranges) > qos-match-rule > # just a description > use: low latency by class 7-9 or 11 > qos-class: 7-9,11 > # number of qos-level to apply to the matching PR/MPR > qos-level-sn: 1 > end-qos-match-rule > # show matching by destination group AND service-ids > qos-match-rule > use: Storage targets connection > destination: Storage > service-id: 22,4719-5000 > qos-level-sn: 2 > end-qos-match-rule > # show matching by source group only > qos-match-rule > use: bla bla > source: Storage > qos-level-sn: 3 > end-qos-match-rule > > end-qos-match-rules > > > 4. IPoIB > --------- > > IPoIB already query the SA for its broadcast group information. The > additional > functionality required is for IPoIB to provide the broadcast group SL, > MTU, > and RATE in every following PathRecord query performed when a new UDAV is > needed by IPoIB. > We could assign a special Service-ID for IPoIB use but since all > communication > on the same IPoIB interface shares the same QoS-Level without the > ability to > differentiate it by target service we can ignore it for simplicity. > > 5. CMA features > ---------------- > > The CMA interface supports Service-ID through the notion of port space > as a > prefixes to the port_num which is part of the sockaddr provided to > rdma_resolve_add(). What is missing is the explicit request for a > QoS-Class that > should allow the ULP (like SDP) to propagate a specific request for a > class of > service. A mechanism for providing the QoS-Class is available in the > IPv6 address, > so we could use that address field. Another option is to implement a > special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class > and Service-ID > in the sent PR/MPR. When a response is obtained it is an existing > requirement for > the CMA to use the PR/MPR from the response in setting up the QP > address vector. > > > 6. SDP > ------- > > SDP uses CMA for building its connections. > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > holding the remote TCP/IP Port Number to connect to. > SDP might be provided with SO_PRIORITY socket option. In that case the > value > provided should be sent to the CMA as the TClass option of that > connection. > This requires modifications a applications and does not allow a global definition of Qos for all SDP applications into the fabric. This is inconsistent with Libsdp provided to migrate transparently TCP/IP application to SDP. If the maching rules allows some kind of bitmask pattern matching, we can define something like : qos-match-rule use: all SDP applications service-id: 0x000000000001???? qos-level-sn: 2 end-qos-match-rule > 7. SRP > ------- > > Current SRP implementation uses its own CM callbacks (not CMA). So SRP > should > fill in the Service-ID in the PR/MPR by itself and use that > information in > setting up the QP. The T10 SRP standard defines the SRP Service-ID to > be defined > by the SRP target I/O Controller (but they should also comply with > IBTA Service- > ID rules). Anyway, the Service-ID is reported by the I/O Controller in > the > ServiceEntries DMA attribute and should be used in the PR/MPR if the SA > reports its ability to handle QoS PR/MPRs. > > 8. iSER > -------- > iSER uses CMA and thus should be very close to SDP. The Service-ID for > iSER > should be TBD. > > > 9. OpenSM features > ------------------- > The QoS related functionality to be provided by OpenSM can be split > into two > main parts: > > 3.1. Fabric Setup > During fabric initialization the SM should parse the policy and apply its > settings to the discovered fabric elements. The following actions > should be > performed: > * Parsing of policy > * Node Group identification. Warning should be provided for each node not > specified but found. > * SL2VL settings validation should be checked: > + A warning will be provided if there are no matching targets for > the SL2VL > setting statement. > + An error message will be printed to the log file if an invalid > setting is > found. A setting is invalid if it refers to: > - Non existing port numbers of the target devices > - Unsupported VLs for the target device. In the later case the map > to non > existing VLs should be replaced to VL15 i.e. packets will be > dropped. > * SL2VL setting is to be performed > * VL Arbitration table settings should be validated according to the > following > rules: > + A warning will be provided if there are no matching targets for > the setting > statement > + An error will be provided if the port number exceeds the target ports > + An error will be generated if the table length exceeds device > capabilities > + A warning will be generated if the table quote a VL that is not > supported > by the target device > * VL Arbitration tables will be set on the appropriate targets > > 3.2. PR/MPR query handling: > OpenSM should be able to enforce the provided policy on client request. > The overall flow for such requests is: first the request is matched > against the > defined match rules such that the target QoS-Level definition is > found. Given > the QoS-Level a path(s) search is performed with the given > restrictions imposed > by that level. The following two sections describe these steps. > > How Service-ID is carried in the PathRecord and MultiPathRecord > attributes is > now standardized by the IBTA. > > > 3.2.1. Matching rule search: > A rule is "matching" a PR/MPR request using the following criteria: > * Matching rules provide values in a list of either single value, or > range of > values. A PR/MPR field is "matching" the rule field if it is explicitly > noted in the list of values or is one of the values covered by a range > included in the field values list. > * Only PR/MPR fields that have their component mask bit set should be > compared. > * For a rule to be "matching" a PR/MPR request all the rule fields > should be > "matching" their PR/MPR fields. Such that a PR/MPR request that does > not have a component mask field set for one of the rule defined > fields can > not match that rule. > * A PR/MPR request that have a component mask bit set for one of the > fields > that is not defined by the rule can match the rule. > > The algorithm to be used for searching for a rule match might be as > simple as a > sequential search through all rules or enhanced for better > performance. The > semantics of every rule field and its matching PR/MPR field are described > below: > * Source: the SGID or SLID should be part of this group > * Destination: the DGID or DLID should be part of this group > * Service-ID: check if the requested Service-ID (available in the > PR/MPR old > SM-Key field) is matching any of this rule Service-IDs > * TClass: check if the PR/MPR TClass field is matching > > 3.2.2 PR/MPR response generation: > The QoS-Level pointed by the first rule that matches the PR/MPR request > should be used for obtaining the response SL, MTU-Limit, RATE-Limit, > Path-Bits > and QoS-Class. A default QoS-Level should be used if no rule is > matching the query. > > The efficient algorithm for finding paths that meet the QoS-Level > criteria is > beyond the scope of this RFC and left for the implementer to provide. > However > the criteria by which the paths match the QoS-Level are described below: > > * SL: The paths found should all use the given SL. For that sake PR/MPR > algorithm should traverse the path from source to destination only > through > ports that carry a valid VL (not VL15) by the SL2VL map (should > consider input > and output ports and SL). > * MTU-Limit: The resulting paths MTU should not exceed the given > MTU-Limit > * Rate-Limit: The resulting paths RATE should not exceed the given > RATE-Limit > (rate limit is given in units of link BW = Width*Speed according to > IBTA > Specification Vol-1 table-205 p-901 l-24). > * Path-Bits: define the target LID lowest bits (number of bits defined > by the > target port PortInfo.LMC field). The path should traverse the LFT > using the > target port LID with the path-bits set. > * QoS-Class: should be returned in the result PR/MPR. When routing is > going to > be supported by OpenSM we might use this field in selecting the target > router too in a TBD way. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > Philippe Gregoire CEA/DAM From mst at dev.mellanox.co.il Fri Jul 27 01:34:38 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Fri, 27 Jul 2007 11:34:38 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: <46A97850.2030607@osrg.net> References: <46A97850.2030607@osrg.net> Message-ID: <20070727083438.GA9912@mellanox.co.il> > Quoting Yoshiaki Tamura : > Subject: OFED-1.2 on x86 debian > > Hi. > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > Although build_env.sh seems to work on debian, > it fails compiling both kernel modules and user land tools by rpmbuild. > > Is OFED-1.2 tested on debian or totally unsupported? It's not on a list of supported platforms, but I think we do builds on ubuntu so debian should work too. Vlad? -- MST From vlad at lists.openfabrics.org Fri Jul 27 01:39:47 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 27 Jul 2007 01:39:47 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070727-0100 daily build status Message-ID: <20070727083947.77152E60858@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From sigvard.dadisman at bestil.net Fri Jul 27 03:51:26 2007 From: sigvard.dadisman at bestil.net (Heriberto Villalobos) Date: Fri, 27 Jul 2007 08:51:26 -0200 Subject: [ofa-general] To be slim Message-ID: <01c7d02b$511d6300$cdd47859@sigvard.dadisman> -------------- next part -------------- A non-text attachment was scrubbed... Name: themall.gif Type: image/gif Size: 9472 bytes Desc: not available URL: From vlad at lists.openfabrics.org Fri Jul 27 02:49:14 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Fri, 27 Jul 2007 02:49:14 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070727-0200 daily build status Message-ID: <20070727094915.02179E6085F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.19 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From Sumit.Gaur at Sun.COM Fri Jul 27 03:17:23 2007 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Fri, 27 Jul 2007 15:47:23 +0530 Subject: [ofa-general] TransactionID(IB_MAD_TRID_F) description Message-ID: <46A9C633.7040302@Sun.COM> Hi, I have just observed that TransactionID that I am providing with sndbuf to *umad_send* is not the one that I received back from *umad_recv* function. Going more in detail I have seen that only low 32 bits of TID are matching in received mad with send mad. Is this functionality of TID is expected or there is any suitable way to get the all 64 bits of TID in place of low 32 bits only. Thanks and Regards sumit From hnguyen at linux.vnet.ibm.com Fri Jul 27 03:52:49 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 27 Jul 2007 12:52:49 +0200 Subject: [ofa-general] [PATCH 0/2] ehca: remove WARNING: externs should be avoided in .c files Message-ID: <200707271252.50193.hnguyen@linux.vnet.ibm.com> Hello Roland! This small patch set fixes some coding-style related issues for ehca: [1/2] remove checkpatch.pl's warnings "externs should be avoided in .c files" [2/2] correction include order according kernel coding style Thanks Nam From hnguyen at linux.vnet.ibm.com Fri Jul 27 03:54:50 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 27 Jul 2007 12:54:50 +0200 Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings "externs should be avoided in .c files" Message-ID: <200707271254.51055.hnguyen@linux.vnet.ibm.com> From b5d0336089b5ebe5b18acb94b2c94c2026cb95ee Mon Sep 17 00:00:00 2001 From: Hoang-Nam Nguyen Date: Fri, 27 Jul 2007 10:24:49 +0200 Subject: [PATCH] remove checkpatch.pl's warnings "externs should be avoided in .c files" Signed-off-by: Hoang-Nam Nguyen --- drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 -- drivers/infiniband/hw/ehca/ehca_pd.c | 1 - drivers/infiniband/hw/ehca/hcp_if.c | 1 - drivers/infiniband/hw/ehca/ipz_pt_fn.h | 2 ++ 5 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 3725aa8..b5e9603 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -322,6 +322,7 @@ extern int ehca_static_rate; extern int ehca_port_act_time; extern int ehca_use_hp_mr; extern int ehca_scaling_code; +extern int ehca_mr_largepage; struct ipzu_queue_resp { u32 qe_size; /* queue entry size */ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index c1b868b..773ac3f 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -64,8 +64,6 @@ enum ehca_mr_pgsize { EHCA_MR_PGSIZE16M = 0x1000000L }; -extern int ehca_mr_largepage; - static u32 ehca_encode_hwpage_size(u32 pgsize) { u32 idx = 0; diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c index 3dafd7f..43bcf08 100644 --- a/drivers/infiniband/hw/ehca/ehca_pd.c +++ b/drivers/infiniband/hw/ehca/ehca_pd.c @@ -88,7 +88,6 @@ int ehca_dealloc_pd(struct ib_pd *pd) u32 cur_pid = current->tgid; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); int i, leftovers = 0; - extern struct kmem_cache *small_qp_cache; struct ipz_small_queue_page *page, *tmp; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index fdbfebe..24f4541 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -758,7 +758,6 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, const u64 logical_address_of_page, const u64 count) { - extern int ehca_debug_level; u64 ret; if (unlikely(ehca_debug_level >= 2)) { diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index c6937a0..a801274 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -54,6 +54,8 @@ struct ehca_pd; struct ipz_small_queue_page; +extern struct kmem_cache *small_qp_cache; + /* struct generic ehca page */ struct ipz_page { u8 entries[EHCA_PAGESIZE]; -- 1.5.2 From hnguyen at linux.vnet.ibm.com Fri Jul 27 03:55:19 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Fri, 27 Jul 2007 12:55:19 +0200 Subject: [ofa-general] [PATCH 2/2] ehca: correction include order according kernel coding style Message-ID: <200707271255.19456.hnguyen@linux.vnet.ibm.com> From a2794450cbee597cefd7b6e159257583c459d358 Mon Sep 17 00:00:00 2001 From: Hoang-Nam Nguyen Date: Fri, 27 Jul 2007 10:26:40 +0200 Subject: [PATCH] correction include order according kernel coding style Signed-off-by: Hoang-Nam Nguyen --- drivers/infiniband/hw/ehca/ehca_mrmw.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 773ac3f..1180b65 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -40,9 +40,8 @@ * POSSIBILITY OF SUCH DAMAGE. */ -#include - #include +#include #include "ehca_iverbs.h" #include "ehca_mrmw.h" -- 1.5.2 From elsgz at bbsc.com.cn Fri Jul 27 02:58:08 2007 From: elsgz at bbsc.com.cn (Canadian Charity'2007 refnum_317) Date: Fri, 27 Jul 2007 19:58:08 +1000 Subject: [ofa-general] We are expanding, new offer for you. Id: 265 Message-ID: <002301c7d03d$04766830$97018279@jmzp39dexle2z9o> This proposal is of most weight to all EU candidates We are glad to introduce you our new mission. This is a vacancy for European residents only. Requirements and benefits: Monthly gross salary: 1500-3000 EUR per month Age limit: from 18 y.o. Possible profession growth and promotion opportunity Internet access, mobile or home phone number and the email Part-time (2-3hr per day) and full-time employment (8hr per day) Our organization Canadian Charity is looking for new candidates and collaborators in EU. Become a part of our donating corporation that includes worldwide donations to HIV positives, war refugees from Middle East and starving children from poorest European countries. Our mission does not charge or ask you to invest anything. We do not try to take your money. Our regional sponsors and investors from different European Union and USA regions have already accepted our offer and are now the investing affiliates in our international donating program. Cooperate with our investors during the donation process and receive from 1500 EUR (1800 USD) up to 3000 EUR (3600 USD) income per month. Together we can make this system work with greatest efficiency and thus have an occasion to ease the sufferings and reduce the needs of thousands of people. This vacancy you can apply for is the "Donating Assistant" (future promotion to "donating manager" is possible after 3 months of successful support). Please reply if you are interested in becoming a part of our system and EMAIL US. We will then send you more details concerning the vacancy of a "donating assistant". Thank you very much for your interest and for your wish to help the ones who really need our help and joint support. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sam at ravnborg.org Fri Jul 27 04:01:18 2007 From: sam at ravnborg.org (Sam Ravnborg) Date: Fri, 27 Jul 2007 13:01:18 +0200 Subject: [ofa-general] Re: [PATCH 1/2] ehca: remove checkpatch.pl's warnings "externs should be avoided in .c files" In-Reply-To: <200707271254.51055.hnguyen@linux.vnet.ibm.com> References: <200707271254.51055.hnguyen@linux.vnet.ibm.com> Message-ID: <20070727110118.GB12647@uranus.ravnborg.org> On Fri, Jul 27, 2007 at 12:54:50PM +0200, Hoang-Nam Nguyen wrote: > >From b5d0336089b5ebe5b18acb94b2c94c2026cb95ee Mon Sep 17 00:00:00 2001 > From: Hoang-Nam Nguyen > Date: Fri, 27 Jul 2007 10:24:49 +0200 > Subject: [PATCH] remove checkpatch.pl's warnings "externs should be avoided in .c files" > > Signed-off-by: Hoang-Nam Nguyen And you checked that said .h file was indeed included by the .c file that has the original definition? Otherwise the definition and the declaration can get out of sync without notice. Sam From eitan at mellanox.co.il Fri Jul 27 04:27:53 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 27 Jul 2007 14:27:53 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <20070727010707.GR2472@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901F7510A@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> <20070727010707.GR2472@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com> The problem I have with back-to-back plug is that it is a fatal case if found in a case where there was no use of this plug. So we will need some sort of user input if it is OK or not. The case of moving a port in the middle of a sweep can be easily detected if instead of reporting an error a second check of the original DR where the same GUID was found is performed... Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Friday, July 27, 2007 4:07 AM > To: Eitan Zahavi > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > On 09:25 Thu 26 Jul , Eitan Zahavi wrote: > > > Hi Eitan, Hal, > > > > > > On 20:44 Wed 25 Jul , Eitan Zahavi wrote: > > > > > > > > I am not following you. > > > > Why do a user need to run -y if a simple legal cable > connector is > > > > plugged? > > > > > > Because duplicated GUIDs detector can aborts OpenSM when regular > > > port is reconnected to another location during hard sweep. This > > > issue is not related to loopback plug at all. > > I think we should handle the case of "migrated port" in a > more global > > sense: > > If a port "moved" during the sweep we have to do a new sweep anyway. > > Another option is just to use recently discovered port > location. In case of CA it could work, switch migration can > be more complicated. > > > Maybe we could delay the 'abort' to the second sweep. > > > > So practically I propose: > > 1. Add state flag "was duplicated" on the port saying it > was reported > > as duplicate GUID. > > 2. Set the variable controlling a forced secodn sweep > (similar to the > > one used if we got Set error) > > We even can catch this yet before drop_manager and just rediscover. > > > 3. Repeat the sweep - if we find a port where it is a duplicate and > > the "was duplicated" flag is set - abort. > > > > A refinement for the user who is doing many changes > continuously might > > be to keep a counter. > > And have the abort happen after the Nth iteration. > > It is better approach than what we have today. > > > > > > > > The issue is only if a "loop back" plug connecting a port > > > to itself is > > > > plugged. > > > > > > No, not only. Now there are two completely separate known issues > > > with duplicated GUIDs detector: > > > > > > 1. Port moving > > > 2. Loopback plug > > > > > > And I think that _both_ should be solved. And if just using '-y' > > > could be suitable for (2) because it is esoteric > (although perfectly > > > legal) use, it is not acceptable solution for (1). > > > > > > I think we need to improve GUIDs duplication detector > instead. For > > > example we could add NodeInfo comparison there, and only > in case if > > > it is different drop GUIDs duplication error. Also I think this > > > should not be fatal error and should not abort OpenSM, > just logging > > > (probably via syslog too) should be sufficient - > non-working port is > > > good reason to look at logs. Another ideas? > > The problem is that the SM will sort of figure out the network but > > will create a completely bogus routing etc. > > Right. But it is not so with back-to-back (when loopback plug > could be interpreted as back-to-back duplicated GUID). So no > need to abort in this (back-to-back/loopback) case. Agreed? > > Sasha > > > > > > > > > Sasha > > > > > > > Do users use these plugs? For what sake? > > > > > > > > > > > > Eitan Zahavi > > > > Senior Engineering Director, Software Architect Mellanox > > > Technologies > > > > LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > > > Sent: Wednesday, July 25, 2007 3:19 AM > > > > > To: Eitan Zahavi > > > > > Cc: Hal Rosenstock; OpenFabrics General; Yevgeny Kliteynik > > > > > Subject: Re: OpenSM detection of duplicated GUIDs on loopback > > > > > > > > > > On 23:25 Tue 24 Jul , Eitan Zahavi wrote: > > > > > > > > > > > > On 7/24/07, Eitan Zahavi wrote: > > > > > > > > > > > > Maybe avoid the log if -y is provided? > > > > > > > > > > > > > > > > > > That avoids the spew but the duplicated GUID is > > > > > important to know so > > > > > > IMO something in the "middle" is needed where > > > duplicated GUIDs are > > > > > > logged but not continually the same ones. > > > > > > [EZ] > > > > > > OK so in -y mode only we track which ones were reported > > > > > and do not > > > > > > repeat the log? > > > > > > > > > > And how port moving problem should be solved? > > > > > > > > > > We cannot ask an user to run OpenSM with '-y' if in > > > her/his plans to > > > > > reconnect some ports in a future and just decrease logging. > > > > > > > > > > Sasha > > > > > > > > > From suri at baymicrosystems.com Fri Jul 27 05:26:04 2007 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 27 Jul 2007 08:26:04 -0400 Subject: [ofa-general] opensm off by default In-Reply-To: <46A99F9C.5040303@cea.fr> References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr> Message-ID: <01e501c7d049$50483d10$1914a8c0@surioffice> Since opensm is off by default in ofed1.2 (which I found out the hard way), can we please add a note either to the documentation or the ./install.sh menu on how to enable/install Opensm please. Thanks, Suri From hal.rosenstock at gmail.com Fri Jul 27 05:47:59 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Jul 2007 08:47:59 -0400 Subject: [ofa-general] opensm off by default In-Reply-To: <01e501c7d049$50483d10$1914a8c0@surioffice> References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr> <01e501c7d049$50483d10$1914a8c0@surioffice> Message-ID: On 7/27/07, Suresh Shelvapille wrote: > Since opensm is off by default in ofed1.2 (which I found out the hard way), can we please > add a note either to the documentation or the ./install.sh menu on how to enable/install > Opensm please. I think this may be an EWG request rather than OpenIB/Fabrics. -- Hal > > Thanks, > Suri > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From landman at scalableinformatics.com Fri Jul 27 06:42:02 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 27 Jul 2007 09:42:02 -0400 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: <20070727083438.GA9912@mellanox.co.il> References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> Message-ID: <46A9F62A.7080500@scalableinformatics.com> Michael S. Tsirkin wrote: >> Quoting Yoshiaki Tamura : >> Subject: OFED-1.2 on x86 debian >> >> Hi. >> >> I'm trying to install OFED-1.2 on x86 (32bit) debian machine. >> Although build_env.sh seems to work on debian, >> it fails compiling both kernel modules and user land tools by rpmbuild. >> >> Is OFED-1.2 tested on debian or totally unsupported? > > It's not on a list of supported platforms, but I think we do builds > on ubuntu so debian should work too. Vlad? I have been trying to make it work here on Ubuntu (Debian rebuild) 7.04. Had to hack build_env.sh a little to get it to ignore some of the dependency checking (done by package name, which is not portable across distros). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From transter at gmail.com Fri Jul 27 07:47:25 2007 From: transter at gmail.com (lbt) Date: Fri, 27 Jul 2007 07:47:25 -0700 Subject: [ofa-general] Lost in-service traps during Open SM migration In-Reply-To: <20070727025952.GE6691@sashak.voltaire.com> References: <20070725220204.GI31582@sashak.voltaire.com> <20070727025952.GE6691@sashak.voltaire.com> Message-ID: Hi Sasha, Yes, the problem seems to appear only when there is an SM migration. I receive in-service notices for other ports, as long as there is no SM migration occurring. Thanks, Lan On 7/26/07, Sasha Khapyorsky wrote: > > On 12:37 Thu 26 Jul , lbt wrote: > > Thanks for the suggestion Sasha! > > > > Our host stack does receive a rereregistration notice and does > resubscribe > > all handlers at > > that point in time. At the time of the SM migration, our stack prints > out > > some informational messages to > > confirm this: > > Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER > occurred > > on port 1 > > Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM > LID=8 > > > > And also confirmed in the SM logs that after the migration, the higher > > priority SM is getting a subscription request for in-service trap: > > Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method: > > Subscribe Request with QPN: 0x000001 > > Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [ > > Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [ > > Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump: > > > gid.....................0x0000000000000000 : > > 0x0000000000000000 > > lid_range_begin.........0xFFFF > > lid_range_end...........0x0 > > is_generic..............0x1 > > subscribe...............0x0 > > trap_type...............0x3 > > trap_num................64 > > qpn.....................0x000001 > > resp_time_val...........0x0 > > node_type...............0x000004 > > Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ] > > > > It maybe a problem if the resubscription of the in-service handler > occurs > > after the in-service notice was forwarded, but I think the problem is > that > > there is never a notice that is forwared for the higher priority SM > port > > that is restored. > > And after OpenSM migration, did you receive in-service notices for > another ports? Does the problem happen only in migration time? > > > Perhaps, neither SM (the lower priority and higher > > priority one), generates an in-service trap because of the timing gap > > between when the restored port is detected and "marked" (i.e. added to > > new_ports_list) and when in-service traps are generated for new ports. > > During SM migration, the lower priority SM detects the new port, but > the > > higher priority SM does the trap generation (but it doesn't realize > that > > it's own port is a new port and thus doesn't generate a trap for it). > > > > Our host stack executes some functions when a port is restored (in our > > in-service subscription handler). > > Am I not supposed to receive an in-service trap for a restored port > that > > happens to be the Master SM, > > Yes, I guess you are. > > > and instead execute these actions with a > > client reregistration event? > > Client reregistration request is not suitable here - SM can ask for > client reregistration at any time (in practice OpenSM now does it only > when enters MASTER state, but it is also optional). > > Sasha > > > > > Thanks again for your help! > > Lan > > > > > > > > On 7/25/07, Sasha Khapyorsky wrote: > > > > > > Hi Lan, > > > > > > On 09:57 Wed 25 Jul , lbt wrote: > > > > Hello, > > > > > > > > I have been seeing a problem where a subscriber for in-service > traps is > > > not > > > > getting informed when the port of master openSM is restored (i.e. > > > causing an > > > > SM migration). > > > > > > > > I have an IB subnet with 2 nodes running OpenSM , different > priorities > > > of > > > > course (OpenSM Rev:openib-2.0.5). I also have another node on the > > > subnet > > > > that has subscribed for the forwarding of any > > > IB_SA_GENERIC_TRAP_NUM_IN_SVC > > > > trap events. I've been doing cable pull tests on the IB ports, to > check > > > if > > > > the in-service handler I have subscribed gets invoked when I > restore > > > the > > > > cable. I've noticed that everything works as expected ( i.e. my > > > in-service > > > > handler is invoked) whenever I restore the cable on the lower > priority > > > SM IB > > > > port without ever touching the master SM port. But if I cause an SM > > > > migration, by restoring the port of the higher priority SM, the > > > in-service > > > > trap does not get generated as expected on a cable restore. > > > > > > > > Steps to Reproduce: > > > > 1) Start with port to higher priority SM disconnected. > > > > 2) restore port cable on the higher priority SM > > > > --> This causes an SM Migration as expected, SM's migration happens > > > okay > > > > --> I expected the restoration of the higher priority SM to tit to > also > > > > trigger an in-service trap as well and notify subscribers, but it > > > doesn't > > > > occur > > > > > > > > I have collected debug messages log for both open SM's, and it > appears > > > that > > > > the reason is because: > > > > 1) in-service traps are generated based on what ports are added on > the > > > > Master SM's new_ports_list, but these traps are generated only > after > > > LID > > > > assignment > > > > 2) when the higher priority SM port is restored, the restored port > gets > > > > added to the lower priority SM's new_ports_list (since it's still > the > > > Master > > > > SM at that point in time) > > > > 3) the handover of Master SM from lower priority to higher > priority > > > SM > > > > occurs (before LID assignment and thus a chance for traps get > generated > > > for > > > > those ports on new_ports_list) > > > > 4) the higher priority SM is now Master SM, but it has an empty > > > > new_ports_list, so no trap generated either > > > > > > > > Does this look like a legitimate Open SM bug? Any feedback would be > > > much > > > > appreciated, and if I can help further in any way please let me > know . > > > > > > As far as I know when OpenSM (even old like 2.0.5) becomes master it > > > requests client to reregister SA related stuff (by setting this bit in > > > PortInfo). > > > > > > Probably your port doesn't not support this (you could verify by > seeing > > > PortInfo:CapabilityMask - use 'smpquery portinfo ') > or > > > maybe your host stack doesn't do reregistration? > > > > > > Anyway you could track this in the OpenSM code in osm_lid_mgr.c > > > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set > > > (with ib_port_info_set_client_rereg()) or not. Then we will know more > > > about this problem. > > > > > > Sasha > > > > > > > > > > > > > > > Subset of logs from lower priority SM during the cable restore of > > > higher > > > > priority SM port: > > > > ### Jul 18 14:31:56 614522 [41401960] -> > > > __osm_trap_rcv_process_request: > > > > Received Generic Notice type:0x03 num:128 Producer:2 from > LID:0x000A > > > > TID:0x00000016000012e1 > > > > ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: > > > Received > > > > signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE > > > > ### 14:31:56 ******************** INITIATING HEAVY SWEEP > > > > ********************** > > > > ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: > > > Received > > > > signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > > > > OSM_SM_STATE_SWEEP_HEAVY_SELF > > > > Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: > Adding > > > port > > > > GUID:0x00504501483e0000 to new_ports_list > > > > Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: > Received > > > signal > > > > OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > > > Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: > Received > > > signal > > > > OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state > > > OSM_SM_STATE_SWEEP_HEAVY_SUBNET > > > > 14:31:56 ********************* HEAVY SWEEP COMPLETE > > > *********************** > > > > Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: > Received > > > > signal OSM_SM_SIGNAL_HANDOVER_SENT in state > IB_SMINFO_STATE_MASTER### > > > > 14:31:56 ******************** ENTERING SM STANDBY STATE > > > ******************* > > > > > > > > Subset of logs from higher priority SM during the cable restore of > > > higher > > > > priority SM port: > > > > > > > > Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [ > > > > Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: > Received > > > > signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state > > > > IB_SMINFO_STATE_DISCOVERING > > > > Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state > > > > Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg: > > > > ******************** ENTERING SM MASTER STATE ******************** > > > > Jul 18 14:32:03 009014 [41401960] -> > > > __osm_state_mgr_set_sm_lid_done_msg: > > > > **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** > > > > Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg > > > > ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** > > > > Jul 18 14:32:03 024052 [41E02960] -> > __osm_state_mgr_report_new_ports: > > > [ > > > > ----> no in-service traps are generated and notices forwarded > because > > > there > > > > are no ports on this list > > > > Jul 18 14:32:03 024057 [41E02960] -> > __osm_state_mgr_report_new_ports: > > > ] > > > > > > > > > > > > Thanks! > > > > Lan > > > > > > > _______________________________________________ > > > > general mailing list > > > > general at lists.openfabrics.org > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Jul 27 09:15:20 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jul 2007 09:15:20 -0700 Subject: [ofa-general] TransactionID(IB_MAD_TRID_F) description In-Reply-To: <46A9C633.7040302@Sun.COM> References: <46A9C633.7040302@Sun.COM> Message-ID: <46AA1A18.8020501@ichips.intel.com> > I have just observed that TransactionID that I am providing with sndbuf > to *umad_send* is not the one that I received back from *umad_recv* > function. Going more in detail I have seen that only low 32 bits of TID > are matching in received mad with send mad. Is this functionality of TID > is expected or there is any suitable way to get the all 64 bits of TID > in place of low 32 bits only. This is the correct behavior. The user only has control over the lower 32-bits of the TID. The upper 32-bits are reserved by the MAD layer for multiplexing purposes. - Sean From mshefty at ichips.intel.com Fri Jul 27 09:24:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jul 2007 09:24:35 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A94657.1020101@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A94657.1020101@ichips.intel.com> Message-ID: <46AA1C43.20808@ichips.intel.com> > I think the RDMA CM needs two solutions, depending on which address > family is used. For IPv6, the existing interface is sufficient, and > works for both IB and iWarp. The RDMA CM only needs to include the TC > and FL as part of its PR query. For IPv4, to remain transport neutral, > I think we should add an rdma_set_option() routine to specify the QoS > field. The RDMA CM would include the QoS field for PR query under this > condition. > > For IB, this requires changes to the ib_sa to support the new PR > extensions. I don't think we gain anything having the RDMA CM include > service IDs as part of the query. I overlooked multicast in my reply. Unfortunately, the QoS field was not added to MCMemberRecord. For multicast, IPv6 addresses would still use the TC and FL provided by the user. For IPv4, the RDMA CM will either need to match the TC and FL of the IPoIB broadcast group or leave these fields unspecified. - Sean From hal.rosenstock at gmail.com Fri Jul 27 09:33:25 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Jul 2007 12:33:25 -0400 Subject: [ofa-general] QoS RFC In-Reply-To: <46AA1C43.20808@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A94657.1020101@ichips.intel.com> <46AA1C43.20808@ichips.intel.com> Message-ID: On 7/27/07, Sean Hefty wrote: > > I think the RDMA CM needs two solutions, depending on which address > > family is used. For IPv6, the existing interface is sufficient, and > > works for both IB and iWarp. The RDMA CM only needs to include the TC > > and FL as part of its PR query. For IPv4, to remain transport neutral, > > I think we should add an rdma_set_option() routine to specify the QoS > > field. The RDMA CM would include the QoS field for PR query under this > > condition. > > > > For IB, this requires changes to the ib_sa to support the new PR > > extensions. I don't think we gain anything having the RDMA CM include > > service IDs as part of the query. > > I overlooked multicast in my reply. Unfortunately, the QoS field was > not added to MCMemberRecord. Good point. You can make PR requests with MGID as DGID though. -- Hal > For multicast, IPv6 addresses would still > use the TC and FL provided by the user. For IPv4, the RDMA CM will > either need to match the TC and FL of the IPoIB broadcast group or leave > these fields unspecified. > > - Sean > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Fri Jul 27 09:44:35 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jul 2007 09:44:35 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A99F9C.5040303@cea.fr> References: <46A283B6.1070105@dev.mellanox.co.il> <46A99F9C.5040303@cea.fr> Message-ID: <46AA20F3.8010901@ichips.intel.com> >> SDP uses CMA for building its connections. >> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits >> holding the remote TCP/IP Port Number to connect to. >> SDP might be provided with SO_PRIORITY socket option. In that case the >> value >> provided should be sent to the CMA as the TClass option of that >> connection. >> > This requires modifications a applications and does not allow a global > definition of Qos for all SDP applications into the fabric. > This is inconsistent with Libsdp provided to migrate transparently > TCP/IP application to SDP. > If the maching rules allows some kind of bitmask pattern matching, we > can define something like : > qos-match-rule > use: all SDP applications > service-id: 0x000000000001???? > qos-level-sn: 2 > end-qos-match-rule Please see my response from yesterday. I believe we can eliminate the use of the service ID for SDP, and instead rely on the IPv6 address or socket options. My suggestions for the host stack restrict the use of the service ID to SRP. If SRP were to provide a QoS parameter instead, we could avoid any use of service ID in our implementation. However, I don't know the scope required to support that change. - Sean From mshefty at ichips.intel.com Fri Jul 27 09:45:56 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jul 2007 09:45:56 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: References: <46A283B6.1070105@dev.mellanox.co.il> <46A94657.1020101@ichips.intel.com> <46AA1C43.20808@ichips.intel.com> Message-ID: <46AA2144.70102@ichips.intel.com> > You can make PR requests with MGID as DGID though. I thought about that, but the QoS field isn't defined for PR responses, only requests (Get, GetTable). - Sean From mshefty at ichips.intel.com Fri Jul 27 09:59:54 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 27 Jul 2007 09:59:54 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: References: <46A283B6.1070105@dev.mellanox.co.il> <46A94657.1020101@ichips.intel.com> <46AA1C43.20808@ichips.intel.com> Message-ID: <46AA248A.3010808@ichips.intel.com> >> I overlooked multicast in my reply. Unfortunately, the QoS field was >> not added to MCMemberRecord. > > Good point. > > You can make PR requests with MGID as DGID though. My bad here. We need to specify the SL, FL, and TC when creating the multicast group, so the required QoS -> TC, FL mapping needs to be done by the user. So, the RDMA CM will need to use the IPoIB broadcast group information. - Sean From rick.jones2 at hp.com Fri Jul 27 10:07:47 2007 From: rick.jones2 at hp.com (Rick Jones) Date: Fri, 27 Jul 2007 10:07:47 -0700 Subject: [ofa-general] OFED-1.2 on x86 debian In-Reply-To: <46A97850.2030607@osrg.net> References: <46A97850.2030607@osrg.net> Message-ID: <46AA2663.4060709@hp.com> Yoshiaki Tamura wrote: > Hi. > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > Although build_env.sh seems to work on debian, > it fails compiling both kernel modules and user land tools by rpmbuild. > > Is OFED-1.2 tested on debian or totally unsupported? When I tried to do that with ia64 Debian I was directed towards some tar files of the mods rather than the install.sh stuff. I don't have the pointers at my fingertips, but would assume they remain in the list archives. rick jones From hal.rosenstock at gmail.com Fri Jul 27 12:03:47 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Jul 2007 15:03:47 -0400 Subject: [ofa-general] ibutils building Message-ID: Hi Eitan, When building ibutils (master), I get the following error: gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c -fPIC -DPIC -o .libs/ibis.o ibis.c:41:25: git_version.h: No such file or directory Any idea ? What's git-version.h ? Thanks. -- Hal From hal.rosenstock at gmail.com Fri Jul 27 12:32:50 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 27 Jul 2007 15:32:50 -0400 Subject: [ofa-general] ibutils/ibdm building Message-ID: Hi again Eitan, When building ibutils/ibdm (master), I get the following error: if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../datamodel -g -O2 -MT osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o osm_check.o osm_check.cpp; \ then mv -f ".deps/osm_check.Tpo" ".deps/osm_check.Po"; else rm -f ".deps/osm_check.Tpo"; exit 1; fi osm_check.cpp: In function `int main(int, char**)': osm_check.cpp:428: `R_OK' undeclared (first use this function) osm_check.cpp:428: (Each undeclared identifier is reported only once for each function it appears in.) osm_check.cpp:428: `access' undeclared (first use this function) Thanks. -- Hal From prema.dadisman at bakkerne.dk Fri Jul 27 05:27:26 2007 From: prema.dadisman at bakkerne.dk (Gerardo Guy) Date: Fri, 27 Jul 2007 20:27:26 +0800 Subject: [ofa-general] Good summer, dude Message-ID: <01c7d08c$8be97440$c2bcb2cf@prema.dadisman> -------------- next part -------------- A non-text attachment was scrubbed... Name: tamtam.gif Type: image/gif Size: 10024 bytes Desc: not available URL: From douggibson at tikkurila.hu Fri Jul 27 16:14:50 2007 From: douggibson at tikkurila.hu (Canadian Charity '07 refnum-45) Date: Fri, 27 Jul 2007 19:14:50 -0400 Subject: [ofa-general] Superb parttime position Id: 7418 Message-ID: <002701c7d0a3$eea2aba0$0202a8c0@Mallorylaptop1> This proposal is of most importance to all European candidates We are glad to introduce you our new project. This is a job proposal for EU candidates only. Requirements and benefits: Monthly gross earnings: 1500-3000 EUR per month Age limit: 18-80 y.o. Possible career growth and promotion opportunity Internet access, cellular or home phone number and the e-mail Part-time (2-3hr per day) and full-time employment (8hr per day) Our organization Canadian Charity is looking for new workers and collaborators in Europa. Become a part of our donating system that includes worldwide donations to HIV positives, war refugees from Middle East and starving children from poorest European countries. Our program does not charge or ask you to invest anything. We do not try to take your money. Our regional sponsors and investors from different European Union and North American regions have already accepted our proposition and are now the investing affiliates in our international donating program. Work in partnership with our investors during the donation process and earn from 1500 EUR (1800 USD) up to 3000 EUR (3600 USD) wages per month. Together we can make this program work with highest efficiency and thus have an occasion to ease the sufferings and minimize the needs of thousands of people. This vacancy you can apply for is the "Donating Assistant" (future promotion to "donating manager" is possible after 3 months of successful work). Please let us know if you are interested in becoming a part of our program and EMAIL US. We will then send you more details regarding to the position of a "donating assistant". Thank you very much for your interest and for your wish to help the ones who really need our assistance and joint support. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at mellanox.co.il Fri Jul 27 21:05:10 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 28 Jul 2007 07:05:10 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-28:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Wed_Jul_25_02:46:48_2007 [d06c318cb50ddddf55b20a5e896d2d22d7b90948] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=520 Pass=467 Fail=53 Pass: 39 Stability IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 12 FatTree merge-roots-4-ary-2-tree.topo Failures: 39 Pkey IS1-16.topo 13 Pkey IS3-128.topo 1 FatTree merge-roots-4-ary-2-tree.topo From vlad at lists.openfabrics.org Sat Jul 28 01:39:07 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 28 Jul 2007 01:39:07 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070728-0100 daily build status Message-ID: <20070728083907.95087E60848@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.14 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From ujz at hetnet.nl Sat Jul 28 02:13:50 2007 From: ujz at hetnet.nl (postcards.com) Date: Sat, 28 Jul 2007 04:13:50 -0500 Subject: [ofa-general] You've received an ecard from a School-mate! Message-ID: <001101c7d0f7$9ca70700$aa73b523@sow.sfdtb> Hi. School-mate has sent you an ecard. See your card as often as you wish during the next 15 days. SEEING YOUR CARD If your email software creates links to Web pages, click on your card's direct www address below while you are connected to the Internet: http://68.79.168.46/?a32e6b9ea6878b15d7703a3b01bda Or copy and paste it into your browser's "Location" box (where Internet addresses go). We hope you enjoy your awesome card. Wishing you the best, Mail Delivery System, postcards.com From vlad at lists.openfabrics.org Sat Jul 28 02:51:04 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sat, 28 Jul 2007 02:51:04 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070728-0200 daily build status Message-ID: <20070728095104.BEE66E60848@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.9-34.ELsmp Failed: From fathi.daker at oxygen.ie Sat Jul 28 05:07:57 2007 From: fathi.daker at oxygen.ie (Susanne Granger) Date: Sat, 28 Jul 2007 12:07:57 +0000 Subject: [ofa-general] Do it for pleasure Message-ID: <01c7d10f$ef7e3990$d8b0183e@fathi.daker> -------------- next part -------------- A non-text attachment was scrubbed... Name: tamtam.gif Type: image/gif Size: 8853 bytes Desc: not available URL: From eitan at mellanox.co.il Sat Jul 28 13:01:51 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 28 Jul 2007 23:01:51 +0300 Subject: [ofa-general] RE: ibutils building In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com> Git_version.h is a new automatically generated file I added to the Makefile.am and ibis.i Fabric.cpp and sim.i. Is it possible you did not rerun autogen.sh ? Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Friday, July 27, 2007 10:04 PM > To: Eitan Zahavi > Cc: OpenFabrics General > Subject: ibutils building > > Hi Eitan, > > When building ibutils (master), I get the following error: > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB > -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 > -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c -fPIC > -DPIC -o .libs/ibis.o > ibis.c:41:25: git_version.h: No such file or directory > > Any idea ? What's git-version.h ? Thanks. > > -- Hal > From hal.rosenstock at gmail.com Sat Jul 28 13:28:49 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 28 Jul 2007 16:28:49 -0400 Subject: [ofa-general] Re: ibutils building In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com> Message-ID: On 7/28/07, Eitan Zahavi wrote: > Git_version.h is a new automatically generated file I added to the > Makefile.am and ibis.i Fabric.cpp and sim.i. > Is it possible you did not rerun autogen.sh ? Nope; I ran that prior to configuring. -- Hal > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > Sent: Friday, July 27, 2007 10:04 PM > > To: Eitan Zahavi > > Cc: OpenFabrics General > > Subject: ibutils building > > > > Hi Eitan, > > > > When building ibutils (master), I get the following error: > > > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include > > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG > > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB > > -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 > > -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g > > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c -fPIC > > -DPIC -o .libs/ibis.o > > ibis.c:41:25: git_version.h: No such file or directory > > > > Any idea ? What's git-version.h ? Thanks. > > > > -- Hal > > > From eitan at mellanox.co.il Sat Jul 28 13:49:18 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 28 Jul 2007 23:49:18 +0300 Subject: [ofa-general] RE: ibutils/ibdm building In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com> Hi Hal This does not reproduce on my fresh clone. So I am not sure what is going on. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Friday, July 27, 2007 10:33 PM > To: Eitan Zahavi > Cc: OpenFabrics General > Subject: ibutils/ibdm building > > Hi again Eitan, > > When building ibutils/ibdm (master), I get the following error: > > if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../datamodel -g -O2 -MT > osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o > osm_check.o osm_check.cpp; \ then mv -f ".deps/osm_check.Tpo" > ".deps/osm_check.Po"; else rm -f ".deps/osm_check.Tpo"; exit 1; fi > osm_check.cpp: In function `int main(int, char**)': > osm_check.cpp:428: `R_OK' undeclared (first use this function) > osm_check.cpp:428: (Each undeclared identifier is reported > only once for each > function it appears in.) > osm_check.cpp:428: `access' undeclared (first use this function) > > Thanks. > > -- Hal > From sashak at voltaire.com Sat Jul 28 13:53:07 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 28 Jul 2007 23:53:07 +0300 Subject: [ofa-general] Re: ibutils building In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com> Message-ID: <20070728205307.GC12351@sashak.voltaire.com> On 16:28 Sat 28 Jul , Hal Rosenstock wrote: > On 7/28/07, Eitan Zahavi wrote: > > Git_version.h is a new automatically generated file I added to the > > Makefile.am and ibis.i Fabric.cpp and sim.i. > > Is it possible you did not rerun autogen.sh ? > > Nope; I ran that prior to configuring. The same problem is here (after ./autogen.sh). Sasha > > -- Hal > > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > Sent: Friday, July 27, 2007 10:04 PM > > > To: Eitan Zahavi > > > Cc: OpenFabrics General > > > Subject: ibutils building > > > > > > Hi Eitan, > > > > > > When building ibutils (master), I get the following error: > > > > > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include > > > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG > > > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB > > > -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 > > > -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g > > > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c -fPIC > > > -DPIC -o .libs/ibis.o > > > ibis.c:41:25: git_version.h: No such file or directory > > > > > > Any idea ? What's git-version.h ? Thanks. > > > > > > -- Hal > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Sat Jul 28 14:05:50 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 29 Jul 2007 00:05:50 +0300 Subject: [ofa-general] RE: ibutils building In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901F75B4B@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75B60@mtlexch01.mtl.com> Just reproduced it on a fresh checkout. A fix was just pushed in Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Saturday, July 28, 2007 11:29 PM > To: Eitan Zahavi > Cc: OpenFabrics General > Subject: Re: ibutils building > > On 7/28/07, Eitan Zahavi wrote: > > Git_version.h is a new automatically generated file I added to the > > Makefile.am and ibis.i Fabric.cpp and sim.i. > > Is it possible you did not rerun autogen.sh ? > > Nope; I ran that prior to configuring. > > -- Hal > > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect Mellanox > Technologies > > LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > > Sent: Friday, July 27, 2007 10:04 PM > > > To: Eitan Zahavi > > > Cc: OpenFabrics General > > > Subject: ibutils building > > > > > > Hi Eitan, > > > > > > When building ibutils (master), I get the following error: > > > > > > gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include > > > -I/usr/local/include/infiniband -I/usr/local/include -DDEBUG > > > -D_DEBUG -D_DEBUG_ -DDBG -DOSM_VENDOR_INTF_OPENIB > -DOSM_BUILD_OPENIB > > > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall > -fno-strict-aliasing > > > -fPIC -DIBIS_VERSION=\"1.2\" -g > > > -O2 -MT ibis.lo -MD -MP -MF .deps/ibis.Tpo -c ibis.c > -fPIC -DPIC -o > > > .libs/ibis.o > > > ibis.c:41:25: git_version.h: No such file or directory > > > > > > Any idea ? What's git-version.h ? Thanks. > > > > > > -- Hal > > > > > > From sashak at voltaire.com Sat Jul 28 14:55:27 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 29 Jul 2007 00:55:27 +0300 Subject: [ofa-general] Re: pkey.sim.tcl In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> <20070726224133.GC2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com> Message-ID: <20070728215527.GH12351@sashak.voltaire.com> Hi Eitan, On 07:56 Fri 27 Jul , Eitan Zahavi wrote: > > > > On 09:26 Thu 26 Jul , Eitan Zahavi wrote: > > > > > > I am happy you actually use the simulator. > > > Please provide more info regarding the failure. You should tar > > > compress the /tmp/ibmgtsim.XXXX of your run. > > > > I can send this for you if you want, but the failure is trivial. > No need if you already know where the bug is... > > > > Yes, and it is due (6), where default Pkey is removed > > "externally". I'm not sure that OpenSM should handle the case > > when pkey table is modified externally by something which is not SM. > > > > For a few years it just worked fine. So I wonder why this fucntionality > was removed ? > It is a real BAD case where Pkeys are altered but I think would be wise > to "refresh" these tables on heavy seep. We discussed how and when port tables refresh should be done just few days ago in this thread. My impression was that we are "in sync" about this. > In general it seems OpenSM has lost its "heavy sweep" concept. Now it > does not refresh the fabric setup even on heavy sweep. Not on each heavy sweep, but it does when it needed or when data could change. I don't think the concept was changed, just optimized. Let just look at the numbers: $ time ./opensm/opensm -e -f ./osm.log -o ... SUBNET UP Exiting SM real 0m7.995s user 0m4.488s sys 0m6.072s $ time ./opensm/opensm -e -f ./osm.log -o --qos ... SUBNET UP Exiting SM real 0m22.521s user 0m10.921s sys 0m17.173s This is simulated runs (with ibsim), the fabric is ~1300 nodes. The difference there is '--qos' flag, so OpenSM skips SL2VL and VLArb update in first run and does it in the second - sweep times are 8 against 22 seconds. > This is assuming a "perfect" HW and software and I would really this we > should have preserved that capability. What about an option? Now with subn->need_update flag (which always enforces updates) it is trivial to implement. > Note that a "heavy sweep" does not happen unless somethng changed or > trapped. Yes, for example some port was connected/disconnected, some node rebooted, etc.. OpenSM starts huge heavy sweep, it takes a while, SA is not responsive most the time, TCP connection over IPoIB timeouted, applications failed. This is production experiences... :( Sasha From sashak at voltaire.com Sat Jul 28 15:15:40 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 29 Jul 2007 01:15:40 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> <20070727010707.GR2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com> Message-ID: <20070728221540.GI12351@sashak.voltaire.com> On 14:27 Fri 27 Jul , Eitan Zahavi wrote: > The problem I have with back-to-back plug is that it is a fatal case if > found in a case where there was no use of this plug. > So we will need some sort of user input if it is OK or not. Ok, and let's add cl_qmap_count() check there. > The case of moving a port in the middle of a sweep can be easily > detected if instead of reporting an error a second > check of the original DR where the same GUID was found is performed... Do you mean to resend NodeInfo request to the original location? Assuming so, I guess it should be instead of second heavy sweep, and it is a good idea. The only small downside of this I can see is potential timeouts (and discovery slowdown). But anyway it is much better then fatal error. Thanks! Sasha From envio10006 at gmail.com Sat Jul 28 17:09:18 2007 From: envio10006 at gmail.com (Imedeen) Date: Sat, 28 Jul 2007 20:09:18 -0400 Subject: [ofa-general] Rejuvenezca la piel de todo su cuerpo ... Message-ID: <9484122-220077029091820@Mauricio> Desde Dinamarca para su piel .. Imedeen Time Perfection Mejora la estructura y calidad de la piel. Reduce las líneas finas y arrugas. Incrementa la humectación de la piel. Atenúa los capilares finos y pigmentos de la piel. Otorga mayor firmeza a la piel. Protege a las fibras de colágeno y elastina. Ayuda a neutralizar los elementos nocivos para la piel. Otorga a la piel un aspecto más brillante, terso y juvenil. Se recomienda en mujeres desde los 30 años y en hombres desde los 40 años. Se comienzan a visualizar los beneficios a los 90 días. Tomar diariamente 2 tabletas, en forma continua. Avalado con documentación científica Descripción detallada del producto Imedeen Time Perfection Tratamiento para 30 días (60 Cápsulas) $41.900.- Tratamiento para 90 días (180 Cápsulas) $100.000.- El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. El despacho a regiones es vía Tur Bus, el flete es por pagar, el pedido se cancela con un depósito en la Cta.Cte de la empresa. Imedeen Prime Renewal Mejora la estructura y calidad de la piel. Reduce las líneas finas y arrugas. Incrementa la humectación de la piel. Atenúa los capilares finos y pigmentos de la piel. Otorga mayor firmeza a la piel. Protege a las fibras de colágeno y elastina. Ayuda a neutralizar los elementos nocivos para la piel. Otorga a la piel un aspecto más brillante, terso y juvenil. Se recomienda solo en mujeres desde los 45 a los 65 años. Se comienzan a visualizar los beneficios a los 90 días. Tomar diariamente 2 tabletas en la mañana y 2 en la noche después de la cena. Toma continua, sin descanso. Avalado con documentación científica Descripción detallada del producto Imedeen Prime Renewal Tratamiento para 30 días (120 Cápsulas) $54.900.- Tratamiento para 90 días (360 Cápsulas) $132.000.- El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. El despacho a regiones es vía Tur Bus, el flete es por pagar, el pedido se cancela con un depósito en la Cta.Cte de la empresa Imedeen Tan Optimizar Prepara la piel para el sol. Disminuye la sensibilidad de la piel al sol. Acelera el bronceado. Optimiza el bronceado. Homogeniza el bronceado. Prolonga el bronceado. Producto unisex, recomendado desde los 15 años de edad. Se comienzan a visualizar los beneficios a los 35 días. Tomar una cápsula diariamente 1 mes antes de la exposición y 1 cápsula 1 mes después de la exposición. Toma continua sin descanso, si se desea. Avalado con documentación científica. Idealmente en los meses de verano se debe tomar Imedeen Tan Optimizar en forma conjunta con Imedeen Time Perfection o Imedeen Prime Renewal. La toma de Imedeen Tan Optimizar no reemplaza el uso de protectores solares (SPF) y no se debe exponer en las horas de máxima intensidad solar. Descripción detallada del producto Imedeen Tan Optimizar Tratamiento para 30 días (30 Cápsulas) $24.900.- Tratamiento para 60 días (60 Cápsulas) $39.900.- Tratamiento para 90 días (90 Cápsulas) $54.900.- El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. El despacho a regiones es vía Tur Bus, el flete es por pagar, el pedido se cancela con un depósito en la Cta.Cte de la empresa + Oferta Imedeen Time Perfection para 90 días (180 Cápsulas) Imedeen Tan Optimizar para 90 días (90 Cápsulas) Oferta $140.000.- El despacho de este producto es sin costo en Santiago, el pedido se cancela a la persona que lo entrega y puede ser pagado con cheque o efectivo. El despacho a regiones es vía Tur Bus, el flete es por pagar, el pedido se cancela con un depósito en la Cta.Cte de la empresa Este mensaje se enva en base al art. 28b de la ley 19.955 que reforma la la ley de derechos del consumidor, y los artculos 2 y 4 de la ley 19.628 sobre proteccin de la vida privada o datos de carcter personal, todo esto en conformidad a los numerales 4 y 12 de la constitucin poltica. Su direccin ha sido extrada manualmente por personal de nuestra compaa desde su sitio Web en Internet, o ha sido introducida por usted al aceptar el envo de mensajes publicitarios al inscribirse en alguno de los sitios o foros de nuestra Red de trabajo. Para ser removido presione Borrarme de su Base de Datos Todos los productos se despachan en Santiago sin costo de Lunes a Domingo , tambien se despacha a todo Chile via Tur Bus o Pullman Bus (La encomienda se cancela en destino final) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Sat Jul 28 20:09:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 28 Jul 2007 20:09:14 -0700 Subject: [ofa-general] Re: [PATCH] amso1100: QP init bug in amso driver References: <1185305512.20489.6.camel@trinity.ogc.int> Message-ID: thanks, applied. From rdreier at cisco.com Sat Jul 28 20:34:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 28 Jul 2007 20:34:14 -0700 Subject: [ofa-general] Re: [PATCH] mlx4: fix double-kfree in mlx4_mr_alloc error flow References: <200707261116.58679.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Sat Jul 28 20:39:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 28 Jul 2007 20:39:14 -0700 Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings "externs should be avoided in .c files" References: <200707271254.51055.hnguyen@linux.vnet.ibm.com> Message-ID: the patch looks fine except your mailer seems to have mangled it... can you resend so I can apply it? thanks... From rdreier at cisco.com Sat Jul 28 20:39:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 28 Jul 2007 20:39:14 -0700 Subject: [ofa-general] Re: [PATCH 2/2] ehca: correction include order according kernel coding style References: <200707271255.19456.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, I applied this by hand since it was so trivial. From kliteyn at mellanox.co.il Sat Jul 28 21:38:00 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 29 Jul 2007 07:38:00 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-29:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Wed_Jul_25_02:46:48_2007 [d06c318cb50ddddf55b20a5e896d2d22d7b90948] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=560 Pass=504 Fail=56 Pass: 42 Stability IS1-16.topo 42 OsmTest IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo 14 FatTree merge-roots-4-ary-2-tree.topo 14 FatTree merge-root-4-ary-3-tree.topo 14 FatTree gnu-stallion-64.topo 14 FatTree blend-4-ary-2-tree.topo 14 FatTree RhinoDDR.topo 14 FatTree FullGnu.topo 14 FatTree 4-ary-2-tree.topo 14 FatTree 2-ary-4-tree.topo 14 FatTree 12-node-spaced.topo 14 FTreeFail 4-ary-2-tree-missing-sw-link.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: 42 Pkey IS1-16.topo 14 Pkey IS3-128.topo From rdreier at cisco.com Sat Jul 28 21:39:14 2007 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 28 Jul 2007 21:39:14 -0700 Subject: [ofa-general] Bug in inline sends with sge_num > 0 in libmlx4 References: <20070724121440.GA2775@minantech.com> Message-ID: thanks, good catch. applied. From dotanb at dev.mellanox.co.il Sat Jul 28 23:53:36 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 29 Jul 2007 09:53:36 +0300 Subject: [ofa-general] [PATCH] perftest: Fix deleting the utils files in "make clean" Message-ID: <200707290953.36390.dotanb@dev.mellanox.co.il> Fix deleting the utils files in "make clean". Signed-off-by: Dotan Barak --- Index: connectx_user/src/userspace/perftest/Makefile =================================================================== --- connectx_user.orig/src/userspace/perftest/Makefile 2007-07-26 08:02:02.000000000 +0300 +++ connectx_user/src/userspace/perftest/Makefile 2007-07-29 09:38:40.000000000 +0300 @@ -16,6 +16,6 @@ ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@ clean: $(foreach fname,${TESTS}, rm -f ib_${fname}) - rm -f ${UTILS} + $(foreach fname,${UTILS}, rm -f ib_${fname}) .DELETE_ON_ERROR: .PHONY: all clean From ogerlitz at voltaire.com Sun Jul 29 01:25:09 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 29 Jul 2007 11:25:09 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070726172619.GA5208@mellanox.co.il> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726172619.GA5208@mellanox.co.il> Message-ID: <46AC4EE5.3080806@voltaire.com> Michael S. Tsirkin wrote: >> I believe a better solution is for everyone to use cached records, if >> they exist, with a feedback mechanism from the CM that removes paths on >> a connection failure or path migration event. > Ack timeout on an RC QP is also a good indication we should redo the lookup. I am not following you. The only way I know for IB SW to be aware to ACK timeout is if the RC QP retries are set to zero (actually also at this scheme the QP can move to error as of other reason). Indeed IPoIB-CM uses RC and does this zero-retries settings, but we have already agreed that moving forward, IPoIB-CM default IB transport would change to UC, didn't we? Or. From ogerlitz at voltaire.com Sun Jul 29 01:27:08 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 29 Jul 2007 11:27:08 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070726174700.GB5208@mellanox.co.il> References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726172619.GA5208@mellanox.co.il> <46A8DBED.40808@ichips.intel.com> <20070726174700.GB5208@mellanox.co.il> Message-ID: <46AC4F5C.6030605@voltaire.com> Michael S. Tsirkin wrote: >> Do you know if we get a specific event for this? (I don't remember.) > CQE with error IIRC. OK, got you, CQE with "retries exceeded", sorry. Or. From ogerlitz at voltaire.com Sun Jul 29 01:32:27 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 29 Jul 2007 11:32:27 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070726181132.GO19768@obsidianresearch.com> References: <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726181132.GO19768@obsidianresearch.com> Message-ID: <46AC509B.6020206@voltaire.com> Jason Gunthorpe wrote: > The existing trap monitoring in Sean's module covers about 90% of the > cases in IB when you need to invalidate a PR, the last 10% will need > something new :( Let it be. Do you think the last 10% should not prevent the local sa merge to the upstream code? Or. From ogerlitz at voltaire.com Sun Jul 29 01:27:32 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 29 Jul 2007 11:27:32 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46A8D80C.1090305@ichips.intel.com> References: <46A2F696.4060007@voltaire.com> <46A46637.3080104@voltaire.com> <20070723083020.GD20614@mellanox.co.il> <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> Message-ID: <46AC4F74.2000904@voltaire.com> Sean Hefty wrote: > Administrators can enable or disable the cache. I don't believe that > individual applications should be able to override the administrator, > nor do I think we gain anything by having per application settings. This > is similar to exposing to applications whether they want to use cached > ARP information every time they connect. Applications --can-- delete the network stack neighbour before doing this or that action. >> For example, I think it would be correct for IB block and file I/O >> ULPs (iSER, SRP, Lustre, rNFS, etc) to request non cached PR, as their >> connecting model is not all-to-all but rather n-to-m (n clients to m >> servers with m << n), the connections are long-lived (hours, days, >> weeks, more) and a connection failure as of PR caching does not seem >> acceptable. > I believe a better solution is for everyone to use cached records, if > they exist, with a feedback mechanism from the CM that removes paths on > a connection failure or path migration event. That's an interesting point. What's the conceptual difference between CM connection failure caused as of "wrong" PR to failure of --unicast-- ARP probe initiated by the network stack? CM feedback to the local sa seems a correct approach for me, however, I don't see the equivalent for UD communication. > With all to all connections over the rdma cm, the first thing that needs > to be done is resolve the remote addresses to GIDs. This causes an ARP > storm, followed by an SA storm caused by IPoIB, followed by a second SA > storm caused by the rdma cm. For scalability, we need to remove both of > these SA storms, not just the second. We don't see the first SA storm > today because IPoIB caches PRs. Let's not add it. Restricting caching > to the rdma cm, but removing it from IPoIB leaves us with the same > issues that we have today. Again, typical I/O client-server scheme is n-to-m where m is small (1,2 say up to few tens). The PRs are needed by IPoIB only at the --passive-- side which sends the unicast ARP reply. So when n=1024 and m=4 the SA would need to serve 4096 PRs which is about one fourth of the queries/second rate you have reported on earlier threads on the matter. Or. From vlad at lists.openfabrics.org Sun Jul 29 01:43:24 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 29 Jul 2007 01:43:24 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070729-0100 daily build status Message-ID: <20070729084325.2EDD2E6085B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ia64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.12 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.12 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From eitan at mellanox.co.il Sun Jul 29 02:00:33 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 29 Jul 2007 12:00:33 +0300 Subject: [ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <20070728221540.GI12351@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901F7512E@mtlexch01.mtl.com> <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> <20070727010707.GR2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com> <20070728221540.GI12351@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com> > On 14:27 Fri 27 Jul , Eitan Zahavi wrote: > > The problem I have with back-to-back plug is that it is a > fatal case > > if found in a case where there was no use of this plug. > > So we will need some sort of user input if it is OK or not. > > Ok, and let's add cl_qmap_count() check there. Not following you. > > > The case of moving a port in the middle of a sweep can be easily > > detected if instead of reporting an error a second check of the > > original DR where the same GUID was found is performed... > > Do you mean to resend NodeInfo request to the original location? > Assuming so, I guess it should be instead of second heavy > sweep, and it is a good idea. The only small downside of this > I can see is potential timeouts (and discovery slowdown). But > anyway it is much better then fatal error. Thanks! So we are inline with this one . Instead of changing the order of things we could generate list of DR's that are to be re-scanned during drop-mgr and then abort if really dulicates. > > Sasha > From eitan at mellanox.co.il Sun Jul 29 02:11:05 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 29 Jul 2007 12:11:05 +0300 Subject: [ofa-general] RE: pkey.sim.tcl In-Reply-To: <20070728215527.GH12351@sashak.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901ED61BC@mtlexch01.mtl.com> <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> <20070726224133.GC2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com> <20070728215527.GH12351@sashak.voltaire.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com> Regarding the test : Once I will know the exact condition causing a full re-sweep I would use it in the test. In OFED 1.2 it was enough to set one switch ChangeBit to force a full reconfiguration. Regarding incremental flow in general: 1. Yes - it is good. 2. But we must make sure it is robust enough that we do not loose some nodes or functionality under extreme cases of reboot or HW errors. 3. We should have a way to force a full sweep without killing the SM: As the size of the clusters grow there is a growing chance that "soft errors" will hit the devices. Most of the device memory is guarded and would be auto detected if affected. However I think it is wise to allow for the user to force full reconfiguration without making the SM "go away". Regarding OpenSM does not respond to SA queries during sweep: It is due to the fact there is no "double buffer" for the internal DB. So whenever the SM starts a sweep the SA will see an "empty" DB. The solution for that problem may be having a "previous" DB during sweeps. I suspect using that approach will also enable a fine grain incremental capability too. Eitan Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Sunday, July 29, 2007 12:55 AM > To: Eitan Zahavi > Cc: Yevgeny Kliteynik; Hal Rosenstock; general at lists.openfabrics.org > Subject: Re: pkey.sim.tcl > > Hi Eitan, > > On 07:56 Fri 27 Jul , Eitan Zahavi wrote: > > > > > > On 09:26 Thu 26 Jul , Eitan Zahavi wrote: > > > > > > > > I am happy you actually use the simulator. > > > > Please provide more info regarding the failure. You should tar > > > > compress the /tmp/ibmgtsim.XXXX of your run. > > > > > > I can send this for you if you want, but the failure is trivial. > > No need if you already know where the bug is... > > > > > > Yes, and it is due (6), where default Pkey is removed > "externally". > > > I'm not sure that OpenSM should handle the case when pkey > table is > > > modified externally by something which is not SM. > > > > > > > For a few years it just worked fine. So I wonder why this > > fucntionality was removed ? > > It is a real BAD case where Pkeys are altered but I think would be > > wise to "refresh" these tables on heavy seep. > > We discussed how and when port tables refresh should be done > just few days ago in this thread. My impression was that we > are "in sync" about this. > > > In general it seems OpenSM has lost its "heavy sweep" > concept. Now it > > does not refresh the fabric setup even on heavy sweep. > > Not on each heavy sweep, but it does when it needed or when > data could change. I don't think the concept was changed, > just optimized. Let just look at the numbers: > > $ time ./opensm/opensm -e -f ./osm.log -o ... > SUBNET UP > Exiting SM > > real 0m7.995s > user 0m4.488s > sys 0m6.072s > > $ time ./opensm/opensm -e -f ./osm.log -o --qos ... > SUBNET UP > Exiting SM > > real 0m22.521s > user 0m10.921s > sys 0m17.173s > > > This is simulated runs (with ibsim), the fabric is ~1300 nodes. > > The difference there is '--qos' flag, so OpenSM skips SL2VL > and VLArb update in first run and does it in the second - > sweep times are 8 against 22 seconds. > > > This is assuming a "perfect" HW and software and I would > really this > > we should have preserved that capability. > > What about an option? Now with subn->need_update flag (which > always enforces updates) it is trivial to implement. > > > Note that a "heavy sweep" does not happen unless somethng > changed or > > trapped. > > Yes, for example some port was connected/disconnected, some > node rebooted, etc.. OpenSM starts huge heavy sweep, it takes > a while, SA is not responsive most the time, TCP connection > over IPoIB timeouted, applications failed. This is production > experiences... :( > > Sasha > From vlad at lists.openfabrics.org Sun Jul 29 02:49:55 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Sun, 29 Jul 2007 02:49:55 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070729-0200 daily build status Message-ID: <20070729094955.64799E608FE@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on ia64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.17 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on x86_64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.19 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From dotanb at dev.mellanox.co.il Sun Jul 29 03:26:36 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 29 Jul 2007 13:26:36 +0300 Subject: [ofa-general] Re: I think that there is a resource leak in the core file mad_rmpp.c In-Reply-To: <46A4F5A2.2020508@ichips.intel.com> References: <46A45A8C.2090800@dev.mellanox.co.il> <46A4F5A2.2020508@ichips.intel.com> Message-ID: <46AC6B5C.6020702@dev.mellanox.co.il> Sean Hefty wrote: >> I reviewed the file mad_rmpp.c and it seems that there is a leak of >> the Address Handle. >> The AH that is being created in the function "alloc_response_msg" is >> never being destroyed. > > The AH is destroyed in ib_rmpp_send_handler(). I checked this issue again and I added the following prints: the AH handler which is being created in alloc_response_msg() the AH handler which is being destroyed in ib_rmpp_send_handler() It seems that the AHs which are being created in alloc_response_msg() (which is being called from ack_ds_ack()) are not being destroyed because the rmpp_type of this packet is IB_MGMT_RMPP_TYPE_ACK, so the destroy AH is not being executed. We saw this issue in our daily regression during the osmtest, here are the reproduction instructions: Start the openSM (during the following commands, the openSM needs to be online) execute: # osmtest -f c execute: # osmtest -f a during the test, the AHs which were mentioned above will be created. thanks Dotan From dotanb at dev.mellanox.co.il Sun Jul 29 03:32:54 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 29 Jul 2007 13:32:54 +0300 Subject: [ofa-general] [PATCH-v2] perftest: Fix deleting the utils files in "make clean" Message-ID: <200707291332.54356.dotanb@dev.mellanox.co.il> Fix deleting the utils files in "make clean". Signed-off-by: Dotan Barak --- diff --git a/Makefile b/Makefile index 812de14..8042531 100644 --- a/Makefile +++ b/Makefile @@ -15,7 +15,6 @@ ${TESTS}: LOADLIBES += -libverbs -lrdmacm ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@ clean: - $(foreach fname,${TESTS}, rm -f ib_${fname}) - rm -f ${UTILS} + $(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname}) .DELETE_ON_ERROR: .PHONY: all clean From hal.rosenstock at gmail.com Sun Jul 29 04:09:43 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 29 Jul 2007 07:09:43 -0400 Subject: [ofa-general] Re: ibutils/ibdm building In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F75B5A@mtlexch01.mtl.com> Message-ID: On 7/28/07, Eitan Zahavi wrote: > Hi Hal > > This does not reproduce on my fresh clone. > So I am not sure what is going on. Your latest change fixed this: commit a39dcb1db6f0559ca95df2948675fda0222f1532 Author: Eitan Zahavi Date: Sat Jul 28 23:44:28 2007 +0300 The target i sno ibis.o but ibis.lo diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index 27f0652..2018c66 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -89,7 +89,7 @@ SWIG_IFC_FILES= $(srcdir)/ibbbm.i \ $(srcdir)/ibsm.i \ $(srcdir)/ibvs.i -ibis.o: $(srcdir)/git_version.h +ibis.lo: $(srcdir)/git_version.h # track latest GIT version for this tree: $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > > Sent: Friday, July 27, 2007 10:33 PM > > To: Eitan Zahavi > > Cc: OpenFabrics General > > Subject: ibutils/ibdm building > > > > Hi again Eitan, > > > > When building ibutils/ibdm (master), I get the following error: > > > > if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../datamodel -g -O2 -MT > > osm_check.o -MD -MP -MF ".deps/osm_check.Tpo" -c -o > > osm_check.o osm_check.cpp; \ then mv -f ".deps/osm_check.Tpo" > > ".deps/osm_check.Po"; else rm -f ".deps/osm_check.Tpo"; exit 1; fi > > osm_check.cpp: In function `int main(int, char**)': > > osm_check.cpp:428: `R_OK' undeclared (first use this function) > > osm_check.cpp:428: (Each undeclared identifier is reported > > only once for each > > function it appears in.) > > osm_check.cpp:428: `access' undeclared (first use this function) > > > > Thanks. > > > > -- Hal > > > From mst at dev.mellanox.co.il Sun Jul 29 04:17:34 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Jul 2007 14:17:34 +0300 Subject: [ofa-general] Re: [PATCH-v2] perftest: Fix deleting the utils files in "make clean" In-Reply-To: <200707291332.54356.dotanb@dev.mellanox.co.il> References: <200707291332.54356.dotanb@dev.mellanox.co.il> Message-ID: <20070729111734.GA16915@mellanox.co.il> thanks, applied Quoting Dotan Barak : Subject: [PATCH-v2] perftest: Fix deleting the utils files in "make clean" Fix deleting the utils files in "make clean". Signed-off-by: Dotan Barak --- diff --git a/Makefile b/Makefile index 812de14..8042531 100644 --- a/Makefile +++ b/Makefile @@ -15,7 +15,6 @@ ${TESTS}: LOADLIBES += -libverbs -lrdmacm ${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@ clean: - $(foreach fname,${TESTS}, rm -f ib_${fname}) - rm -f ${UTILS} + $(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname}) .DELETE_ON_ERROR: .PHONY: all clean -- MST From hal.rosenstock at gmail.com Sun Jul 29 04:27:31 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 29 Jul 2007 07:27:31 -0400 Subject: [ofa-general] [PATCH] mad.c: Fix memory leak in switch handling and improve error handling Message-ID: mad.c: Fix memory leak in switch handling and improve error handling Signed-off-by: Suresh Shelvapille Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index bc547f1..6310dc3 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1847,11 +1847,6 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_mad_agent_private *mad_agent; int port_num; - response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); - if (!response) - printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " - "for response buffer\n"); - mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; dequeue_mad(mad_list); @@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) { + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + goto out; + } + if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) port_num = wc->port_num; else @@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, response->header.recv_wc.recv_buf.mad = &response->mad.mad; response->header.recv_wc.recv_buf.grh = &response->grh; - if (!agent_send_response(&response->mad.mad, - &response->grh, wc, - port_priv->device, - smi_get_fwd_port(&recv->mad.smp), - qp_info->qp->qp_num)) - response = NULL; + agent_send_response(&response->mad.mad, + &response->grh, wc, + port_priv->device, + smi_get_fwd_port(&recv->mad.smp), + qp_info->qp->qp_num); goto out; } @@ -1930,15 +1931,6 @@ local: if (port_priv->device->process_mad) { int ret; - if (!response) { - printk(KERN_ERR PFX "No memory for response MAD\n"); - /* - * Is it better to assume that - * it wouldn't be processed ? - */ - goto out; - } - ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc, &recv->grh, From naim.hammond at gmail.com Sun Jul 29 04:29:56 2007 From: naim.hammond at gmail.com (Naim Hammond) Date: Sun, 29 Jul 2007 14:29:56 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: <20070727083438.GA9912@mellanox.co.il> References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> Message-ID: Where is the list of supported distributions? Where can I see it? Thanks On 7/27/07, Michael S. Tsirkin wrote: > > > Quoting Yoshiaki Tamura : > > Subject: OFED-1.2 on x86 debian > > > > Hi. > > > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > > Although build_env.sh seems to work on debian, > > it fails compiling both kernel modules and user land tools by rpmbuild. > > > > Is OFED-1.2 tested on debian or totally unsupported? > > It's not on a list of supported platforms, but I think we do builds > on ubuntu so debian should work too. Vlad? > > -- > MST > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at mellanox.co.il Sun Jul 29 05:48:09 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 29 Jul 2007 15:48:09 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> Message-ID: <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> Hi, See OFED-1.2/docs/OFED_release_notes.txt: 1.2 Supported Platforms and Operating Systems --------------------------------------------- o CPU architectures: - x86_64 - x86 - ppc64 - ia64 o Linux Operating Systems: - RedHat EL4 up3: 2.6.9-34.ELsmp - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL5: 2.6.18-8.el5 - SLES10: 2.6.16.21-0.8-smp - kernel.org: 2.6.20.x - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) OFED-1.2 use RPM environment for installation. You can't use OFED installation script as is on Debian. Regards, Vladimir From: Naim Hammond [mailto:naim.hammond at gmail.com] Sent: Sunday, July 29, 2007 2:30 PM To: Michael S. Tsirkin Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian Where is the list of supported distributions? Where can I see it? Thanks On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il > wrote: > Quoting Yoshiaki Tamura < tamura at osrg.net >: > Subject: OFED-1.2 on x86 debian > > Hi. > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > Although build_env.sh seems to work on debian, > it fails compiling both kernel modules and user land tools by rpmbuild. > > Is OFED-1.2 tested on debian or totally unsupported? It's not on a list of supported platforms, but I think we do builds on ubuntu so debian should work too. Vlad? -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From ygdlz at amphi.com Sun Jul 29 06:21:10 2007 From: ygdlz at amphi.com (Hernandez G. Kit) Date: Sun, 29 Jul 2007 13:21:10 -0000 Subject: [ofa-general] Doc Message-ID: <46334ABF.4020001@amphi.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: Doc.pdf Type: application/pdf Size: 24371 bytes Desc: not available URL: From mst at dev.mellanox.co.il Sun Jul 29 07:04:31 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Jul 2007 17:04:31 +0300 Subject: [ofa-general] RFC: SRC API Message-ID: <20070729140431.GG16915@mellanox.co.il> Hello! Here is an API proposal for support of the SRC (scalable reliable connected) protocol extension in libibverbs. This adds APIs to: - manage SRC domains - share SRC domains between processes, by means of creating a 1:1 association between an SRC domain and a file. Notes: - The file is specified by means of a file descriptor, this makes it possible for the user to manage file creation/deletion in the most flexible manner (e.g. tmpfile can be used). - I envision implementing this sharing mechanism in kernel by means of a per-device tree, with inode as a key and domain object as a value. Please comment. Signed-off-by: Michael S. Tsirkin --- diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index acc1b82..503f201 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -370,6 +370,11 @@ struct ibv_ah_attr { uint8_t port_num; }; +struct ibv_src_domain { + struct ibv_context *context; + uint32_t handle; +}; + enum ibv_srq_attr_mask { IBV_SRQ_MAX_WR = 1 << 0, IBV_SRQ_LIMIT = 1 << 1 @@ -389,7 +394,8 @@ struct ibv_srq_init_attr { enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, - IBV_QPT_UD + IBV_QPT_UD, + IBV_QPT_SRC }; struct ibv_qp_cap { @@ -408,6 +414,7 @@ struct ibv_qp_init_attr { struct ibv_qp_cap cap; enum ibv_qp_type qp_type; int sq_sig_all; + struct ibv_src_domain *src_domain; }; enum ibv_qp_attr_mask { @@ -526,6 +533,7 @@ struct ibv_send_wr { uint32_t remote_qkey; } ud; } wr; + uint32_t src_remote_srq_num; }; struct ibv_recv_wr { @@ -553,6 +561,10 @@ struct ibv_srq { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + uint32_t src_srq_num; + struct ibv_src_domain *src_domain; + struct ibv_cq *src_cq; }; struct ibv_qp { @@ -570,6 +582,8 @@ struct ibv_qp { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + struct ibv_src_domain *src_domain; }; struct ibv_comp_channel { @@ -912,6 +926,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); /** + * ibv_create_src_srq - Creates a SRQ associated with the specified protection + * domain and src domain. + * @pd: The protection domain associated with the SRQ. + * @src_domain: The SRC domain associated with the SRQ. + * @src_cq: CQ to report completions for SRC packets on. + * + * @srq_init_attr: A list of initial attributes required to create the SRQ. + * + * srq_attr->max_wr and srq_attr->max_sge are read the determine the + * requested size of the SRQ, and set to the actual values allocated + * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * will always be at least as large as the requested values. + */ +struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd, + struct ibv_src_domain *src_domain, + struct ibv_cq *src_cq, + struct ibv_srq_init_attr *srq_init_attr); + +/** * ibv_modify_srq - Modifies the attributes for the specified SRQ. * @srq: The SRQ to modify. * @srq_attr: On input, specifies the SRQ attributes to modify. On output, @@ -1074,6 +1107,44 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); */ int ibv_fork_init(void); +/** + * ibv_alloc_src_domain - Allocate an SRC domain + * Returns a reference to an SRC domain. + * Use ibv_put_src_domain to free the reference. + * @context: Device context + */ +struct ibv_src_domain *ibv_get_new_src_domain(struct ibv_context *context); + +/** + * ibv_share_src_domain - associate the src domain with a file. + * Establishes a connection between an SRC domain object and a file descriptor. + * + * @d: SRC domain to share + * @fd: descriptor for a file to associate with the domain + */ +int ibv_share_src_domain(struct ibv_src_domain *d, int fd); + +/** + * ibv_unshare_src_domain - disassociate the src domain from a file. + * Subsequent calls to ibv_get_shared_src_domain will fail. + * @d: SRC domain to unshare + */ +int ibv_unshare_src_domain(struct ibv_src_domain *d); + +/** + * ibv_get_src_domain - get a reference to shared SRC domain + * @context: Device context + * @fd: descriptor for a file associated with the domain + */ +struct ibv_src_domain *ibv_get_shared_src_domain(struct ibv_context *context, int fd); + +/** + * ibv_put_src_domain - destroy a reference to an SRC domain + * If this is the last reference, destroys the domain. + * @d: reference to SRC domain to put + */ +int ibv_put_src_domain(struct ibv_src_domain *d); + END_C_DECLS # undef __attribute_const -- MST From naim.hammond at gmail.com Sun Jul 29 08:01:00 2007 From: naim.hammond at gmail.com (Naim Hammond) Date: Sun, 29 Jul 2007 18:01:00 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> Message-ID: So OFED is not supported on any free distribution. You did mention it is tested on Ubuntu, but you weren't sure. is it? N On 7/29/07, Vladimir Sokolovsky wrote: > > Hi, > > See OFED-1.2/docs/OFED_release_notes.txt: > > > > 1.2 Supported Platforms and Operating Systems > > --------------------------------------------- > > o CPU architectures: > > - x86_64 > > - x86 > > - ppc64 > > - ia64 > > > > o Linux Operating Systems: > > - RedHat EL4 up3: 2.6.9-34.ELsmp > > - RedHat EL4 up4: 2.6.9-42.ELsmp > > - RedHat EL4 up5: 2.6.9-55.ELsmp > > - RedHat EL5: 2.6.18-8.el5 > > - SLES10: 2.6.16.21-0.8-smp > > - kernel.org: 2.6.20.x > > - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) > > * * > > OFED-1.2 use RPM environment for installation. You can't use OFED > installation script as is on Debian. > > * * > > *Regards,* > > *Vladimir* > > > > *From:* Naim Hammond [mailto:naim.hammond at gmail.com] > *Sent:* Sunday, July 29, 2007 2:30 PM > *To:* Michael S. Tsirkin > *Cc:* Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org > *Subject:* Re: [ofa-general] Re: OFED-1.2 on x86 debian > > > > Where is the list of supported distributions? > Where can I see it? > > Thanks > > On 7/27/07, *Michael S. Tsirkin* < mst at dev.mellanox.co.il> wrote: > > > Quoting Yoshiaki Tamura < tamura at osrg.net>: > > Subject: OFED-1.2 on x86 debian > > > > Hi. > > > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > > Although build_env.sh seems to work on debian, > > it fails compiling both kernel modules and user land tools by rpmbuild. > > > > Is OFED-1.2 tested on debian or totally unsupported? > > It's not on a list of supported platforms, but I think we do builds > on ubuntu so debian should work too. Vlad? > > -- > MST > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sun Jul 29 09:47:47 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 29 Jul 2007 19:47:47 +0300 Subject: [ofa-general] Re: pkey.sim.tcl In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com> References: <20070724005153.GD11674@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901ED620B@mtlexch01.mtl.com> <20070724170432.GZ27878@sashak.voltaire.com> <20070724215441.GA25264@sashak.voltaire.com> <20070725202418.GD31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75674@mtlexch01.mtl.com> <20070726224133.GC2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75A8F@mtlexch01.mtl.com> <20070728215527.GH12351@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75D17@mtlexch01.mtl.com> Message-ID: <20070729164747.GB29844@sashak.voltaire.com> Hi Eitan, On 12:11 Sun 29 Jul , Eitan Zahavi wrote: > Regarding the test : > Once I will know the exact condition causing a full re-sweep I would use > it in the test. > In OFED 1.2 it was enough to set one switch ChangeBit to force a full > reconfiguration. You can set PortState where pkey table was modified to INIT and this will trigger update. > Regarding incremental flow in general: > 1. Yes - it is good. Ok. > 2. But we must make sure it is robust enough that we do not loose some > nodes or functionality > under extreme cases of reboot or HW errors. Testing reports are welcomed (as usual). I'm testing too. > 3. We should have a way to force a full sweep without killing the SM: > As the size of the clusters grow there is a growing chance that "soft > errors" will hit the devices. > Most of the device memory is guarded and would be auto detected if > affected. > However I think it is wise to allow for the user to force full > reconfiguration without making the SM "go away". We can add config option to force update unconditionally. Would it be sufficient? > Regarding OpenSM does not respond to SA queries during sweep: > It is due to the fact there is no "double buffer" for the internal DB. > So whenever the SM starts a sweep the SA will see an "empty" DB. Specific problem was due to fact that OpenSM DB is in "locked" state most of the time during sweep and SA is waiting to get access. > The solution for that problem may be having a "previous" DB during > sweeps. > I suspect using that approach will also enable a fine grain incremental > capability too. I agree, this could be good direction too. As well as some others like more granular locking etc.. Sasha From mst at dev.mellanox.co.il Sun Jul 29 09:46:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Sun, 29 Jul 2007 19:46:32 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> Message-ID: <20070729164632.GA28212@mellanox.co.il> > Quoting Naim Hammond : > Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian > > So OFED is not supported on any free distribution. > > You did mention it is tested on Ubuntu, but you weren't sure. is it? Note that support in this context means whether OFED was tested on the distro, not whether it builds/works. Quoting Naim Hammond : Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian So OFED is not supported on any free distribution. You did mention it is tested on Ubuntu, but you weren't sure. is it? N On 7/29/07, Vladimir Sokolovsky wrote: Hi, See OFED-1.2/docs/OFED_release_notes.txt: 1.2 Supported Platforms and Operating Systems --------------------------------------------- o CPU architectures: - x86_64 - x86 - ppc64 - ia64 o Linux Operating Systems: - RedHat EL4 up3: 2.6.9-34.ELsmp - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL5: 2.6.18-8.el5 - SLES10: 2.6.16.21-0.8-smp - kernel.org: 2.6.20.x - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) OFED-1.2 use RPM environment for installation. You can't use OFED installation script as is on Debian. Regards, Vladimir From: Naim Hammond [mailto:naim.hammond at gmail.com] Sent: Sunday, July 29, 2007 2:30 PM To: Michael S. Tsirkin Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian Where is the list of supported distributions? Where can I see it? Thanks On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il> wrote: > Quoting Yoshiaki Tamura < tamura at osrg.net>: > Subject: OFED-1.2 on x86 debian > > Hi. > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > Although build_env.sh seems to work on debian, > it fails compiling both kernel modules and user land tools by rpmbuild. > > Is OFED-1.2 tested on debian or totally unsupported? It's not on a list of supported platforms, but I think we do builds on ubuntu so debian should work too. Vlad? -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/ openib-general -- MST From sashak at voltaire.com Sun Jul 29 09:52:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 29 Jul 2007 19:52:20 +0300 Subject: [ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901F75131@mtlexch01.mtl.com> <20070725001847.GG25264@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F7558A@mtlexch01.mtl.com> <20070725194856.GB31582@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75673@mtlexch01.mtl.com> <20070727010707.GR2472@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75ABE@mtlexch01.mtl.com> <20070728221540.GI12351@sashak.voltaire.com> <6C2C79E72C305246B504CBA17B5500C901F75CF6@mtlexch01.mtl.com> Message-ID: <20070729165220.GC29844@sashak.voltaire.com> On 12:00 Sun 29 Jul , Eitan Zahavi wrote: > > On 14:27 Fri 27 Jul , Eitan Zahavi wrote: > > > The problem I have with back-to-back plug is that it is a > > fatal case > > > if found in a case where there was no use of this plug. > > > So we will need some sort of user input if it is OK or not. > > > > Ok, and let's add cl_qmap_count() check there. > Not following you. With back-to-back network cl_qmap_count(&sw_guid_tbl) should be 0. > > > The case of moving a port in the middle of a sweep can be easily > > > detected if instead of reporting an error a second check of the > > > original DR where the same GUID was found is performed... > > > > Do you mean to resend NodeInfo request to the original location? > > Assuming so, I guess it should be instead of second heavy > > sweep, and it is a good idea. The only small downside of this > > I can see is potential timeouts (and discovery slowdown). But > > anyway it is much better then fatal error. Thanks! > > So we are inline with this one . > Instead of changing the order of things we could generate list of DR's > that are to be re-scanned > during drop-mgr and then abort if really dulicates. I will need to look at code... Sasha From jgunthorpe at obsidianresearch.com Sun Jul 29 10:32:32 2007 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Sun, 29 Jul 2007 11:32:32 -0600 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46AC509B.6020206@voltaire.com> References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726181132.GO19768@obsidianresearch.com> <46AC509B.6020206@voltaire.com> Message-ID: <20070729173232.GA14867@obsidianresearch.com> On Sun, Jul 29, 2007 at 11:32:27AM +0300, Or Gerlitz wrote: > Jason Gunthorpe wrote: > >The existing trap monitoring in Sean's module covers about 90% of the > >cases in IB when you need to invalidate a PR, the last 10% will need > >something new :( > > Let it be. Do you think the last 10% should not prevent the local sa > merge to the upstream code? Only that the design philosophy should accommodate an eventual solution to this remaining problem. Mainly, as I've said, I'd like to see more stuff in userspace and a simple well defined kernel component. What about you? Your arguments about linking arp lifetime to PR cache lifetime are trying to address this very same 10%. Jason From swise at opengridcomputing.com Sun Jul 29 13:12:26 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 29 Jul 2007 15:12:26 -0500 Subject: [ofa-general] [PATCH 2.6.23 1/2] Make the iw_cxgb3 module parameters writable. Message-ID: <20070729201226.31659.85900.stgit@dell3.ogc.int> Make the iw_cxgb3 module parameters writable. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 9574088..fa95dce 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -63,37 +63,37 @@ static char *states[] = { }; static int ep_timeout_secs = 10; -module_param(ep_timeout_secs, int, 0444); +module_param(ep_timeout_secs, int, 0644); MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " "in seconds (default=10)"); static int mpa_rev = 1; -module_param(mpa_rev, int, 0444); +module_param(mpa_rev, int, 0644); MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " "1 is spec compliant. (default=1)"); static int markers_enabled = 0; -module_param(markers_enabled, int, 0444); +module_param(markers_enabled, int, 0644); MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); static int crc_enabled = 1; -module_param(crc_enabled, int, 0444); +module_param(crc_enabled, int, 0644); MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); static int rcv_win = 256 * 1024; -module_param(rcv_win, int, 0444); +module_param(rcv_win, int, 0644); MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=256)"); static int snd_win = 32 * 1024; -module_param(snd_win, int, 0444); +module_param(snd_win, int, 0644); MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=32KB)"); static unsigned int nocong = 0; -module_param(nocong, uint, 0444); +module_param(nocong, uint, 0644); MODULE_PARM_DESC(nocong, "Turn off congestion control (default=0)"); static unsigned int cong_flavor = 1; -module_param(cong_flavor, uint, 0444); +module_param(cong_flavor, uint, 0644); MODULE_PARM_DESC(cong_flavor, "TCP Congestion control flavor (default=1)"); static void process_work(struct work_struct *work); From swise at opengridcomputing.com Sun Jul 29 13:12:29 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 29 Jul 2007 15:12:29 -0500 Subject: [ofa-general] [PATCH 2.6.23 2/2] iw_cxgb3: Always call low level send function via cxgb3_ofld_send(). In-Reply-To: <20070729201226.31659.85900.stgit@dell3.ogc.int> References: <20070729201226.31659.85900.stgit@dell3.ogc.int> Message-ID: <20070729201228.31659.26300.stgit@dell3.ogc.int> iw_cxgb3: Always call low level send function via cxgb3_ofld_send(). Avoids deadlocks. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index fa95dce..20ba372 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -139,7 +139,7 @@ static void release_tid(struct t3cdev *t req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); skb->priority = CPL_PRIORITY_SETUP; - tdev->send(tdev, skb); + cxgb3_ofld_send(tdev, skb); return; } @@ -161,7 +161,7 @@ int iwch_quiesce_tid(struct iwch_ep *ep) req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); skb->priority = CPL_PRIORITY_DATA; - ep->com.tdev->send(ep->com.tdev, skb); + cxgb3_ofld_send(ep->com.tdev, skb); return 0; } @@ -183,7 +183,7 @@ int iwch_resume_tid(struct iwch_ep *ep) req->val = 0; skb->priority = CPL_PRIORITY_DATA; - ep->com.tdev->send(ep->com.tdev, skb); + cxgb3_ofld_send(ep->com.tdev, skb); return 0; } @@ -784,7 +784,7 @@ static int update_rx_credits(struct iwch OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); skb->priority = CPL_PRIORITY_ACK; - ep->com.tdev->send(ep->com.tdev, skb); + cxgb3_ofld_send(ep->com.tdev, skb); return credits; } @@ -1152,7 +1152,7 @@ static int listen_start(struct iwch_list req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); skb->priority = 1; - ep->com.tdev->send(ep->com.tdev, skb); + cxgb3_ofld_send(ep->com.tdev, skb); return 0; } @@ -1186,7 +1186,7 @@ static int listen_stop(struct iwch_liste req->cpu_idx = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); skb->priority = 1; - ep->com.tdev->send(ep->com.tdev, skb); + cxgb3_ofld_send(ep->com.tdev, skb); return 0; } @@ -1264,7 +1264,7 @@ static void reject_cr(struct t3cdev *tde rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); rpl->opt2 = 0; rpl->rsvd = rpl->opt2; - tdev->send(tdev, skb); + cxgb3_ofld_send(tdev, skb); } } @@ -1557,7 +1557,7 @@ static int peer_abort(struct t3cdev *tde rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); rpl->cmd = CPL_ABORT_NO_RST; - ep->com.tdev->send(ep->com.tdev, rpl_skb); + cxgb3_ofld_send(ep->com.tdev, rpl_skb); if (state != ABORTING) { state_set(&ep->com, DEAD); release_ep_resources(ep); From swise at opengridcomputing.com Sun Jul 29 13:34:37 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 29 Jul 2007 15:34:37 -0500 Subject: [ofa-general] [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. Message-ID: <46ACF9DD.1010509@opengridcomputing.com> RDMA experts, I'd like input on the patch below. iWARP devices that support both native stack TCP and iwarp connections on the same interface need the fix below or some similar enhancement to the rdma cm. This is a bug in the ofed-1.2 RDMA-CM code as it stands. I propose we fix this for ofed-1.2.1 or ofed-1.3. Here is the issue: Consider an mpi cluster running mvapich2. And the cluster runs MPI/Sockets jobs concurrently with MPI/RDMA jobs. It is possible, without the patch below, for MPI/Sockets processes to mistakenly get incoming RDMA connections and vice versa. The way mvapich2 works is that the ranks all bind and listen to a random port (retrying new random ports if the bind fails with "in use"). Once they get a free port and bind/listen, they advertise that port to the peers to do connection setup. Currently, without the patch below, the mpi/rdma processes can end up binding/listening to the _same_ port number as the mpi/sockets processes running over the native tcp stack. This is due to duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP port space. If this happens, then the connections can get screwed up. The correct solution in my mind is to use the host stack's TCP port space for _all_ RDMA_PS_TCP port allocations. The patch below is a minimal delta to unify the port spaces bay using the kernel stack to bind ports. This is done by allocating a kernel socket and binding to the appropriate local addr/port. It also allows the kernel stack to pick ephemeral ports by virtue of just passing in port 0 on the kernel bind operation. I'd like to discuss this with the RDMA folks first and iron out an agreement on how this should be implemented, then widen the audience to lklm/netdev. With a goal of inclusion in 2.6.23 and ofed-1.2.1 or 1.3. Thanks, Steve. -------- Original Message -------- Subject: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. Date: Sun, 29 Jul 2007 15:17:04 -0500 From: Steve Wise To: swise at opengridcomputing.com RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. This is needed for iwarp providers that support native and rdma connections over the same interface. Signed-off-by: Steve Wise --- drivers/infiniband/core/cma.c | 27 ++++++++++++++++++++++++++- 1 files changed, 26 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9e0ab04..e4d2d7f 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -111,6 +111,7 @@ struct rdma_id_private { struct rdma_cm_id id; struct rdma_bind_list *bind_list; + struct socket *sock; struct hlist_node node; struct list_head list; struct list_head listen_list; @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma kfree(bind_list); } mutex_unlock(&lock); + if (id_priv->sock) + sock_release(id_priv->sock); } void rdma_destroy_id(struct rdma_cm_id *id) @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps, return 0; } +static int cma_get_tcp_port(struct rdma_id_private *id_priv) +{ + int ret; + struct socket *sock; + + ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); + if (ret) + return ret; + ret = sock->ops->bind(sock, + (struct socketaddr *)&id_priv->id.route.addr.src_addr, + ip_addr_size(&id_priv->id.route.addr.src_addr)); + if (ret) { + sock_release(sock); + return ret; + } + id_priv->sock = sock; + return 0; +} + static int cma_get_port(struct rdma_id_private *id_priv) { struct idr *ps; @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p break; case RDMA_PS_TCP: ps = &tcp_ps; + ret = cma_get_tcp_port(id_priv); /* Synch with native stack */ + if (ret) + goto out; break; case RDMA_PS_UDP: ps = &udp_ps; @@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p else ret = cma_use_port(ps, id_priv); mutex_unlock(&lock); - +out: return ret; } From lypfw at pavarini.com Sun Jul 29 14:16:32 2007 From: lypfw at pavarini.com (Cook Robin) Date: Sun, 29 Jul 2007 21:16:32 +0000 Subject: [ofa-general] e-mail Message-ID: <46AD03B0.7030001@pavarini.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: e-mail.pdf Type: application/pdf Size: 25842 bytes Desc: not available URL: From tamura at osrg.net Sun Jul 29 17:30:32 2007 From: tamura at osrg.net (Yoshiaki Tamura) Date: Mon, 30 Jul 2007 09:30:32 +0900 Subject: [ofa-general] OFED-1.2 on x86 debian In-Reply-To: <46AA2663.4060709@hp.com> References: <46A97850.2030607@osrg.net> <46AA2663.4060709@hp.com> Message-ID: <46AD3128.1020009@osrg.net> > Michael S. Tsirkin wrote: >>> Quoting Yoshiaki Tamura : >>> Subject: OFED-1.2 on x86 debian >>> >>> Hi. >>> >>> I'm trying to install OFED-1.2 on x86 (32bit) debian machine. >>> Although build_env.sh seems to work on debian, >>> it fails compiling both kernel modules and user land tools by rpmbuild. >>> >>> Is OFED-1.2 tested on debian or totally unsupported? >> >> It's not on a list of supported platforms, but I think we do builds >> on ubuntu so debian should work too. Vlad? For some components it seems to work, but not all of them. > I have been trying to make it work here on Ubuntu (Debian rebuild) 7.04. > > Had to hack build_env.sh a little to get it to ignore some of the > dependency checking (done by package name, which is not portable across > distros). I removed gcc and zlib dependency checking to build on debian etch. I could compile user land basic packages, but it failed building dapl. rpmbuild couldn't find dat.conf. > When I tried to do that with ia64 Debian I was directed towards some tar files > of the mods rather than the install.sh stuff. I don't have the pointers at my > fingertips, but would assume they remain in the list archives. > > rick jones Maybe this page? http://www.openfabrics.org/builds/ Thanks for your comments. Yoshi From kliteyn at mellanox.co.il Sun Jul 29 21:05:11 2007 From: kliteyn at mellanox.co.il (kliteyn at mellanox.co.il) Date: 30 Jul 2007 07:05:11 +0300 Subject: [ofa-general] nightly osm_sim report 2007-07-30:normal completion Message-ID: OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM rev = Fri_Jul_27_07:38:19_2007 [7284d020ea232b253331faf52c950626cf330aab] ibutils rev = Tue_Mar_13_14:36:32_2007 [80aaff94f0eb65117db39b9db7d609ffdcc055de] Total=520 Pass=467 Fail=53 Pass: 39 Stability IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo 13 FatTree merge-roots-4-ary-2-tree.topo 13 FatTree merge-root-4-ary-3-tree.topo 13 FatTree gnu-stallion-64.topo 13 FatTree blend-4-ary-2-tree.topo 13 FatTree RhinoDDR.topo 13 FatTree FullGnu.topo 13 FatTree 4-ary-2-tree.topo 13 FatTree 2-ary-4-tree.topo 13 FatTree 12-node-spaced.topo 13 FTreeFail 4-ary-2-tree-missing-sw-link.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 13 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 13 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo 12 OsmTest IS3-loop.topo Failures: 39 Pkey IS1-16.topo 13 Pkey IS3-128.topo 1 OsmTest IS3-loop.topo From naim.hammond at gmail.com Sun Jul 29 23:13:51 2007 From: naim.hammond at gmail.com (Naim Hammond) Date: Mon, 30 Jul 2007 09:13:51 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: <20070729164632.GA28212@mellanox.co.il> References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> <20070729164632.GA28212@mellanox.co.il> Message-ID: On 7/29/07, Michael S. Tsirkin wrote: > > > Quoting Naim Hammond : > > You did mention it is tested on Ubuntu, but you weren't sure. is it? > > Note that support in this context means whether OFED was tested on the > distro, > not whether it builds/works. I'm sorry that I don't understand your answer. What exactly do you mean "OFED was tested" if it does not build, nor works, on this or that distribution? Please explain. N -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Mon Jul 30 01:40:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 30 Jul 2007 01:40:35 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070730-0100 daily build status Message-ID: <20070730084035.6DFB6E60863@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.15 Passed on x86_64 with linux-2.6.21.1 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on x86_64 with linux-2.6.19 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Failed: From glebn at voltaire.com Mon Jul 30 01:50:16 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 11:50:16 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070729140431.GG16915@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> Message-ID: <20070730085016.GG4434@minantech.com> On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote: > Hello! > Here is an API proposal for support of the SRC > (scalable reliable connected) protocol extension in libibverbs. > > This adds APIs to: > - manage SRC domains > > - share SRC domains between processes, > by means of creating a 1:1 association > between an SRC domain and a file. > > Notes: > - The file is specified by means of a file descriptor, > this makes it possible for the user to manage file > creation/deletion in the most flexible manner > (e.g. tmpfile can be used). > > - I envision implementing this sharing mechanism in kernel by means > of a per-device tree, with inode as a key and domain object > as a value. > > Please comment. Can you provide a pseudo code of an application using this API? Especially QP sharing part. > > Signed-off-by: Michael S. Tsirkin > > --- > > diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h > index acc1b82..503f201 100644 > --- a/include/infiniband/verbs.h > +++ b/include/infiniband/verbs.h > @@ -370,6 +370,11 @@ struct ibv_ah_attr { > uint8_t port_num; > }; > > +struct ibv_src_domain { > + struct ibv_context *context; > + uint32_t handle; > +}; > + > enum ibv_srq_attr_mask { > IBV_SRQ_MAX_WR = 1 << 0, > IBV_SRQ_LIMIT = 1 << 1 > @@ -389,7 +394,8 @@ struct ibv_srq_init_attr { > enum ibv_qp_type { > IBV_QPT_RC = 2, > IBV_QPT_UC, > - IBV_QPT_UD > + IBV_QPT_UD, > + IBV_QPT_SRC > }; > > struct ibv_qp_cap { > @@ -408,6 +414,7 @@ struct ibv_qp_init_attr { > struct ibv_qp_cap cap; > enum ibv_qp_type qp_type; > int sq_sig_all; > + struct ibv_src_domain *src_domain; > }; > > enum ibv_qp_attr_mask { > @@ -526,6 +533,7 @@ struct ibv_send_wr { > uint32_t remote_qkey; > } ud; > } wr; > + uint32_t src_remote_srq_num; > }; > > struct ibv_recv_wr { > @@ -553,6 +561,10 @@ struct ibv_srq { > pthread_mutex_t mutex; > pthread_cond_t cond; > uint32_t events_completed; > + > + uint32_t src_srq_num; > + struct ibv_src_domain *src_domain; > + struct ibv_cq *src_cq; > }; > > struct ibv_qp { > @@ -570,6 +582,8 @@ struct ibv_qp { > pthread_mutex_t mutex; > pthread_cond_t cond; > uint32_t events_completed; > + > + struct ibv_src_domain *src_domain; > }; > > struct ibv_comp_channel { > @@ -912,6 +926,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, > struct ibv_srq_init_attr *srq_init_attr); > > /** > + * ibv_create_src_srq - Creates a SRQ associated with the specified protection > + * domain and src domain. > + * @pd: The protection domain associated with the SRQ. > + * @src_domain: The SRC domain associated with the SRQ. > + * @src_cq: CQ to report completions for SRC packets on. > + * > + * @srq_init_attr: A list of initial attributes required to create the SRQ. > + * > + * srq_attr->max_wr and srq_attr->max_sge are read the determine the > + * requested size of the SRQ, and set to the actual values allocated > + * on return. If ibv_create_srq() succeeds, then max_wr and max_sge > + * will always be at least as large as the requested values. > + */ > +struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd, > + struct ibv_src_domain *src_domain, > + struct ibv_cq *src_cq, > + struct ibv_srq_init_attr *srq_init_attr); > + > +/** > * ibv_modify_srq - Modifies the attributes for the specified SRQ. > * @srq: The SRQ to modify. > * @srq_attr: On input, specifies the SRQ attributes to modify. On output, > @@ -1074,6 +1107,44 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); > */ > int ibv_fork_init(void); > > +/** > + * ibv_alloc_src_domain - Allocate an SRC domain > + * Returns a reference to an SRC domain. > + * Use ibv_put_src_domain to free the reference. > + * @context: Device context > + */ > +struct ibv_src_domain *ibv_get_new_src_domain(struct ibv_context *context); > + > +/** > + * ibv_share_src_domain - associate the src domain with a file. > + * Establishes a connection between an SRC domain object and a file descriptor. > + * > + * @d: SRC domain to share > + * @fd: descriptor for a file to associate with the domain > + */ > +int ibv_share_src_domain(struct ibv_src_domain *d, int fd); > + > +/** > + * ibv_unshare_src_domain - disassociate the src domain from a file. > + * Subsequent calls to ibv_get_shared_src_domain will fail. > + * @d: SRC domain to unshare > + */ > +int ibv_unshare_src_domain(struct ibv_src_domain *d); > + > +/** > + * ibv_get_src_domain - get a reference to shared SRC domain > + * @context: Device context > + * @fd: descriptor for a file associated with the domain > + */ > +struct ibv_src_domain *ibv_get_shared_src_domain(struct ibv_context *context, int fd); > + > +/** > + * ibv_put_src_domain - destroy a reference to an SRC domain > + * If this is the last reference, destroys the domain. > + * @d: reference to SRC domain to put > + */ > +int ibv_put_src_domain(struct ibv_src_domain *d); > + > END_C_DECLS > > # undef __attribute_const > > > > -- > MST > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb. From mst at dev.mellanox.co.il Mon Jul 30 01:52:21 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 11:52:21 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730085016.GG4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> Message-ID: <20070730085221.GF9963@mellanox.co.il> > On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote: > > Hello! > > Here is an API proposal for support of the SRC > > (scalable reliable connected) protocol extension in libibverbs. > > > > This adds APIs to: > > - manage SRC domains > > > > - share SRC domains between processes, > > by means of creating a 1:1 association > > between an SRC domain and a file. > > > > Notes: > > - The file is specified by means of a file descriptor, > > this makes it possible for the user to manage file > > creation/deletion in the most flexible manner > > (e.g. tmpfile can be used). > > > > - I envision implementing this sharing mechanism in kernel by means > > of a per-device tree, with inode as a key and domain object > > as a value. > > > > Please comment. > Can you provide a pseudo code of an application using this API? > Especially QP sharing part. There's no QP sharing here. You mean SRC domain sharing? -- MST From glebn at voltaire.com Mon Jul 30 01:54:09 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 11:54:09 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730085221.GF9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730085221.GF9963@mellanox.co.il> Message-ID: <20070730085409.GH4434@minantech.com> On Mon, Jul 30, 2007 at 11:52:21AM +0300, Michael S. Tsirkin wrote: > > On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote: > > > Hello! > > > Here is an API proposal for support of the SRC > > > (scalable reliable connected) protocol extension in libibverbs. > > > > > > This adds APIs to: > > > - manage SRC domains > > > > > > - share SRC domains between processes, > > > by means of creating a 1:1 association > > > between an SRC domain and a file. > > > > > > Notes: > > > - The file is specified by means of a file descriptor, > > > this makes it possible for the user to manage file > > > creation/deletion in the most flexible manner > > > (e.g. tmpfile can be used). > > > > > > - I envision implementing this sharing mechanism in kernel by means > > > of a per-device tree, with inode as a key and domain object > > > as a value. > > > > > > Please comment. > > Can you provide a pseudo code of an application using this API? > > Especially QP sharing part. > > There's no QP sharing here. > You mean SRC domain sharing? > Yes. Sorry. -- Gleb. From mst at dev.mellanox.co.il Mon Jul 30 02:01:40 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 12:01:40 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730085016.GG4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> Message-ID: <20070730090140.GG9963@mellanox.co.il> Some code examples: /* create a domain and share it: */ struct ibv_src_domain * d = ibv_get_new_src_domain(ctx); int fd = open(path, O_CREAT | O_RDWR, mode); ibv_share_src_domain(d, fd); /* get a reference to a shared domain: */ int fd = open(path, O_CREAT | O_RDWR, mode); struct ibv_src_domain * d = ibv_get_shared_src_domain(ctx, fd); /* once done: */ ibv_put_src_domain(d); Note: when all users do put, domain is destroyed. -- MST From vlad at mellanox.co.il Mon Jul 30 02:04:00 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 30 Jul 2007 12:04:00 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian In-Reply-To: References: <46A97850.2030607@osrg.net> <20070727083438.GA9912@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901F75EA5@mtlexch01.mtl.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0CB0@mtlexch01.mtl.com> OFED-1.2 should work on Fedora C6, CentOS 4.4, 4.5, 5.0. Regards, Vladimir From: Naim Hammond [mailto:naim.hammond at gmail.com] Sent: Sunday, July 29, 2007 6:01 PM To: Vladimir Sokolovsky Cc: Michael S. Tsirkin; Yoshiaki Tamura; openib-general at openib.org Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian So OFED is not supported on any free distribution. You did mention it is tested on Ubuntu, but you weren't sure. is it? N On 7/29/07, Vladimir Sokolovsky wrote: Hi, See OFED-1.2/docs/OFED_release_notes.txt: 1.2 Supported Platforms and Operating Systems --------------------------------------------- o CPU architectures: - x86_64 - x86 - ppc64 - ia64 o Linux Operating Systems: - RedHat EL4 up3: 2.6.9-34.ELsmp - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL5: 2.6.18-8.el5 - SLES10: 2.6.16.21-0.8-smp - kernel.org: 2.6.20.x - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) OFED-1.2 use RPM environment for installation. You can't use OFED installation script as is on Debian. Regards, Vladimir From: Naim Hammond [mailto:naim.hammond at gmail.com] Sent: Sunday, July 29, 2007 2:30 PM To: Michael S. Tsirkin Cc: Yoshiaki Tamura; Vladimir Sokolovsky; openib-general at openib.org Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian Where is the list of supported distributions? Where can I see it? Thanks On 7/27/07, Michael S. Tsirkin < mst at dev.mellanox.co.il > wrote: > Quoting Yoshiaki Tamura < tamura at osrg.net >: > Subject: OFED-1.2 on x86 debian > > Hi. > > I'm trying to install OFED-1.2 on x86 (32bit) debian machine. > Although build_env.sh seems to work on debian, > it fails compiling both kernel modules and user land tools by rpmbuild. > > Is OFED-1.2 tested on debian or totally unsupported? It's not on a list of supported platforms, but I think we do builds on ubuntu so debian should work too. Vlad? -- MST _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From glebn at voltaire.com Mon Jul 30 02:06:00 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 12:06:00 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730090140.GG9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730090140.GG9963@mellanox.co.il> Message-ID: <20070730090600.GI4434@minantech.com> On Mon, Jul 30, 2007 at 12:01:40PM +0300, Michael S. Tsirkin wrote: > Some code examples: > /* create a domain and share it: */ > > struct ibv_src_domain * d = ibv_get_new_src_domain(ctx); > int fd = open(path, O_CREAT | O_RDWR, mode); > ibv_share_src_domain(d, fd); > > /* get a reference to a shared domain: */ > > int fd = open(path, O_CREAT | O_RDWR, mode); > struct ibv_src_domain * d = ibv_get_shared_src_domain(ctx, fd); > > /* once done: */ > ibv_put_src_domain(d); > > Note: when all users do put, domain is destroyed. > OK. I am more interested in how SRC is connected to a QP in different processes. How a process will be able to do sends to different processes through one QP, etc... -- Gleb. From ogerlitz at voltaire.com Mon Jul 30 02:08:21 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 30 Jul 2007 12:08:21 +0300 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <20070729173232.GA14867@obsidianresearch.com> References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726181132.GO19768@obsidianresearch.com> <46AC509B.6020206@voltaire.com> <20070729173232.GA14867@obsidianresearch.com> Message-ID: <46ADAA85.8070106@voltaire.com> Jason Gunthorpe wrote: > On Sun, Jul 29, 2007 at 11:32:27AM +0300, Or Gerlitz wrote: >> Jason Gunthorpe wrote: >>> The existing trap monitoring in Sean's module covers about 90% of the >>> cases in IB when you need to invalidate a PR, the last 10% will need >>> something new :( >> Let it be. Do you think the last 10% should not prevent the local sa >> merge to the upstream code? > Only that the design philosophy should accommodate an eventual solution > to this remaining problem. Mainly, as I've said, I'd like to see more > stuff in userspace and a simple well defined kernel component. > What about you? Your arguments about linking arp lifetime to PR cache > lifetime are trying to address this very same 10%. Indeed. The argument I was trying to make is that arp cache invalidation requires IPoIB PR cache invalidation, this handles 100% of the cases, including the 10% not covered by doing cache invalidation based only on IB events such as port up / sm lid change / sm reregister / etc So far, my approach has not accepted as is by Sean and you (Roland, Michael - would be nice to get your say here), I have to see what other design is possible here. Or. From mst at dev.mellanox.co.il Mon Jul 30 02:10:20 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 12:10:20 +0300 Subject: [ofa-general] Re: OFED-1.2 on x86 debian Message-ID: <20070730091020.GH9963@mellanox.co.il> Quoting Michael S. Tsirkin : Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian > Quoting Naim Hammond : > Subject: Re: [ofa-general] Re: OFED-1.2 on x86 debian > > On 7/29/07, Michael S. Tsirkin wrote: > > > Quoting Naim Hammond : > > You did mention it is tested on Ubuntu, but you weren't sure. is it? > > Note that support in this context means whether OFED was tested on the > distro, not whether it builds/works. > > > I'm sorry that I don't understand your answer. > What exactly do you mean "OFED was tested" if it does not build, nor works, on > this or that distribution? > > Please explain. > > N The reason some things in OFED might not work on a given distro is because, no one volunteered to test this distro. However: - Maintainers for some packages use distros outside the list of supported platforms. These will build and work, but no one compiled a specific list - one'll have to ask around, and OFED packaging might not work so one might need to install packages individially. - We do care about portability. If someone is interested enough to test things out on a given distro, and report issues, and work with maintainers on fixing things, you will find that people will be happy to help. Hope this helps, -- MST From mst at dev.mellanox.co.il Mon Jul 30 02:16:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 12:16:39 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730085016.GG4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> Message-ID: <20070730091639.GI9963@mellanox.co.il> More code examples: Create an SRC QP, part of SRC domain: attr.qp_type = IBV_QPT_SRC; attr.src_domain = d; qp = ibv_create_qp(pd, &attr); Given remote SRQ number, send data to this SRQ over an SRC QP: wr.src_remote_srq_num = src_remote_srq_num; ib_post_send(qp, &wr); Note: SRQ number needs to be exchanged as part of CM private data or some other protocol. -- MST From vlad at lists.openfabrics.org Mon Jul 30 02:49:35 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Mon, 30 Jul 2007 02:49:35 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070730-0200 daily build status Message-ID: <20070730094936.111EAE60921@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on powerpc with linux-2.6.16 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ia64 with linux-2.6.22 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From glebn at voltaire.com Mon Jul 30 03:41:17 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 13:41:17 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730091639.GI9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> Message-ID: <20070730104117.GJ4434@minantech.com> On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote: > More code examples: > > Create an SRC QP, part of SRC domain: > > attr.qp_type = IBV_QPT_SRC; > attr.src_domain = d; > qp = ibv_create_qp(pd, &attr); > > Given remote SRQ number, send data to this SRQ over an SRC QP: > > wr.src_remote_srq_num = src_remote_srq_num; > ib_post_send(qp, &wr); > > Note: SRQ number needs to be exchanged as part of CM private data > or some other protocol. > You are too brief. I can come up with one linears based on the API by myself. I am trying to understand how sharing of SRC between processes will work and your example doesn't show this. Can I connected the same SRC to different QPs? If yes, can I send packet to any SRQ connected to the SRC through any QP connected to the same SRC? If yes how is this different from having regular QPs? -- Gleb. From ishai at mellanox.co.il Mon Jul 30 04:03:40 2007 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 30 Jul 2007 14:03:40 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730104117.GJ4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> Gleb, I'm attaching a presentation that explains how we can use SRC in MPI. (You need power point to watch it). Comments are welcomed. Enjoy Ishai -----Original Message----- From: Gleb Natapov [mailto:glebn at voltaire.com] Sent: Monday, July 30, 2007 13:41 PM To: Michael S. Tsirkin Cc: general at lists.openfabrics.org; Roland Dreier; Pavel Shamis; Ishai Rabinovitz; ewg at lists.openfabrics.org Subject: Re: [ofa-general] RFC: SRC API On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote: > More code examples: > > Create an SRC QP, part of SRC domain: > > attr.qp_type = IBV_QPT_SRC; > attr.src_domain = d; > qp = ibv_create_qp(pd, &attr); > > Given remote SRQ number, send data to this SRQ over an SRC QP: > > wr.src_remote_srq_num = src_remote_srq_num; > ib_post_send(qp, &wr); > > Note: SRQ number needs to be exchanged as part of CM private data > or some other protocol. > You are too brief. I can come up with one linears based on the API by myself. I am trying to understand how sharing of SRC between processes will work and your example doesn't show this. Can I connected the same SRC to different QPs? If yes, can I send packet to any SRQ connected to the SRC through any QP connected to the same SRC? If yes how is this different from having regular QPs? -- Gleb. -------------- next part -------------- A non-text attachment was scrubbed... Name: SRC-2.ppt Type: application/vnd.ms-powerpoint Size: 66560 bytes Desc: SRC-2.ppt URL: From mst at dev.mellanox.co.il Mon Jul 30 04:21:30 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 14:21:30 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730104117.GJ4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> Message-ID: <20070730112130.GJ9963@mellanox.co.il> > Quoting Gleb Natapov : > Subject: Re: [ofa-general] RFC: SRC API > > On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote: > > More code examples: > > > > Create an SRC QP, part of SRC domain: > > > > attr.qp_type = IBV_QPT_SRC; > > attr.src_domain = d; > > qp = ibv_create_qp(pd, &attr); > > > > Given remote SRQ number, send data to this SRQ over an SRC QP: > > > > wr.src_remote_srq_num = src_remote_srq_num; > > ib_post_send(qp, &wr); > > > > Note: SRQ number needs to be exchanged as part of CM private data > > or some other protocol. > > > You are too brief. I can come up with one linears based on the API by > myself. I am trying to understand how sharing of SRC between processes > will work and your example doesn't show this. It seems what you are missing is what SRC is, not how to use the API. I'll have a working example when I get closer to implementation. For now you'll have to look up Dror's preso if you want to understand what SRC is. > Can I connected the same > SRC to different QPs? If yes, can I send packet to any SRQ connected to > the SRC through any QP connected to the same SRC? Yes to both. > If yes how is this > different from having regular QPs? With regular QP you can only send to a single SRQ. But again, look at Dror's preso. -- MST From glebn at voltaire.com Mon Jul 30 04:27:22 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 14:27:22 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> Message-ID: <20070730112722.GK4434@minantech.com> On Mon, Jul 30, 2007 at 02:03:40PM +0300, Ishai Rabinovitz wrote: > Gleb, > > I'm attaching a presentation that explains how we can use SRC in MPI. > (You need power point to watch it). > > Comments are welcomed. So you propose to have separate QP for sending and receiving? And receiving QP should be shared between ranks (this part is not addressed by proposed API BTW). Correct? -- Gleb. From ishai at mellanox.co.il Mon Jul 30 04:29:46 2007 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 30 Jul 2007 14:29:46 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730112722.GK4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> Yes, there will be a different QP for send and a different QP for receive. There is no need for a special API to support this. You just open several QPs and treat them the way you want. Ishai -----Original Message----- From: Gleb Natapov [mailto:glebn at voltaire.com] Sent: Monday, July 30, 2007 14:27 PM To: Ishai Rabinovitz Cc: Michael S. Tsirkin; general at lists.openfabrics.org; Roland Dreier; Pavel Shamis; Jeff Squyres; Galen Shipman; Gil Bloch; panda at cse.ohio-state.edu Subject: Re: [ofa-general] RFC: SRC API On Mon, Jul 30, 2007 at 02:03:40PM +0300, Ishai Rabinovitz wrote: > Gleb, > > I'm attaching a presentation that explains how we can use SRC in MPI. > (You need power point to watch it). > > Comments are welcomed. So you propose to have separate QP for sending and receiving? And receiving QP should be shared between ranks (this part is not addressed by proposed API BTW). Correct? -- Gleb. From glebn at voltaire.com Mon Jul 30 04:54:25 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 14:54:25 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730112130.GJ9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <20070730112130.GJ9963@mellanox.co.il> Message-ID: <20070730115425.GL4434@minantech.com> On Mon, Jul 30, 2007 at 02:21:30PM +0300, Michael S. Tsirkin wrote: > > Quoting Gleb Natapov : > > Subject: Re: [ofa-general] RFC: SRC API > > > > On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote: > > > More code examples: > > > > > > Create an SRC QP, part of SRC domain: > > > > > > attr.qp_type = IBV_QPT_SRC; > > > attr.src_domain = d; > > > qp = ibv_create_qp(pd, &attr); > > > > > > Given remote SRQ number, send data to this SRQ over an SRC QP: > > > > > > wr.src_remote_srq_num = src_remote_srq_num; > > > ib_post_send(qp, &wr); > > > > > > Note: SRQ number needs to be exchanged as part of CM private data > > > or some other protocol. > > > > > You are too brief. I can come up with one linears based on the API by > > myself. I am trying to understand how sharing of SRC between processes > > will work and your example doesn't show this. > > It seems what you are missing is what SRC is, not how to use the API. So tell us. Because it seems I am not the only one judging by presentation I've got from Ishai. In this presentation he propose to create separate receive QPs and send QPs. Is this how it meant to be working if SRC domain is shared between processes? Because frankly, I don't see how it can be used in any other way. > I'll have a working example when I get closer to implementation. > For now you'll have to look up Dror's preso if you want to > understand what SRC is. I looked at Dror's presentation not once. If we are talking about the same presentation there is no much details there except additional field in the header with destination SRQ number so HW will be able to demux a packet in the right SRQ. > > > Can I connected the same > > SRC to different QPs? If yes, can I send packet to any SRQ connected to > > the SRC through any QP connected to the same SRC? > > Yes to both. And can I attach SRQ to SRC domain without creating QP? I suppose yes. > > > If yes how is this > > different from having regular QPs? > > With regular QP you can only send to a single SRQ. > But again, look at Dror's preso. > Yes but I can use the same QP for sending and receiving (this is a Queue Pair after all). Now I'll have to create QP for send and QP for receive. Overall number of QPs may still be smaller though... -- Gleb. From glebn at voltaire.com Mon Jul 30 05:00:18 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 15:00:18 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> Message-ID: <20070730120018.GM4434@minantech.com> On Mon, Jul 30, 2007 at 02:29:46PM +0300, Ishai Rabinovitz wrote: > Yes, there will be a different QP for send and a different QP for > receive. > There is no need for a special API to support this. You just open > several QPs and treat them the way you want. The way it is present in your slides the receive QPs are in shared memory, but if it is possible to attach SRQ to SRC without access to QP the QP may reside in a memory of one of the ranks. By the way ibv_create_src_srq() gets PD as a parameter and each process will have its own PD, so one QP will be able to put messages to different PD domains is that right? -- Gleb. From mst at dev.mellanox.co.il Mon Jul 30 05:07:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 15:07:06 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730120018.GM4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> <20070730120018.GM4434@minantech.com> Message-ID: <20070730120706.GM9963@mellanox.co.il> > By the way ibv_create_src_srq() gets PD as a parameter and each process will > have its own PD, so one QP will be able to put messages to different PD > domains is that right? Correct. That's part of the SRC extension. -- MST From ishai at mellanox.co.il Mon Jul 30 05:06:01 2007 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Mon, 30 Jul 2007 15:06:01 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730120018.GM4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> <20070730120018.GM4434@minantech.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF0E0F@mtlexch01.mtl.com> -----Original Message----- From: Gleb Natapov [mailto:glebn at voltaire.com] Sent: Monday, July 30, 2007 15:00 PM To: Ishai Rabinovitz Cc: Michael S. Tsirkin; general at lists.openfabrics.org; Roland Dreier; Pavel Shamis; Jeff Squyres; Galen Shipman; Gil Bloch; panda at cse.ohio-state.edu Subject: Re: [ofa-general] RFC: SRC API On Mon, Jul 30, 2007 at 02:29:46PM +0300, Ishai Rabinovitz wrote: >The way it is present in your slides the receive QPs are in shared memory, but if it is possible to attach SRQ to SRC without access to QP the QP may reside in a > memory of one of the ranks. Actually, no one access the receive QP and it occupies little space. I draw it in SHM, but you can think of it as existing only in the kernel and in the HCA. > By the way > ibv_create_src_srq() gets PD as a parameter and each process will have its own PD, so one QP will be able to put messages to different PD domains is that right? Yes. Ishai From mst at dev.mellanox.co.il Mon Jul 30 05:10:57 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 15:10:57 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730115425.GL4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <20070730112130.GJ9963@mellanox.co.il> <20070730115425.GL4434@minantech.com> Message-ID: <20070730121057.GN9963@mellanox.co.il> > > It seems what you are missing is what SRC is, not how to use the API. > > So tell us. This calls for a separate document. From feedback from Sonoma I really assumed people have it figured out. Let's open a separate thread, and there I will try writing up what SRC is from the protocol point of view. -- MST From glebn at voltaire.com Mon Jul 30 05:11:02 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 15:11:02 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730120706.GM9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> <20070730120018.GM4434@minantech.com> <20070730120706.GM9963@mellanox.co.il> Message-ID: <20070730121102.GN4434@minantech.com> On Mon, Jul 30, 2007 at 03:07:06PM +0300, Michael S. Tsirkin wrote: > > By the way ibv_create_src_srq() gets PD as a parameter and each process will > > have its own PD, so one QP will be able to put messages to different PD > > domains is that right? > > Correct. That's part of the SRC extension. > Is rkey/lkey are unique across different PDs? If yes is this required by Spec or is this just a consequences of the implementation? -- Gleb. From glebn at voltaire.com Mon Jul 30 05:12:13 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Mon, 30 Jul 2007 15:12:13 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730121057.GN9963@mellanox.co.il> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <20070730112130.GJ9963@mellanox.co.il> <20070730115425.GL4434@minantech.com> <20070730121057.GN9963@mellanox.co.il> Message-ID: <20070730121213.GO4434@minantech.com> On Mon, Jul 30, 2007 at 03:10:57PM +0300, Michael S. Tsirkin wrote: > > > It seems what you are missing is what SRC is, not how to use the API. > > > > So tell us. > > This calls for a separate document. From feedback from Sonoma I really assumed > people have it figured out. > > Let's open a separate thread, and there I will try writing up > what SRC is from the protocol point of view. > No problem. Start it :) -- Gleb. From tziporet at dev.mellanox.co.il Mon Jul 30 05:24:06 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 30 Jul 2007 15:24:06 +0300 Subject: [ewg] Re: [ofa-general] RE: OFA website edits In-Reply-To: <46A798F0.5070902@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A798F0.5070902@ichips.intel.com> Message-ID: <46ADD866.7080301@mellanox.co.il> Arlin Davis wrote: > >> I would like to propose adding project directories under >> http://www.openfabrics.org/downloads/ where appropriate and give >> maintainers access. For example: >> > Jeff, please add the following directories with maintainer access as > follow (or grant access at a maintainer group level): > > http://www.openfabrics.org/downloads/sdp (eitan) SDP should be on the name of Jim Mott (jimmott) since he is the maintainer of SDP and not Eitan. Tziporet From monis at voltaire.com Mon Jul 30 05:37:29 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:37:29 +0300 Subject: [ofa-general] [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver Message-ID: <46ADDB89.5030601@voltaire.com> This patch series is the third version (see below link to V2) of the suggested changes to the bonding driver so it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. The motivation is to enable the bonding driver on its HA mode to work with the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with fail-over and fail-back working fine. The working environment was the net-2.6 git. More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which is used by native IB ULPs whose addressing scheme is based on IP (e.g. iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for these ULPs. This holds as when the ULP is informed by the IB HW on the failure of the current IB connection, it just need to reconnect, where the bonding device will now issue the IB ARP over the active IPoIB slave. This series also includes patches to the IPoIB driver that fix some fix some neighboring related issues. There are still 2 open issues here: 1. When bonding enslaves an IPoIB device the bonding neighbor holds a reference to a cleanup function in the IPoIB drives. This makes it unsafe to unload the IPoIB module if there are bonding neighbors in the air. So, to avoid this race one must unload bonding before unloading IPoIB. 2. Patch No. 7 is a workaround to a problem where gratuitous were not sent quite often. I didn't find something better that fixes this and I would appreciate advices and comments regarding it. However, this doesn't seem to me as an issue related exclusively to IPoIB. Links to earlier discussion: 1. A discussion in netdev about bonding support for IPoIB. http://lists.openwall.net/netdev/2006/11/30/46 2. A discussion in openfabrics regarding changes in the IPoIB that enable using it as a slave for bonding. http://lists.openfabrics.org/pipermail/general/2007-March/034033.html From monis at voltaire.com Mon Jul 30 05:48:20 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:48:20 +0300 Subject: [ofa-general] [PATCH V3 1/7] IB/ipoib: Bound the net device to the ipoib_neigh structue In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDE14.7000708@voltaire.com> IPoIB uses a two layer neighboring scheme, such that for each struct neighbour whose device is an ipoib one, there is a struct ipoib_neigh buddy which is created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour) call. When using the bonding driver, neighbours are created by the net stack on behalf of the bonding (master) device. On the tx flow the bonding code gets an skb such that skb->dev points to the master device, it changes this skb to point on the slave device and calls the slave hard_start_xmit function. Under this scheme, ipoib_neigh_destructor assumption that for each struct neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev) can be casted to struct ipoib_dev_priv is buggy. To fix it, this patch adds a dev field to struct ipoib_neigh which is used instead of the struct neighbour dev one, when n->dev->flags has the IFF_MASTER bit set. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 17 +++++++++++++++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 +- 3 files changed, 19 insertions(+), 4 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-25 14:56:13.000000000 +0300 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-25 14:57:48.095724495 +0300 @@ -328,6 +328,7 @@ struct ipoib_neigh { struct sk_buff_head queue; struct neighbour *neighbour; + struct net_device *dev; struct list_head list; }; @@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, + struct net_device *dev); void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-25 14:56:13.000000000 +0300 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-25 15:03:11.291291271 +0300 @@ -510,7 +510,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = ipoib_neigh_alloc(skb->dst->neighbour); + neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -817,6 +817,16 @@ static void ipoib_neigh_cleanup(struct n unsigned long flags; struct ipoib_ah *ah = NULL; + if (n->dev->flags & IFF_MASTER) { + /* n->dev is not an IPoIB device and we have to take priv from elsewhere */ + neigh = *to_ipoib_neigh(n); + if (neigh){ + priv = netdev_priv(neigh->dev); + ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n", + n->dev->name); + } else + return; + } ipoib_dbg(priv, "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), @@ -838,7 +848,9 @@ static void ipoib_neigh_cleanup(struct n ipoib_put_ah(ah); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour, + struct net_device *dev) + { struct ipoib_neigh *neigh; @@ -847,6 +859,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return NULL; neigh->neighbour = neighbour; + neigh->dev = dev; *to_ipoib_neigh(neighbour) = neigh; skb_queue_head_init(&neigh->queue); ipoib_cm_set(neigh, NULL); Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-25 14:56:13.000000000 +0300 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-25 14:57:48.097724142 +0300 @@ -727,7 +727,7 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (neigh) { kref_get(&mcast->ah->ref); From monis at voltaire.com Mon Jul 30 05:49:43 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:49:43 +0300 Subject: [ofa-general] [PATCH V3 2/7] IB/ipoib: Verify address handle validity on send In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDE67.7030502@voltaire.com> When the bonding device senses a carrier loss of its active slave it replaces that slave with a new one. In between the times when the carrier of an IPoIB device goes down and ipoib_neigh is destroyed, it is possible that the bonding driver will send a packet on a new slave that uses an old ipoib_neigh. This patch detects and prevents this from happenning. Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-25 14:57:48.000000000 +0300 +++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-25 15:02:55.525131034 +0300 @@ -685,9 +685,10 @@ static int ipoib_start_xmit(struct sk_bu goto out; } } else if (neigh->ah) { - if (unlikely(memcmp(&neigh->dgid.raw, + if (unlikely((memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, - sizeof(union ib_gid)))) { + sizeof(union ib_gid))) || + (neigh->dev != dev))) { spin_lock(&priv->lock); /* * It's safe to call ipoib_put_ah() inside From mst at dev.mellanox.co.il Mon Jul 30 05:50:54 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 15:50:54 +0300 Subject: [ofa-general] Scalable reliable connection Message-ID: <20070730125054.GO9963@mellanox.co.il> Here's some background on what SRC is. This is basically slide 6 in Dror's talk, for those that missed the talk. * * * SRC is an extension supported by recent Mellanox hardware which is geared toward reducing the number of QPs required for all-to-all communication on systems with a high number of jobs per node. =================================================================== Motivation: =================================================================== Given N nodes with J jobs per node, number of QPs required for all-to-all communication is: With RC: O((N * J) ^ 2) Since each job out of O(N * J) jobs must create a single QP to communicate with each one of O(N * J) other jobs. With SRC: O(N ^ 2 * J) This is achived by using a single send queue (per job, out of O(N * J) jobs) to send data to all J jobs running on a specific node (out of O(N) nodes). Hardware uses new "SRQ number" field in packet header to multiplex receive WRs and WCs to private memory of each job. This is similiar idea to IB RD. Q: Why not use RD then? A: Because no hardware supports it. Details: =================================================================== Verbs extension: =================================================================== - There is a new transport/QP type "SRC". - There is a new object type "SRC domain" - Each SRQ gets new (optional) attributes: SRC domain SRC SRQ number SRC CQ SRQ must have either all 3 of these or none of these attributes - QPs of type SRC have all the same attributes as regular RC QPs connected to SRQ, except that: A. Each SRC QP has a new required attribute "SRC domain" B. SRC QPs do *not* have "SRQ" attribute (do not have a specific SRQ associated with them) =================================================================== Protocol extension: =================================================================== SRC QP behaviour: Requestor - Post send WR for this QP type is extended with SRQ number field This number is sent as part of packet header - SRC Packets follow rules for RC packets on the wire, exactly What is different is their handling at the responder side SRC QP behaviour: Responder Each incoming packet passes transport checks with respect to the SRC QP, following RC rules, exactly. After this, SRQ number in packet header is used to look up a specific SRQ. SRC domain of the resulting SRQ must be equal to SRC domain of the QP, otherwise a NAK is sent, and QP moves to error state. If the SRC domains match, receive WR and receive WC processing are as follows: - RC Send - Rather than using SRQ to which the QP is attached, SRQ is looked up by SRQ number in the packet. Receive WR is taken from this SRQ. - Completions are generated on the CQ specified in the SRQ - RDMA/Atomic - Rather than using PD to which the QP is attached, SRQ is looked up by SRQ number in the packet. PD of this SRQ is used for protection checks. =================================================================== -- MST From monis at voltaire.com Mon Jul 30 05:51:22 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:51:22 +0300 Subject: [ofa-general] [PATCH V3 3/7] net/bonding: Enable bonding to enslave non ARPHRD_ETHER In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDECA.8010605@voltaire.com> This patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 38 insertions(+) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:02:10.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-29 16:24:30.913343981 +0300 @@ -1277,6 +1277,26 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev->hard_header = slave_dev->hard_header; + bond_dev->rebuild_header = slave_dev->rebuild_header; + bond_dev->hard_header_cache = slave_dev->hard_header_cache; + bond_dev->header_cache_update = slave_dev->header_cache_update; + bond_dev->hard_header_parse = slave_dev->hard_header_parse; + + bond_dev->neigh_setup = slave_dev->neigh_setup; + + bond_dev->type = slave_dev->type; + bond_dev->hard_header_len = slave_dev->hard_header_len; + bond_dev->addr_len = slave_dev->addr_len; + + memcpy(bond_dev->broadcast, slave_dev->broadcast, + slave_dev->addr_len); +} + /* enslave device to bond device */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1351,6 +1371,24 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are + * created with ether_setup, so when the slave type is not ARPHRD_ETHER + * there is a need to override some of the type dependent attribs/funcs. + * + * bond ether type mutual exclusion - don't allow slaves of dissimilar + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond + */ + if (bond->slave_cnt == 0) { + if (slave_dev->type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev->type != slave_dev->type) { + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different from " + "other slaves (%d), can not enslave it.\n", slave_dev->name, + slave_dev->type, bond_dev->type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev->set_mac_address == NULL) { printk(KERN_ERR DRV_NAME ": %s: Error: The slave device you specified does " From mst at dev.mellanox.co.il Mon Jul 30 05:52:37 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Mon, 30 Jul 2007 15:52:37 +0300 Subject: [ofa-general] RFC: SRC API In-Reply-To: <20070730121102.GN4434@minantech.com> References: <20070729140431.GG16915@mellanox.co.il> <20070730085016.GG4434@minantech.com> <20070730091639.GI9963@mellanox.co.il> <20070730104117.GJ4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0D95@mtlexch01.mtl.com> <20070730112722.GK4434@minantech.com> <6C2C79E72C305246B504CBA17B5500C901FF0DBF@mtlexch01.mtl.com> <20070730120018.GM4434@minantech.com> <20070730120706.GM9963@mellanox.co.il> <20070730121102.GN4434@minantech.com> Message-ID: <20070730125237.GP9963@mellanox.co.il> > Quoting Gleb Natapov : > Subject: Re: [ofa-general] RFC: SRC API > > On Mon, Jul 30, 2007 at 03:07:06PM +0300, Michael S. Tsirkin wrote: > > > By the way ibv_create_src_srq() gets PD as a parameter and each process will > > > have its own PD, so one QP will be able to put messages to different PD > > > domains is that right? > > > > Correct. That's part of the SRC extension. > > > Is rkey/lkey are unique across different PDs? If yes is this required by > Spec or is this just a consequences of the implementation? Yes, but I think that this is not required by the spec. -- MST From monis at voltaire.com Mon Jul 30 05:52:50 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:52:50 +0300 Subject: [ofa-general] [PATCH V3 4/7] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address() In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDF22.3090604@voltaire.com> This patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 88 +++++++++++++++++++++++++++------------- drivers/net/bonding/bonding.h | 1 2 files changed, 61 insertions(+), 28 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-29 16:24:30.913343981 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-29 16:36:53.234602471 +0300 @@ -1127,6 +1127,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC + * address is the one of the active slave. + */ + if (new_active && !bond->do_set_mac_addr) + memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, + new_active->dev->addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1390,13 +1398,22 @@ int bond_enslave(struct net_device *bond } if (slave_dev->set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - ": %s: Error: The slave device you specified does " - "not support setting the MAC address. " - "Your kernel likely does not support slave " - "devices.\n", bond_dev->name); - res = -EOPNOTSUPP; - goto err_undo_flags; + if (bond->slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + ": %s: Warning: The first slave device you " + "specified does not support setting the MAC " + "address. This bond MAC address would be that " + "of the active slave.\n", bond_dev->name); + bond->do_set_mac_addr = 0; + } else if (bond->do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + ": %s: Error: The slave device you specified " + "does not support setting the MAC addres,." + "but this bond uses this practice. \n" + , bond_dev->name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } } new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL); @@ -1417,16 +1434,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - /* - * Set slave to master's mac address. The application already - * set the master's mac address to that of the first slave - */ - memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); - addr.sa_family = slave_dev->type; - res = dev_set_mac_address(slave_dev, &addr); - if (res) { - dprintk("Error %d calling set_mac_address\n", res); - goto err_free; + if (bond->do_set_mac_addr) { + /* + * Set slave to master's mac address. The application already + * set the master's mac address to that of the first slave + */ + memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); + addr.sa_family = slave_dev->type; + res = dev_set_mac_address(slave_dev, &addr); + if (res) { + dprintk("Error %d calling set_mac_address\n", res); + goto err_free; + } } res = netdev_set_master(slave_dev, bond_dev); @@ -1651,9 +1670,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } err_free: kfree(new_slave); @@ -1831,10 +1852,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address */ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address */ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1921,10 +1944,12 @@ static int bond_release_all(struct net_d /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - dev_set_mac_address(slave_dev, &addr); + if (bond->do_set_mac_addr) { + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + dev_set_mac_address(slave_dev, &addr); + } slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE); @@ -3961,6 +3986,10 @@ static int bond_set_mac_address(struct n dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None")); + if (!bond->do_set_mac_addr) { + return -EOPNOTSUPP; + } + if (!is_valid_ether_addr(sa->sa_data)) { return -EADDRNOTAVAIL; } @@ -4351,6 +4380,9 @@ static int bond_init(struct net_device * bond_create_proc_entry(bond); #endif + /* set do_set_mac_addr to true on startup */ + bond->do_set_mac_addr = 1; + list_add_tail(&bond->bond_list, &bond_dev_list); return 0; Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-07-29 16:25:22.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-07-29 16:37:13.163056181 +0300 @@ -201,6 +201,7 @@ struct bonding { struct list_head vlan_list; struct vlan_group *vlgrp; struct packet_type arp_mon_pt; + s8 do_set_mac_addr; }; /** From monis at voltaire.com Mon Jul 30 05:54:03 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:54:03 +0300 Subject: [ofa-general] [PATCH V3 5/7] net/bonding: Enable IP multicast for bonding IPoIB devices In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDF6B.1080907@voltaire.com> Allow to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for multicast joins taking place after the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) Signed-off-by: Moni Shoua Signed-off-by: Or Gerlitz --- drivers/net/bonding/bond_main.c | 4 ++-- drivers/net/bonding/bond_sysfs.c | 6 ++---- 2 files changed, 4 insertions(+), 6 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:04:50.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-25 15:06:17.175820632 +0300 @@ -1325,8 +1325,8 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev->flags & IFF_UP)) { - dprintk("Error, master_dev is not up\n"); - return -EPERM; + printk(KERN_WARNING DRV_NAME + " %s: master_dev is not up in bond_enslave\n", bond_dev->name); } /* already enslaved */ Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-07-25 14:18:12.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-07-25 15:06:17.176820452 +0300 @@ -266,11 +266,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond->dev->flags & IFF_UP)) { - printk(KERN_ERR DRV_NAME - ": %s: Unable to update slaves because interface is down.\n", + printk(KERN_WARNING DRV_NAME + ": %s: doing slave updates when interface is down.\n", bond->dev->name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond->lock here, as bond_create grabs it. */ From monis at voltaire.com Mon Jul 30 05:54:59 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:54:59 +0300 Subject: [ofa-general] [PATCH V3 6/7] net/bonding: Handlle wrong assumptions that slave is always an Ethernet device In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDFA3.1000100@voltaire.com> bonding sometimes uses Ethernet constants (such as MTU and address length) which are not good when it enslaves non Ethernet devices (such as InfiniBand). Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 2 +- drivers/net/bonding/bond_sysfs.c | 10 ++++++++-- drivers/net/bonding/bonding.h | 1 + 3 files changed, 10 insertions(+), 3 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:06:17.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-25 15:33:25.012883360 +0300 @@ -1255,7 +1255,7 @@ static int bond_compute_features(struct unsigned long features = BOND_INTERSECT_FEATURES; struct slave *slave; struct net_device *bond_dev = bond->dev; - unsigned short max_hard_header_len = ETH_HLEN; + u16 max_hard_header_len = max((u16)ETH_HLEN, bond_dev->hard_header_len); int i; bond_for_each_slave(bond, slave, i) { Index: net-2.6/drivers/net/bonding/bond_sysfs.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_sysfs.c 2007-07-25 15:06:17.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_sysfs.c 2007-07-25 15:20:10.224527636 +0300 @@ -260,6 +260,7 @@ static ssize_t bonding_store_slaves(stru char command[IFNAMSIZ + 1] = { 0, }; char *ifname; int i, res, found, ret = count; + u32 original_mtu; struct slave *slave; struct net_device *dev = NULL; struct bonding *bond = to_bond(d); @@ -325,6 +326,7 @@ static ssize_t bonding_store_slaves(stru } /* Set the slave's MTU to match the bond */ + original_mtu = dev->mtu; if (dev->mtu != bond->dev->mtu) { if (dev->change_mtu) { res = dev->change_mtu(dev, @@ -339,6 +341,9 @@ static ssize_t bonding_store_slaves(stru } rtnl_lock(); res = bond_enslave(bond->dev, dev); + bond_for_each_slave(bond, slave, i) + if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) + slave->original_mtu=original_mtu; rtnl_unlock(); if (res) { ret = res; @@ -351,6 +356,7 @@ static ssize_t bonding_store_slaves(stru bond_for_each_slave(bond, slave, i) if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) { dev = slave->dev; + original_mtu = slave->original_mtu; break; } if (dev) { @@ -365,9 +371,9 @@ static ssize_t bonding_store_slaves(stru } /* set the slave MTU to the default */ if (dev->change_mtu) { - dev->change_mtu(dev, 1500); + dev->change_mtu(dev, original_mtu); } else { - dev->mtu = 1500; + dev->mtu = original_mtu; } } else { Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-07-25 15:03:32.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-07-25 15:20:10.223527810 +0300 @@ -156,6 +156,7 @@ struct slave { s8 link; /* one of BOND_LINK_XXXX */ s8 state; /* one of BOND_STATE_XXXX */ u32 original_flags; + u32 original_mtu; u32 link_failure_count; u16 speed; u8 duplex; From hnguyen at linux.vnet.ibm.com Mon Jul 30 06:07:47 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 30 Jul 2007 15:07:47 +0200 Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings "externs should be avoided in .c files" In-Reply-To: References: <200707271254.51055.hnguyen@linux.vnet.ibm.com> Message-ID: <200707301507.47575.hnguyen@linux.vnet.ibm.com> Hi Roland! > the patch looks fine except your mailer seems to have mangled > it... can you resend so I can apply it? Was going to recreate this patch, but then I saw that you probably have incorporated it (manually) in your latest git. Just want to make sure I'm seeing it right. Anyway, appreciate your help! Nam From monis at voltaire.com Mon Jul 30 05:56:06 2007 From: monis at voltaire.com (Moni Shoua) Date: Mon, 30 Jul 2007 15:56:06 +0300 Subject: [ofa-general] [PATCH V3 7/7] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <46ADDB89.5030601@voltaire.com> References: <46ADDB89.5030601@voltaire.com> Message-ID: <46ADDFE6.9000609@voltaire.com> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit in dev->state field is on. This improves the chances for the arp packet to be transmitted. Signed-off-by: Moni Shoua --- drivers/net/bonding/bond_main.c | 25 +++++++++++++++++++++---- drivers/net/bonding/bonding.h | 1 + 2 files changed, 22 insertions(+), 4 deletions(-) Index: net-2.6/drivers/net/bonding/bond_main.c =================================================================== --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:33:25.000000000 +0300 +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-26 18:42:59.296296622 +0300 @@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon if (new_active && !bond->do_set_mac_addr) memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, new_active->dev->addr_len); - - bond_send_gratuitous_arp(bond); + if (bond->curr_active_slave && + test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){ + dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name); + bond->send_grat_arp=1; + }else{ + bond_send_gratuitous_arp(bond); + } } } @@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device * program could monitor the link itself if needed. */ + if (bond->send_grat_arp) { + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)) + dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__); + else { + dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n"); + bond_send_gratuitous_arp(bond); + bond->send_grat_arp=0; + } + } read_lock(&bond->curr_slave_lock); oldcurrent = bond->curr_active_slave; read_unlock(&bond->curr_slave_lock); @@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str struct slave *slave = bond->curr_active_slave; struct vlan_entry *vlan; struct net_device *vlan_dev; + int i; dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name, slave ? slave->dev->name : "NULL"); @@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str return; if (bond->master_ip) { - bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, - bond->master_ip, 0); + for (i=0;i<3;i++) + bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, + bond->master_ip, 0); } list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { @@ -4331,6 +4347,7 @@ static int bond_init(struct net_device * bond->current_arp_slave = NULL; bond->primary_slave = NULL; bond->dev = bond_dev; + bond->send_grat_arp=0; INIT_LIST_HEAD(&bond->vlan_list); /* Initialize the device entry points */ Index: net-2.6/drivers/net/bonding/bonding.h =================================================================== --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-07-25 15:20:10.000000000 +0300 +++ net-2.6/drivers/net/bonding/bonding.h 2007-07-26 18:42:43.652087660 +0300 @@ -203,6 +203,7 @@ struct bonding { struct vlan_group *vlgrp; struct packet_type arp_mon_pt; s8 do_set_mac_addr; + int send_grat_arp; }; /** From rdreier at cisco.com Mon Jul 30 06:54:30 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 06:54:30 -0700 Subject: [ofa-general] [PATCH 1/2] ehca: remove checkpatch.pl's warnings "externs should be avoided in .c files" In-Reply-To: <200707301507.47575.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Mon, 30 Jul 2007 15:07:47 +0200") References: <200707271254.51055.hnguyen@linux.vnet.ibm.com> <200707301507.47575.hnguyen@linux.vnet.ibm.com> Message-ID: > Was going to recreate this patch, but then I saw that you > probably have incorporated it (manually) in your latest git. > Just want to make sure I'm seeing it right. Yes, I ended up doing it by hand. Thanks. From tziporet at dev.mellanox.co.il Mon Jul 30 06:58:23 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 30 Jul 2007 16:58:23 +0300 Subject: [ofa-general] reminder: OFED meeting today at 9am PST Message-ID: <46ADEE7F.2000005@mellanox.co.il> Hi All, We will have our bi-weekly OFED meeting today at 9am PST Agenda: - Status update - Bugzilla cleanup If you have more agenda items please send them Tziporet From jim at mellanox.com Mon Jul 30 07:08:26 2007 From: jim at mellanox.com (Jim Mott) Date: Mon, 30 Jul 2007 07:08:26 -0700 Subject: [ofa-general] [PATCH V1 1/1] sdplib: fix error return information In-Reply-To: <46ADDFE6.9000609@voltaire.com> References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> Message-ID: Hi, I am the new maintainer of SDP and have almost figured out what that means. It is time to start submitting code changes for public review. Please send comments on both the content and the style of these notices. Fix various improper error indications returned by libsdp.so. Most of the problems were found by unit tests and the rest by inspection looking for similar coding issues. Diff from OFED 1.2: Index: ofa_user/src/userspace/libsdp/src/port.c =================================================================== --- ofa_user.orig/src/userspace/libsdp/src/port.c 2007-07-16 23:51:00.000000000 +0300 +++ ofa_user/src/userspace/libsdp/src/port.c 2007-07-18 23:43:08.000000000 +0300 @@ -418,11 +418,13 @@ __sdp_sockaddr_to_sdp( if ( !addr_in ) { __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "provided NULL input pointer\n" ); + errno = EINVAL; return -1; } if ( !addr_out ) { __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "provided NULL output pointer\n" ); + errno = EINVAL; return -1; } @@ -432,6 +434,7 @@ __sdp_sockaddr_to_sdp( __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "provided address length:%d < IPv4 length %d\n", addrlen, sizeof( struct sockaddr_in ) ); + errno = EINVAL; return -1; } @@ -443,6 +446,7 @@ __sdp_sockaddr_to_sdp( __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "provided address length:%d < IPv6 length %d\n", addrlen, IPV6_ADDR_IN_MIN_LEN ); + errno = EINVAL; return -1; } @@ -450,6 +454,7 @@ __sdp_sockaddr_to_sdp( if ( !is_ipv4_embedded_in_ipv6( sin6 ) ) { __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "Given IPv6 address not an embedded IPv4\n" ); + errno = EINVAL; return -1; } memset( addr_out, 0, sizeof( struct sockaddr_in ) ); @@ -490,7 +495,8 @@ __sdp_sockaddr_to_sdp( } else { __sdp_log( 9, "Error __sdp_sockaddr_to_sdp: " "address family <%d> is unknown\n", sin->sin_family ); - return 1; + errno = EAFNOSUPPORT; + return -1; } return 0; @@ -1270,7 +1276,7 @@ bind( if ( __sdp_sockaddr_to_sdp( my_addr, addrlen, &sdp_addr, &was_ipv6 ) ) { __sdp_log( 9, "Error bind: failed to convert address:%s for SDP\n", buf ); - ret = EADDRNOTAVAIL; + ret = -1; goto done; } #ifndef SDP_SUPPORTS_IPv6 @@ -1305,6 +1311,7 @@ bind( __sdp_log( 9, "BIND: Failed to find common free port\n" ); /* We cannot bind both tcp and sdp on the same port, we will close * the sdp and continue with tcp only */ + goto done; } else { /* copy the port to the tmp address */ set_addr_port_num( ( struct sockaddr * )&tmp_my_addr, port ); @@ -1454,7 +1461,7 @@ connect( __sdp_log( 9, "Error connect: " "failed to convert address:%s to SDP\n", buf ); - ret = EADDRNOTAVAIL; + ret = -1; goto done; } #ifndef SDP_SUPPORTS_IPv6 @@ -1485,7 +1492,7 @@ connect( if ( __sdp_sockaddr_to_sdp( serv_addr, addrlen, sdp_sin, &was_ipv6 ) ) { __sdp_log( 9, "Error connect: " "failed to convert to shadow address:%s to SDP\n", buf ); - ret = EADDRNOTAVAIL; + ret = -1; } else { #ifndef SDP_SUPPORTS_IPv6 if ( was_ipv6 ) @@ -1590,7 +1597,8 @@ listen( getsockname( fd, ( struct sockaddr * )&tmp_sin, &tmp_sinlen ) < 0 ) { __sdp_log( 9, "Error listen: getsockname return <%d> for TCP socket\n", errno ); - sret = EADDRNOTAVAIL; + errno = EADDRNOTAVAIL; + sret = -1; goto done; } @@ -1623,7 +1631,7 @@ listen( tmp_sinlen, sdp_sin, &was_ipv6 ) ) { __sdp_log( 9, "Error listen: " "failed to convert to address:%s to SDP\n", buf ); - ret = EOPNOTSUPP; + ret = -1; } else { #ifndef SDP_SUPPORTS_IPv6 if ( was_ipv6 ) From hnguyen at linux.vnet.ibm.com Mon Jul 30 08:02:59 2007 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 30 Jul 2007 17:02:59 +0200 Subject: [ofa-general] Re: [PATCH 2/5] ehca: Generate event when SRQ limit reached In-Reply-To: References: <200707201602.19142.hnguyen@linux.vnet.ibm.com> Message-ID: <200707301703.00111.hnguyen@linux.vnet.ibm.com> Hi, > BTW, does your SRQ-capable hardware support generating the "last WQE > reached" event? There's not any reliable way to avoid problems when > destroying QPs attached to an SRQ without it, and the IB spec requires > CAs that support SRQs to generate it (o11-5.2.5 in chapter 11 of vol 1). > > I don't see any code in ehca to generate the event, and IPoIB CM at > least will be very unhappy when using SRQs if the event is not > generated. Thanks for this good catch. We're investigating how to implement this. Will keep you updated. Regards Nam From arthur.jones at qlogic.com Mon Jul 30 08:06:00 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 30 Jul 2007 08:06:00 -0700 Subject: [ofa-general] [PATCH 1/4] IB/ipath - Remove unsafe fastrcvint code from interrupt handler In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070730150600.19920.61255.stgit@eng-46.internal.keyresearch.com> From: Dave Olson The fastrcvint code's purpose was to avoid reading the interrupt status if kernel packets were in the receive queue (to reduce overhead). Because intstatus was not read, we could miss the error interrupt bit indicating freeze mode, since it only delivers a single interrupt, even if still pending after intclear is written. This patch removes that optimization. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_common.h | 3 +-- drivers/infiniband/hw/ipath/ipath_intr.c | 31 ---------------------------- 2 files changed, 1 insertions(+), 33 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h index b4b786d..6ad822c 100644 --- a/drivers/infiniband/hw/ipath/ipath_common.h +++ b/drivers/infiniband/hw/ipath/ipath_common.h @@ -100,8 +100,7 @@ struct infinipath_stats { __u64 sps_hwerrs; /* number of times IB link changed state unexpectedly */ __u64 sps_iblink; - /* kernel receive interrupts that didn't read intstat */ - __u64 sps_fastrcvint; + __u64 sps_unused; /* was fastrcvint, no longer implemented */ /* number of kernel (port0) packets received */ __u64 sps_port0pkts; /* number of "ethernet" packets sent by driver */ diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 1fd91c5..9b03154 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -1035,36 +1035,6 @@ irqreturn_t ipath_intr(int irq, void *data) goto bail; } - /* - * We try to avoid reading the interrupt status register, since - * that's a PIO read, and stalls the processor for up to about - * ~0.25 usec. The idea is that if we processed a port0 packet, - * we blindly clear the port 0 receive interrupt bits, and nothing - * else, then return. If other interrupts are pending, the chip - * will re-interrupt us as soon as we write the intclear register. - * We then won't process any more kernel packets (if not the 2nd - * time, then the 3rd or 4th) and we'll then handle the other - * interrupts. We clear the interrupts first so that we don't - * lose intr for later packets that arrive while we are processing. - */ - oldhead = dd->ipath_port0head; - curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); - if (oldhead != curtail) { - if (dd->ipath_flags & IPATH_GPIO_INTR) { - ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, - (u64) (1 << IPATH_GPIO_PORT0_BIT)); - istat = port0rbits | INFINIPATH_I_GPIO; - } - else - istat = port0rbits; - ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); - ipath_kreceive(dd); - if (oldhead != dd->ipath_port0head) { - ipath_stats.sps_fastrcvint++; - goto done; - } - } - istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); if (unlikely(!istat)) { @@ -1225,7 +1195,6 @@ irqreturn_t ipath_intr(int irq, void *data) handle_layer_pioavail(dd); } -done: ret = IRQ_HANDLED; bail: From arthur.jones at qlogic.com Mon Jul 30 08:05:55 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 30 Jul 2007 08:05:55 -0700 Subject: [ofa-general] [PATCH] IB/ipath -- bug fixes in for-roland Message-ID: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> hi roland, welcome back -- to help you feel wanted, here are the latest set of fixes for 2.6.23. these changes are avail via git pull from: git://git.qlogic.com/ipath-linux-2.6 for-roland arthur From arthur.jones at qlogic.com Mon Jul 30 08:06:05 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 30 Jul 2007 08:06:05 -0700 Subject: [ofa-general] [PATCH 2/4] IB/ipath - use faster put_tid_2 routine after initialization In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070730150605.19920.66997.stgit@eng-46.internal.keyresearch.com> From: Dave Olson At some point the ipath_minrev field was initialized prior to the ipath_init_iba6120_funcs call, but that is no longer the case, so the slower put_tid routine was always being used. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_iba6120.c | 20 +++++++++++++------- 1 files changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 9868ccd..5b6ac9a 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -321,6 +321,8 @@ static const struct ipath_hwerror_msgs ipath_6120_hwerror_msgs[] = { << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) static int ipath_pe_txe_recover(struct ipath_devdata *); +static void ipath_pe_put_tid_2(struct ipath_devdata *, u64 __iomem *, + u32, unsigned long); /** * ipath_pe_handle_hwerrors - display hardware errors. @@ -555,8 +557,11 @@ static int ipath_pe_boardname(struct ipath_devdata *dd, char *name, ipath_dev_err(dd, "Unsupported InfiniPath hardware revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; - } else + } else { ret = 0; + if (dd->ipath_minrev >= 2) + dd->ipath_f_put_tid = ipath_pe_put_tid_2; + } return ret; } @@ -1220,7 +1225,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port) port * dd->ipath_rcvtidcnt * sizeof(*tidbase)); for (i = 0; i < dd->ipath_rcvtidcnt; i++) - ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED, + dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED, tidinv); tidbase = (u64 __iomem *) @@ -1229,7 +1234,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port) port * dd->ipath_rcvegrcnt * sizeof(*tidbase)); for (i = 0; i < dd->ipath_rcvegrcnt; i++) - ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER, + dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER, tidinv); } @@ -1395,10 +1400,11 @@ void ipath_init_iba6120_funcs(struct ipath_devdata *dd) dd->ipath_f_quiet_serdes = ipath_pe_quiet_serdes; dd->ipath_f_bringup_serdes = ipath_pe_bringup_serdes; dd->ipath_f_clear_tids = ipath_pe_clear_tids; - if (dd->ipath_minrev >= 2) - dd->ipath_f_put_tid = ipath_pe_put_tid_2; - else - dd->ipath_f_put_tid = ipath_pe_put_tid; + /* + * this may get changed after we read the chip revision, + * but we start with the safe version for all revs + */ + dd->ipath_f_put_tid = ipath_pe_put_tid; dd->ipath_f_cleanup = ipath_setup_pe_cleanup; dd->ipath_f_setextled = ipath_setup_pe_setextled; dd->ipath_f_get_base_info = ipath_pe_get_base_info; From arthur.jones at qlogic.com Mon Jul 30 08:06:10 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 30 Jul 2007 08:06:10 -0700 Subject: [ofa-general] [PATCH 3/4] IB/ipath - Fix some issues with buffer cancel and sendctrl register update In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070730150610.19920.74815.stgit@eng-46.internal.keyresearch.com> From: Dave Olson There was confused use between INFINIPATH_S_PIOBUFAVAILUPD (value) and IPATH_S_PIOBUFAVAILUPD (bit position). Also, some callers of ipath_cancel_sends() need kr_sendctrl restored, and some want to do it later. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_driver.c | 11 +++++++---- drivers/infiniband/hw/ipath/ipath_init_chip.c | 2 +- drivers/infiniband/hw/ipath/ipath_intr.c | 6 +++--- drivers/infiniband/hw/ipath/ipath_kernel.h | 2 +- 4 files changed, 12 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 09c5fd8..6ccba36 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -740,7 +740,7 @@ void ipath_disarm_piobufs(struct ipath_devdata *dd, unsigned first, * pioavail updates to memory to stop. */ ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - sendorig & ~IPATH_S_PIOBUFAVAILUPD); + sendorig & ~INFINIPATH_S_PIOBUFAVAILUPD); sendorig = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); @@ -1614,7 +1614,7 @@ int ipath_waitfor_mdio_cmdready(struct ipath_devdata *dd) * it's safer to always do it. * PIOAvail bits are updated by the chip as if normal send had happened. */ -void ipath_cancel_sends(struct ipath_devdata *dd) +void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) { ipath_dbg("Cancelling all in-progress send buffers\n"); dd->ipath_lastcancel = jiffies+HZ/2; /* skip armlaunch errs a bit */ @@ -1627,6 +1627,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd) ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_disarm_piobufs(dd, 0, (unsigned)(dd->ipath_piobcnt2k + dd->ipath_piobcnt4k)); + if (restore_sendctrl) /* else done by caller later */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl); /* and again, be sure all have hit the chip */ ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); @@ -1655,7 +1658,7 @@ static void ipath_set_ib_lstate(struct ipath_devdata *dd, int which) /* flush all queued sends when going to DOWN or INIT, to be sure that * they don't block MAD packets */ if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT) - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 1); ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, dd->ipath_ibcctrl | which); @@ -2000,7 +2003,7 @@ void ipath_shutdown_device(struct ipath_devdata *dd) ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_DISABLE << INFINIPATH_IBCC_LINKINITCMD_SHIFT); - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); /* disable IBC */ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 49951d5..71e6c9d 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -782,7 +782,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) * Follows early_init because some chips have to initialize * PIO buffers in early_init to avoid false parity errors. */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); /* early_init sets rcvhdrentsize and rcvhdrsize, so this must be * done after early_init */ diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 9b03154..a5b3e7e 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -303,7 +303,7 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, * Flush all queued sends when link went to DOWN or INIT, * to be sure that they don't block SMA and other MAD packets */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 1); } else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM || lstate == IPATH_IBSTATE_ACTIVE) { @@ -799,13 +799,13 @@ void ipath_clear_freeze(struct ipath_devdata *dd) * therefore would not be sent, and eventually * might cause the process to run out of bufs */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); ipath_write_kreg(dd, dd->ipath_kregs->kr_control, dd->ipath_control); /* ensure pio avail updates continue */ ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD); + dd->ipath_sendctrl & ~INFINIPATH_S_PIOBUFAVAILUPD); ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index ace63ef..ef77329 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -683,7 +683,7 @@ int ipath_unordered_wc(void); void ipath_disarm_piobufs(struct ipath_devdata *, unsigned first, unsigned cnt); -void ipath_cancel_sends(struct ipath_devdata *); +void ipath_cancel_sends(struct ipath_devdata *, int); int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *); void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *); From arthur.jones at qlogic.com Mon Jul 30 08:06:15 2007 From: arthur.jones at qlogic.com (Arthur Jones) Date: Mon, 30 Jul 2007 08:06:15 -0700 Subject: [ofa-general] [PATCH 4/4] IB/ipath - Workaround problem of errormask register being overwritten In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> Message-ID: <20070730150615.19920.44705.stgit@eng-46.internal.keyresearch.com> From: Dave Olson On some system hardware, we are seeing moderately common cases of the chip errormask register being overwritten due to a chip bug in iba6120 that is triggered by a vendor specific PCIe broadcast message. This patch merely checks periodically, and corrects it if needed (the overwrite can cause us to not get error and hardware error interrupts). Also, make dd->ipath_errormask the one, true canonical source for kr_errormask, and remove references to ipath_ignorederrs as it is currently unused. Signed-off-by: Dave Olson Signed-off-by: John Gregor --- drivers/infiniband/hw/ipath/ipath_init_chip.c | 5 +- drivers/infiniband/hw/ipath/ipath_intr.c | 25 ++++++----- drivers/infiniband/hw/ipath/ipath_kernel.h | 11 +---- drivers/infiniband/hw/ipath/ipath_stats.c | 55 ++++++++++++++++++++++--- 4 files changed, 67 insertions(+), 29 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 71e6c9d..9dd0bac 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -851,13 +851,14 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrmask, dd->ipath_hwerrmask); - dd->ipath_maskederrs = dd->ipath_ignorederrs; /* clear all */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, -1LL); /* enable errors that are masked, at least this first time. */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, ~dd->ipath_maskederrs); - /* clear any interrups up to this point (ints still not enabled) */ + dd->ipath_errormask = ipath_read_kreg64(dd, + dd->ipath_kregs->kr_errormask); + /* clear any interrupts up to this point (ints still not enabled) */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL); /* diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index a5b3e7e..6480465 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -517,10 +517,7 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint); - /* - * don't report errors that are masked (includes those always - * ignored) - */ + /* don't report errors that are masked */ errs &= ~dd->ipath_maskederrs; /* do these first, they are most important */ @@ -566,19 +563,19 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) * ones on this particular interrupt, which also isn't great */ dd->ipath_maskederrs |= dd->ipath_lasterror | errs; + dd->ipath_errormask &= ~dd->ipath_maskederrs; ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); s_iserr = ipath_decode_err(msg, sizeof msg, - (dd->ipath_maskederrs & ~dd-> - ipath_ignorederrs)); + dd->ipath_maskederrs); - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) & + if (dd->ipath_maskederrs & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS)) ipath_dev_err(dd, "Temporarily disabling " "error(s) %llx reporting; too frequent (%s)\n", - (unsigned long long) (dd->ipath_maskederrs & - ~dd->ipath_ignorederrs), msg); + (unsigned long long)dd->ipath_maskederrs, + msg); else { /* * rcvegrfull and rcvhdrqfull are "normal", @@ -793,6 +790,9 @@ void ipath_clear_freeze(struct ipath_devdata *dd) /* disable error interrupts, to avoid confusion */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL); + /* also disable interrupts; errormask is sometimes overwriten */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL); + /* * clear all sends, because they have may been * completed by usercode while in freeze mode, and @@ -817,7 +817,7 @@ void ipath_clear_freeze(struct ipath_devdata *dd) for (i = 0; i < dd->ipath_pioavregs; i++) { /* deal with 6110 chip bug */ im = i > 3 ? ((i&1) ? i-1 : i+1) : i; - val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64))); + val = ipath_read_kreg64(dd, (0x1000/sizeof(u64))+im); dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i] = le64_to_cpu(val); } @@ -832,7 +832,8 @@ void ipath_clear_freeze(struct ipath_devdata *dd) ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, E_SPKT_ERRS_IGNORE); ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); + ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, -1LL); ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL); } diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index ef77329..7a7966f 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -261,18 +261,10 @@ struct ipath_devdata { * limiting of hwerror reporting */ ipath_err_t ipath_lasthwerror; - /* - * errors masked because they occur too fast, also includes errors - * that are always ignored (ipath_ignorederrs) - */ + /* errors masked because they occur too fast */ ipath_err_t ipath_maskederrs; /* time in jiffies at which to re-enable maskederrs */ unsigned long ipath_unmasktime; - /* - * errors always ignored (masked), at least for a given - * chip/device, because they are wrong or not useful - */ - ipath_err_t ipath_ignorederrs; /* count of egrfull errors, combined for all ports */ u64 ipath_last_tidfull; /* for ipath_qcheck() */ @@ -436,6 +428,7 @@ struct ipath_devdata { u64 ipath_lastibcstat; /* hwerrmask shadow */ ipath_err_t ipath_hwerrmask; + ipath_err_t ipath_errormask; /* errormask shadow */ /* interrupt config reg shadow */ u64 ipath_intconfig; /* kr_sendpiobufbase value */ diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c index 73ed17d..7338312 100644 --- a/drivers/infiniband/hw/ipath/ipath_stats.c +++ b/drivers/infiniband/hw/ipath/ipath_stats.c @@ -196,6 +196,46 @@ static void ipath_qcheck(struct ipath_devdata *dd) } } + +static void ipath_chk_errormask(struct ipath_devdata *dd) +{ + static u32 fixed; + u32 ctrl; + unsigned long errormask; + unsigned long hwerrs; + + if (!dd->ipath_errormask || !(dd->ipath_flags & IPATH_INITTED)) + return; + + errormask = ipath_read_kreg64(dd, dd->ipath_kregs->kr_errormask); + + if (errormask == dd->ipath_errormask) + return; + fixed++; + + hwerrs = ipath_read_kreg64(dd, dd->ipath_kregs->kr_hwerrstatus); + ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control); + + ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, + dd->ipath_errormask); + + if ((hwerrs & dd->ipath_hwerrmask) || + (ctrl & INFINIPATH_C_FREEZEMODE)) { + /* force re-interrupt of pending events, just in case */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL); + ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, 0ULL); + ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL); + dev_info(&dd->pcidev->dev, + "errormask fixed(%u) %lx -> %lx, ctrl %x hwerr %lx\n", + fixed, errormask, (unsigned long)dd->ipath_errormask, + ctrl, hwerrs); + } else + ipath_dbg("errormask fixed(%u) %lx -> %lx, no freeze\n", + fixed, errormask, + (unsigned long)dd->ipath_errormask); +} + + /** * ipath_get_faststats - get word counters from chip before they overflow * @opaque - contains a pointer to the infinipath device ipath_devdata @@ -251,14 +291,13 @@ void ipath_get_faststats(unsigned long opaque) dd->ipath_lasterror = 0; if (dd->ipath_lasthwerror) dd->ipath_lasthwerror = 0; - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) + if (dd->ipath_maskederrs && time_after(jiffies, dd->ipath_unmasktime)) { char ebuf[256]; int iserr; iserr = ipath_decode_err(ebuf, sizeof ebuf, - (dd->ipath_maskederrs & ~dd-> - ipath_ignorederrs)); - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) & + dd->ipath_maskederrs); + if (dd->ipath_maskederrs & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS )) ipath_dev_err(dd, "Re-enabling masked errors " @@ -278,9 +317,12 @@ void ipath_get_faststats(unsigned long opaque) ipath_cdbg(ERRPKT, "Re-enabling packet" " problem interrupt (%s)\n", ebuf); } - dd->ipath_maskederrs = dd->ipath_ignorederrs; + + /* re-enable masked errors */ + dd->ipath_errormask |= dd->ipath_maskederrs; ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); + dd->ipath_maskederrs = 0; } /* limit qfull messages to ~one per minute per port */ @@ -294,6 +336,7 @@ void ipath_get_faststats(unsigned long opaque) } } + ipath_chk_errormask(dd); done: mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); } From kenjeffries at storagegear.com Mon Jul 30 08:30:18 2007 From: kenjeffries at storagegear.com (Ken Jeffries) Date: Mon, 30 Jul 2007 10:30:18 -0500 Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance with Modified Write Protocol Message-ID: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip> We have been doing a fair amount of performance testing on our SRP target. One thing we found early on was that client writes were considerably slower than client reads. We addressed this by patching the SRP client code so that it could include the client write data in the SRP CMD IU if it would fit. This notion is in iSER but is not in standard SRP. Architecturally, the capability is signaled using an additional data buffer format bit. We find that client write performance is considerably improved by using this capability. We are calling SRP spec compliant writes "standard writes" and our modified writes "iu data writes". We also implemented a similar capability for client reads but on our system we did not see a performance improvement. We would like to know if other SRP'rs would be interesting in us making the patch available for either inclusion or for discussion. Since we did this without input from anyone else we are not going to claim that the way we did it is necessarily the best way to do it. Below are some of our performance numbers, preceeded by a description of our test setup. The StorageGear SRP Solid State Disk System is an asymmetrical embedded system based on proprietary firmware and a Supermicro X7DBi+ motherboard with two 2.00GHz Woodcrest processors (four cpus altogether). The system used in this test includes two Mellanox sdr pci-e hcas in 8x slots. Four independent SSDs (SRP0, SRP1, ...) are configured. SRP0 is made visible on the first hca port, SRP1 is made visible on the second hca port and so on. Each hca is statically associated with a cpu at boot time. The native block size of each ssd is 4KB. The native block size can be configured to be from 512B to 64KB. We suspect that 4KB is best for Linux applications. "testy" is a small client program that uses Linux asynchronous i/o and O_DIRECT to drive read and write requests as quickly as possible. It tries to keep a specified number of reads or writes of specified size outstanding for a specified time. testy was written because available tools were not able to load the StorageGear target sufficiently. All testy io is random. For an SSD, random io performance should be the same as sequential so we don't look at sequential performance at all. The SRP clients, Tesla and Newton, used in the tests have Asus A8N32-SLI Deluxe motherboards, each with a AMD 1.8GHz Dual Core Opteron 165 processor, 1GB ram, 2 Mellanox sdr pci-e 8x hcas in 16x slots running OFED-1.2 with SRP on SUSE Linux Enterprise Server 10 (x86_64). Tesla runs kernel 2.6.16.27-0.9-smp and Newton runs kernel 2.6.16.21-0.8-smp. Two Mellanox MTEK 43132 8-port 4x switches are used to implement two subnets. SMs for each subnet are provided by separate systems. For these tests, four testys are run, two per client, one per srp target. The paths are arranged thru visibility and allow/deny configuration to use all four client ports and all four srp target ports. We monitor our target cpu utilization and we know that the maximum number of "small" iops for a particular hca is reached when the cpu associated with the hca reaches 100% utilization. All numbers are 90 second testy run averages. 4KB Random Standard Reads testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- newton.0 srp0 30636 newton.1 srp1 30682 hca0 61318 tesla.0 srp2 30680 tesla.1 srp3 30710 hca1 61390 4KB Random Standard Writes testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- newton.0 srp0 25201 newton.1 srp1 25291 hca0 50492 tesla.0 srp2 25412 tesla.1 srp3 25441 hca1 50853 4KB Random IU Data Writes testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- newton.0 srp0 31993 newton.1 srp1 32526 hca0 64519 tesla.0 srp2 32172 tesla.1 srp3 32594 hca1 64766 - 64KB Random Standard Reads testy target target mbps target hca hca mbps ----- ------ ----------- ---------- -------- newton.0 srp0 681.2 newton.1 srp1 681.2 hca0 1362.4 tesla.0 srp2 680.1 tesla.1 srp3 680.2 hca1 1360.3 128KB Random Standard Writes testy target target mbps target hca hca mbps ----- ------ ----------- ---------- -------- newton.0 srp0 747.8 newton.1 srp1 739.5 hca0 1487.3 tesla.0 srp2 747.2 tesla.1 srp3 738.7 hca1 1485.9 - The following tests are one testy to one srp target. 4KB Random Reads testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- tesla srp3 59289 hca1 59289 4KB Random Standard Writes testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- tesla srp3 43054 hca1 43054 4KB Random IU Data Writes testy target target iops target hca hca iops ----- ------ ----------- ---------- -------- tesla srp3 53839 hca1 53839 128 Random Standard Reads testy target target mbps target hca hca mbps ----- ------ ----------- ---------- -------- tesla srp3 971.9 hca1 971.9 128 Random Standard Writes testy target target mbps target hca hca mbps ----- ------ ----------- ---------- -------- tesla srp3 881.5 hca1 881.5 We have done some testing with directly connected DDR hcas. The DDR hcas provide an iops boost in the range of 10%. Ken Jeffries StorageGear From hal.rosenstock at gmail.com Mon Jul 30 09:04:40 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 30 Jul 2007 12:04:40 -0400 Subject: [ofa-general] QoS RFC In-Reply-To: <46A283B6.1070105@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> Message-ID: Hi Yevgeny, On 7/21/07, Yevgeny Kliteynik wrote: > Hi All > > Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack. > Your comments are welcome. A couple of quick questions: How does this differ from the original RFC posted 5/30/06 ? What I can see is the following: 1. Updated for not yet released IBTA QoS Annex 2. Use of plain text rather than XML based policy file for OpenSM Anything else ? Below, IPoIB is discussed in terms of UD. What about IPoIB-CM ? It uses CM and has a service ID. Also, have my specific comments to the patches originally submitted been addressed ? (Do I need to dig them out again ?) Just wondering... Thanks. -- Hal > > -- Yevgeny > > RFC: OpenFabrics Enhancements for QoS Support > =============================================== > > Authors: . Eitan Zahavi > Authors: . Yevgeny Kliteynik > Date: .... Jul 2007. > Revision: 0.2 > > Table of contents: > 1. Overview > 2. Architecture > 3. Supported Policy > 4. CMA functionality > 5. IPoIB functionality > 6. SDP functionality > 7. SRP functionality > 8. iSER functionality > 9. OpenSM functionality > > 1. Overview > ------------ > Quality of Service requirements stem from the realization of I/O consolidation > over IB network: As multiple applications and ULPs share the same fabric, means > to control their use of the network resources are becoming a must. The basic > need is to differentiate the service levels provided to different traffic flows, > such that a policy could be enforced and control each flow utilization of the > fabric resources. > > IBTA specification defined several hardware features and management interfaces > to support QoS: > * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner > * Arbitration between traffic of different VLs is performed by a 2 priority > levels weighted round robin arbiter. The arbiter is programmable with > a sequence of (VL, weight) pairs and maximal number of high priority credits > to be processed before low priority is served > * Packets carry class of service marking in the range 0 to 15 in their > header SL field > * Each switch can map the incoming packet by its SL to a particular output > VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > * The Subnet Administrator controls each communication flow parameters > by providing them as a response to Path Record (PR) or MultiPathRecord (MPR) > queries > > The IB QoS features provide the means to implement a DiffServ like architecture. > DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic > fabrics. > > This proposal provides the detailed functional definition for the various > software elements that are required to enable a DiffServ like architecture over > the OpenFabrics software stack. > > > > 2. Architecture > ---------------- > This proposal split the QoS functionality between the SM/SA, CMA and the various > ULPS. We take the "chronology approach" to describe how the overall system > works: > > 2.1. The network manager (human) provides a set of rules (policy) that defines > how the network is being configured and how its resources are split to different > QoS-Levels. The policy also define how to decide which QoS-Level each > application or ULP or service use. > > 2.2. The SM analyzes the provided policy to see if it is realizable and performs > the necessary fabric setup. The SM may continuously monitor the policy and adapt > to changes in it. Part of this policy defines the default QoS-Level of each > partition. The SA is being enhanced to match the requested Source, Destination, > QoS-Class, Service-ID (and optionally SL and priority) against the policy. So > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > enhanced to support setting up partitions with appropriate IPoIB broadcast > group. This broadcast group carries its QoS attributes: SL, MTU and > RATE. > > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the > multicast group which forms the broadcast group of this partition. > > 2.4. MPI which provides non IB based connection management should be configured > to run using hard coded SLs. It uses these SLs for every QP being opened. > > 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned > Service-ID and use it while obtaining PR/MPR for establishing connections. > The SA receiving the PR/MPR should match it against the policy and return > the appropriate PR/MPR including SL, MTU and RATE. > > 2.6. ULPs and programs using CMA to establish RC connection should provide the > CMA the target IP and Service-ID. Some of the ULPs might also provide QoS-Class > (E.g. for SDP sockets that are provided the TOS socket option). The CMA should > then use the provided Service-ID and optional QoS-Class and pass them in the > PR/MPR request. The resulting PR/MPR should be used for configuring the > connection QP. > > PathRecord and MultiPathRecord enhancement for QoS: > As mentioned above the PathRecord and MultiPathRecord attributes should be > enhanced to carry the Service-ID which is a 64bit value, which has been > standardized by the IBTA. A new field QoS-Class is also provided. > A new capability bit should describe the SM QoS support in the SA class port > info. This approach provides an easy migration path for existing access layer > and ULPs by not introducing new set of PR/MPR attribute. > > > 3. Supported Policy > -------------------- > > The QoS policy supported by this proposal is divided into 4 sub sections: > > I) Port Group: a set of CAs, Routers or Switches that share the same settings. > A port group might be a partition defined by the partition manager policy in > terms of GUIDs. Future implementations might provide support for NodeDescription > based definition of port groups. > > II) Fabric Setup: > Defines how the SL2VL and VLArb tables should be setup. This policy definition > assumes the computation of overall end to end network behavior should be performed > outside of OpenSM. > > III) QoS-Levels Definition: > This section defines the possible sets of parameters for QoS that a client > might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, > Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS). > > IV) Matching Rules: > A list of rules that match an incoming PR/MPR request to a QoS-Level. The > rules are processed in order such as the first match is applied. Each rule is > built out of a set of match expressions which should all match for the rule to > apply. The matching expressions are defined for the following fields > ** SRC and DST to lists of port groups > ** Service-ID to a list of Service-ID or Service-ID ranges > ** QoS-Class to a list of QoS-Class values or ranges > > QoS Policy file syntax > > * Empty lines are ignored > * Leading and trailing blanks, as well as empty lines, are ignored, so the > indentation in the example is just for better readability > * Comments are started with the pound sign (#) and terminated by EOL > * Comments may appear only in a separate line > * Keywords that denote section/subsection start have matching closing keywords > * Any keyword should be the first non-blank in the line > > QoS Policy file example > > # Port Groups define sets of ports to be used later in the settings > port-groups > # using port GUIDs > port-group > name: Storage > # "use" is just a description that is used for logging. > # Other than that, it is just a commentary > use: our SRP storage targets > port-guid: 0x1000000000000001 > port-guid: 0x1000000000000002 > end-port-group > > port-group > name: Virtual Servers > use: node desc and IB port num > # The syntax of the port name is as follows: "hostname/CA-num/Pnum". > # "hostname" and "CA-num" are compared to the first 2 words of > # NodeDescription, and "Pnum" is a port number on that node. > port-name: vs1/HCA-1/P1 > port-name: vs3/HCA-1/P1 > port-name: vs3/HCA-2/P2 > end-port-group > > # using partitions defined in the partition policy > port-group > name: Group for Partition 1 > use: default settings > partition: Part1 > end-port-group > > # using node types CA|ROUTER|SWITCH > port-group > name: Routers > use: all routers > node-type: ROUTER > end-port-group > > end-port-groups > > qos-setup > > # define all types of VLArb tables. The length of the tables should > # match the physically supported tables by their target ports > vlarb-tables > # scope defines the exact ports the VLArb tables apply to > vlarb-scope > # defining VLArb tables on all the ports that belong to > # port group 'Storage', and on all the ports connected > # to ports of port group 'Storage' > group: Storage > # "across" means all the ports that are connected to ports > # that belong to the specified port group > across: Storage > # VLArb table holds VL and weight pairs > vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1 > vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3 > vl-high-limit: 10 > end-vlarb-scope > # There can be several scopes > end-vlarb-tables > > sl2vl-tables > # Scope defines the exact devices and in/out ports tables apply to. > # Note: if the same port is matching several rules the *FIRST* one applies. > sl2vl-scope > # SL2VL tables are orgnized as SL2VL(in-port,out-port) > # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*) > # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m) > # > # The following example specifies that all the SL2VL tables > # entries should be defined for all the ports of group Part1: > group: Part1 > from: * > to: * > # SL2VL table has to have 16 values at max - one for each SL. > # If the user specifies less than 16 values, all the missing > # VL values will be implicitly set to 0 > sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > end-sl2vl-scope > > sl2vl-scope > # "across-to" is a combination of "across" keyword (definition can be found > # in VLArb tables section) and "to" keyword. > # "across: PortGroupName" refers to all the ports that are connected > # to ports that belong to PortGroupName. > # > # Example of "across-to" usage: > # A user has a set of 'special' nodes (e.g. storage nodes), and all > # the traffic to these nodes has to get specific VL. > # The solution is to define port group (i.g. "Storage") that will > # include all the ports of these nodes, and then to configure SL2VL > # tables on all the switch ports that are connected to the Storage > # port group by specifying "across-to: Storage". > # > across-to: Storage2 > # Similar to "across-to", "across-from" is a combination of "across" > # and "to" keywords > across-from: Storage1 > sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 > end-sl2vl-scope > end-sl2vl-tables > > end-qos-setup > > > qos-levels > > # the first one is just setting SL > qos-level > use: for the lowest priority communication > sl: 15 > packet-life: 16 > end-qos-level > # the second sets SL and QoS Class > qos-level > use: low latency best bandwidth > sl: 0 > end-qos-level > # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits > qos-level > use: just an example > sl: 0 > mtu-limit: 1 > rate-limit: 1 > packet-life: 12 > # Path Bits can be used e.g. to provide a different routes through the > # subnet to a particular port > path-bits: 2,4,8-32 > end-qos-level > > end-qos-levels > > > # Match rules are scanned in a first-fit manner (like firewall rules table) > qos-match-rules > > # matching by single criteria: class (list of values and ranges) > qos-match-rule > # just a description > use: low latency by class 7-9 or 11 > qos-class: 7-9,11 > # number of qos-level to apply to the matching PR/MPR > qos-level-sn: 1 > end-qos-match-rule > # show matching by destination group AND service-ids > qos-match-rule > use: Storage targets connection > destination: Storage > service-id: 22,4719-5000 > qos-level-sn: 2 > end-qos-match-rule > # show matching by source group only > qos-match-rule > use: bla bla > source: Storage > qos-level-sn: 3 > end-qos-match-rule > > end-qos-match-rules > > > 4. IPoIB > --------- > > IPoIB already query the SA for its broadcast group information. The additional > functionality required is for IPoIB to provide the broadcast group SL, MTU, > and RATE in every following PathRecord query performed when a new UDAV is > needed by IPoIB. > We could assign a special Service-ID for IPoIB use but since all communication > on the same IPoIB interface shares the same QoS-Level without the ability to > differentiate it by target service we can ignore it for simplicity. > > 5. CMA features > ---------------- > > The CMA interface supports Service-ID through the notion of port space as a > prefixes to the port_num which is part of the sockaddr provided to > rdma_resolve_add(). What is missing is the explicit request for a QoS-Class that > should allow the ULP (like SDP) to propagate a specific request for a class of > service. A mechanism for providing the QoS-Class is available in the IPv6 address, > so we could use that address field. Another option is to implement a special > connection options API for CMA. > > Missing functionality by CMA is the usage of the provided QoS-Class and Service-ID > in the sent PR/MPR. When a response is obtained it is an existing requirement for > the CMA to use the PR/MPR from the response in setting up the QP address vector. > > > 6. SDP > ------- > > SDP uses CMA for building its connections. > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > holding the remote TCP/IP Port Number to connect to. > SDP might be provided with SO_PRIORITY socket option. In that case the value > provided should be sent to the CMA as the TClass option of that connection. > > 7. SRP > ------- > > Current SRP implementation uses its own CM callbacks (not CMA). So SRP should > fill in the Service-ID in the PR/MPR by itself and use that information in > setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined > by the SRP target I/O Controller (but they should also comply with IBTA Service- > ID rules). Anyway, the Service-ID is reported by the I/O Controller in the > ServiceEntries DMA attribute and should be used in the PR/MPR if the SA > reports its ability to handle QoS PR/MPRs. > > 8. iSER > -------- > iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER > should be TBD. > > > 9. OpenSM features > ------------------- > The QoS related functionality to be provided by OpenSM can be split into two > main parts: > > 3.1. Fabric Setup > During fabric initialization the SM should parse the policy and apply its > settings to the discovered fabric elements. The following actions should be > performed: > * Parsing of policy > * Node Group identification. Warning should be provided for each node not > specified but found. > * SL2VL settings validation should be checked: > + A warning will be provided if there are no matching targets for the SL2VL > setting statement. > + An error message will be printed to the log file if an invalid setting is > found. A setting is invalid if it refers to: > - Non existing port numbers of the target devices > - Unsupported VLs for the target device. In the later case the map to non > existing VLs should be replaced to VL15 i.e. packets will be dropped. > * SL2VL setting is to be performed > * VL Arbitration table settings should be validated according to the following > rules: > + A warning will be provided if there are no matching targets for the setting > statement > + An error will be provided if the port number exceeds the target ports > + An error will be generated if the table length exceeds device capabilities > + A warning will be generated if the table quote a VL that is not supported > by the target device > * VL Arbitration tables will be set on the appropriate targets > > 3.2. PR/MPR query handling: > OpenSM should be able to enforce the provided policy on client request. > The overall flow for such requests is: first the request is matched against the > defined match rules such that the target QoS-Level definition is found. Given > the QoS-Level a path(s) search is performed with the given restrictions imposed > by that level. The following two sections describe these steps. > > How Service-ID is carried in the PathRecord and MultiPathRecord attributes is > now standardized by the IBTA. > > > 3.2.1. Matching rule search: > A rule is "matching" a PR/MPR request using the following criteria: > * Matching rules provide values in a list of either single value, or range of > values. A PR/MPR field is "matching" the rule field if it is explicitly > noted in the list of values or is one of the values covered by a range > included in the field values list. > * Only PR/MPR fields that have their component mask bit set should be > compared. > * For a rule to be "matching" a PR/MPR request all the rule fields should be > "matching" their PR/MPR fields. Such that a PR/MPR request that does > not have a component mask field set for one of the rule defined fields can > not match that rule. > * A PR/MPR request that have a component mask bit set for one of the fields > that is not defined by the rule can match the rule. > > The algorithm to be used for searching for a rule match might be as simple as a > sequential search through all rules or enhanced for better performance. The > semantics of every rule field and its matching PR/MPR field are described > below: > * Source: the SGID or SLID should be part of this group > * Destination: the DGID or DLID should be part of this group > * Service-ID: check if the requested Service-ID (available in the PR/MPR old > SM-Key field) is matching any of this rule Service-IDs > * TClass: check if the PR/MPR TClass field is matching > > 3.2.2 PR/MPR response generation: > The QoS-Level pointed by the first rule that matches the PR/MPR request > should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits > and QoS-Class. A default QoS-Level should be used if no rule is matching the query. > > The efficient algorithm for finding paths that meet the QoS-Level criteria is > beyond the scope of this RFC and left for the implementer to provide. However > the criteria by which the paths match the QoS-Level are described below: > > * SL: The paths found should all use the given SL. For that sake PR/MPR > algorithm should traverse the path from source to destination only through > ports that carry a valid VL (not VL15) by the SL2VL map (should consider input > and output ports and SL). > * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit > * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit > (rate limit is given in units of link BW = Width*Speed according to IBTA > Specification Vol-1 table-205 p-901 l-24). > * Path-Bits: define the target LID lowest bits (number of bits defined by the > target port PortInfo.LMC field). The path should traverse the LFT using the > target port LID with the path-bits set. > * QoS-Class: should be returned in the result PR/MPR. When routing is going to > be supported by OpenSM we might use this field in selecting the target > router too in a TBD way. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Mon Jul 30 09:55:03 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 30 Jul 2007 11:55:03 -0500 Subject: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <46ADEE7F.2000005@mellanox.co.il> References: <46ADEE7F.2000005@mellanox.co.il> Message-ID: <46AE17E7.3020305@opengridcomputing.com> Am I missing the call info? I tried an older conf id, and it didn't work. Can you please post the conf call info along with the meeting notification? Thanks, Steve. Tziporet Koren wrote: > Hi All, > > We will have our bi-weekly OFED meeting today at 9am PST > > Agenda: > - Status update > - Bugzilla cleanup > > If you have more agenda items please send them > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From pauln at psc.edu Mon Jul 30 09:56:31 2007 From: pauln at psc.edu (Paul Nowoczynski) Date: Mon, 30 Jul 2007 12:56:31 -0400 Subject: [ofa-general] SDP kernel Oops. Message-ID: <46AE183F.5090907@psc.edu> Hi, I am wondering if someone could shed some light on this problem? I'm trying use SDP on a kernel socket with limited success. Can someone with working knowledge of SDP please give me some advice? I'm running OFED-1.1. I've looked at diff's for 1.2 but didn't notice anything that looked pertinent to this problem. The bug appears when a second socket instance is invoked. My feeling is that the problem is related to teardown.. thanks, paul ----------- [cut here ] --------- [please bite here ] --------- Jul 30 12:39:52 Kernel BUG at sdp_cma:372 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_sdp rdma_cm ib_addr iptable_filter ip_tables e1000 ib_srp ib_cm ib_ipoib ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth Pid: 2578, comm: ib_cm/0 Not tainted 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 RIP: 0010:[] {:ib_sdp:sdp_connect_handler+186} RSP: 0018:000001015721bc08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000010150467040 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000010150467028 RDI: ffffffffa00b8b40 RBP: ffffffffa00b86c0 R08: 00000101542faef8 R09: 00000101542faf08 R10: 00000000ffffffff R11: 0000000000000000 R12: 0000010150467740 R13: 000001015721bd08 R14: 0000000000000000 R15: 0000010150496c00 FS: 0000002a9589db00(0000) GS:ffffffff805a30c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000005b3430 CR3: 0000000000101000 CR4: 00000000000006e0 Process ib_cm/0 (pid: 2578, threadinfo 000001015721a000, task 00000101567a1030) Stack: 000001015721bd08 00000101504679a0 0000010150467740 0000010150496c00 000001015721bd08 0000000000000000 0000000000000000 ffffffffa00b2e80 0000000000000000 0000000000000000 Call Trace:{:ib_sdp:sdp_cma_handler+896} {:ib_core:ib_find_cached_gid+239} {:rdma_cm:cma_notify_user+30} {:rdma_cm:cma_req_handler+851} {:ib_cm:cm_process_work+26} {:ib_cm:cm_req_handler+2307} {:ib_cm:cm_work_handler+0} {:ib_cm:cm_work_handler+66} {__wake_up+67} {:ib_cm:cm_work_handler+0} {worker_thread+496} {default_wake_function+0} {__wake_up_common+64} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 0f 0b 15 5e 0b a0 ff ff ff ff 74 01 65 8b 04 25 34 00 00 00 RIP {:ib_sdp:sdp_connect_handler+186} RSP <000001015721bc08> <0>Kernel panic - not syncing: Oops From jsquyres at cisco.com Mon Jul 30 10:03:32 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 30 Jul 2007 13:03:32 -0400 Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <46AE17E7.3020305@opengridcomputing.com> References: <46ADEE7F.2000005@mellanox.co.il> <46AE17E7.3020305@opengridcomputing.com> Message-ID: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> Yes, you missed it; the call was over about half an hour ago. I [re-] posted the dial-in info about 3 hours before the call this morning on the ewg list. On Jul 30, 2007, at 12:55 PM, Steve Wise wrote: > Am I missing the call info? I tried an older conf id, and it > didn't work. Can you please post the conf call info along with the > meeting notification? > > Thanks, > > Steve. > > > Tziporet Koren wrote: >> Hi All, >> We will have our bi-weekly OFED meeting today at 9am PST >> Agenda: >> - Status update >> - Bugzilla cleanup >> If you have more agenda items please send them >> Tziporet >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From hal.rosenstock at gmail.com Mon Jul 30 10:54:25 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 30 Jul 2007 13:54:25 -0400 Subject: [ofa-general] [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add missing include file Message-ID: ibdm/src/osm_check.cpp: Add missing include file Signed-off-by: Hal Rosenstock diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp index 49215c2..f24eec6 100644 --- a/ibdm/src/osm_check.cpp +++ b/ibdm/src/osm_check.cpp @@ -35,6 +35,7 @@ #include "Fabric.h" #include "SubnMgt.h" #include "CredLoops.h" +#include #include #include From pw at osc.edu Mon Jul 30 11:23:40 2007 From: pw at osc.edu (Pete Wyckoff) Date: Mon, 30 Jul 2007 14:23:40 -0400 Subject: [ofa-general] Announcing new open source iSER (iSCSI/RDMA) target Message-ID: <20070730182340.GI12789@osc.edu> We are releasing code to add support for iSCSI Extensions for RDMA (iSER) to the existing STGT user space SCSI target. It uses OpenFabrics libraries and kernel drivers to act as a SCSI target over RDMA-capable devices. The code has been tested against the existing Linux iSER initiator over InfiniBand cards, but should be specification compliant and work generally. A bit of documentation is included, and a short technical report is available at http://www.osc.edu/~pw/papers/iser-techreport.pdf . For performance, a single SCSI client using iSCSI over gigabit ethernet does 100 MB/s. iSCSI over IPoIB gets 200 MB/s, and iSER over native IB sees 500 MB/s. More information on STGT is available at http://stgt.berlios.de . The seven iSER patches can be downloaded from: git://git.osc.edu/tgt or browsed at: http://git.osc.edu/?p=tgt.git;a=summary New and modified files are distributed under a GPLv2 license. I'll submit individual patches to stgt-devel for review and eventual inclusion in STGT. -- Pete From pradeeps at linux.vnet.ibm.com Mon Jul 30 12:07:45 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 30 Jul 2007 12:07:45 -0700 Subject: NOSRQ QP implementation issues (wasRe: [ofa-general] Merge window for 2.6.23 closed) In-Reply-To: References: Message-ID: <46AE3701.40603@linux.vnet.ibm.com> Roland Dreier wrote: > > - IPoIB CM without SRQ. Pradeep, I'm sorry this missed the window > but the patch quality really doesn't look up to par to me, and > your being in a rush to get this merged I think has actually slowed > things up. I think the basic idea is OK, but I have doubts about > a static array as a data structure, and MST's comments about not > dealing with remote implementations that send packets on passive > connections looks quite serious as well. I would like to close > this for 2.6.24 so (as above) please let's keep working this and > not wait for the 2.6.24 merge window. > For sending (both on the active and passive side) the skbs are associated with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side) and WRs are posted to receive packets. An skb (for send) is not associated with SQ of the rx_qp. Therefore, no packets are expected to be sent through the rx_qp. In an erroneous case if packets do get sent to the wrong RQ, then they will get dropped as no WQEs are posted. As discussed, an RNR will be returned as expected and a new connection will get established. I still see no issues with this either. If in the future, we do want to use the unused SQ and RQs, then we will have to associate them with corresponding QP at the remote end. This will be work for both the SRQ and non-SRQ case. I do not see any issues. Can you please explain what is missing with this implementation? Pradeep From parks at lanl.gov Mon Jul 30 12:18:43 2007 From: parks at lanl.gov (Parks Fields) Date: Mon, 30 Jul 2007 13:18:43 -0600 Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> References: <46ADEE7F.2000005@mellanox.co.il> <46AE17E7.3020305@opengridcomputing.com> <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> Message-ID: <7.0.1.0.2.20070730131818.02838a90@lanl.gov> At 11:03 AM 7/30/2007, Jeff Squyres wrote: >Yes, you missed it; the call was over about half an hour ago. I >[re-] posted the dial-in info about 3 hours before the call this morning on >the ewg list. I am on the EWG list and didn't see it. :-( >On Jul 30, 2007, at 12:55 PM, Steve Wise wrote: > >>Am I missing the call info? I tried an older conf id, and it >>didn't work. Can you please post the conf call info along with the >>meeting notification? >> >>Thanks, >> >>Steve. >> >> >>Tziporet Koren wrote: >>>Hi All, >>>We will have our bi-weekly OFED meeting today at 9am PST >>>Agenda: >>>- Status update >>>- Bugzilla cleanup >>>If you have more agenda items please send them >>>Tziporet >>>_______________________________________________ >>>general mailing list >>>general at lists.openfabrics.org >>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>To unsubscribe, please visit http://openib.org/mailman/listinfo/ >>>openib-general >> >>_______________________________________________ >>ewg mailing list >>ewg at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > >-- >Jeff Squyres >Cisco Systems > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review From jimmmott at austin.rr.com Mon Jul 30 12:18:32 2007 From: jimmmott at austin.rr.com (Jim Mott) Date: Mon, 30 Jul 2007 14:18:32 -0500 Subject: [ofa-general] SDP kernel Oops. In-Reply-To: <46AE183F.5090907@psc.edu> References: <46AE183F.5090907@psc.edu> Message-ID: <004201c7d2de$6d1dca30$47595e90$@rr.com> Hi, It appears that this is an illegal instruction (illegal operand) trap in a modified Rhat4U4 kernel. I am not sure about the line number, but perhaps sdp_cma_handler() is processing an RDMA_CM_EVENT_ROUTE_RESOLVED event. A few things might help: 1) Get some debug info If the whole system does not crash, could you collect some debug information from ib_sdp. - dmesg -c (to clear) - echo 1 > /sys/module/ib_sdp/debug_level - Run your app - dmesg > xxx This will collect some flow information that can help. 2) OFED 1.2 I am the new guy here and have not worked with SDP from OFED 1.1. I am looking at the code now, but am much more familiar with 1.2. This problem is in the CM/CMA area and my understanding is that there were quite a few fixes there. Sorry to not be more helpful. JIm -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Paul Nowoczynski Sent: Monday, July 30, 2007 11:57 AM To: general at lists.openfabrics.org Subject: [ofa-general] SDP kernel Oops. Hi, I am wondering if someone could shed some light on this problem? I'm trying use SDP on a kernel socket with limited success. Can someone with working knowledge of SDP please give me some advice? I'm running OFED-1.1. I've looked at diff's for 1.2 but didn't notice anything that looked pertinent to this problem. The bug appears when a second socket instance is invoked. My feeling is that the problem is related to teardown.. thanks, paul ----------- [cut here ] --------- [please bite here ] --------- Jul 30 12:39:52 Kernel BUG at sdp_cma:372 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_sdp rdma_cm ib_addr iptable_filter ip_tables e1000 ib_srp ib_cm ib_ipoib ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth Pid: 2578, comm: ib_cm/0 Not tainted 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 RIP: 0010:[] {:ib_sdp:sdp_connect_handler+186} RSP: 0018:000001015721bc08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000010150467040 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000010150467028 RDI: ffffffffa00b8b40 RBP: ffffffffa00b86c0 R08: 00000101542faef8 R09: 00000101542faf08 R10: 00000000ffffffff R11: 0000000000000000 R12: 0000010150467740 R13: 000001015721bd08 R14: 0000000000000000 R15: 0000010150496c00 FS: 0000002a9589db00(0000) GS:ffffffff805a30c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000005b3430 CR3: 0000000000101000 CR4: 00000000000006e0 Process ib_cm/0 (pid: 2578, threadinfo 000001015721a000, task 00000101567a1030) Stack: 000001015721bd08 00000101504679a0 0000010150467740 0000010150496c00 000001015721bd08 0000000000000000 0000000000000000 ffffffffa00b2e80 0000000000000000 0000000000000000 Call Trace:{:ib_sdp:sdp_cma_handler+896} {:ib_core:ib_find_cached_gid+239} {:rdma_cm:cma_notify_user+30} {:rdma_cm:cma_req_handler+851} {:ib_cm:cm_process_work+26} {:ib_cm:cm_req_handler+2307} {:ib_cm:cm_work_handler+0} {:ib_cm:cm_work_handler+66} {__wake_up+67} {:ib_cm:cm_work_handler+0} {worker_thread+496} {default_wake_function+0} {__wake_up_common+64} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 0f 0b 15 5e 0b a0 ff ff ff ff 74 01 65 8b 04 25 34 00 00 00 RIP {:ib_sdp:sdp_connect_handler+186} RSP <000001015721bc08> <0>Kernel panic - not syncing: Oops _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jsquyres at cisco.com Mon Jul 30 12:25:45 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 30 Jul 2007 15:25:45 -0400 Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <7.0.1.0.2.20070730131818.02838a90@lanl.gov> References: <46ADEE7F.2000005@mellanox.co.il> <46AE17E7.3020305@opengridcomputing.com> <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> <7.0.1.0.2.20070730131818.02838a90@lanl.gov> Message-ID: <37A4EC2D-8A1E-481A-A149-8ED10215AFB3@cisco.com> [shrug] I sent it today at 8:21am US Eastern time: http://lists.openfabrics.org/pipermail/ewg/2007-July/004075.html Maybe check your spam folder? On Jul 30, 2007, at 3:18 PM, Parks Fields wrote: > At 11:03 AM 7/30/2007, Jeff Squyres wrote: >> Yes, you missed it; the call was over about half an hour ago. I >> [re-] posted the dial-in info about 3 hours before the call this >> morning on >> the ewg list. > > > > I am on the EWG list and didn't see it. :-( > > > > >> On Jul 30, 2007, at 12:55 PM, Steve Wise wrote: >> >>> Am I missing the call info? I tried an older conf id, and it >>> didn't work. Can you please post the conf call info along with the >>> meeting notification? >>> >>> Thanks, >>> >>> Steve. >>> >>> >>> Tziporet Koren wrote: >>>> Hi All, >>>> We will have our bi-weekly OFED meeting today at 9am PST >>>> Agenda: >>>> - Status update >>>> - Bugzilla cleanup >>>> If you have more agenda items please send them >>>> Tziporet >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >>>> openib-general >>> >>> _______________________________________________ >>> ewg mailing list >>> ewg at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general > > ***** Correspondence ***** > > This email contains no programmatic content that requires > independent ADC review -- Jeff Squyres Cisco Systems From hal.rosenstock at gmail.com Mon Jul 30 12:54:11 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 30 Jul 2007 15:54:11 -0400 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Some comment fixes Message-ID: include/iba/ib_types.h: Some comment fixes Signed-off-by: Hal Rosenstock diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index f341a37..358cd62 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -4931,7 +4931,7 @@ ib_port_info_get_mtu_cap( * [in] Pointer to a PortInfo attribute. * * RETURN VALUES -* Returns the LMC value assigned to this port. +* Returns the encooded value for the maximum MTU supported by this port. * * NOTES * @@ -4943,7 +4943,7 @@ ib_port_info_get_mtu_cap( * ib_port_info_get_neighbor_mtu * * DESCRIPTION -* Returns the encoded value for the maximum MTU supported by this port. +* Returns the encoded value for the neighbor MTU supported by this port. * * SYNOPSIS */ From rdreier at cisco.com Mon Jul 30 13:11:41 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:11:41 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipath -- bug fixes in for-roland In-Reply-To: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> (Arthur Jones's message of "Mon, 30 Jul 2007 08:05:55 -0700") References: <20070730150555.19920.69378.stgit@eng-46.internal.keyresearch.com> Message-ID: thanks, applied all 4. From rdreier at cisco.com Mon Jul 30 13:18:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:18:42 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get some small fixes for 2.6.23: Dave Olson (4): IB/ipath: Remove unsafe fastrcvint code from interrupt handler IB/ipath: Use faster put_tid_2 routine after initialization IB/ipath: Fix some issues with buffer cancel and sendctrl register update IB/ipath: Workaround problem of errormask register being overwritten Hoang-Nam Nguyen (2): IB/ehca: Fix include order to better match kernel style IB/ehca: Move extern declarations from .c files to .h files Jack Morgenstein (1): mlx4_core: Remove kfree() in mlx4_mr_alloc() error flow Roland Dreier (1): IB/mlx4: Whitespace fix Tom Tucker (1): RDMA/amso1100: Initialize the wait_queue_head_t in the c2_qp structure drivers/infiniband/hw/amso1100/c2_qp.c | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 1 + drivers/infiniband/hw/ehca/ehca_mrmw.c | 6 +-- drivers/infiniband/hw/ehca/ehca_pd.c | 1 - drivers/infiniband/hw/ehca/hcp_if.c | 1 - drivers/infiniband/hw/ehca/ipz_pt_fn.h | 2 + drivers/infiniband/hw/ipath/ipath_common.h | 3 +- drivers/infiniband/hw/ipath/ipath_driver.c | 11 +++-- drivers/infiniband/hw/ipath/ipath_iba6120.c | 20 +++++--- drivers/infiniband/hw/ipath/ipath_init_chip.c | 7 ++- drivers/infiniband/hw/ipath/ipath_intr.c | 63 ++++++------------------ drivers/infiniband/hw/ipath/ipath_kernel.h | 13 +---- drivers/infiniband/hw/ipath/ipath_stats.c | 54 +++++++++++++++++++-- drivers/infiniband/hw/mlx4/qp.c | 1 - drivers/net/mlx4/mr.c | 15 +----- 15 files changed, 101 insertions(+), 98 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c index 420c138..01d0786 100644 --- a/drivers/infiniband/hw/amso1100/c2_qp.c +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev, qp->send_sgl_depth = qp_attrs->cap.max_send_sge; qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; + init_waitqueue_head(&qp->wait); /* Initialize the SQ MQ */ q_size = be32_to_cpu(reply->sq_depth); diff --git a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h index 3725aa8..b5e9603 100644 --- a/drivers/infiniband/hw/ehca/ehca_classes.h +++ b/drivers/infiniband/hw/ehca/ehca_classes.h @@ -322,6 +322,7 @@ extern int ehca_static_rate; extern int ehca_port_act_time; extern int ehca_use_hp_mr; extern int ehca_scaling_code; +extern int ehca_mr_largepage; struct ipzu_queue_resp { u32 qe_size; /* queue entry size */ diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index c1b868b..d97eda3 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -40,10 +40,10 @@ * POSSIBILITY OF SUCH DAMAGE. */ -#include - #include +#include + #include "ehca_iverbs.h" #include "ehca_mrmw.h" #include "hcp_if.h" @@ -64,8 +64,6 @@ enum ehca_mr_pgsize { EHCA_MR_PGSIZE16M = 0x1000000L }; -extern int ehca_mr_largepage; - static u32 ehca_encode_hwpage_size(u32 pgsize) { u32 idx = 0; diff --git a/drivers/infiniband/hw/ehca/ehca_pd.c b/drivers/infiniband/hw/ehca/ehca_pd.c index 3dafd7f..43bcf08 100644 --- a/drivers/infiniband/hw/ehca/ehca_pd.c +++ b/drivers/infiniband/hw/ehca/ehca_pd.c @@ -88,7 +88,6 @@ int ehca_dealloc_pd(struct ib_pd *pd) u32 cur_pid = current->tgid; struct ehca_pd *my_pd = container_of(pd, struct ehca_pd, ib_pd); int i, leftovers = 0; - extern struct kmem_cache *small_qp_cache; struct ipz_small_queue_page *page, *tmp; if (my_pd->ib_pd.uobject && my_pd->ib_pd.uobject->context && diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index fdbfebe..24f4541 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -758,7 +758,6 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, const u64 logical_address_of_page, const u64 count) { - extern int ehca_debug_level; u64 ret; if (unlikely(ehca_debug_level >= 2)) { diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.h b/drivers/infiniband/hw/ehca/ipz_pt_fn.h index c6937a0..a801274 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.h +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.h @@ -54,6 +54,8 @@ struct ehca_pd; struct ipz_small_queue_page; +extern struct kmem_cache *small_qp_cache; + /* struct generic ehca page */ struct ipz_page { u8 entries[EHCA_PAGESIZE]; diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h index b4b786d..6ad822c 100644 --- a/drivers/infiniband/hw/ipath/ipath_common.h +++ b/drivers/infiniband/hw/ipath/ipath_common.h @@ -100,8 +100,7 @@ struct infinipath_stats { __u64 sps_hwerrs; /* number of times IB link changed state unexpectedly */ __u64 sps_iblink; - /* kernel receive interrupts that didn't read intstat */ - __u64 sps_fastrcvint; + __u64 sps_unused; /* was fastrcvint, no longer implemented */ /* number of kernel (port0) packets received */ __u64 sps_port0pkts; /* number of "ethernet" packets sent by driver */ diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 09c5fd8..6ccba36 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -740,7 +740,7 @@ void ipath_disarm_piobufs(struct ipath_devdata *dd, unsigned first, * pioavail updates to memory to stop. */ ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - sendorig & ~IPATH_S_PIOBUFAVAILUPD); + sendorig & ~INFINIPATH_S_PIOBUFAVAILUPD); sendorig = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); @@ -1614,7 +1614,7 @@ int ipath_waitfor_mdio_cmdready(struct ipath_devdata *dd) * it's safer to always do it. * PIOAvail bits are updated by the chip as if normal send had happened. */ -void ipath_cancel_sends(struct ipath_devdata *dd) +void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) { ipath_dbg("Cancelling all in-progress send buffers\n"); dd->ipath_lastcancel = jiffies+HZ/2; /* skip armlaunch errs a bit */ @@ -1627,6 +1627,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd) ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_disarm_piobufs(dd, 0, (unsigned)(dd->ipath_piobcnt2k + dd->ipath_piobcnt4k)); + if (restore_sendctrl) /* else done by caller later */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl); /* and again, be sure all have hit the chip */ ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); @@ -1655,7 +1658,7 @@ static void ipath_set_ib_lstate(struct ipath_devdata *dd, int which) /* flush all queued sends when going to DOWN or INIT, to be sure that * they don't block MAD packets */ if (!linkcmd || linkcmd == INFINIPATH_IBCC_LINKCMD_INIT) - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 1); ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl, dd->ipath_ibcctrl | which); @@ -2000,7 +2003,7 @@ void ipath_shutdown_device(struct ipath_devdata *dd) ipath_set_ib_lstate(dd, INFINIPATH_IBCC_LINKINITCMD_DISABLE << INFINIPATH_IBCC_LINKINITCMD_SHIFT); - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); /* disable IBC */ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 9868ccd..5b6ac9a 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -321,6 +321,8 @@ static const struct ipath_hwerror_msgs ipath_6120_hwerror_msgs[] = { << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) static int ipath_pe_txe_recover(struct ipath_devdata *); +static void ipath_pe_put_tid_2(struct ipath_devdata *, u64 __iomem *, + u32, unsigned long); /** * ipath_pe_handle_hwerrors - display hardware errors. @@ -555,8 +557,11 @@ static int ipath_pe_boardname(struct ipath_devdata *dd, char *name, ipath_dev_err(dd, "Unsupported InfiniPath hardware revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; - } else + } else { ret = 0; + if (dd->ipath_minrev >= 2) + dd->ipath_f_put_tid = ipath_pe_put_tid_2; + } return ret; } @@ -1220,7 +1225,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port) port * dd->ipath_rcvtidcnt * sizeof(*tidbase)); for (i = 0; i < dd->ipath_rcvtidcnt; i++) - ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED, + dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EXPECTED, tidinv); tidbase = (u64 __iomem *) @@ -1229,7 +1234,7 @@ static void ipath_pe_clear_tids(struct ipath_devdata *dd, unsigned port) port * dd->ipath_rcvegrcnt * sizeof(*tidbase)); for (i = 0; i < dd->ipath_rcvegrcnt; i++) - ipath_pe_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER, + dd->ipath_f_put_tid(dd, &tidbase[i], RCVHQ_RCV_TYPE_EAGER, tidinv); } @@ -1395,10 +1400,11 @@ void ipath_init_iba6120_funcs(struct ipath_devdata *dd) dd->ipath_f_quiet_serdes = ipath_pe_quiet_serdes; dd->ipath_f_bringup_serdes = ipath_pe_bringup_serdes; dd->ipath_f_clear_tids = ipath_pe_clear_tids; - if (dd->ipath_minrev >= 2) - dd->ipath_f_put_tid = ipath_pe_put_tid_2; - else - dd->ipath_f_put_tid = ipath_pe_put_tid; + /* + * this may get changed after we read the chip revision, + * but we start with the safe version for all revs + */ + dd->ipath_f_put_tid = ipath_pe_put_tid; dd->ipath_f_cleanup = ipath_setup_pe_cleanup; dd->ipath_f_setextled = ipath_setup_pe_setextled; dd->ipath_f_get_base_info = ipath_pe_get_base_info; diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 49951d5..9dd0bac 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -782,7 +782,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) * Follows early_init because some chips have to initialize * PIO buffers in early_init to avoid false parity errors. */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); /* early_init sets rcvhdrentsize and rcvhdrsize, so this must be * done after early_init */ @@ -851,13 +851,14 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrmask, dd->ipath_hwerrmask); - dd->ipath_maskederrs = dd->ipath_ignorederrs; /* clear all */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, -1LL); /* enable errors that are masked, at least this first time. */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, ~dd->ipath_maskederrs); - /* clear any interrups up to this point (ints still not enabled) */ + dd->ipath_errormask = ipath_read_kreg64(dd, + dd->ipath_kregs->kr_errormask); + /* clear any interrupts up to this point (ints still not enabled) */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, -1LL); /* diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 1fd91c5..b29fe7e 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -303,7 +303,7 @@ static void handle_e_ibstatuschanged(struct ipath_devdata *dd, * Flush all queued sends when link went to DOWN or INIT, * to be sure that they don't block SMA and other MAD packets */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 1); } else if (lstate == IPATH_IBSTATE_INIT || lstate == IPATH_IBSTATE_ARM || lstate == IPATH_IBSTATE_ACTIVE) { @@ -517,10 +517,7 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) supp_msgs = handle_frequent_errors(dd, errs, msg, &noprint); - /* - * don't report errors that are masked (includes those always - * ignored) - */ + /* don't report errors that are masked */ errs &= ~dd->ipath_maskederrs; /* do these first, they are most important */ @@ -566,19 +563,19 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) * ones on this particular interrupt, which also isn't great */ dd->ipath_maskederrs |= dd->ipath_lasterror | errs; + dd->ipath_errormask &= ~dd->ipath_maskederrs; ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); s_iserr = ipath_decode_err(msg, sizeof msg, - (dd->ipath_maskederrs & ~dd-> - ipath_ignorederrs)); + dd->ipath_maskederrs); - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) & + if (dd->ipath_maskederrs & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS)) ipath_dev_err(dd, "Temporarily disabling " "error(s) %llx reporting; too frequent (%s)\n", - (unsigned long long) (dd->ipath_maskederrs & - ~dd->ipath_ignorederrs), msg); + (unsigned long long)dd->ipath_maskederrs, + msg); else { /* * rcvegrfull and rcvhdrqfull are "normal", @@ -793,19 +790,22 @@ void ipath_clear_freeze(struct ipath_devdata *dd) /* disable error interrupts, to avoid confusion */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL); + /* also disable interrupts; errormask is sometimes overwriten */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL); + /* * clear all sends, because they have may been * completed by usercode while in freeze mode, and * therefore would not be sent, and eventually * might cause the process to run out of bufs */ - ipath_cancel_sends(dd); + ipath_cancel_sends(dd, 0); ipath_write_kreg(dd, dd->ipath_kregs->kr_control, dd->ipath_control); /* ensure pio avail updates continue */ ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl & ~IPATH_S_PIOBUFAVAILUPD); + dd->ipath_sendctrl & ~INFINIPATH_S_PIOBUFAVAILUPD); ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); @@ -817,7 +817,7 @@ void ipath_clear_freeze(struct ipath_devdata *dd) for (i = 0; i < dd->ipath_pioavregs; i++) { /* deal with 6110 chip bug */ im = i > 3 ? ((i&1) ? i-1 : i+1) : i; - val = ipath_read_kreg64(dd, 0x1000+(im*sizeof(u64))); + val = ipath_read_kreg64(dd, (0x1000/sizeof(u64))+im); dd->ipath_pioavailregs_dma[i] = dd->ipath_pioavailshadow[i] = le64_to_cpu(val); } @@ -832,7 +832,8 @@ void ipath_clear_freeze(struct ipath_devdata *dd) ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, E_SPKT_ERRS_IGNORE); ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); + ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, -1LL); ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL); } @@ -1002,7 +1003,6 @@ irqreturn_t ipath_intr(int irq, void *data) u32 istat, chk0rcv = 0; ipath_err_t estat = 0; irqreturn_t ret; - u32 oldhead, curtail; static unsigned unexpected = 0; static const u32 port0rbits = (1U<ipath_port0head; - curtail = (u32)le64_to_cpu(*dd->ipath_hdrqtailptr); - if (oldhead != curtail) { - if (dd->ipath_flags & IPATH_GPIO_INTR) { - ipath_write_kreg(dd, dd->ipath_kregs->kr_gpio_clear, - (u64) (1 << IPATH_GPIO_PORT0_BIT)); - istat = port0rbits | INFINIPATH_I_GPIO; - } - else - istat = port0rbits; - ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, istat); - ipath_kreceive(dd); - if (oldhead != dd->ipath_port0head) { - ipath_stats.sps_fastrcvint++; - goto done; - } - } - istat = ipath_read_kreg32(dd, dd->ipath_kregs->kr_intstatus); if (unlikely(!istat)) { @@ -1225,7 +1195,6 @@ irqreturn_t ipath_intr(int irq, void *data) handle_layer_pioavail(dd); } -done: ret = IRQ_HANDLED; bail: diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index ace63ef..7a7966f 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -261,18 +261,10 @@ struct ipath_devdata { * limiting of hwerror reporting */ ipath_err_t ipath_lasthwerror; - /* - * errors masked because they occur too fast, also includes errors - * that are always ignored (ipath_ignorederrs) - */ + /* errors masked because they occur too fast */ ipath_err_t ipath_maskederrs; /* time in jiffies at which to re-enable maskederrs */ unsigned long ipath_unmasktime; - /* - * errors always ignored (masked), at least for a given - * chip/device, because they are wrong or not useful - */ - ipath_err_t ipath_ignorederrs; /* count of egrfull errors, combined for all ports */ u64 ipath_last_tidfull; /* for ipath_qcheck() */ @@ -436,6 +428,7 @@ struct ipath_devdata { u64 ipath_lastibcstat; /* hwerrmask shadow */ ipath_err_t ipath_hwerrmask; + ipath_err_t ipath_errormask; /* errormask shadow */ /* interrupt config reg shadow */ u64 ipath_intconfig; /* kr_sendpiobufbase value */ @@ -683,7 +676,7 @@ int ipath_unordered_wc(void); void ipath_disarm_piobufs(struct ipath_devdata *, unsigned first, unsigned cnt); -void ipath_cancel_sends(struct ipath_devdata *); +void ipath_cancel_sends(struct ipath_devdata *, int); int ipath_create_rcvhdrq(struct ipath_devdata *, struct ipath_portdata *); void ipath_free_pddata(struct ipath_devdata *, struct ipath_portdata *); diff --git a/drivers/infiniband/hw/ipath/ipath_stats.c b/drivers/infiniband/hw/ipath/ipath_stats.c index 73ed17d..bae4f56 100644 --- a/drivers/infiniband/hw/ipath/ipath_stats.c +++ b/drivers/infiniband/hw/ipath/ipath_stats.c @@ -196,6 +196,45 @@ static void ipath_qcheck(struct ipath_devdata *dd) } } +static void ipath_chk_errormask(struct ipath_devdata *dd) +{ + static u32 fixed; + u32 ctrl; + unsigned long errormask; + unsigned long hwerrs; + + if (!dd->ipath_errormask || !(dd->ipath_flags & IPATH_INITTED)) + return; + + errormask = ipath_read_kreg64(dd, dd->ipath_kregs->kr_errormask); + + if (errormask == dd->ipath_errormask) + return; + fixed++; + + hwerrs = ipath_read_kreg64(dd, dd->ipath_kregs->kr_hwerrstatus); + ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control); + + ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, + dd->ipath_errormask); + + if ((hwerrs & dd->ipath_hwerrmask) || + (ctrl & INFINIPATH_C_FREEZEMODE)) { + /* force re-interrupt of pending events, just in case */ + ipath_write_kreg(dd, dd->ipath_kregs->kr_hwerrclear, 0ULL); + ipath_write_kreg(dd, dd->ipath_kregs->kr_errorclear, 0ULL); + ipath_write_kreg(dd, dd->ipath_kregs->kr_intclear, 0ULL); + dev_info(&dd->pcidev->dev, + "errormask fixed(%u) %lx -> %lx, ctrl %x hwerr %lx\n", + fixed, errormask, (unsigned long)dd->ipath_errormask, + ctrl, hwerrs); + } else + ipath_dbg("errormask fixed(%u) %lx -> %lx, no freeze\n", + fixed, errormask, + (unsigned long)dd->ipath_errormask); +} + + /** * ipath_get_faststats - get word counters from chip before they overflow * @opaque - contains a pointer to the infinipath device ipath_devdata @@ -251,14 +290,13 @@ void ipath_get_faststats(unsigned long opaque) dd->ipath_lasterror = 0; if (dd->ipath_lasthwerror) dd->ipath_lasthwerror = 0; - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) + if (dd->ipath_maskederrs && time_after(jiffies, dd->ipath_unmasktime)) { char ebuf[256]; int iserr; iserr = ipath_decode_err(ebuf, sizeof ebuf, - (dd->ipath_maskederrs & ~dd-> - ipath_ignorederrs)); - if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) & + dd->ipath_maskederrs); + if (dd->ipath_maskederrs & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | INFINIPATH_E_PKTERRS )) ipath_dev_err(dd, "Re-enabling masked errors " @@ -278,9 +316,12 @@ void ipath_get_faststats(unsigned long opaque) ipath_cdbg(ERRPKT, "Re-enabling packet" " problem interrupt (%s)\n", ebuf); } - dd->ipath_maskederrs = dd->ipath_ignorederrs; + + /* re-enable masked errors */ + dd->ipath_errormask |= dd->ipath_maskederrs; ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, - ~dd->ipath_maskederrs); + dd->ipath_errormask); + dd->ipath_maskederrs = 0; } /* limit qfull messages to ~one per minute per port */ @@ -294,6 +335,7 @@ void ipath_get_faststats(unsigned long opaque) } } + ipath_chk_errormask(dd); done: mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); } diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index f6315df..ba0428d 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1209,7 +1209,6 @@ static void set_datagram_seg(struct mlx4_wqe_datagram_seg *dseg, memcpy(dseg->av, &to_mah(wr->wr.ud.ah)->av, sizeof (struct mlx4_av)); dseg->dqpn = cpu_to_be32(wr->wr.ud.remote_qpn); dseg->qkey = cpu_to_be32(wr->wr.ud.remote_qkey); - } static void set_data_seg(struct mlx4_wqe_data_seg *dseg, diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index d0808fa..5b87183 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -255,10 +255,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access, int err; index = mlx4_bitmap_alloc(&priv->mr_table.mpt_bitmap); - if (index == -1) { - err = -ENOMEM; - goto err; - } + if (index == -1) + return -ENOMEM; mr->iova = iova; mr->size = size; @@ -269,15 +267,8 @@ int mlx4_mr_alloc(struct mlx4_dev *dev, u32 pd, u64 iova, u64 size, u32 access, err = mlx4_mtt_init(dev, npages, page_shift, &mr->mtt); if (err) - goto err_index; - - return 0; - -err_index: - mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index); + mlx4_bitmap_free(&priv->mr_table.mpt_bitmap, index); -err: - kfree(mr); return err; } EXPORT_SYMBOL_GPL(mlx4_mr_alloc); From rdreier at cisco.com Mon Jul 30 13:20:21 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:20:21 -0700 Subject: [ofa-general] Re: [PATCH 2.6.23 1/2] Make the iw_cxgb3 module parameters writable. In-Reply-To: <20070729201226.31659.85900.stgit@dell3.ogc.int> (Steve Wise's message of "Sun, 29 Jul 2007 15:12:26 -0500") References: <20070729201226.31659.85900.stgit@dell3.ogc.int> Message-ID: ugh, missed these before my last merge... anyway: why do we want to parameters writable? a good changelog tells me what, why and how, and this changelog just covered the "what". Also, I assume you've checked that it's OK for these variables to change at any time? From rdreier at cisco.com Mon Jul 30 13:22:01 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:22:01 -0700 Subject: [ofa-general] ipoib question In-Reply-To: <1185172339.5513.11.camel@mtls03> (Eli Cohen's message of "Mon, 23 Jul 2007 09:32:19 +0300") References: <1185172339.5513.11.camel@mtls03> Message-ID: > Roland, > > can you explain why you add 1 to the size of the CQ in > ipoib_transport_dev_init()? Not really... I think it's lost in the depths of time, and probably wrong too. From rdreier at cisco.com Mon Jul 30 13:23:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:23:36 -0700 Subject: [ofa-general] PATCH] IB/core: ignore membership bit when looking for a P_Key in the table In-Reply-To: <46A453BE.3030408@gmail.com> (Moni Shoua's message of "Mon, 23 Jul 2007 10:07:42 +0300") References: <46A36E77.5020307@gmail.com> <46A453BE.3030408@gmail.com> Message-ID: Looks OK I guess. But it seems that we should fix up the code in sa_query.c too, right? From rdreier at cisco.com Mon Jul 30 13:25:12 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:25:12 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46A536EC.4060201@ichips.intel.com> (Arlin Davis's message of "Mon, 23 Jul 2007 16:17:00 -0700") References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A536EC.4060201@ichips.intel.com> Message-ID: > Maintainers: please review the following proposal regarding new public > download locations/website links and respond. This request originated > from xwg. > > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html I guess it's OK, but what's the difference between a README and a WEB_README? Would it make sense to have just one file (maybe in a format that is easily transformed to HTML, eg reStructuredText) for all purposes? - R. From rdreier at cisco.com Mon Jul 30 13:25:26 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:25:26 -0700 Subject: [Fwd: [ofa-general] [PATCH] libibverbs: document IBV_SEND_INLINE buffer ownership relaxation] In-Reply-To: <46A6F17C.8060404@voltaire.com> (Or Gerlitz's message of "Wed, 25 Jul 2007 09:45:16 +0300") References: <46A6F17C.8060404@voltaire.com> Message-ID: > It seems that you have missed this patch, can you have a look? Sorry, I need to get to this... From rdreier at cisco.com Mon Jul 30 13:26:44 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:26:44 -0700 Subject: [ofa-general] Question on IPoIB start xmit In-Reply-To: (Krishna Kumar2's message of "Wed, 25 Jul 2007 16:03:23 +0530") References: Message-ID: > Pathlookup skb2, Mcast send skb5, Unicast arp send8, Good skb1, Good skb3, > Good skb4, Good skb6, Good skb7, Good skb9 > > Or is there any requirement or logic that will break unless skbs are sent > in the same order > that it was received from ULP ? Just the requirement that the low-level driver not gratuitously reorder skbs within a flow. Some small amount of reordering probably acceptable if it helps performance a lot. From pauln at psc.edu Mon Jul 30 13:26:49 2007 From: pauln at psc.edu (Paul Nowoczynski) Date: Mon, 30 Jul 2007 16:26:49 -0400 Subject: [ofa-general] SDP kernel Oops.. (OFED-1.2) Message-ID: <46AE4989.9010508@psc.edu> Jim, I just ran with 1.2 and hit the same bug. I've included the debug msgs leading up to the oops (at the bottom). I think the problem has to do with handling a connection request after a socket has been destroyed. The failed instance of sdp_connect_handler() doesn't appear to run sdp_init_qp() so I assume that it fails somewhere before that. I wonder if event->param.conn.private_data is bogus? Thanks for your help. Paul int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, struct rdma_cm_event *event) { struct sockaddr_in *dst_addr; struct sock *child; const struct sdp_hh *h; int rc; sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id); h = event->param.conn.private_data; if (!h->max_adverts) return -EINVAL; child = sk_clone(sk, GFP_KERNEL); if (!child) return -ENOMEM; sdp_add_sock(sdp_sk(child)); INIT_LIST_HEAD(&sdp_sk(child)->accept_queue); INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue); INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, sdp_time_wait_work); INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work); dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr; inet_sk(child)->dport = dst_addr->sin_port; inet_sk(child)->daddr = dst_addr->sin_addr.s_addr; bh_unlock_sock(child); __sock_put(child); rc = sdp_init_qp(child, id); ... ################# Console Msgs ########################################### oss08p login: Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 id 000001015ed2b600 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): RDMA_CM_EVENT_CONNECT_REQUEST Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 000001015ed30c00 -> 000001015ed2b600 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler bufs 64 xmit_size_goal 32768 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 handled Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler event 9 id 000001015ed2b600 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): RDMA_CM_EVENT_ESTABLISHED Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler child connection established Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler event 9 handled Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: ib_req_notify_cq Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 sk 000001014e790780 newsk 000001014e790040 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 id 000001015ee0c800 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): RDMA_CM_EVENT_CONNECT_REQUEST Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 000001015ed30c00 -> 000001015ee0c800 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connect_handler bufs 64 xmit_size_goal 32768 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 handled Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler event 9 id 000001015ee0c800 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): RDMA_CM_EVENT_ESTABLISHED Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler child connection established Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler event 9 handled Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. status 0 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: ib_req_notify_cq Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 sk 000001014e790780 newsk 0000010151c457c0 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 expected 10 *err -22 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 sk 000001014e790780 newsk 0000000000000000 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: entering time wait refcnt 2 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last socket put 2 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: destroy in time wait state Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler event 10 id 000001015ee0c800 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): RDMA_CM_EVENT_DISCONNECTED Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: refcnt 1 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_disconnected_handler Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler event 10 handled Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. status -104 Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; releasing sock Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Fedora Core release 3 (Heidelberg) Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 oss08p login: Kernel BUG at sdp_cma:372 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth Pid: 2362, comm: ib_cm/1 Not tainted 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 RIP: 0010:[] {:ib_sdp:sdp_connect_handler+207} RSP: 0018:0000010155f21bc8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640 RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08 R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780 R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4 FS: 0000002a9589db00(0000) GS:ffffffff805a3140(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0 Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task 00000101565897e0) Stack: 000001015ede2400 000001014e7909e0 000001014e790780 000001015ede2400 0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581 000001015ede2400 000001015ede2458 Call Trace:{:ib_sdp:sdp_cma_handler+945} {:rdma_cm:cma_acquire_dev+359} {:rdma_cm:cma_req_handler+1000} {:ib_cm:cm_process_work+26} {:ib_cm:cm_req_handler+2463} {:ib_cm:cm_work_handler+0} {:ib_cm:cm_work_handler+66} {__wake_up+67} {:ib_cm:cm_work_handler+0} {worker_thread+496} {default_wake_function+0} {__wake_up_common+64} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04 RIP {:ib_sdp:sdp_connect_handler+207} RSP <0000010155f21bc8> <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): sdp_cma_handler event 4 id 000001015ede2400 Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): RDMA_CM_EVENT_CONNECT_REQUEST Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler 000001015ed30c00 -> 000001015ede 2400 Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- [please bite here ] --------- Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372 Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP Jul 30 16:10:12 oss08p kernel: CPU 1 Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib _local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md forcedeth Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 Jul 30 16:10:12 oss08p kernel: RIP: 0010:[] {:ib_sdp:sdp_connect_handler+207} Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8 EtFLAGS: 00010246 Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000 Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640 Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08 Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780 Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4 Jul 30 16:10:12 oss08sp kernel: FS: 0000002a9589db00(0000) GS:ffffffff805a3140(0000) knlGS:0000000000000000 Jul 30 16:10:12 oss08p kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0 Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task 00000101565897e0) Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 000001014e7909e0 000001014e790780 000001015ede2400 Jul 30 16:10:12 oss08p kernel: 00n00010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581 Jul 30 16:10:12 oss08p kernel: 000001015ede2400 000001015ede2458 Jul 30 16:10:12 oss08p kernel: Call Trace:{:ib_sdp:sdp_cma_handler+945} {:rdma_cm:cma_acquire_cdev+359} Jul 30 16:10:12 oss08p kernel: {:rdma_cm:cma_req_handler+1000} {:ib_cm:cm_process_work+26} Jul 30 16:10:12 oss08p kernel: {:ib_cm:cm_req_handler+2463} {:ib_cim:cm_work_handler+0} Jul 30 16:10:12 oss08p kernel: {:ib_cm:cm_work_handler+66} {__wake_up+67} Jul 30 16:10:12 oss08p kernel: {:ib_cm:cm_work_handler+0} {worker_thread+496} Jul 30 n16:10:12 oss08p kernel: {default_wake_function+0} {__wake_up_common+64} Jul 30 16:10:12 oss08p kernel: {default_wake_function+0} {keventd_create_kthread+0} Jul 30 16:g10:12 oss08p kernel: {worker_thread+0} {keventd_create_kthread+0} Jul 30 16:10:12 oss08p kernel: {kthread+217} {child_rip+8} Jul 30 16:10:12 oss08p kernel: {:keventd_create_kthread+0} {kthread+0} Jul 30 16:10:12 oss08p kernel: {child_rip+0} Jul 30 16:10:12 oss08p kernel: Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04 Jul 30 16:10:12 oss08p kernel: RIP {:ib_sdp:sdp_connect_handler+207} RSP <0000010155f21bc8> Jul 30 16:10:12 oss08p kernel: <0>Kernel panic - not syncing: Oops Oops From fubar at us.ibm.com Mon Jul 30 13:29:44 2007 From: fubar at us.ibm.com (Jay Vosburgh) Date: Mon, 30 Jul 2007 13:29:44 -0700 Subject: [ofa-general] Re: [PATCH V3 7/7] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <46ADDFE6.9000609@voltaire.com> References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> Message-ID: <19319.1185827384@death> Moni Shoua wrote: >Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit >in dev->state field is on. This improves the chances for the arp packet to >be transmitted. Under what circumstances were you seeing problems that delaying the gratuitous ARP until linkwatch is done improves things? Is this really an IB thing, or did you experience problems here over regular ethernet? >Signed-off-by: Moni Shoua >--- > drivers/net/bonding/bond_main.c | 25 +++++++++++++++++++++---- > drivers/net/bonding/bonding.h | 1 + > 2 files changed, 22 insertions(+), 4 deletions(-) > >Index: net-2.6/drivers/net/bonding/bond_main.c >=================================================================== >--- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:33:25.000000000 +0300 >+++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-26 18:42:59.296296622 +0300 >@@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon > if (new_active && !bond->do_set_mac_addr) > memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, > new_active->dev->addr_len); >- >- bond_send_gratuitous_arp(bond); >+ if (bond->curr_active_slave && >+ test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){ >+ dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name); >+ bond->send_grat_arp=1; >+ }else{ >+ bond_send_gratuitous_arp(bond); >+ } Style issues throughout the patch series: many lines are too long, many things are all smashed together, e.g., "}else{" instead of "} else {", "send_grat_arp=1" instead of "send_grat_arp = 1", and so on. > } > } > >@@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device > * program could monitor the link itself if needed. > */ > >+ if (bond->send_grat_arp) { >+ if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)) >+ dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__); >+ else { >+ dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n"); >+ bond_send_gratuitous_arp(bond); >+ bond->send_grat_arp=0; >+ } >+ } > read_lock(&bond->curr_slave_lock); > oldcurrent = bond->curr_active_slave; > read_unlock(&bond->curr_slave_lock); >@@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str > struct slave *slave = bond->curr_active_slave; > struct vlan_entry *vlan; > struct net_device *vlan_dev; >+ int i; > > dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name, > slave ? slave->dev->name : "NULL"); >@@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str > return; > > if (bond->master_ip) { >- bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, >- bond->master_ip, 0); >+ for (i=0;i<3;i++) >+ bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, >+ bond->master_ip, 0); > } If you delay the grat ARP until linkwatch is done, why is it also necessary to shotgun several ARPs instead of one? Why are the ARPs sent for VLANs not also shotgunned in a similar fashion? If shotgunning like this really is useful, would it not make more sense to space them out a bit, e.g., over successive monitor passes? > list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { >@@ -4331,6 +4347,7 @@ static int bond_init(struct net_device * > bond->current_arp_slave = NULL; > bond->primary_slave = NULL; > bond->dev = bond_dev; >+ bond->send_grat_arp=0; > INIT_LIST_HEAD(&bond->vlan_list); > > /* Initialize the device entry points */ >Index: net-2.6/drivers/net/bonding/bonding.h >=================================================================== >--- net-2.6.orig/drivers/net/bonding/bonding.h 2007-07-25 15:20:10.000000000 +0300 >+++ net-2.6/drivers/net/bonding/bonding.h 2007-07-26 18:42:43.652087660 +0300 >@@ -203,6 +203,7 @@ struct bonding { > struct vlan_group *vlgrp; > struct packet_type arp_mon_pt; > s8 do_set_mac_addr; >+ int send_grat_arp; This need not be a full int, and (this applies to do_set_mac_addr, also) could probably be squeezed into gaps already existing within the struct bonding somewhere. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com From rdreier at cisco.com Mon Jul 30 13:30:40 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:30:40 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: <20070726014931.GL10235@sgi.com> (akepner@sgi.com's message of "Wed, 25 Jul 2007 18:49:31 -0700") References: <20070726014931.GL10235@sgi.com> Message-ID: > +union mthca_doorbell { > + __be64 val64; > + __be32 val32[2]; > +} __attribute__ ((aligned (sizeof(__be64)))); would we get the same effect from just adding the __attribute__((aligned to the declarations of the doorbell arrays? I wonder how it would affect the generated code on various platforms if we just made the doorbell values be computed as __be64 and then passed that in to the write64 function... - R. From akepner at sgi.com Mon Jul 30 13:43:45 2007 From: akepner at sgi.com (akepner at sgi.com) Date: Mon, 30 Jul 2007 13:43:45 -0700 Subject: [ofa-general] Re: [RFC/PATCH] mthca: ensure alignment of doorbell writes In-Reply-To: References: <20070726014931.GL10235@sgi.com> Message-ID: <20070730204345.GI10032@sgi.com> On Mon, Jul 30, 2007 at 01:30:40PM -0700, Roland Dreier wrote: > > +union mthca_doorbell { > > + __be64 val64; > > + __be32 val32[2]; > > +} __attribute__ ((aligned (sizeof(__be64)))); > > would we get the same effect from just adding the __attribute__((aligned > to the declarations of the doorbell arrays? Yes. (And of course using "((aligned (sizeof(__be64))))" with a union containing a __be64 member is silly anyway....) > > I wonder how it would affect the generated code on various platforms > if we just made the doorbell values be computed as __be64 and then > passed that in to the write64 function... > That'd work fine for ia64 :-) For other platforms I can't answer... -- Arthur From rdreier at cisco.com Mon Jul 30 13:45:00 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 13:45:00 -0700 Subject: [ofa-general] Re: [PATCH] mad.c: Fix memory leak in switch handling and improve error handling In-Reply-To: (Hal Rosenstock's message of "Sun, 29 Jul 2007 07:27:31 -0400") References: Message-ID: I'm having a hard time seeing what this does exactly. It seems that the current code } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) { /* forward case for switches */ memcpy(response, recv, sizeof(*response)); will blindly dereference response even if the allocation failed, so the first chunk that bails out if allocating response seems to be fixing this. Anyway this seems like an unrelated change to the rest of the patch. I guess the leak fix is: > - if (!agent_send_response(&response->mad.mad, > + agent_send_response(&response->mad.mad, but now you're ignoring the return value of that function. Hmm... seems that the only other caller also ignores the return value too. Should agent_send_response() just become a void function, since it doesn't seem as if there's anything useful to do with the return value anyway? - R. From hal.rosenstock at gmail.com Mon Jul 30 13:57:24 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 30 Jul 2007 16:57:24 -0400 Subject: [ofa-general] Re: [PATCH] mad.c: Fix memory leak in switch handling and improve error handling In-Reply-To: References: Message-ID: On 7/30/07, Roland Dreier wrote: > I'm having a hard time seeing what this does exactly. It seems that > the current code > > } else if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) { > /* forward case for switches */ > memcpy(response, recv, sizeof(*response)); > > will blindly dereference response even if the allocation failed, so > the first chunk that bails out if allocating response seems to be > fixing this. Yes. > Anyway this seems like an unrelated change to the rest > of the patch. Do these need to be 2 separate patches ? > I guess the leak fix is: > > > - if (!agent_send_response(&response->mad.mad, > > > + agent_send_response(&response->mad.mad, Yes (not going to out is the leak fix). > but now you're ignoring the return value of that function. > Hmm... seems that the only other caller also ignores the return > value too. Should agent_send_response() just become a void function, > since it doesn't seem as if there's anything useful to do with the > return value anyway? It could. Do you want an updated patch (or patches) ? Let me know. -- Hal > - R. > From rdreier at cisco.com Mon Jul 30 14:20:22 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 14:20:22 -0700 Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46ADDB89.5030601@voltaire.com> (Moni Shoua's message of "Mon, 30 Jul 2007 15:37:29 +0300") References: <46ADDB89.5030601@voltaire.com> Message-ID: > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a > reference to a cleanup function in the IPoIB drives. This makes it unsafe to > unload the IPoIB module if there are bonding neighbors in the air. So, to > avoid this race one must unload bonding before unloading IPoIB. I think we really want to resolve this somehow. Getting an oops by doing "modprobe -r ipoib" isn't that friendly. Also, what happened to the problem of having an address handle belonging to the wrong device on bond failover? Did you figure out a way to fix that one? - R. From ardavis at ichips.intel.com Mon Jul 30 14:32:49 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 30 Jul 2007 14:32:49 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A536EC.4060201@ichips.intel.com> Message-ID: <46AE5901.7010307@ichips.intel.com> Roland Dreier wrote: > > Maintainers: please review the following proposal regarding new public > > download locations/website links and respond. This request originated > > from xwg. > > > > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html > >I guess it's OK, but what's the difference between a README and a >WEB_README? > >Would it make sense to have just one file (maybe in a format that is >easily transformed to HTML, eg reStructuredText) for all purposes? > > That works for me. I was waiting for to hear back from Jeff regarding a filename and content. Jeff, can you comment? What format will work best for you? -arlin From rdreier at cisco.com Mon Jul 30 14:33:10 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 14:33:10 -0700 Subject: [ofa-general] Re: NOSRQ QP implementation issues In-Reply-To: <46AE3701.40603@linux.vnet.ibm.com> (Pradeep Satyanarayana's message of "Mon, 30 Jul 2007 12:07:45 -0700") References: <46AE3701.40603@linux.vnet.ibm.com> Message-ID: > For sending (both on the active and passive side) the skbs are associated > with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side) > and WRs are posted to receive packets. An skb (for send) is not associated > with SQ of the rx_qp. Therefore, no packets are expected to be sent through > the rx_qp. > > In an erroneous case if packets do get sent to the wrong RQ, then they will > get dropped as no WQEs are posted. As discussed, an RNR will be returned as > expected and a new connection will get established. I still see no issues > with this either. > > If in the future, we do want to use the unused SQ and RQs, then we will have > to associate them with corresponding QP at the remote end. This will be work > for both the SRQ and non-SRQ case. > > I do not see any issues. Can you please explain what is missing with this > implementation? I think what you are missing is that Linux is not necessarily the only IPoIB CM implementation. The Linux IPoIB driver needs to be able to talk to any other implementation that follows the RFCs, in particular RFC 4755 for connected mode. And according to my reading of the RFC at least, it is OK for a system to accept an IPoIB CM connection and then use that connection to send packets back to the system that originated the connection. There is no requirement that a new connection be opened for traffic in the other direction. And killing the connection as soon as a packet is sent in the wrong direction seems pretty wrong to me. The current SRQ code actually handles it fine, because all the QPs, no matter which direction they were opened, are attached to the SRQ and hence have receives available. One possibility would be to set the maxium receive MTU to 0 for connections initiated in the no-SRQ case. However I'm not sure whether that is within the spirit of the RFC, and it might really confuse other systems that do want to send on that QP. Another possibility would be to post one receive to all no-SRQ QPs, and if that receive is consumed then post more. - R. From pauln at psc.edu Mon Jul 30 14:36:41 2007 From: pauln at psc.edu (Paul Nowoczynski) Date: Mon, 30 Jul 2007 17:36:41 -0400 Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2) In-Reply-To: <46AE4989.9010508@psc.edu> References: <46AE4989.9010508@psc.edu> Message-ID: <46AE59E9.7070103@psc.edu> I was running old firmware. Upgrading to the 4.8.200 seems to have fixed the problem. paul Paul Nowoczynski wrote: > Jim, > I just ran with 1.2 and hit the same bug. I've included the debug > msgs leading up to the oops (at the bottom). I think the problem has > to do with handling a connection request after a socket has been > destroyed. The failed instance of sdp_connect_handler() doesn't > appear to run sdp_init_qp() so I assume that it fails somewhere before > that. > > I wonder if event->param.conn.private_data is bogus? > > Thanks for your help. > Paul > > > int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, > struct rdma_cm_event *event) > { > struct sockaddr_in *dst_addr; > struct sock *child; > const struct sdp_hh *h; > int rc; > > sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id); > > h = event->param.conn.private_data; > > if (!h->max_adverts) > return -EINVAL; > > child = sk_clone(sk, GFP_KERNEL); > if (!child) > return -ENOMEM; > > sdp_add_sock(sdp_sk(child)); > INIT_LIST_HEAD(&sdp_sk(child)->accept_queue); > INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue); > INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, > sdp_time_wait_work); > INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work); > > dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr; > inet_sk(child)->dport = dst_addr->sin_port; > inet_sk(child)->daddr = dst_addr->sin_addr.s_addr; > > bh_unlock_sock(child); > __sock_put(child); > > rc = sdp_init_qp(child, id); > ... > > ################# Console Msgs > ########################################### > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler > bufs 64 xmit_size_goal 32768 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event > 4 handled > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler > event 9 id 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): > RDMA_CM_EVENT_ESTABLISHED > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler > child connection established > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler > event 9 handled > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: > ib_req_notify_cq > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 > sk 000001014e790780 newsk 000001014e790040 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > sdp_connect_handler bufs 64 xmit_size_goal 32768 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event > 4 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 9 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > RDMA_CM_EVENT_ESTABLISHED > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler > child connection established > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 9 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. > status 0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: > ib_req_notify_cq > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 > sk 000001014e790780 newsk 0000010151c457c0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: > entering time wait refcnt 2 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last > socket put 2 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: > destroy in time wait state > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 10 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > RDMA_CM_EVENT_DISCONNECTED > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: > refcnt 1 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > sdp_disconnected_handler > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 10 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. > status -104 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; > releasing sock > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Kernel BUG at sdp_cma:372 > invalid operand: 0000 [1] SMP > CPU 1 > Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib > ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i > p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md > forcedeth > Pid: 2362, comm: ib_cm/1 Not tainted > 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 > RIP: 0010:[] > {:ib_sdp:sdp_connect_handler+207} > RSP: 0018:0000010155f21bc8 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000 > RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640 > RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08 > R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780 > R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4 > FS: 0000002a9589db00(0000) GS:ffffffff805a3140(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0 > Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task > 00000101565897e0) > Stack: 000001015ede2400 000001014e7909e0 000001014e790780 > 000001015ede2400 > 0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581 > 000001015ede2400 000001015ede2458 > Call Trace:{:ib_sdp:sdp_cma_handler+945} > {:rdma_cm:cma_acquire_dev+359} > {:rdma_cm:cma_req_handler+1000} > {:ib_cm:cm_process_work+26} > {:ib_cm:cm_req_handler+2463} > {:ib_cm:cm_work_handler+0} > {:ib_cm:cm_work_handler+66} > {__wake_up+67} > {:ib_cm:cm_work_handler+0} > {worker_thread+496} > {default_wake_function+0} > {__wake_up_common+64} > {default_wake_function+0} > {keventd_create_kthread+0} > {worker_thread+0} > {keventd_create_kthread+0} > {kthread+217} {child_rip+8} > {keventd_create_kthread+0} > {kthread+0} > {child_rip+0} > > Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04 > RIP {:ib_sdp:sdp_connect_handler+207} RSP > <0000010155f21bc8> > <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ede2400 > Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ede 2400 > Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372 > Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP > Jul 30 16:10:12 oss08p kernel: CPU 1 > Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc > obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib > _local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad > ib_mthca ib_mad ib_core md forcedeth > Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted > 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 > Jul 30 16:10:12 oss08p kernel: RIP: 0010:[] > {:ib_sdp:sdp_connect_handler+207} > Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8 EtFLAGS: > 00010246 > Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: > 0000010150e45800 RCX: 0000000000000000 > Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: > 0000010150e450a8 RDI: ffffffffa00b3640 > Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: > 0000010155ef8ef8 R09: 0000010155ef8f08 > Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: > 0000000000000000 R12: 000001014e790780 > Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: > 0000000000000000 R15: 000001015681bfa4 > Jul 30 16:10:12 oss08sp kernel: FS: 0000002a9589db00(0000) > GS:ffffffff805a3140(0000) knlGS:0000000000000000 > Jul 30 16:10:12 oss08p kernel: CS: 0010 DS: 0018 ES: 0018 CR0: > 000000008005003b > Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: > 000000007ea38000 CR4: 00000000000006e0 > Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo > 0000010155f20000, task 00000101565897e0) > Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 > 000001014e7909e0 000001014e790780 000001015ede2400 > Jul 30 16:10:12 oss08p kernel: 00n00010155f21cf8 > 0000000000000000 0000000000000000 ffffffffa00ac581 > Jul 30 16:10:12 oss08p kernel: 000001015ede2400 000001015ede2458 > Jul 30 16:10:12 oss08p kernel: Call > Trace:{:ib_sdp:sdp_cma_handler+945} > {:rdma_cm:cma_acquire_cdev+359} > Jul 30 16:10:12 oss08p kernel: > {:rdma_cm:cma_req_handler+1000} > {:ib_cm:cm_process_work+26} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_req_handler+2463} > {:ib_cim:cm_work_handler+0} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_work_handler+66} > {__wake_up+67} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_work_handler+0} > {worker_thread+496} > Jul 30 n16:10:12 oss08p kernel: > {default_wake_function+0} > {__wake_up_common+64} > Jul 30 16:10:12 oss08p kernel: > {default_wake_function+0} > {keventd_create_kthread+0} > Jul 30 16:g10:12 oss08p kernel: > {worker_thread+0} > {keventd_create_kthread+0} > Jul 30 16:10:12 oss08p kernel: {kthread+217} > {child_rip+8} > Jul 30 16:10:12 oss08p kernel: > {:keventd_create_kthread+0} > {kthread+0} > Jul 30 16:10:12 oss08p kernel: {child_rip+0} > Jul 30 16:10:12 oss08p kernel: > Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 > 01 66 66 90 66 90 65 8b 04 > Jul 30 16:10:12 oss08p kernel: RIP > {:ib_sdp:sdp_connect_handler+207} RSP > <0000010155f21bc8> > Jul 30 16:10:12 oss08p kernel: <0>Kernel panic - not syncing: Oops > Oops > > From rdreier at cisco.com Mon Jul 30 14:37:38 2007 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 30 Jul 2007 14:37:38 -0700 Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance with Modified Write Protocol In-Reply-To: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip> (Ken Jeffries's message of "Mon, 30 Jul 2007 10:30:18 -0500") References: <02ca01c7d2be$89a17eb0$0a97a8c0@blacktip> Message-ID: > We have been doing a fair amount of performance testing on our SRP target. > One thing we found early on was that client writes were considerably slower > than client reads. We addressed this by patching the SRP client code so > that it could include the client write data in the SRP CMD IU if it would > fit. This notion is in iSER but is not in standard SRP. Architecturally, > the capability is signaled using an additional data buffer format bit. > We find that client write performance is considerably improved by using > this capability. We are calling SRP spec compliant writes "standard > writes" and our modified writes "iu data writes". I think this may make sense but you probably want to involve T10 to get it standardized somehow. Also, although I know having a big IOP number is important for various non-technical reasons, are there any realistic storage workloads that do lots of single-block writes? Also I guess you need to use giant IUs to be able to hold at least one block in the IU? - R. From jimmmott at austin.rr.com Mon Jul 30 14:54:15 2007 From: jimmmott at austin.rr.com (Jim Mott) Date: Mon, 30 Jul 2007 16:54:15 -0500 Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2) In-Reply-To: <46AE59E9.7070103@psc.edu> References: <46AE4989.9010508@psc.edu> <46AE59E9.7070103@psc.edu> Message-ID: <004301c7d2f4$2d651180$882f3480$@rr.com> Great! -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Paul Nowoczynski Sent: Monday, July 30, 2007 4:37 PM To: general at lists.openfabrics.org Subject: [ofa-general] Re: SDP kernel Oops.. (OFED-1.2) I was running old firmware. Upgrading to the 4.8.200 seems to have fixed the problem. paul Paul Nowoczynski wrote: > Jim, > I just ran with 1.2 and hit the same bug. I've included the debug > msgs leading up to the oops (at the bottom). I think the problem has > to do with handling a connection request after a socket has been > destroyed. The failed instance of sdp_connect_handler() doesn't > appear to run sdp_init_qp() so I assume that it fails somewhere before > that. > > I wonder if event->param.conn.private_data is bogus? > > Thanks for your help. > Paul > > > int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, > struct rdma_cm_event *event) > { > struct sockaddr_in *dst_addr; > struct sock *child; > const struct sdp_hh *h; > int rc; > > sdp_dbg(sk, "%s %p -> %p\n", __func__, sdp_sk(sk)->id, id); > > h = event->param.conn.private_data; > > if (!h->max_adverts) > return -EINVAL; > > child = sk_clone(sk, GFP_KERNEL); > if (!child) > return -ENOMEM; > > sdp_add_sock(sdp_sk(child)); > INIT_LIST_HEAD(&sdp_sk(child)->accept_queue); > INIT_LIST_HEAD(&sdp_sk(child)->backlog_queue); > INIT_DELAYED_WORK(&sdp_sk(child)->time_wait_work, > sdp_time_wait_work); > INIT_WORK(&sdp_sk(child)->destroy_work, sdp_destroy_work); > > dst_addr = (struct sockaddr_in *)&id->route.addr.dst_addr; > inet_sk(child)->dport = dst_addr->sin_port; > inet_sk(child)->daddr = dst_addr->sin_addr.s_addr; > > bh_unlock_sock(child); > __sock_put(child); > > rc = sdp_init_qp(child, id); > ... > > ################# Console Msgs > ########################################### > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_init_qp done > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connect_handler > bufs 64 xmit_size_goal 32768 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event > 4 handled > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler > event 9 id 000001015ed2b600 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): > RDMA_CM_EVENT_ESTABLISHED > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_connected_handler > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_connected_handler > child connection established > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_cma_handler > event 9 handled > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): event 9 done. status 0 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_accept: > ib_req_notify_cq > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 > sk 000001014e790780 newsk 000001014e790040 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:08:22 oss08p kernel: sdp_sock(988:1023): sdp_setsockopt > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_init_qp done > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > sdp_connect_handler bufs 64 xmit_size_goal 32768 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_cma_handler event > 4 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): event 4 done. status 0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 9 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > RDMA_CM_EVENT_ESTABLISHED > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_connected_handler > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_connected_handler > child connection established > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 9 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 9 done. > status 0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_accept: > ib_req_notify_cq > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -22 > sk 000001014e790780 newsk 0000010151c457c0 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept state 10 > expected 10 *err -22 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: error -11 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:0): sdp_accept: status -11 > sk 000001014e790780 newsk 0000000000000000 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_setsockopt > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_fin > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: > entering time wait refcnt 2 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close: last > socket put 2 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_unhash > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_handle_wc: > destroy in time wait state > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 10 id 000001015ee0c800 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > RDMA_CM_EVENT_DISCONNECTED > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destroy_work: > refcnt 1 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): > sdp_disconnected_handler > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_cma_handler > event 10 handled > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_reset_sk > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): event 10 done. > status -104 > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_close_sk done; > releasing sock > Jul 30 16:09:40 oss08p kernel: sdp_sock(988:32768): sdp_destruct done > > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Fedora Core release 3 (Heidelberg) > Kernel 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 on an x86_64 > > oss08p login: > Kernel BUG at sdp_cma:372 > invalid operand: 0000 [1] SMP > CPU 1 > Modules linked in: ksocklnd ptlrpc obdclass lvfs lnet libcfs ib_ipoib > ib_srp ib_sdp rdma_cm ib_addr iw_cm ib_local_sa ib_cm iptable_filter i > p_tables e1000 ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core md > forcedeth > Pid: 2362, comm: ib_cm/1 Not tainted > 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 > RIP: 0010:[] > {:ib_sdp:sdp_connect_handler+207} > RSP: 0018:0000010155f21bc8 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000010150e45800 RCX: 0000000000000000 > RDX: 00000000ffffffff RSI: 0000010150e450a8 RDI: ffffffffa00b3640 > RBP: ffffffffa00b3140 R08: 0000010155ef8ef8 R09: 0000010155ef8f08 > R10: 00000000ffffffff R11: 0000000000000000 R12: 000001014e790780 > R13: 0000010155f21cf8 R14: 0000000000000000 R15: 000001015681bfa4 > FS: 0000002a9589db00(0000) GS:ffffffff805a3140(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000002a986adad8 CR3: 000000007ea38000 CR4: 00000000000006e0 > Process ib_cm/1 (pid: 2362, threadinfo 0000010155f20000, task > 00000101565897e0) > Stack: 000001015ede2400 000001014e7909e0 000001014e790780 > 000001015ede2400 > 0000010155f21cf8 0000000000000000 0000000000000000 ffffffffa00ac581 > 000001015ede2400 000001015ede2458 > Call Trace:{:ib_sdp:sdp_cma_handler+945} > {:rdma_cm:cma_acquire_dev+359} > {:rdma_cm:cma_req_handler+1000} > {:ib_cm:cm_process_work+26} > {:ib_cm:cm_req_handler+2463} > {:ib_cm:cm_work_handler+0} > {:ib_cm:cm_work_handler+66} > {__wake_up+67} > {:ib_cm:cm_work_handler+0} > {worker_thread+496} > {default_wake_function+0} > {__wake_up_common+64} > {default_wake_function+0} > {keventd_create_kthread+0} > {worker_thread+0} > {keventd_create_kthread+0} > {kthread+217} {child_rip+8} > {keventd_create_kthread+0} > {kthread+0} > {child_rip+0} > > Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 01 66 66 90 66 90 65 8b 04 > RIP {:ib_sdp:sdp_connect_handler+207} RSP > <0000010155f21bc8> > <0>Kernel panic Jul 30 16:10:11 -oss08p kernel: sdp_sock(988:0): > sdp_cma_handler event 4 id 000001015ede2400 > Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): > RDMA_CM_EVENT_CONNECT_REQUEST > Jul 30 16:10:11 oss08p kernel: sdp_sock(988:0): sdp_connect_handler > 000001015ed30c00 -> 000001015ede 2400 > Jul 30 16:10:11 oss08p kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Jul 30 16:10:12 oss08p kernel: Kernel BUG at sdp_cma:372 > Jul 30 16:10:12 oss08p kernel: invalid operand: 0000 [1] SMP > Jul 30 16:10:12 oss08p kernel: CPU 1 > Jul 30 16:n10:12 oss08p kernel: Modules linked in: ksocklnd ptlrpc > obdclass lvfs lnet libcfs ib_ipoib ib_srp ib_sdp rdma_cm ib_addr iw_cm ib > _local_sa ib_cm iptable_filter ip_tables e1000 ib_sa ib_uverbs ib_umad > ib_mthca ib_mad ib_core md forcedeth > Jul 30 16:10:12 osos08p kernel: Pid: 2362, comm: ib_cm/1 Not tainted > 2.6.9-42.0.8.EL_lustre.1.4.9.1customOBost_0.4 > Jul 30 16:10:12 oss08p kernel: RIP: 0010:[] > {:ib_sdp:sdp_connect_handler+207} > Jul 30 16:10:12 oss08p kernel: RSP: 0018:0000010155f21bc8 EtFLAGS: > 00010246 > Jul 30 16:10:12 oss08p kernel: RAX: 0000000000000000 RBX: > 0000010150e45800 RCX: 0000000000000000 > Jul 30 16:10:12 oss08p kernel: RDX: 00000000ffffffff RSI: > 0000010150e450a8 RDI: ffffffffa00b3640 > Jul 30 16:10:12 oss08p kernel: RBP: fffffff fa00b3140 R08: > 0000010155ef8ef8 R09: 0000010155ef8f08 > Jul 30 16:10:12 oss08p kernel: R10: 00000000ffffffff R11: > 0000000000000000 R12: 000001014e790780 > Jul 30 16:10:12 oss08p kernel: R13: 0000010155f21cf8 R14: > 0000000000000000 R15: 000001015681bfa4 > Jul 30 16:10:12 oss08sp kernel: FS: 0000002a9589db00(0000) > GS:ffffffff805a3140(0000) knlGS:0000000000000000 > Jul 30 16:10:12 oss08p kernel: CS: 0010 DS: 0018 ES: 0018 CR0: > 000000008005003b > Jul 30 16:10:12 oss08p kernel: CR2: 0000002a986adad8 CR3: > 000000007ea38000 CR4: 00000000000006e0 > Jul y30 16:10:12 oss08p kernel: Process ib_cm/1 (pid: 2362, threadinfo > 0000010155f20000, task 00000101565897e0) > Jul 30 16:10:12 oss08p kernel: Stack: 000001015ede2400 > 000001014e7909e0 000001014e790780 000001015ede2400 > Jul 30 16:10:12 oss08p kernel: 00n00010155f21cf8 > 0000000000000000 0000000000000000 ffffffffa00ac581 > Jul 30 16:10:12 oss08p kernel: 000001015ede2400 000001015ede2458 > Jul 30 16:10:12 oss08p kernel: Call > Trace:{:ib_sdp:sdp_cma_handler+945} > {:rdma_cm:cma_acquire_cdev+359} > Jul 30 16:10:12 oss08p kernel: > {:rdma_cm:cma_req_handler+1000} > {:ib_cm:cm_process_work+26} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_req_handler+2463} > {:ib_cim:cm_work_handler+0} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_work_handler+66} > {__wake_up+67} > Jul 30 16:10:12 oss08p kernel: > {:ib_cm:cm_work_handler+0} > {worker_thread+496} > Jul 30 n16:10:12 oss08p kernel: > {default_wake_function+0} > {__wake_up_common+64} > Jul 30 16:10:12 oss08p kernel: > {default_wake_function+0} > {keventd_create_kthread+0} > Jul 30 16:g10:12 oss08p kernel: > {worker_thread+0} > {keventd_create_kthread+0} > Jul 30 16:10:12 oss08p kernel: {kthread+217} > {child_rip+8} > Jul 30 16:10:12 oss08p kernel: > {:keventd_create_kthread+0} > {kthread+0} > Jul 30 16:10:12 oss08p kernel: {child_rip+0} > Jul 30 16:10:12 oss08p kernel: > Jul 30 16:10:12 oss08p kernel: Code: 0f 0b 01 f0 0a a0 ff ff ff ff 74 > 01 66 66 90 66 90 65 8b 04 > Jul 30 16:10:12 oss08p kernel: RIP > {:ib_sdp:sdp_connect_handler+207} RSP > <0000010155f21bc8> > Jul 30 16:10:12 oss08p kernel: <0>Kernel panic - not syncing: Oops > Oops > > _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ycai at Brocade.COM Mon Jul 30 15:25:44 2007 From: ycai at Brocade.COM (Ying Cai) Date: Mon, 30 Jul 2007 15:25:44 -0700 Subject: [ofa-general] Event for active/passive connection Message-ID: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com> Hi, After reading the OFED 1.2 code, I have a question. In cma_iw_handler(): case IW_CM_EVENT_CONNECT_REPLY: ... switch (iw_event->status) { case 0: event.event = RDMA_CM_EVENT_ESTABLISHED; break; ... } break; case IW_CM_EVENT_ESTABLISHED: event.event = RDMA_CM_EVENT_ESTABLISHED; break; It could cause a problem in SDP, since in SDP RDMA_CM_EVENT_ESTABLISHED is handled by sdp_connected_handler(), which can only handle passive connection case (it assumes the socket has parent, which is only true for listening sockets). Is the SDP over iWarp case tested, or did I miss something? Seems the correct event for SDP should be RDMA_CM_EVENT_CONNECT_RESPONSE. Thanks, -Ying -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Mon Jul 30 16:10:22 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 30 Jul 2007 18:10:22 -0500 Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> References: <46ADEE7F.2000005@mellanox.co.il> <46AE17E7.3020305@opengridcomputing.com> <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> Message-ID: <46AE6FDE.6030604@opengridcomputing.com> Jeff Squyres wrote: > Yes, you missed it; the call was over about half an hour ago. I > [re-]posted the dial-in info about 3 hours before the call this morning > on the ewg list. > I see. That's why I missed it. I'm not on the ewg list. Are all attendees expected to be on the ewg list? Steve. > > On Jul 30, 2007, at 12:55 PM, Steve Wise wrote: > >> Am I missing the call info? I tried an older conf id, and it didn't >> work. Can you please post the conf call info along with the meeting >> notification? >> >> Thanks, >> >> Steve. >> >> >> Tziporet Koren wrote: >>> Hi All, >>> We will have our bi-weekly OFED meeting today at 9am PST >>> Agenda: >>> - Status update >>> - Bugzilla cleanup >>> If you have more agenda items please send them >>> Tziporet >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> >> _______________________________________________ >> ewg mailing list >> ewg at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > From swise at opengridcomputing.com Mon Jul 30 16:15:48 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 30 Jul 2007 18:15:48 -0500 Subject: [ofa-general] Event for active/passive connection In-Reply-To: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com> References: <4C94DE2070B172459E4F1EE14BD2364E3CA97F@HQ-EXCH-5.corp.brocade.com> Message-ID: <46AE7124.7010903@opengridcomputing.com> I've never tested SDP over iWARP... Ying Cai wrote: > Hi, > > > > After reading the OFED 1.2 code, I have a question. > > > > In cma_iw_handler(): > > > > case IW_CM_EVENT_CONNECT_REPLY: > > … > > switch (iw_event->status) { > > case 0: > > event.event = RDMA_CM_EVENT_ESTABLISHED; > > break; > > … > > } > > break; > > case IW_CM_EVENT_ESTABLISHED: > > event.event = RDMA_CM_EVENT_ESTABLISHED; > > break; > > > > It could cause a problem in SDP, since in SDP RDMA_CM_EVENT_ESTABLISHED > is handled by sdp_connected_handler(), which can only handle passive > connection case (it assumes the socket has parent, which is only true > for listening sockets). Is the SDP over iWarp case tested, or did I miss > something? > > > > Seems the correct event for SDP should be RDMA_CM_EVENT_CONNECT_RESPONSE. > > > > Thanks, > > -Ying > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jsquyres at cisco.com Mon Jul 30 16:54:09 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 30 Jul 2007 19:54:09 -0400 Subject: [ewg] Re: [ofa-general] reminder: OFED meeting today at 9am PST In-Reply-To: <46AE6FDE.6030604@opengridcomputing.com> References: <46ADEE7F.2000005@mellanox.co.il> <46AE17E7.3020305@opengridcomputing.com> <0CB03EC2-3F31-4881-80FA-0F46D90825A3@cisco.com> <46AE6FDE.6030604@opengridcomputing.com> Message-ID: <03AEC794-684D-4FC3-BE77-024667268420@cisco.com> On Jul 30, 2007, at 7:10 PM, Steve Wise wrote: >> Yes, you missed it; the call was over about half an hour ago. I >> [re-]posted the dial-in info about 3 hours before the call this >> morning on the ewg list. > > I see. That's why I missed it. I'm not on the ewg list. > > Are all attendees expected to be on the ewg list? It's an OFED-specific call, so I generally post the call info just to the EWG list (there's been some backlash before about posting OFED- specific stuff on the general list and/or not on the ewg list). -- Jeff Squyres Cisco Systems From kenjeffries at storagegear.com Mon Jul 30 17:00:18 2007 From: kenjeffries at storagegear.com (Ken Jeffries) Date: Mon, 30 Jul 2007 19:00:18 -0500 Subject: [ofa-general] OFED SRP Client / StorageGear Target / Performance with Modified Write Protocol In-Reply-To: Message-ID: <02f601c7d305$c8b79480$0a97a8c0@blacktip> Our implicit assumption has been that since T10 abandoned SRP 2 that the T10/SRP community had little interest in SRP enhancements. If there is IB/SRP community interest we would certainly persue a T10 project of some sort. If StorageGear is the only interested party, then not so much. Our general target is small clusters that benefit from SSDs. An SSD takes advantage of IB much more fully than an IB RAID box does. Since we support up to 4 hca's, we envision up to 8 system clusters that use our system without needing an IB switch. Enabling low cost switchless clusters doesn't sell many switches directly but enabling low cost IB does help the IB market in general. We think these clusters will want to do random "small" i/o's and that "small" will almost always be larger than 512 bytes. Yes we use giant IUs to be able to hold at least one block. "giant" is relative though. Using srp_sg_tablesize=255 results in an IU of 4148 bytes which is plenty to hold one 4096 byte block. Since our motherboard supports up to 64GB, the overhead of the large IU's is a non-issue for us. Of course the client only transmits the used portion of the IU so non-iu-data-writes remain small on the wire. The client side code simply uses an additional s/g entry passed to the IB layer so no client side copy is done. Ken -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Monday, July 30, 2007 4:38 PM To: Ken Jeffries Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] OFED SRP Client / StorageGear Target / Performance with Modified Write Protocol > We have been doing a fair amount of performance testing on our SRP target. > One thing we found early on was that client writes were considerably slower > than client reads. We addressed this by patching the SRP client code so > that it could include the client write data in the SRP CMD IU if it would > fit. This notion is in iSER but is not in standard SRP. Architecturally, > the capability is signaled using an additional data buffer format bit. > We find that client write performance is considerably improved by using > this capability. We are calling SRP spec compliant writes "standard > writes" and our modified writes "iu data writes". I think this may make sense but you probably want to involve T10 to get it standardized somehow. Also, although I know having a big IOP number is important for various non-technical reasons, are there any realistic storage workloads that do lots of single-block writes? Also I guess you need to use giant IUs to be able to hold at least one block in the IU? - R. From pradeeps at linux.vnet.ibm.com Mon Jul 30 17:25:47 2007 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 30 Jul 2007 17:25:47 -0700 Subject: [ofa-general] Re: NOSRQ QP implementation issues In-Reply-To: References: <46AE3701.40603@linux.vnet.ibm.com> Message-ID: <46AE818B.4080107@linux.vnet.ibm.com> Roland Dreier wrote: > > For sending (both on the active and passive side) the skbs are associated > > with the tx_qp. The remote qp for the tx_qp is the rx_qp (on the other side) > > and WRs are posted to receive packets. An skb (for send) is not associated > > with SQ of the rx_qp. Therefore, no packets are expected to be sent through > > the rx_qp. > > > > In an erroneous case if packets do get sent to the wrong RQ, then they will > > get dropped as no WQEs are posted. As discussed, an RNR will be returned as > > expected and a new connection will get established. I still see no issues > > with this either. > > > > If in the future, we do want to use the unused SQ and RQs, then we will have > > to associate them with corresponding QP at the remote end. This will be work > > for both the SRQ and non-SRQ case. > > > > I do not see any issues. Can you please explain what is missing with this > > implementation? > > I think what you are missing is that Linux is not necessarily the only > IPoIB CM implementation. The Linux IPoIB driver needs to be able to > talk to any other implementation that follows the RFCs, in particular > RFC 4755 for connected mode. And according to my reading of the RFC > at least, it is OK for a system to accept an IPoIB CM connection and > then use that connection to send packets back to the system that > originated the connection. There is no requirement that a new > connection be opened for traffic in the other direction. > > And killing the connection as soon as a packet is sent in the wrong > direction seems pretty wrong to me. The current SRQ code actually > handles it fine, because all the QPs, no matter which direction they > were opened, are attached to the SRQ and hence have receives available. > > One possibility would be to set the maxium receive MTU to 0 for > connections initiated in the no-SRQ case. However I'm not sure > whether that is within the spirit of the RFC, and it might really > confuse other systems that do want to send on that QP. Another > possibility would be to post one receive to all no-SRQ QPs, and if > that receive is consumed then post more. > > - R. > Thanks for pointing that out Roland. Yes, I was focussed on Linux and did not consider other systems. Michael, Thanks for catching this. Till I saw Roland's description I did not consider the other possibilities and did not see what you were alluding to. What do you folks think about the following: in addition to posting 1 WR suppose I create a separate CQ for the RQ (for tx_qp). There will be a small completion handler that spits out a message that this request was received from a non-Linux system, and then calls ipoib_ib_completion(). So, this way we will not kill the connection, but the performance may be limited. Pradeep From amar.mudrankit at gmail.com Mon Jul 30 23:45:06 2007 From: amar.mudrankit at gmail.com (Amar Mudrankit) Date: Tue, 31 Jul 2007 12:15:06 +0530 Subject: [ofa-general] IPoIB CM Connection establishment Message-ID: Hi all, While establishing a connection with the remote node, path is resolved and REQ is sent by the requester. We get a REP from the peer indicating that it is ready for this connection establishment. At the requester's end, the REP is handled by ipoib_cm_rep_handler function in which the context of the path is recalled. All the skbs are then first de-queued, their dev pointers are changed to the device present in context(skb->dev = p->dev) and again queued for transmission using dev_queue_xmit. Now, when I traced back the initialization of context from requester point of view, I found it done in function ipoib_cm_create_tx in which dev argument is the network device corresponding to ipoib interface. Hence, what is the difference between skb->dev and p->dev? Is p->dev is a different network device because of new connection? what is the difference between ipoib net_device and this new p->dev? Precisely, I would like to understand the semantics of this p->dev. Can anyone tell me whether this trace is right and point me to correct trace if it is wrong? Regards, Amar -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenyon.dahl at maroc-ebusiness.com Tue Jul 31 04:01:51 2007 From: kenyon.dahl at maroc-ebusiness.com (Kendall Conn) Date: Tue, 31 Jul 2007 08:01:51 -0300 Subject: [ofa-general] Good summer, dude Message-ID: <853562849.82175104469044@maroc-ebusiness.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: deine.gif Type: image/gif Size: 11822 bytes Desc: not available URL: From vlad at lists.openfabrics.org Tue Jul 31 01:40:15 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 31 Jul 2007 01:40:15 -0700 (PDT) Subject: [ofa-general] ofa_1_2_kernel 20070731-0100 daily build status Message-ID: <20070731084015.B8378E6086F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2 Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on ppc64 with linux-2.6.18 Passed on ia64 with linux-2.6.18 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.16 Passed on powerpc with linux-2.6.18 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on ppc64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.15 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.19 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Failed: From jggp at brokershome.com Tue Jul 31 02:31:58 2007 From: jggp at brokershome.com (Ottilia) Date: Tue, 31 Jul 2007 12:31:58 +0300 Subject: [ofa-general] Cashed Message-ID: <46AF018E.4020508@brokershome.com> -------------- next part -------------- A non-text attachment was scrubbed... Name: Cashed.zip Type: application/octet-stream Size: 8528 bytes Desc: not available URL: From sashak at voltaire.com Tue Jul 31 02:41:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 31 Jul 2007 12:41:36 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM/include/iba/ib_types.h: Some comment fixes In-Reply-To: References: Message-ID: <20070731094136.GD13838@sashak.voltaire.com> On 15:54 Mon 30 Jul , Hal Rosenstock wrote: > include/iba/ib_types.h: Some comment fixes > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From vlad at lists.openfabrics.org Tue Jul 31 02:49:57 2007 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky) Date: Tue, 31 Jul 2007 02:49:57 -0700 (PDT) Subject: [ofa-general] ofa_1_2_c_kernel 20070731-0200 daily build status Message-ID: <20070731094957.AE542E60805@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_2/linux-2.6.git git_branch: ofed_1_2_c Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on x86_64 with linux-2.6.20 Passed on powerpc with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.12 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on powerpc with linux-2.6.18 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on powerpc with linux-2.6.17 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.15 Passed on powerpc with linux-2.6.16 Passed on x86_64 with linux-2.6.18 Passed on powerpc with linux-2.6.12 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ppc64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-22.ELsmp Passed on x86_64 with linux-2.6.9-34.ELsmp Passed on ia64 with linux-2.6.16.21-0.8-default Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Failed: From hal.rosenstock at gmail.com Tue Jul 31 03:39:34 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 31 Jul 2007 06:39:34 -0400 Subject: [ofa-general] [PATCH] mad.c: Fix memory leak in switch handling and improve error handling in ib_mad_recv_done_handler Message-ID: mad.c: Fix memory leak in switch handling and improve error handling in ib_mad_recv_done_handler. Also, eliminate no longer needed return value in agent.c:agent_send_response. Signed-off-by: Suresh Shelvapille Signed-off-by: Hal Rosenstock diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index db2633e..4c1a1ca 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -78,15 +78,14 @@ ib_get_agent_port(struct ib_device *device, int port_num) return entry; } -int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, - struct ib_wc *wc, struct ib_device *device, - int port_num, int qpn) +void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn) { struct ib_agent_port_private *port_priv; struct ib_mad_agent *agent; struct ib_mad_send_buf *send_buf; struct ib_ah *ah; - int ret; struct ib_mad_send_wr_private *mad_send_wr; if (device->node_type == RDMA_NODE_IB_SWITCH) @@ -96,23 +95,21 @@ int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, if (!port_priv) { printk(KERN_ERR SPFX "Unable to find port agent\n"); - return -ENODEV; + return; } agent = port_priv->agent[qpn]; ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); if (IS_ERR(ah)) { - ret = PTR_ERR(ah); - printk(KERN_ERR SPFX "ib_create_ah_from_wc error:%d\n", ret); - return ret; + printk(KERN_ERR SPFX "ib_create_ah_from_wc error\n"); + return; } send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_KERNEL); if (IS_ERR(send_buf)) { - ret = PTR_ERR(send_buf); - printk(KERN_ERR SPFX "ib_create_send_mad error:%d\n", ret); + printk(KERN_ERR SPFX "ib_create_send_mad error\n"); goto err1; } @@ -126,16 +123,16 @@ int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, mad_send_wr->send_wr.wr.ud.port_num = port_num; } - if ((ret = ib_post_send_mad(send_buf, NULL))) { - printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret); + if (ib_post_send_mad(send_buf, NULL)) { + printk(KERN_ERR SPFX "ib_post_send_mad error\n"); goto err2; } - return 0; + return; err2: ib_free_send_mad(send_buf); err1: ib_destroy_ah(ah); - return ret; + return; } static void agent_send_handler(struct ib_mad_agent *mad_agent, diff --git a/drivers/infiniband/core/agent.h b/drivers/infiniband/core/agent.h index 86d72fa..fb9ed14 100644 --- a/drivers/infiniband/core/agent.h +++ b/drivers/infiniband/core/agent.h @@ -46,8 +46,8 @@ extern int ib_agent_port_open(struct ib_device *device, int port_num); extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, - struct ib_wc *wc, struct ib_device *device, - int port_num, int qpn); +extern void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn); #endif /* __AGENT_H_ */ diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index bc547f1..f82900d 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1842,16 +1842,11 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, { struct ib_mad_qp_info *qp_info; struct ib_mad_private_header *mad_priv_hdr; - struct ib_mad_private *recv, *response; + struct ib_mad_private *recv, *response = NULL; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; int port_num; - response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); - if (!response) - printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " - "for response buffer\n"); - mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; qp_info = mad_list->mad_queue->qp_info; dequeue_mad(mad_list); @@ -1879,6 +1874,13 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) { + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + goto out; + } + if (port_priv->device->node_type == RDMA_NODE_IB_SWITCH) port_num = wc->port_num; else @@ -1914,12 +1916,11 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, response->header.recv_wc.recv_buf.mad = &response->mad.mad; response->header.recv_wc.recv_buf.grh = &response->grh; - if (!agent_send_response(&response->mad.mad, - &response->grh, wc, - port_priv->device, - smi_get_fwd_port(&recv->mad.smp), - qp_info->qp->qp_num)) - response = NULL; + agent_send_response(&response->mad.mad, + &response->grh, wc, + port_priv->device, + smi_get_fwd_port(&recv->mad.smp), + qp_info->qp->qp_num); goto out; } @@ -1930,15 +1931,6 @@ local: if (port_priv->device->process_mad) { int ret; - if (!response) { - printk(KERN_ERR PFX "No memory for response MAD\n"); - /* - * Is it better to assume that - * it wouldn't be processed ? - */ - goto out; - } - ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc, &recv->grh, From glebn at voltaire.com Tue Jul 31 04:56:05 2007 From: glebn at voltaire.com (Gleb Natapov) Date: Tue, 31 Jul 2007 14:56:05 +0300 Subject: [ofa-general] Re: Scalable reliable connection In-Reply-To: <20070730125054.GO9963@mellanox.co.il> References: <20070730125054.GO9963@mellanox.co.il> Message-ID: <20070731115605.GU4434@minantech.com> On Mon, Jul 30, 2007 at 03:50:54PM +0300, Michael S. Tsirkin wrote: > With SRC: > O(N ^ 2 * J) > > This is achived by using a single send queue (per job, out of O(N * J) jobs) > to send data to all J jobs running on a specific node (out of O(N) nodes). > Hardware uses new "SRQ number" field in packet header to > multiplex receive WRs and WCs to private memory of each job. > But since the send queue cannot be used for receiving packets additional receive QPs have to be created one per job so with SRC it is actually O(N ^ 2 * J + N * J) unless I am missing something. > This is similiar idea to IB RD. Except that with RD there is no need to jump through hoops and create separate QP for sending and receiving packets in order to achieve scalability. > Q: Why not use RD then? > A: Because no hardware supports it. Wrong answer :) There was no HW for SRC too, but Mellanox decided to implement SRC instead of RD. The reasons Dror provided for this a) RD is hard to do Not really very sounding reason IMO. Not doing RD is just pushing the complexity from HW to SW. And there are HW implementation of RD, not for IB though. b) RD, as defined by IB spec, will not achieve good performance This reason is serious, but can Spec be changed to allow for high performance implementation? Spec compliance not something that stopped Mellanox from doing things before :) Thanks for protocol explanation. -- Gleb. From mst at dev.mellanox.co.il Tue Jul 31 05:07:06 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 15:07:06 +0300 Subject: [ofa-general] Re: Scalable reliable connection In-Reply-To: <20070731115605.GU4434@minantech.com> References: <20070730125054.GO9963@mellanox.co.il> <20070731115605.GU4434@minantech.com> Message-ID: <20070731120706.GC9087@mellanox.co.il> > Quoting Gleb Natapov : > Subject: Re: Scalable reliable connection > > On Mon, Jul 30, 2007 at 03:50:54PM +0300, Michael S. Tsirkin wrote: > > With SRC: > > O(N ^ 2 * J) > > > > This is achived by using a single send queue (per job, out of O(N * J) jobs) > > to send data to all J jobs running on a specific node (out of O(N) nodes). > > Hardware uses new "SRQ number" field in packet header to > > multiplex receive WRs and WCs to private memory of each job. > > > But since the send queue cannot be used for receiving packets additional > receive QPs have to be created one per job so with SRC it is actually > O(N ^ 2 * J + N * J) > unless I am missing something. Yes but since N >= 1, N ^ 2 >= N and so O(N ^ 2 * J + N * J) == O(N ^ 2 * J). -- MST From jim at mellanox.com Tue Jul 31 05:07:03 2007 From: jim at mellanox.com (Jim Mott) Date: Tue, 31 Jul 2007 05:07:03 -0700 Subject: [ofa-general] [PATCH V1 2/2] sdplib: add KEEPALIVE support References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> Message-ID: Hi, This is the user space part of an OFED 1.3 patch to add keepalive support to SDP. Diff from OFED 1.2 Index: ofa_user/src/userspace/libsdp/src/port.c =================================================================== --- ofa_user.orig/src/userspace/libsdp/src/port.c 2007-06-27 15:56:21.000000000 +0300 +++ ofa_user/src/userspace/libsdp/src/port.c 2007-07-03 20:16:47.000000000 +0300 @@ -793,8 +793,21 @@ setsockopt( __sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> level <%d> name <%d>\n", program_invocation_short_name, fd, shadow_fd, level, optname ); + if (level == SOL_SOCKET && optname == SO_KEEPALIVE && get_is_sdp_socket(fd)) { + level = AF_INET_SDP; + __sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> substitute level %d\n", + program_invocation_short_name, fd, shadow_fd, level ); + } + ret = _socket_funcs.setsockopt( fd, level, optname, optval, optlen ); if ( ( ret >= 0 ) && ( shadow_fd != -1 ) ) { + if (level == SOL_SOCKET && optname == SO_KEEPALIVE && + get_is_sdp_socket(shadow_fd)) { + level = AF_INET_SDP; + __sdp_log( 2, "SETSOCKOPT: <%s:%d:%d> substitute level %d\n", + program_invocation_short_name, fd, shadow_fd, level ); + } + sret = _socket_funcs.setsockopt( shadow_fd, level, optname, optval, optlen ); if ( sret < 0 ) { From jim at mellanox.com Tue Jul 31 05:07:00 2007 From: jim at mellanox.com (Jim Mott) Date: Tue, 31 Jul 2007 05:07:00 -0700 Subject: [ofa-general] [PATCH V1 1/2] sdp: add KEEPALIVE support References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> Message-ID: Hi, This is the kernel part an OFED 1.3 patch to add keepalive support to SDP. There are a couple things to highlight. 1) No specific 'active' bit Instead of setting or clearing some bit on every send or receive, this code just remembers the TX and RX heads every time the keepalive timer pops. If they are the same this pop as last pop, then the probe is sent. 2) Counter of all keepalives sent The keepalive probe itself is a zero byte RDMA (as per-spec). It does not generate a CQ entry unless there is a problem. Since unlike TCP there is nothing that 'tcpdump' or a sniffer could see on the wire, it is hard to test that keepalives are being sent in the absence of problems. In order to create an automated test, there is a /sys counter that gets incremented every time a keepalive is sent. An argument could be made to add a counter to each socket, and add some options to get (and reset) it. I am open to doing it that way if people think it is better. Diff from OFED 1.2 Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp.h =================================================================== --- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp.h 2007-07-16 19:42:32.000000000 +0300 +++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp.h 2007-07-21 03:05:29.000000000 +0300 @@ -42,6 +42,7 @@ extern int sdp_data_debug_level; #define SDP_RESOLVE_TIMEOUT 1000 #define SDP_ROUTE_TIMEOUT 1000 #define SDP_RETRY_COUNT 5 +#define SDP_KEEPALIVE_TIME (120 * 60) #define SDP_TX_SIZE 0x40 #define SDP_RX_SIZE 0x40 @@ -51,6 +52,7 @@ extern int sdp_data_debug_level; #define SDP_NUM_WC 4 #define SDP_OP_RECV 0x800000000LL +#define SDP_OP_SEND 0x400000000LL enum sdp_mid { SDP_MID_HELLO = 0x0, @@ -115,6 +117,12 @@ struct sdp_sock { int time_wait; + unsigned keepalive_time; + + /* tx_head/rx_head when keepalive timer started */ + unsigned keepalive_tx_head; + unsigned keepalive_rx_head; + /* Data below will be reset on error */ /* rdma specific */ struct rdma_cm_id *id; @@ -221,5 +229,7 @@ void sdp_urg(struct sdp_sock *ssk, struc void sdp_add_sock(struct sdp_sock *ssk); void sdp_remove_sock(struct sdp_sock *ssk); void sdp_remove_large_sock(void); +void sdp_post_keepalive(struct sdp_sock *ssk); +void sdp_start_keepalive_timer(struct sock *sk); #endif Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-07-16 19:42:32.000000000 +0300 +++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_bcopy.c 2007-07-16 23:00:04.000000000 +0300 @@ -60,6 +60,12 @@ static int max_large_sockets = 1000; module_param_named(max_large_sockets, max_large_sockets, int, 0644); MODULE_PARM_DESC(max_large_sockets, "Max number of large sockets (32k buffers)."); +#define sdp_cnt(var) do { (var)++; } while (0) +static unsigned sdp_keepalive_probes_sent = 0; + +module_param_named(sdp_keepalive_probes_sent, sdp_keepalive_probes_sent, uint, 0644); +MODULE_PARM_DESC(sdp_keepalive_probes_sent, "Total number of keepalive probes sent."); + static int curr_large_sockets = 0; atomic_t sdp_current_mem_usage; spinlock_t sdp_large_sockets_lock; @@ -107,6 +113,31 @@ static void sdp_fin(struct sock *sk) } } +void sdp_post_keepalive(struct sdp_sock *ssk) +{ + int rc; + struct ib_send_wr wr, *bad_wr; + + sdp_dbg(&ssk->isk.sk, "%s\n", __func__); + + memset(&wr, 0, sizeof(wr)); + + wr.next = NULL; + wr.wr_id = 0; + wr.sg_list = NULL; + wr.num_sge = 0; + wr.opcode = IB_WR_RDMA_WRITE; + + rc = ib_post_send(ssk->qp, &wr, &bad_wr); + if (rc) { + sdp_dbg(&ssk->isk.sk, "ib_post_keepalive failed with status %d.\n", rc); + sdp_set_error(&ssk->isk.sk, -ECONNRESET); + wake_up(&ssk->wq); + } + + sdp_cnt(sdp_keepalive_probes_sent); +} + void sdp_post_send(struct sdp_sock *ssk, struct sk_buff *skb, u8 mid) { struct sdp_buf *tx_req; @@ -158,7 +189,7 @@ void sdp_post_send(struct sdp_sock *ssk, } ssk->tx_wr.next = NULL; - ssk->tx_wr.wr_id = ssk->tx_head; + ssk->tx_wr.wr_id = ssk->tx_head | SDP_OP_SEND; ssk->tx_wr.sg_list = ssk->ibsge; ssk->tx_wr.num_sge = frags + 1; ssk->tx_wr.opcode = IB_WR_SEND; @@ -604,7 +635,7 @@ static void sdp_handle_wc(struct sdp_soc __kfree_skb(skb); } } - } else { + } else if (likely(wc->wr_id & SDP_OP_SEND)) { skb = sdp_send_completion(ssk, wc->wr_id); if (unlikely(!skb)) return; @@ -620,6 +651,22 @@ static void sdp_handle_wc(struct sdp_soc } sk_stream_write_space(&ssk->isk.sk); + } else { + sdp_cnt(sdp_keepalive_probes_sent); + + if (likely(!wc->status)) + return; + + sdp_dbg(&ssk->isk.sk, " %s consumes KEEPALIVE status %d\n", + __func__, wc->status); + + if (wc->status == IB_WC_WR_FLUSH_ERR) + return; + + sdp_set_error(&ssk->isk.sk, -ECONNRESET); + wake_up(&ssk->wq); + + return; } if (likely(!wc->status)) { Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c =================================================================== --- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_cma.c 2007-07-16 19:42:32.000000000 +0300 +++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_cma.c 2007-07-16 23:00:04.000000000 +0300 @@ -270,8 +270,8 @@ static int sdp_response_handler(struct s sk->sk_state = TCP_ESTABLISHED; - /* TODO: If SOCK_KEEPOPEN set, need to reset and start - keepalive timer here */ + if (sock_flag(sk, SOCK_KEEPOPEN)) + sdp_start_keepalive_timer(sk); if (sock_flag(sk, SOCK_DEAD)) return 0; @@ -311,8 +311,8 @@ int sdp_connected_handler(struct sock *s sk->sk_state = TCP_ESTABLISHED; - /* TODO: If SOCK_KEEPOPEN set, need to reset and start - keepalive timer here */ + if (sock_flag(sk, SOCK_KEEPOPEN)) + sdp_start_keepalive_timer(sk); if (sock_flag(sk, SOCK_DEAD)) return 0; Index: ofa_kernel/drivers/infiniband/ulp/sdp/sdp_main.c =================================================================== --- ofa_kernel.orig/drivers/infiniband/ulp/sdp/sdp_main.c 2007-07-16 19:42:38.000000000 +0300 +++ ofa_kernel/drivers/infiniband/ulp/sdp/sdp_main.c 2007-07-21 03:10:14.000000000 +0300 @@ -117,6 +117,11 @@ static int send_poll_thresh = 8192; module_param_named(send_poll_thresh, send_poll_thresh, int, 0644); MODULE_PARM_DESC(send_poll_thresh, "Send message size thresh hold over which to start polling."); +static unsigned int sdp_keepalive_time = SDP_KEEPALIVE_TIME; + +module_param_named(sdp_keepalive_time, sdp_keepalive_time, uint, 0644); +MODULE_PARM_DESC(sdp_keepalive_time, "Default idle time in seconds before keepalive probe sent."); + struct workqueue_struct *sdp_workqueue; static struct list_head sock_list; @@ -124,6 +129,11 @@ static spinlock_t sock_list_lock; DEFINE_RWLOCK(device_removal_lock); +static inline unsigned int sdp_keepalive_time_when(const struct sdp_sock *ssk) +{ + return ssk->keepalive_time ? : sdp_keepalive_time * HZ; +} + inline void sdp_add_sock(struct sdp_sock *ssk) { spin_lock_irq(&sock_list_lock); @@ -221,6 +231,86 @@ static void sdp_destroy_qp(struct sdp_so kfree(ssk->tx_ring); } + +static void sdp_reset_keepalive_timer(struct sock *sk, unsigned long len) +{ + struct sdp_sock *ssk = sdp_sk(sk); + + sdp_dbg(sk, "%s\n", __func__); + + ssk->keepalive_tx_head = ssk->tx_head; + ssk->keepalive_rx_head = ssk->rx_head; + + sk_reset_timer(sk, &sk->sk_timer, jiffies + len); +} + +static void sdp_delete_keepalive_timer(struct sock *sk) +{ + struct sdp_sock *ssk = sdp_sk(sk); + + sdp_dbg(sk, "%s\n", __func__); + + ssk->keepalive_tx_head = 0; + ssk->keepalive_rx_head = 0; + + sk_stop_timer(sk, &sk->sk_timer); +} + +static void sdp_keepalive_timer(unsigned long data) +{ + struct sock *sk = (struct sock *)data; + struct sdp_sock *ssk = sdp_sk(sk); + + sdp_dbg(sk, "%s\n", __func__); + + /* Only process if the socket is not in use */ + bh_lock_sock(sk); + if (sock_owned_by_user(sk)) { + sdp_reset_keepalive_timer(sk, HZ / 20); + goto out; + } + + if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_LISTEN || + sk->sk_state == TCP_CLOSE) + goto out; + + if (ssk->keepalive_tx_head == ssk->tx_head && + ssk->keepalive_rx_head == ssk->rx_head) + sdp_post_keepalive(ssk); + + sdp_reset_keepalive_timer(sk, sdp_keepalive_time_when(ssk)); + +out: + bh_unlock_sock(sk); + sock_put(sk); +} + +static void sdp_init_timer(struct sock *sk) +{ + init_timer(&sk->sk_timer); + + sk->sk_timer.function = sdp_keepalive_timer; + sk->sk_timer.data = (unsigned long)sk; +} + +static void sdp_set_keepalive(struct sock *sk, int val) +{ + sdp_dbg(sk, "%s %d\n", __func__, val); + + if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) + return; + + if (val && !sock_flag(sk, SOCK_KEEPOPEN)) + sdp_start_keepalive_timer(sk); + else if (!val) + sdp_delete_keepalive_timer(sk); +} + +void sdp_start_keepalive_timer(struct sock *sk) +{ + sdp_reset_keepalive_timer(sk, sdp_keepalive_time_when(sdp_sk(sk))); +} + void sdp_reset_sk(struct sock *sk, int rc) { struct sdp_sock *ssk = sdp_sk(sk); @@ -365,6 +455,8 @@ static void sdp_close(struct sock *sk, l sdp_dbg(sk, "%s\n", __func__); + sdp_delete_keepalive_timer(sk); + sk->sk_shutdown = SHUTDOWN_MASK; if (sk->sk_state == TCP_LISTEN || sk->sk_state == TCP_SYN_SENT) { sdp_set_state(sk, TCP_CLOSE); @@ -818,9 +910,6 @@ static int sdp_setsockopt(struct sock *s int err = 0; sdp_dbg(sk, "%s\n", __func__); - if (level != SOL_TCP) - return -ENOPROTOOPT; - if (optlen < sizeof(int)) return -EINVAL; @@ -829,6 +918,28 @@ static int sdp_setsockopt(struct sock *s lock_sock(sk); + /* SOCK_KEEPALIVE is really a SOL_SOCKET level option but there + * is a problem handling it at that level. In order to start + * the keepalive timer on an SDP socket, we must call an SDP + * specific routine. Since sock_setsockopt() can not be modifed + * to understand SDP, the application must pass that option + * through to us. Since SO_KEEPALIVE and TCP_DEFER_ACCEPT both + * use the same optname, the level must not be SOL_TCP or SOL_SOCKET + */ + if (level == PF_INET_SDP && optname == SO_KEEPALIVE) { + sdp_set_keepalive(sk, val); + if (val) + sock_set_flag(sk, SOCK_KEEPOPEN); + else + sock_reset_flag(sk, SOCK_KEEPOPEN); + goto out; + } + + if (level != SOL_TCP) { + err = -ENOPROTOOPT; + goto out; + } + switch (optname) { case TCP_NODELAY: if (val) { @@ -867,11 +978,23 @@ static int sdp_setsockopt(struct sock *s sdp_push_pending_frames(sk); } break; + case TCP_KEEPIDLE: + if (val < 1 || val > MAX_TCP_KEEPIDLE) + err = -EINVAL; + else { + ssk->keepalive_time = val * HZ; + + if (sock_flag(sk, SOCK_KEEPOPEN) && + !((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) + sdp_reset_keepalive_timer(sk, ssk->keepalive_time); + } + break; default: err = -ENOPROTOOPT; break; } +out: release_sock(sk); return err; } @@ -904,6 +1027,9 @@ static int sdp_getsockopt(struct sock *s case TCP_CORK: val = !!(ssk->nonagle&TCP_NAGLE_CORK); break; + case TCP_KEEPIDLE: + val = ssk->keepalive_time ? ssk->keepalive_time / HZ : sdp_keepalive_time; + break; default: return -ENOPROTOOPT; } @@ -1687,6 +1813,8 @@ static int sdp_create_socket(struct sock sk->sk_destruct = sdp_destruct; + sdp_init_timer(sk); + sock->ops = &sdp_proto_ops; sock->state = SS_UNCONNECTED; From monisonlists at gmail.com Tue Jul 31 06:33:20 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 31 Jul 2007 16:33:20 +0300 Subject: [ofa-general] Re: [PATCH V3 7/7] net/bonding: Delay sending of gratuitous ARP to avoid failure In-Reply-To: <19319.1185827384@death> References: <46ADDB89.5030601@voltaire.com> <46ADDFE6.9000609@voltaire.com> <19319.1185827384@death> Message-ID: <46AF3A20.8080700@gmail.com> Jay Vosburgh wrote: > Moni Shoua wrote: > >> Delay sending a gratuitous_arp when LINK_STATE_LINKWATCH_PENDING bit >> in dev->state field is on. This improves the chances for the arp packet to >> be transmitted. > > Under what circumstances were you seeing problems that delaying > the gratuitous ARP until linkwatch is done improves things? Is this > really an IB thing, or did you experience problems here over regular > ethernet? > I tried to figure out what is the difference in the state/flags of the device when grat. ARP send succeeds and when it fails. I found exact correlation with the LINK_STATE_LINKWATCH_PENDING bit on. I don't think that this is an IB issue but I can't be sure. I didn't run tests for Ethernet. >> Signed-off-by: Moni Shoua >> --- >> drivers/net/bonding/bond_main.c | 25 +++++++++++++++++++++---- >> drivers/net/bonding/bonding.h | 1 + >> 2 files changed, 22 insertions(+), 4 deletions(-) >> >> Index: net-2.6/drivers/net/bonding/bond_main.c >> =================================================================== >> --- net-2.6.orig/drivers/net/bonding/bond_main.c 2007-07-25 15:33:25.000000000 +0300 >> +++ net-2.6/drivers/net/bonding/bond_main.c 2007-07-26 18:42:59.296296622 +0300 >> @@ -1134,8 +1134,13 @@ void bond_change_active_slave(struct bon >> if (new_active && !bond->do_set_mac_addr) >> memcpy(bond->dev->dev_addr, new_active->dev->dev_addr, >> new_active->dev->addr_len); >> - >> - bond_send_gratuitous_arp(bond); >> + if (bond->curr_active_slave && >> + test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)){ >> + dprintk("delaying gratuitous arp on %s\n",bond->curr_active_slave->dev->name); >> + bond->send_grat_arp=1; >> + }else{ >> + bond_send_gratuitous_arp(bond); >> + } > > Style issues throughout the patch series: many lines are too > long, many things are all smashed together, e.g., "}else{" instead of > "} else {", "send_grat_arp=1" instead of "send_grat_arp = 1", and so on. > OK thanks. I'll fix and repost. >> } >> } >> >> @@ -2120,6 +2125,15 @@ void bond_mii_monitor(struct net_device >> * program could monitor the link itself if needed. >> */ >> >> + if (bond->send_grat_arp) { >> + if (bond->curr_active_slave && test_bit(__LINK_STATE_LINKWATCH_PENDING, &bond->curr_active_slave->dev->state)) >> + dprintk("Needs to send gratuitous arp but not yet\n",__FUNCTION__); >> + else { >> + dprintk("sending delayed gratuitous arp on ond->curr_active_slave->dev->name\n"); >> + bond_send_gratuitous_arp(bond); >> + bond->send_grat_arp=0; >> + } >> + } > > >> read_lock(&bond->curr_slave_lock); >> oldcurrent = bond->curr_active_slave; >> read_unlock(&bond->curr_slave_lock); >> @@ -2513,6 +2527,7 @@ static void bond_send_gratuitous_arp(str >> struct slave *slave = bond->curr_active_slave; >> struct vlan_entry *vlan; >> struct net_device *vlan_dev; >> + int i; >> >> dprintk("bond_send_grat_arp: bond %s slave %s\n", bond->dev->name, >> slave ? slave->dev->name : "NULL"); >> @@ -2520,8 +2535,9 @@ static void bond_send_gratuitous_arp(str >> return; >> >> if (bond->master_ip) { >> - bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, >> - bond->master_ip, 0); >> + for (i=0;i<3;i++) >> + bond_arp_send(slave->dev, ARPOP_REPLY, bond->master_ip, >> + bond->master_ip, 0); >> } > > If you delay the grat ARP until linkwatch is done, why is it > also necessary to shotgun several ARPs instead of one? Why are the ARPs > sent for VLANs not also shotgunned in a similar fashion? Besides the linkwatch issue I also noticed that on rare occasions, grat. ARPs that found their way to the slave's xmit function were not xmitted. The 3 times send is just an another attempt to improve chances. I'd like to emphasize here that with IB slaves, grat. ARP is much more crucial to a successful change of slaves and that was my focus. > If shotgunning like this really is useful, would it not make > more sense to space them out a bit, e.g., over successive monitor > passes? > I guess you are right about that. >> list_for_each_entry(vlan, &bond->vlan_list, vlan_list) { >> @@ -4331,6 +4347,7 @@ static int bond_init(struct net_device * >> bond->current_arp_slave = NULL; >> bond->primary_slave = NULL; >> bond->dev = bond_dev; >> + bond->send_grat_arp=0; >> INIT_LIST_HEAD(&bond->vlan_list); >> >> /* Initialize the device entry points */ >> Index: net-2.6/drivers/net/bonding/bonding.h >> =================================================================== >> --- net-2.6.orig/drivers/net/bonding/bonding.h 2007-07-25 15:20:10.000000000 +0300 >> +++ net-2.6/drivers/net/bonding/bonding.h 2007-07-26 18:42:43.652087660 +0300 >> @@ -203,6 +203,7 @@ struct bonding { >> struct vlan_group *vlgrp; >> struct packet_type arp_mon_pt; >> s8 do_set_mac_addr; >> + int send_grat_arp; > > This need not be a full int, and (this applies to > do_set_mac_addr, also) could probably be squeezed into gaps already > existing within the struct bonding somewhere. Thanks. Will be fixed. > > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar at us.ibm.com > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at dev.mellanox.co.il Tue Jul 31 06:37:11 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 31 Jul 2007 16:37:11 +0300 Subject: [ofa-general] [PATCH] rdma/ib_mad.h: add include to linux/list.h Message-ID: <200707311637.11327.dotanb@dev.mellanox.co.il> ib_mad.h uses struct list_head, so while linux/list.h seems to be pulled in indirectly by one of the headers it includes, the right thing is to include linux/list.h directly. Signed-off-by: Dotan Barak --- diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 30712dd..8ec3799 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -39,6 +39,8 @@ #if !defined( IB_MAD_H ) #define IB_MAD_H +#include + #include /* Management base version */ From tziporet at mellanox.co.il Tue Jul 31 06:40:06 2007 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 31 Jul 2007 16:40:06 +0300 Subject: [ofa-general] OFED July 30 meeting minutes Message-ID: <6C2C79E72C305246B504CBA17B5500C9015639DD@mtlexch01.mtl.com> OFED July 30 meeting summary ============================ 1. Decided to have only one release in August - OFED 1.2.c. Main reasons are - focus of all companies and saving the verification and QA efforts. 2. Bugzilla: Everybody is requested to review the non-closed bugs and decide what is the action needed. 3. Status update: a. OFED 1.2.c: - 1.2.c-10 will be available tomorrow (Aug-1). - 1.2.c release is targeted for Aug 8. b. OFED 1.3: - Kernel code base was changed to 2.6.23-rc1 - new install scripts and spec files should be ready next week. - Other features - on track for now (no special updates). - Not clear if Open MPI will be ready with support for the new SRC object. Action Items: ============= 1. Chelsio and IBM (that requested 1.2.1 release) - make sure all your changes are committed to 1.2.c branch Reminder: Feature freeze for OFED 1.3 is targeted to Sep 4. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From monisonlists at gmail.com Tue Jul 31 06:44:08 2007 From: monisonlists at gmail.com (Moni Shoua) Date: Tue, 31 Jul 2007 16:44:08 +0300 Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: References: <46ADDB89.5030601@voltaire.com> Message-ID: <46AF3CA8.6050201@gmail.com> Roland Dreier wrote: > > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a > > reference to a cleanup function in the IPoIB drives. This makes it unsafe to > > unload the IPoIB module if there are bonding neighbors in the air. So, to > > avoid this race one must unload bonding before unloading IPoIB. > > I think we really want to resolve this somehow. Getting an oops by > doing "modprobe -r ipoib" isn't that friendly. > You are right and we want to resolve that. One way is to clean the neigh destructor function from all IPoIB neighs. The other way is to prevent ipoib unload if device is a slave or is referenced from somewhere else. I guess I would like an advice here. > Also, what happened to the problem of having an address handle > belonging to the wrong device on bond failover? Did you figure out a > way to fix that one? This is what patch 2 handles. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dotanb at dev.mellanox.co.il Tue Jul 31 06:49:15 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 31 Jul 2007 16:49:15 +0300 Subject: [ofa-general] [PATCH] rdma/ib_verbs.h: add include to linux/list.h and linux/rwsem.h Message-ID: <200707311649.15573.dotanb@dev.mellanox.co.il> ib_verbs.h uses the structs list_head and rw_semaphore, so while the files linux/list.h and linux/rwsem.h seems to be pulled in indirectly by the other header files it includes, the right thing is to include those files directly. Signed-off-by: Dotan Barak --- diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0627a6a..7a99f11 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -46,6 +46,8 @@ #include #include #include +#include +#include #include #include From erezz at voltaire.com Tue Jul 31 06:51:31 2007 From: erezz at voltaire.com (Erez Zilber) Date: Tue, 31 Jul 2007 16:51:31 +0300 Subject: [ofa-general] OFED 1.2.c-9 is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> Message-ID: <46AF3E63.20204@voltaire.com> Tziporet, Does 1.2.c-9 also include FMR support? Else, should we wait for 1.2.c-10? Thanks, Erez Tziporet Koren wrote: > Hi All, > > OFED 1.2.c-9 is available now on the OFA server under: > _http://www.openfabrics.org/builds/connectx/release/_ > Note: this release was tested with FW 2.1.000 that will soon be > available on Mellanox web site for download. > > Supported Platforms and Operating Systems > ================================= > o CPU architectures: > - x86_64 > - x86 > - ppc64 > - ia64 > > o Linux Operating Systems: > - RedHat EL4 up3: 2.6.9-34.ELsmp > - RedHat EL4 up4: 2.6.9-42.ELsmp > - RedHat EL4 up5: 2.6.9-55.ELsmp > - RedHat EL5: 2.6.18-8.el5 > - SLES10: 2.6.16.21-0.8-smp > - kernel.org: 2.6.20.x > - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) > > Main changes from OFED 1.2.c-8: > ========================= > 1. Kernel oops in IPoIB on restart of the driver. > 2. IPoIB CM is now the default. > 3. MPI with SRQ is supported. > 4. Itanium is now supported. > > mlx4 Fixed Bugs and Enhancements > =========================== > - Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428. > - Query QP and query SRQ are now supported. > - Internal error flow was added. > - Number of QPs that can be attached to the same multicast group was > increased to 56. > - SRQ is now supported. > - Fork is now supported. > > ConnectX specific known issues and limitations > =================================== > - The following commands and/or features are not supported: > o Resize CQ > o FMRs > o APM > o SQD > - ibstat does not present all entries. Use ibv_devinfo instead. > - To load the driver on machines with 64KB default page size UAR bar > must be > enlarged. 64KB page size is the default of PPC with RHEL5 and > Itanium with > 64KB page size enabled. > Perform the following three steps: > 1. Add the following line in the firmware configuration (INI) file > under the > [HCA] section: > log2_uar_bar_megabytes = 5 > 2. Burn a modified firmware image with the changed INI file > 3. Reboot the system > > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: _tziporet at mellanox.co.il_ > Tel +972-4-9097200, ext 380 > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Tue Jul 31 06:58:40 2007 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 31 Jul 2007 16:58:40 +0300 Subject: [ofa-general] RE: [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add missing include file In-Reply-To: References: Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF1344@mtlexch01.mtl.com> Applied. Thanks Hal. Eitan Zahavi > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Monday, July 30, 2007 8:54 PM > To: Eitan Zahavi > Cc: OpenFabrics General > Subject: [PATCH][TIRIVIAL] ibdm/src/osm_check.cpp: Add > missing include file > > ibdm/src/osm_check.cpp: Add missing include file > > Signed-off-by: Hal Rosenstock > > diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp > index 49215c2..f24eec6 100644 > --- a/ibdm/src/osm_check.cpp > +++ b/ibdm/src/osm_check.cpp > @@ -35,6 +35,7 @@ > #include "Fabric.h" > #include "SubnMgt.h" > #include "CredLoops.h" > +#include > #include > #include > From dotanb at dev.mellanox.co.il Tue Jul 31 07:04:07 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 31 Jul 2007 17:04:07 +0300 Subject: [ofa-general] should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs? Message-ID: <46AF4157.3090007@dev.mellanox.co.il> Hi Roland. Here is the code of the macro IB_UMEM_MAX_PAGE_CHUNK (from ib_verbs): #define IB_UMEM_MAX_PAGE_CHUNK \ ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) The problem is that the struct ib_umem_chunk is no longer in this header file (ib_verbs.h), the declaration of this structure can be found in ib_umem.h. This problem can be fixed by one of the following: 1) add include to ib_umem.h in ib_verbs.h 2) move this macro to ib_umem.h what do you think? thanks Dotan From mst at dev.mellanox.co.il Tue Jul 31 07:04:36 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 17:04:36 +0300 Subject: [ofa-general] Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46AF3CA8.6050201@gmail.com> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> Message-ID: <20070731140436.GA16015@mellanox.co.il> > Quoting Moni Shoua : > Subject: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the bonding driver > > Roland Dreier wrote: > > > 1. When bonding enslaves an IPoIB device the bonding neighbor holds a > > > reference to a cleanup function in the IPoIB drives. This makes it unsafe to > > > unload the IPoIB module if there are bonding neighbors in the air. So, to > > > avoid this race one must unload bonding before unloading IPoIB. > > > > I think we really want to resolve this somehow. Getting an oops by > > doing "modprobe -r ipoib" isn't that friendly. > > > You are right and we want to resolve that. > One way is to clean the neigh destructor function from all IPoIB neighs. > The other way is to prevent ipoib unload if device is a slave or is referenced from > somewhere else. > > I guess I would like an advice here. I had this idea: Maybe we could use hard_header_cache/header_cache_update methods instead of neighbour cleanup calls. To do this, it is possible that we'll have to switch from storing pointers inside the neighbour to keeping an index there, but I expect the performance impact to be minimal. As a result, bonding would not have to copy pointers into ipoib module and module removal would get fixed. -- MST From mst at dev.mellanox.co.il Tue Jul 31 07:05:32 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 17:05:32 +0300 Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs? In-Reply-To: <46AF4157.3090007@dev.mellanox.co.il> References: <46AF4157.3090007@dev.mellanox.co.il> Message-ID: <20070731140532.GB16015@mellanox.co.il> > Quoting Dotan Barak : > Subject: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs? > > Hi Roland. > > Here is the code of the macro IB_UMEM_MAX_PAGE_CHUNK (from ib_verbs): > > #define IB_UMEM_MAX_PAGE_CHUNK \ > ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ > ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ > (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) > > > > The problem is that the struct ib_umem_chunk is no longer in this header > file (ib_verbs.h), the declaration > of this structure can be found in ib_umem.h. > > This problem can be fixed by one of the following: > 1) add include to ib_umem.h in ib_verbs.h > 2) move this macro to ib_umem.h > > what do you think? 2 I guess. -- MST From tziporet at dev.mellanox.co.il Tue Jul 31 07:16:12 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 31 Jul 2007 17:16:12 +0300 Subject: [ewg] Re: [ofa-general] OFED 1.2.c-9 is available In-Reply-To: <46AF3E63.20204@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF3E63.20204@voltaire.com> Message-ID: <46AF442C.6020904@mellanox.co.il> Erez Zilber wrote: > Does 1.2.c-9 also include FMR support? Else, should we wait for 1.2.c-10? > FMRs are not supported in 1.2.c-9. 1.2.c-10 supports FMRs and it will be available tomorrow Tziporet From ogerlitz at voltaire.com Tue Jul 31 07:19:12 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jul 2007 17:19:12 +0300 Subject: [ofa-general] Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <20070731140436.GA16015@mellanox.co.il> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> <20070731140436.GA16015@mellanox.co.il> Message-ID: <46AF44E0.50700@voltaire.com> Michael S. Tsirkin wrote: > Maybe we could use hard_header_cache/header_cache_update methods instead of > neighbour cleanup calls. > To do this, it is possible that we'll have to switch from storing pointers > inside the neighbour to keeping an index there, but I expect the > performance impact to be minimal. > > As a result, bonding would not have to copy pointers into ipoib module > and module removal would get fixed. To be precise, bonding will copy all the symbols it copies today from the slave module (ipoib), see bond_setup_by_slave() in patch 3/7, except for the neighbour cleanup callback which is copied through coping the neigh_setup function. Or. From ogerlitz at voltaire.com Tue Jul 31 07:21:47 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jul 2007 17:21:47 +0300 Subject: [ofa-general] Re: [ewg] mlx4/fmr support In-Reply-To: <46AF442C.6020904@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF3E63.20204@voltaire.com> <46AF442C.6020904@mellanox.co.il> Message-ID: <46AF457B.9040107@voltaire.com> Tziporet Koren wrote: > 1.2.c-10 supports FMRs and it will be available tomorrow thats very important progress re iser support since unlike srp we can't work without fmr. When are you planning to send the mlx4 FMR code to review on the general list? I guess this is code candidate for 2.6.24, correct? Or. From mst at dev.mellanox.co.il Tue Jul 31 07:22:35 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 17:22:35 +0300 Subject: [ofa-general] Re: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46AF44E0.50700@voltaire.com> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> <20070731140436.GA16015@mellanox.co.il> <46AF44E0.50700@voltaire.com> Message-ID: <20070731142234.GC16015@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: Re: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the bonding driver > > Michael S. Tsirkin wrote: > >Maybe we could use hard_header_cache/header_cache_update methods instead of > >neighbour cleanup calls. > > >To do this, it is possible that we'll have to switch from storing pointers > >inside the neighbour to keeping an index there, but I expect the > >performance impact to be minimal. > > > >As a result, bonding would not have to copy pointers into ipoib module > >and module removal would get fixed. > > To be precise, bonding will copy all the symbols it copies today from > the slave module (ipoib), > see bond_setup_by_slave() in patch 3/7, except > for the neighbour cleanup callback which is copied through coping the > neigh_setup function. Not really. This copying of symbols is something that you added, isn't it? So with this approach, it won't be needed. It's always wrong to copy symbols from another module without referencing it. -- MST From dotanb at dev.mellanox.co.il Tue Jul 31 07:30:51 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 31 Jul 2007 17:30:51 +0300 Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs? In-Reply-To: <20070731140532.GB16015@mellanox.co.il> References: <46AF4157.3090007@dev.mellanox.co.il> <20070731140532.GB16015@mellanox.co.il> Message-ID: <46AF479B.2000505@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> This problem can be fixed by one of the following: >> 1) add include to ib_umem.h in ib_verbs.h >> 2) move this macro to ib_umem.h >> >> what do you think? >> > > 2 I guess. > I will send a patch with the suggested options in a few minutes. Dotan From rdreier at cisco.com Tue Jul 31 07:29:36 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 07:29:36 -0700 Subject: [ofa-general] Re: should the macro IB_UMEM_MAX_PAGE_CHUNK be removed from ib_verbs? In-Reply-To: <46AF479B.2000505@dev.mellanox.co.il> (Dotan Barak's message of "Tue, 31 Jul 2007 17:30:51 +0300") References: <46AF4157.3090007@dev.mellanox.co.il> <20070731140532.GB16015@mellanox.co.il> <46AF479B.2000505@dev.mellanox.co.il> Message-ID: >>> 2) move this macro to ib_umem.h >> 2 I guess. > I will send a patch with the suggested options in a few minutes. Actually just move the macro to umem.c since that's the only place it is (or should be) used anyway. - R. From dotanb at dev.mellanox.co.il Tue Jul 31 07:34:23 2007 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 31 Jul 2007 17:34:23 +0300 Subject: [ofa-general] [PATCH] include/rdma: move the macro IB_UMEM_MAX_PAGE_CHUNK to ib_umem.h Message-ID: <200707311734.24055.dotanb@dev.mellanox.co.il> After moving the struct ib_umem_chunk from the file ib_verbs.h to ib_umem.h there isn't any reason for the macro IB_UMEM_MAX_PAGE_CHUNK to stay in ib_verbs.h. Signed-off-by: Dotan Barak --- diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index c533d6c..69dea83 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -37,6 +37,11 @@ #include #include +#define IB_UMEM_MAX_PAGE_CHUNK \ + ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ + ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ + (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) + struct ib_ucontext; struct ib_umem { diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0627a6a..43b4c97 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -731,11 +731,6 @@ struct ib_udata { size_t outlen; }; -#define IB_UMEM_MAX_PAGE_CHUNK \ - ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ - ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ - (void *) &((struct ib_umem_chunk *) 0)->page_list[0])) - struct ib_pd { struct ib_device *device; struct ib_uobject *uobject; From ogerlitz at voltaire.com Tue Jul 31 07:36:05 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jul 2007 17:36:05 +0300 Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <20070731142234.GC16015@mellanox.co.il> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> <20070731140436.GA16015@mellanox.co.il> <46AF44E0.50700@voltaire.com> <20070731142234.GC16015@mellanox.co.il> Message-ID: <46AF48D5.9000502@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> To be precise, bonding will copy all the symbols it copies today from >> the slave module (ipoib), see bond_setup_by_slave() in patch 3/7 > Not really. > This copying of symbols is something that you added, isn't it? > So with this approach, it won't be needed. > It's always wrong to copy symbols from another module without > referencing it. Its the --first-- time you make this comment, please suggest a different approach, the relevant code is below. > +static void bond_setup_by_slave(struct net_device *bond_dev, > + struct net_device *slave_dev) > +{ > + bond_dev->hard_header = slave_dev->hard_header; > + bond_dev->rebuild_header = slave_dev->rebuild_header; > + bond_dev->hard_header_cache = slave_dev->hard_header_cache; > + bond_dev->header_cache_update = slave_dev->header_cache_update; > + bond_dev->hard_header_parse = slave_dev->hard_header_parse; > + > + bond_dev->neigh_setup = slave_dev->neigh_setup; > + > + bond_dev->type = slave_dev->type; > + bond_dev->hard_header_len = slave_dev->hard_header_len; > + bond_dev->addr_len = slave_dev->addr_len; > + > + memcpy(bond_dev->broadcast, slave_dev->broadcast, > + slave_dev->addr_len); > +} > + > /* enslave device to bond device */ > int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) > { > @@ -1351,6 +1371,24 @@ int bond_enslave(struct net_device *bond > goto err_undo_flags; > } > > + /* set bonding device ether type by slave - bonding netdevices are > + * created with ether_setup, so when the slave type is not ARPHRD_ETHER > + * there is a need to override some of the type dependent attribs/funcs. > + * > + * bond ether type mutual exclusion - don't allow slaves of dissimilar > + * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond > + */ > + if (bond->slave_cnt == 0) { > + if (slave_dev->type != ARPHRD_ETHER) > + bond_setup_by_slave(bond_dev, slave_dev); > + } else if (bond_dev->type != slave_dev->type) { > + printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different from " > + "other slaves (%d), can not enslave it.\n", slave_dev->name, > + slave_dev->type, bond_dev->type); > + res = -EINVAL; > + goto err_undo_flags; > + } > + From tziporet at dev.mellanox.co.il Tue Jul 31 07:49:58 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 31 Jul 2007 17:49:58 +0300 Subject: [ofa-general] Re: [ewg] mlx4/fmr support In-Reply-To: <46AF457B.9040107@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF3E63.20204@voltaire.com> <46AF442C.6020904@mellanox.co.il> <46AF457B.9040107@voltaire.com> Message-ID: <46AF4C16.802@mellanox.co.il> Or Gerlitz wrote: > Tziporet Koren wrote: >> 1.2.c-10 supports FMRs and it will be available tomorrow > > thats very important progress re iser support since unlike srp we > can't work without fmr. When are you planning to send the mlx4 FMR > code to review on the general list? Jack will send it tomorrow. > I guess this is code candidate for 2.6.24, correct? yes Tziporet From mst at dev.mellanox.co.il Tue Jul 31 07:48:27 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 17:48:27 +0300 Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <46AF48D5.9000502@voltaire.com> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> <20070731140436.GA16015@mellanox.co.il> <46AF44E0.50700@voltaire.com> <20070731142234.GC16015@mellanox.co.il> <46AF48D5.9000502@voltaire.com> Message-ID: <20070731144827.GB17331@mellanox.co.il> > Quoting Or Gerlitz : > Subject: Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for?the?bonding driver > > Michael S. Tsirkin wrote: > >>Quoting Or Gerlitz : > > >>To be precise, bonding will copy all the symbols it copies today from > >>the slave module (ipoib), see bond_setup_by_slave() in patch 3/7 > > >Not really. > >This copying of symbols is something that you added, isn't it? > >So with this approach, it won't be needed. > > >It's always wrong to copy symbols from another module without > >referencing it. > > Its the --first-- time you make this comment, It's really a well known fact. That's where the crash with modprobe -r comes from, right? > please suggest a different approach, I don't know, really - if you want to access a module, you really must get a reference to it, or to the device. How about adding the module pointer to struct net_device? >the relevant code is below. >+static void bond_setup_by_slave(struct net_device *bond_dev, >+ struct net_device *slave_dev) >+{ >+ bond_dev->hard_header = slave_dev->hard_header; >+ bond_dev->rebuild_header = slave_dev->rebuild_header; >+ bond_dev->hard_header_cache = slave_dev->hard_header_cache; >+ bond_dev->header_cache_update = slave_dev->header_cache_update; >+ bond_dev->hard_header_parse = slave_dev->hard_header_parse; >+ >+ bond_dev->neigh_setup = slave_dev->neigh_setup; >+ >+ bond_dev->type = slave_dev->type; >+ bond_dev->hard_header_len = slave_dev->hard_header_len; >+ bond_dev->addr_len = slave_dev->addr_len; >+ >+ memcpy(bond_dev->broadcast, slave_dev->broadcast, >+ slave_dev->addr_len); >+} >+ Hmm, it seems that switching to hard_header_cache as I suggested won't help at all. I wonder: is bonding currently broken with devices that implement hard_header_cache/header_cache_update? -- MST From ogerlitz at voltaire.com Tue Jul 31 07:57:46 2007 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 31 Jul 2007 17:57:46 +0300 Subject: [ofa-general] Re: [PATCH V3 0/7] net/bonding: ADD IPoIB support for the bonding driver In-Reply-To: <20070731144827.GB17331@mellanox.co.il> References: <46ADDB89.5030601@voltaire.com> <46AF3CA8.6050201@gmail.com> <20070731140436.GA16015@mellanox.co.il> <46AF44E0.50700@voltaire.com> <20070731142234.GC16015@mellanox.co.il> <46AF48D5.9000502@voltaire.com> <20070731144827.GB17331@mellanox.co.il> Message-ID: <46AF4DEA.9050202@voltaire.com> Michael S. Tsirkin wrote: >> Quoting Or Gerlitz : >> Michael S. Tsirkin wrote: >>> It's always wrong to copy symbols from another module without >>> referencing it. >> Its the --first-- time you make this comment, > It's really a well known fact. That's where the crash > with modprobe -r comes from, right? no, the crash --only-- comes from the neighbour cleanup function being called while ipoib is now probed out of the kernel. The other symbols are not problematic. I got positive feedback that this --is-- the problem in the previous posts and from Roland during my Sonoma presentation. >> please suggest a different approach, > I don't know, really - if you want to access a module, you really must get > a reference to it, or to the device. > How about adding the module pointer to struct net_device? I think there used to be there owner field of type struct module and it was removed... we will check that. >> the relevant code is below. > >> +static void bond_setup_by_slave(struct net_device *bond_dev, >> + struct net_device *slave_dev) >> +{ >> + bond_dev->hard_header = slave_dev->hard_header; >> + bond_dev->rebuild_header = slave_dev->rebuild_header; >> + bond_dev->hard_header_cache = slave_dev->hard_header_cache; >> + bond_dev->header_cache_update = slave_dev->header_cache_update; >> + bond_dev->hard_header_parse = slave_dev->hard_header_parse; >> + >> + bond_dev->neigh_setup = slave_dev->neigh_setup; >> + >> + bond_dev->type = slave_dev->type; >> + bond_dev->hard_header_len = slave_dev->hard_header_len; >> + bond_dev->addr_len = slave_dev->addr_len; >> + >> + memcpy(bond_dev->broadcast, slave_dev->broadcast, >> + slave_dev->addr_len); >> +} >> + > > Hmm, it seems that switching to hard_header_cache as I suggested won't help at all. why? please clarify. > I wonder: is bonding currently broken with devices that implement > hard_header_cache/header_cache_update? I don't think so. Note that bond_setup_by_slave is only called for slaves whose ether type is --not-- Ethernet. Or. From swise at opengridcomputing.com Tue Jul 31 08:16:27 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 31 Jul 2007 10:16:27 -0500 Subject: [ofa-general] ofed kernel git trees. In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com> References: <46AC65E5.5050404@voltaire.com><20070730084616.GE9963@mellanox.co.il><46ADC0FF.2080000@voltaire.com> <20070730141155.GB7360@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com> Message-ID: <46AF524B.60603@opengridcomputing.com> Vlad, Which git tree should I be based against for ofed 1.2 development? I've always used git://git.openfabrics.org/~vlad/ofed_1_2/.git But there is also: git://git.openfabrics.org/ofed_1_2/linux-2.6.git. git://git.openfabrics.org/~vlad/ofed_kernel.git. Which should I use for 1.2 and 1.2.c? Thanks, Steve. From vlad at mellanox.co.il Tue Jul 31 08:21:35 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 31 Jul 2007 18:21:35 +0300 Subject: [ofa-general] RE: ofed kernel git trees. In-Reply-To: <46AF524B.60603@opengridcomputing.com> References: <46AC65E5.5050404@voltaire.com><20070730084616.GE9963@mellanox.co.il><46ADC0FF.2080000@voltaire.com> <20070730141155.GB7360@mellanox.co.il> <6C2C79E72C305246B504CBA17B5500C901FF1044@mtlexch01.mtl.com> <46AF524B.60603@opengridcomputing.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF13C1@mtlexch01.mtl.com> > Which git tree should I be based against for ofed 1.2 development? > > I've always used > > git://git.openfabrics.org/~vlad/ofed_1_2/.git > > But there is also: > > git://git.openfabrics.org/ofed_1_2/linux-2.6.git. ~vlad/ofed_1_2/.git is a symbolic link to ofed_1_2/linux-2.6.git > git://git.openfabrics.org/~vlad/ofed_kernel.git. ~vlad/ofed_kernel.git will be used for OFED-1.3 > > Which should I use for 1.2 and 1.2.c? > for 1.2 you should use git://git.openfabrics.org/ofed_1_2/linux-2.6.git (branch ofed_1_2) for 1.2.c you should use git://git.openfabrics.org/ofed_1_2/linux-2.6.git (branch ofed_1_2_c) Regards, Vladimir From lilian.dahl at mediamehr.at Tue Jul 31 09:53:53 2007 From: lilian.dahl at mediamehr.at (Maryann Stevens) Date: Tue, 31 Jul 2007 15:53:53 -0100 Subject: [ofa-general] Dating site Message-ID: <01c7d38a$feba7a90$58478254@lilian.dahl> -------------- next part -------------- A non-text attachment was scrubbed... Name: leibe.gif Type: image/gif Size: 11481 bytes Desc: not available URL: From sashak at voltaire.com Tue Jul 31 09:02:23 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 31 Jul 2007 19:02:23 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: <46A89608.9010709@dev.mellanox.co.il> References: <46A283B6.1070105@dev.mellanox.co.il> <20070723002010.GU27878@sashak.voltaire.com> <46A89608.9010709@dev.mellanox.co.il> Message-ID: <20070731160223.GF29844@sashak.voltaire.com> Hi Yevgeny, On 15:39 Thu 26 Jul , Yevgeny Kliteynik wrote: > >> > >> * Comments may appear only in a separate line > > Why? What is wrong with: > > port-name: vs1/HCA-1/P1 # my best port > > I can use this too, but then the pound sign, wherever it will > appear, would mean commentary start. No \# or something like this > to include it in some other place - I don't want to complicate the > syntax. Sounds OK? Are we planning to use '#' somewhere? Anyway this comment is minor. > >> end-port-groups > > I agree that proposed syntax has better for human readability than pure > > XML, but isn't stuff like this will be more user-friendly? > > Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ; > > , or > > Storage "Free Text description" { 0x10001, 0x10002, 0x10003 }; > > , or > > Storage "Free Text description": ROUTERS, CAS ; > > GUID list is a good idea. > Not sure about the other stuff. A certain port group can be defined > both by guids and by node-types. How about this: > > port-group > name: routers_and_mgt_nodes > use: all routers and management nodes > node-type: ROUTER > port-guid: 0x10001, 0x10002, 0x10003 > end-port-group I think it is doable too, like: 0x10001, 0x10002, 0x10003, ROUTER Guess it should be easy to parse GUIDs, names and special names (like ROUTER) in one line. Not sure it must be so, just thought... > >> qos-levels > >> > >> # the first one is just setting SL > >> qos-level > >> use: for the lowest priority communication > >> sl: 15 > >> packet-life: 16 > >> end-qos-level > >> # the second sets SL and QoS Class > >> qos-level > >> use: low latency best bandwidth > >> sl: 0 > >> end-qos-level > >> # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path > >> Bits > >> qos-level > >> use: just an example > >> sl: 0 > >> mtu-limit: 1 > >> rate-limit: 1 > >> packet-life: 12 > >> # Path Bits can be used e.g. to provide a different routes > >> through the > >> # subnet to a particular port > >> path-bits: 2,4,8-32 > >> end-qos-level > >> > >> end-qos-levels > >> > >> > >> # Match rules are scanned in a first-fit manner (like firewall rules > >> table) > >> qos-match-rules > >> > >> # matching by single criteria: class (list of values and ranges) > >> qos-match-rule > >> # just a description > >> use: low latency by class 7-9 or 11 > >> qos-class: 7-9,11 > >> # number of qos-level to apply to the matching PR/MPR > >> qos-level-sn: 1 > > Isn't it better and less error prone to match qos_level by name and not > > by sequential number? > > qos-level can have name, and then qos-match-rule will refer to this name. > But matching qos-level by sequential number makes it really easy to locate > the referred qos-level, which is important, as every PR/MPR request would > go through this process, so saving some runtime in this area is important > IMHO. Sure, it is important, but I'm not about internal data representation, internally this should be fast reference - by index or by directly by pointer. But in the file it would be easy for user to have names (numbers could be used as names too) instead of just serial numbering on one side, so an user will not need to count lines. > >> 9. OpenSM features > >> ------------------- > >> The QoS related functionality to be provided by OpenSM can be split into > >> two > >> main parts: > >> > >> 3.1. Fabric Setup > >> During fabric initialization the SM should parse the policy and apply its > >> settings to the discovered fabric elements. The following actions should > >> be > >> performed: > >> * Parsing of policy > >> * Node Group identification. Warning should be provided for each node not > >> specified but found. > >> * SL2VL settings validation should be checked: > >> + A warning will be provided if there are no matching targets for the > >> SL2VL > >> setting statement. > >> + An error message will be printed to the log file if an invalid > >> setting is > >> found. A setting is invalid if it refers to: > >> - Non existing port numbers of the target devices > >> - Unsupported VLs for the target device. In the later case the map to > >> non > >> existing VLs should be replaced to VL15 i.e. packets will be > >> dropped. > > I'm not sure it is optimal. We could have well documented or even > > configurable mapping rule instead, then this will not limit devices with > > higher capabilities. > > I'm open for suggestions. The rule like %(number of OpVLs)? Or even better - configurable mapping rule? > >> * Only PR/MPR fields that have their component mask bit set should be > >> compared. > >> * For a rule to be "matching" a PR/MPR request all the rule fields should > >> be > >> "matching" their PR/MPR fields. Such that a PR/MPR request that does > >> not have a component mask field set for one of the rule defined fields > >> can > >> not match that rule. > >> * A PR/MPR request that have a component mask bit set for one of the > >> fields > >> that is not defined by the rule can match the rule. > > Aren't last two too restrictive? SA can just to filter-out paths in > > response to match rest of the rule. No? > > Not sure I'm following. > The last bullet is not restrictive at all Right, but mostly I'm about previous bullet - where client _must_ set component mask to match all fields. Sasha From changquing.tang at hp.com Tue Jul 31 09:12:09 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 31 Jul 2007 16:12:09 -0000 Subject: [ofa-general] Scalable reliable connection In-Reply-To: <20070730125054.GO9963@mellanox.co.il> References: <20070730125054.GO9963@mellanox.co.il> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net> A send queue can only serve max J jobs within a node. Is it possible to make a single send queue to serve all jobs on all nodes ? --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Michael S. Tsirkin > Sent: Monday, July 30, 2007 7:51 AM > To: Gleb Natapov > Cc: Pavel Shamis; ewg at lists.openfabrics.org; Michael S. > Tsirkin; general at lists.openfabrics.org; Ishai Rabinovitz > Subject: [ofa-general] Scalable reliable connection > > > Here's some background on what SRC is. This is basically > slide 6 in Dror's talk, for those that missed the talk. > > * * * > > SRC is an extension supported by recent Mellanox hardware > which is geared toward reducing the number of QPs required > for all-to-all communication on systems with a high number of > jobs per node. > > =================================================================== > Motivation: > =================================================================== > Given N nodes with J jobs per node, number of QPs required > for all-to-all communication is: > > With RC: > O((N * J) ^ 2) > > Since each job out of O(N * J) jobs must create a single QP > to communicate with each one of O(N * J) other jobs. > > With SRC: > O(N ^ 2 * J) > > This is achived by using a single send queue (per job, > out of O(N * J) jobs) > to send data to all J jobs running on a specific node > (out of O(N) nodes). > Hardware uses new "SRQ number" field in packet header to > multiplex receive WRs and WCs to private memory of each job. > > This is similiar idea to IB RD. > Q: Why not use RD then? > A: Because no hardware supports it. > > Details: > > =================================================================== > Verbs extension: > =================================================================== > > - There is a new transport/QP type "SRC". > - There is a new object type "SRC domain" > - Each SRQ gets new (optional) attributes: > SRC domain > SRC SRQ number > SRC CQ > SRQ must have either all 3 of these or none of these attributes > > - QPs of type SRC have all the same attributes as regular RC QPs > connected to SRQ, except that: > A. Each SRC QP has a new required attribute "SRC domain" > B. SRC QPs do *not* have "SRQ" attribute > (do not have a specific SRQ associated with them) > > =================================================================== > Protocol extension: > =================================================================== > SRC QP behaviour: Requestor > - Post send WR for this QP type is extended with SRQ number field > This number is sent as part of packet header > - SRC Packets follow rules for RC packets on the wire, exactly > What is different is their handling at the responder side > > SRC QP behaviour: Responder > Each incoming packet passes transport checks with respect to > the SRC QP, following RC rules, exactly. > > After this, SRQ number in packet header is used to look up a > specific SRQ. SRC domain of the resulting SRQ must be equal > to SRC domain of the QP, otherwise a NAK is sent, and QP > moves to error state. > > If the SRC domains match, receive WR and receive WC > processing are as follows: > > - RC Send > - Rather than using SRQ to which the QP is attached, > SRQ is looked up by SRQ number in the packet. > Receive WR is taken from this SRQ. > - Completions are generated on the CQ specified in the SRQ > > - RDMA/Atomic > - Rather than using PD to which the QP is attached, > SRQ is looked up by SRQ number in the packet. > PD of this SRQ is used for protection checks. > =================================================================== > > -- > MST > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Tue Jul 31 09:15:07 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 09:15:07 -0700 Subject: [ofa-general] Re: IPoIB path caching In-Reply-To: <46ADAA85.8070106@voltaire.com> References: <46A46A1D.6040000@voltaire.com> <46A4EF00.9070305@ichips.intel.com> <46A5C8E6.5020906@voltaire.com> <46A628D8.4050109@ichips.intel.com> <46A6F50C.5000906@voltaire.com> <46A78146.1090304@ichips.intel.com> <46A846FC.5040704@voltaire.com> <46A8D80C.1090305@ichips.intel.com> <20070726181132.GO19768@obsidianresearch.com> <46AC509B.6020206@voltaire.com> <20070729173232.GA14867@obsidianresearch.com> <46ADAA85.8070106@voltaire.com> Message-ID: <46AF600B.2040904@ichips.intel.com> > Indeed. The argument I was trying to make is that arp cache invalidation > requires IPoIB PR cache invalidation, this handles 100% of the cases, > including the 10% not covered by doing cache invalidation based only on > IB events such as port up / sm lid change / sm reregister / etc ARP cache invalidation does not require, nor does it actually do IPoIB PR cache invalidation. We can argue whether or not it should, but the two are not linked together today. The local SA updates paths either in response to an event: LID change, port state change, GID in/out of service, etc., or when refreshed via a module parameter. A refresh can occur in response to an administrative event, when told to by a system administrator, before executing large jobs, periodically based on a timer, or whenever else. That policy is outside the scope of the proposed patches, but covers all other potential cases where the cache must be updated. I like the advantages of keeping the local SA entirely in user space, but there are issues that need to be worked through first. And implementation wise, it's unlikely to give us anything that remains in sync any better than what's already been proposed without the use of non-standard extensions. - Sean From mst at dev.mellanox.co.il Tue Jul 31 09:15:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Tue, 31 Jul 2007 19:15:52 +0300 Subject: [ofa-general] Re: Scalable reliable connection In-Reply-To: <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net> References: <20070730125054.GO9963@mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net> Message-ID: <20070731161552.GB5743@mellanox.co.il> > Quoting Tang, Changqing : > Subject: RE: Scalable reliable connection > > > A send queue can only serve max J jobs within a node. Is it possible to > make a single send queue to serve all jobs on all nodes ? How do you propose to do this? -- MST From changquing.tang at hp.com Tue Jul 31 09:21:13 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Tue, 31 Jul 2007 16:21:13 -0000 Subject: [ofa-general] RE: Scalable reliable connection In-Reply-To: <20070731161552.GB5743@mellanox.co.il> References: <20070730125054.GO9963@mellanox.co.il> <349DCDA352EACF42A0C49FA6DCEA840301EDB30F@G3W0634.americas.hpqcorp.net> <20070731161552.GB5743@mellanox.co.il> Message-ID: <349DCDA352EACF42A0C49FA6DCEA840301EDB33D@G3W0634.americas.hpqcorp.net> In this way, only one send queue is needed for each job(process), and we don't need to track the location of each other job(which is on which node). from a job point of view, either self, or others, all others are "equal"... --CQ > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] > Sent: Tuesday, July 31, 2007 11:16 AM > To: Tang, Changqing > Cc: Michael S. Tsirkin; Gleb Natapov; Pavel Shamis; > ewg at lists.openfabrics.org; general at lists.openfabrics.org; > Ishai Rabinovitz > Subject: Re: Scalable reliable connection > > > Quoting Tang, Changqing : > > Subject: RE: Scalable reliable connection > > > > > > A send queue can only serve max J jobs within a node. Is it > possible > > to make a single send queue to serve all jobs on all nodes ? > > How do you propose to do this? > > -- > MST > From mshefty at ichips.intel.com Tue Jul 31 09:25:37 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 09:25:37 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <46A94657.1020101@ichips.intel.com> References: <46A283B6.1070105@dev.mellanox.co.il> <46A94657.1020101@ichips.intel.com> Message-ID: <46AF6281.10709@ichips.intel.com> FYI - It is my intention to implement the host side portion of QoS support. (It's one of my path forward objectives.) I plan on implementing the host side as outlined below. If anyone has any comments, I would like to get them as soon as possible. - Sean Sean Hefty wrote: >> 2. Architecture ---------------- > > This is a higher level approach to the problem, but I came up with the > following QoS relationship hierarchy, where '->' means 'maps to'. > > Application Service -> Service ID (or range) > Service ID -> desired QoS > QoS, SGID, DGID, PKey -> SGID, DGID, TClass, FlowLabel, PKey > SGID, DGID, TC, FL, PKey -> SLID, DLID, SL (set if crossing subnets) > SLID, DLID, SL -> MTU, Rate, VL, PacketLifeTime > > I use these relationships below: > >> 4. IPoIB --------- >> >> IPoIB already query the SA for its broadcast group information. The >> additional functionality required is for IPoIB to provide the >> broadcast group SL, MTU, and RATE in every following PathRecord query >> performed when a new UDAV is needed by IPoIB. We could assign a >> special Service-ID for IPoIB use but since all communication on the >> same IPoIB interface shares the same QoS-Level without the ability to >> differentiate it by target service we can ignore it for simplicity. > > Rather than IPoIB specifying SL, MTU, and rate with PR queries, it > should specify TClass and FlowLabel. This is necessary for IPoIB to > span IB subnets. > >> 5. CMA features ---------------- >> >> The CMA interface supports Service-ID through the notion of port >> space as a prefixes to the port_num which is part of the sockaddr >> provided to rdma_resolve_add(). What is missing is the explicit >> request for a QoS-Class that should allow the ULP (like SDP) to >> propagate a specific request for a class of service. A mechanism for >> providing the QoS-Class is available in the IPv6 address, so we could >> use that address field. Another option is to implement a special >> connection options API for CMA. >> >> Missing functionality by CMA is the usage of the provided QoS-Class >> and Service-ID in the sent PR/MPR. When a response is obtained it is >> an existing requirement for the CMA to use the PR/MPR from the >> response in setting up the QP address vector. > > I think the RDMA CM needs two solutions, depending on which address > family is used. For IPv6, the existing interface is sufficient, and > works for both IB and iWarp. The RDMA CM only needs to include the TC > and FL as part of its PR query. For IPv4, to remain transport neutral, > I think we should add an rdma_set_option() routine to specify the QoS > field. The RDMA CM would include the QoS field for PR query under this > condition. > > For IB, this requires changes to the ib_sa to support the new PR > extensions. I don't think we gain anything having the RDMA CM include > service IDs as part of the query. > >> 6. SDP ------- >> >> SDP uses CMA for building its connections. The Service-ID for SDP is >> 0x000000000001PPPP, where PPPP are 4 hex digits holding the remote >> TCP/IP Port Number to connect to. SDP might be provided with >> SO_PRIORITY socket option. In that case the value provided should be >> sent to the CMA as the TClass option of that connection. > > SDP would use specify the QoS through the IPv6 address or > rdma_set_option() routine. > >> 7. SRP ------- >> >> Current SRP implementation uses its own CM callbacks (not CMA). So >> SRP should fill in the Service-ID in the PR/MPR by itself and use >> that information in setting up the QP. The T10 SRP standard defines >> the SRP Service-ID to be defined by the SRP target I/O Controller >> (but they should also comply with IBTA Service- ID rules). Anyway, >> the Service-ID is reported by the I/O Controller in the ServiceEntries >> DMA attribute and should be used in the PR/MPR if the >> SA reports its ability to handle QoS PR/MPRs. > > I agree. > >> 8. iSER -------- iSER uses CMA and thus should be very close to SDP. >> The Service-ID for iSER should be TBD. > > See RDMA CM and SDP. > >> 3.2. PR/MPR query handling: OpenSM should be able to enforce the >> provided policy on client request. The overall flow for such requests >> is: first the request is matched against the defined match rules such >> that the target QoS-Level definition is found. Given the QoS-Level a >> path(s) search is performed with the given restrictions imposed by >> that level. The following two sections describe these steps. > > If we use the QoS hierarchy outlined above, I think we can construct > some fairly simple tables to guide our PR selection. The SA may need to > construct the tables starting at the bottom and working up, but I > *think* it could be done. And by distributing the tables, we can > support a more distributed (a la local SA) operation. > > From an administration point, I would be happier seeing something where > the administrator defines a QoS level in terms of latency or bandwidth > requirements and relative priority. Then, if desired, the administrator > could provide more details, such as indicating which nodes would use > which services, minimum required MTUs, etc. It would then be up to the > SA to map these requirements to specific TC, FL, SL, VL values. > > In general, though, I'm personally far less concerned with the QoS > specification interface to the SA, versus the operation that takes place > on the hosts. > > Comments on using this approach on the host side? From hal.rosenstock at gmail.com Tue Jul 31 09:27:58 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 31 Jul 2007 12:27:58 -0400 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> Message-ID: On 7/12/07, Tziporet Koren wrote: > > > Hi All, > > OFED 1.2.c-9 is available now on the OFA server under: > http://www.openfabrics.org/builds/connectx/release/ > Note: this release was tested with FW 2.1.000 that will soon be available on > Mellanox web site for download. > > Supported Platforms and Operating Systems > ================================= > o CPU architectures: > - x86_64 > - x86 > - ppc64 > - ia64 > > o Linux Operating Systems: > - RedHat EL4 up3: 2.6.9-34.ELsmp > - RedHat EL4 up4: 2.6.9-42.ELsmp > - RedHat EL4 up5: 2.6.9-55.ELsmp > - RedHat EL5: 2.6.18-8.el5 > - SLES10: 2.6.16.21-0.8-smp > - kernel.org: 2.6.20.x > - SLES10 SP1: 2.6.16.46-0.12-smp (partially tested) > > Main changes from OFED 1.2.c-8: > ========================= > 1. Kernel oops in IPoIB on restart of the driver. > 2. IPoIB CM is now the default. > 3. MPI with SRQ is supported. > 4. Itanium is now supported. > > mlx4 Fixed Bugs and Enhancements > =========================== > - Added support for PCI-Ex gen2 devices; device IDs: 26418 and 26428. > - Query QP and query SRQ are now supported. > - Internal error flow was added. > - Number of QPs that can be attached to the same multicast group was > increased to 56. > - SRQ is now supported. > - Fork is now supported. > > ConnectX specific known issues and limitations > =================================== > - The following commands and/or features are not supported: > o Resize CQ > o FMRs > o APM > o SQD > - ibstat does not present all entries. Use ibv_devinfo instead. What is missing from ibstat for ConnectX ? What entries are missing ? -- Hal > - To load the driver on machines with 64KB default page size UAR bar must be > enlarged. 64KB page size is the default of PPC with RHEL5 and Itanium with > 64KB page size enabled. > Perform the following three steps: > 1. Add the following line in the firmware configuration (INI) file under > the > [HCA] section: > log2_uar_bar_megabytes = 5 > 2. Burn a modified firmware image with the changed INI file > 3. Reboot the system > > > > Tziporet Koren > Software Director > Mellanox Technologies > mailto: tziporet at mellanox.co.il > Tel +972-4-9097200, ext 380 > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From mshefty at ichips.intel.com Tue Jul 31 09:54:04 2007 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 09:54:04 -0700 Subject: [ofa-general] SDP kernel Oops. In-Reply-To: <004201c7d2de$6d1dca30$47595e90$@rr.com> References: <46AE183F.5090907@psc.edu> <004201c7d2de$6d1dca30$47595e90$@rr.com> Message-ID: <46AF692C.4090302@ichips.intel.com> > It appears that this is an illegal instruction (illegal operand) trap in a > modified Rhat4U4 kernel. I am not sure about the line number, but perhaps > sdp_cma_handler() is processing an RDMA_CM_EVENT_ROUTE_RESOLVED event. Based on the backtrace, this should be an RDMA_CM_EVENT_CONNECT_REQUEST event. I would verify that whatever structure that is associated with a listening rdma_cm_id is still valid until after the listen has been destroyed. - Sean From sashak at voltaire.com Tue Jul 31 10:26:36 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 31 Jul 2007 20:26:36 +0300 Subject: [ofa-general] [PATCH] opensm: remove lft setup hack Message-ID: <20070731172636.GH29844@sashak.voltaire.com> This removes the hack, where OpenSM's lfts were is updated by ucast_mgr and not only from the network. Once it was needed for dumping fucntions, which use the data from lft, but now the dumping is moved to the end of the sweep, when all lfts are up to date. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_mgr.c | 7 ------- 1 files changed, 0 insertions(+), 7 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index cfe1a58..b90509a 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -508,13 +508,6 @@ osm_ucast_mgr_set_fwd_table( else { p_mgr->any_change = TRUE; - /* - HACK: for now we will assume we succeeded to send - and set the local DB based on it. This should allow - us to immediatly dump out our routing. - */ - osm_switch_set_ft_block( - p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho ); } } -- 1.5.3.rc2.29.gc4640f From sashak at voltaire.com Tue Jul 31 10:33:20 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 31 Jul 2007 20:33:20 +0300 Subject: [ofa-general] [PATCH] opensm: report new ports before handover mastership In-Reply-To: References: <20070725220204.GI31582@sashak.voltaire.com> <20070727025952.GE6691@sashak.voltaire.com> Message-ID: <20070731173320.GI29844@sashak.voltaire.com> This adds new ports reporting (with trap 64) just before mastership handover - new master does not report new ports in its first sweep. Pointed out by: lbt (Lan) Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_state_mgr.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index a6d0e24..1cf6257 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -2168,6 +2168,9 @@ Idle: p_remote_sm = __osm_state_mgr_get_highest_sm( p_mgr ); if( p_remote_sm != NULL ) { + /* report new ports (trap 64) before leaving MASTER */ + __osm_state_mgr_report_new_ports( p_mgr ); + /* need to handover the mastership * to the remote sm, and move to standby */ __osm_state_mgr_send_handover( p_mgr, p_remote_sm ); -- 1.5.3.rc2.29.gc4640f From rdreier at cisco.com Tue Jul 31 10:41:03 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 10:41:03 -0700 Subject: [ofa-general] QoS RFC In-Reply-To: <20070731160223.GF29844@sashak.voltaire.com> (Sasha Khapyorsky's message of "Tue, 31 Jul 2007 19:02:23 +0300") References: <46A283B6.1070105@dev.mellanox.co.il> <20070723002010.GU27878@sashak.voltaire.com> <46A89608.9010709@dev.mellanox.co.il> <20070731160223.GF29844@sashak.voltaire.com> Message-ID: I think that defining a new file format is really going in the wrong direction. XML would make a lot of sense (and you could use something like RELAX NG to define the schema very readably and precisely). XML has the advantage that many parsers, GUI editors, and other tools are already widely available. If you don't like XML for whatever reason, please at least consider something like YAML before you invent something completely new. - R. From Kapil.Dukle at med.ge.com Tue Jul 31 10:49:27 2007 From: Kapil.Dukle at med.ge.com (Dukle, Kapil (GE Healthcare)) Date: Tue, 31 Jul 2007 13:49:27 -0400 Subject: [ofa-general] UDAPL code examples Message-ID: Hi all, Does the OFED distribution have examples/code samples on how UDAPL can be used? The examples I looked at in perftest directly use Verbs API calls. Thanks, -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Jul 31 10:57:08 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 31 Jul 2007 12:57:08 -0500 Subject: [ofa-general] Re: [PATCH 2.6.23 1/2] Make the iw_cxgb3 module parameters writable. In-Reply-To: References: <20070729201226.31659.85900.stgit@dell3.ogc.int> Message-ID: <46AF77F4.2000003@opengridcomputing.com> Roland Dreier wrote: > ugh, missed these before my last merge... > > anyway: > > why do we want to parameters writable? a good changelog tells me > what, why and how, and this changelog just covered the "what". Also, > I assume you've checked that it's OK for these variables to change at > any time? I want to be able to changes these parameters at run time. Eventually, if we might want these parameters as rdma connection setup parameters. For now, its useful to be able to set them without reloading. Also, it is safe to change them at any time. All of these are read once and utilized at connection setup. So changing them is safe in that existing connections aren't affected, and only subsequent connections will utilize the new values. Sorry for the terse changelog... Steve. From ardavis at ichips.intel.com Tue Jul 31 11:08:14 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 31 Jul 2007 11:08:14 -0700 Subject: [ofa-general] UDAPL code examples In-Reply-To: References: Message-ID: <46AF7A8E.8010902@ichips.intel.com> Dukle, Kapil (GE Healthcare) wrote: > > Hi all, > > Does the OFED distribution have examples/code samples on how UDAPL can > be used? The examples I looked at in > perftest directly use Verbs API calls. > Take a look at dtest (dapl/test/dtest/dtest.c) for a simple server/client example that does message sends, rdma writes, and rdma reads. -arlin From sean.hefty at intel.com Tue Jul 31 11:14:03 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 11:14:03 -0700 Subject: [ofa-general] Re: I think that there is a resource leak in the corefile mad_rmpp.c In-Reply-To: <46AC6B5C.6020702@dev.mellanox.co.il> Message-ID: <000301c7d39e$9487cdd0$12c8180a@amr.corp.intel.com> >It seems that the AHs which are being created in alloc_response_msg() >(which is being called from >ack_ds_ack()) are not being destroyed because the rmpp_type of this >packet is >IB_MGMT_RMPP_TYPE_ACK, so the destroy AH is not being executed. Thanks for the clarification. This is a dual-sided RMPP issue involving the direction switch ACK. ib_rmpp_send_handler() needs to distinguish this ACK from normal ACKs. I will see if I can come up with a (simple) fix for this. - Sean From sashak at voltaire.com Tue Jul 31 11:41:38 2007 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 31 Jul 2007 21:41:38 +0300 Subject: [ofa-general] QoS RFC In-Reply-To: References: <46A283B6.1070105@dev.mellanox.co.il> <20070723002010.GU27878@sashak.voltaire.com> <46A89608.9010709@dev.mellanox.co.il> <20070731160223.GF29844@sashak.voltaire.com> Message-ID: <20070731184138.GJ29844@sashak.voltaire.com> On 10:41 Tue 31 Jul , Roland Dreier wrote: > I think that defining a new file format is really going in the wrong > direction. XML would make a lot of sense (and you could use something > like RELAX NG to define the schema very readably and precisely). XML > has the advantage that many parsers, GUI editors, and other tools are > already widely available. > > If you don't like XML for whatever reason, please at least consider > something like YAML before you invent something completely new. We don't have any XML or YAML config files yet. Personally I would prefer human rather than machine readable/writable files format just because hand editing still be main option now and we don't have any useful GUI management infrastructure. Sasha From swise at opengridcomputing.com Tue Jul 31 12:12:32 2007 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 31 Jul 2007 14:12:32 -0500 Subject: [ofa-general] patches for 1.2.c Message-ID: <46AF89A0.9070805@opengridcomputing.com> Guys, I have 2 more patches to go in ofed_1_2/ofed_1_2_c. Is there some grand scheme to the naming of kernel_patches/fixes/* for 1.2.c? I noticed a slew of new files for the post-2.6.22 fixes, and wondered if there is a naming scheme? Or should I just post a patch for the ofed_1_2 branch and let you all create the ofed_1_2_c kernel_patches/fixes/ patch file ?? Thanks, Steve. From tziporet at dev.mellanox.co.il Tue Jul 31 12:19:05 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 31 Jul 2007 22:19:05 +0300 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> Message-ID: <46AF8B29.7090906@mellanox.co.il> Hal Rosenstock wrote: >> - ibstat does not present all entries. Use ibv_devinfo instead. >> > > What is missing from ibstat for ConnectX ? What entries are missing ? > > See in the report below. If you can fix it it will be great Tziporet #> ibstat CA 'mlx4_0' CA type: <=== missing Number of ports: 2 Firmware version: <=== missing Hardware version: <=== missing Node GUID: 0x0002c903000004bc System image GUID: 0x0002c903000004bf Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x02500868 Port GUID: 0x0002c903000004bd Port 2: State: Active Physical state: LinkUp Rate: 20 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x02500868 Port GUID: 0x0002c903000004be From rdreier at cisco.com Tue Jul 31 12:24:42 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 12:24:42 -0700 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: <46AF8B29.7090906@mellanox.co.il> (Tziporet Koren's message of "Tue, 31 Jul 2007 22:19:05 +0300") References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> Message-ID: > CA type: <=== missing > Firmware version: <=== missing > Hardware version: <=== missing These need sysfs entries from the mlx4_ib driver, I guess. From tziporet at dev.mellanox.co.il Tue Jul 31 12:43:13 2007 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 31 Jul 2007 22:43:13 +0300 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> Message-ID: <46AF90D1.8050000@mellanox.co.il> Roland Dreier wrote: > > CA type: <=== missing > > Firmware version: <=== missing > > Hardware version: <=== missing > > These need sysfs entries from the mlx4_ib driver, I guess. > > I think we have them but under drivers/net and not drivers/infiniband Tziporet From becker at nas.nasa.gov Tue Jul 31 12:43:42 2007 From: becker at nas.nasa.gov (Jeff Becker) Date: Tue, 31 Jul 2007 12:43:42 -0700 Subject: [ofa-general] RE: OFA website edits In-Reply-To: <46AE5901.7010307@ichips.intel.com> References: <46956FF9.50102@ichips.intel.com> <46968448.2000401@ichips.intel.com> <46A536EC.4060201@ichips.intel.com> <46AE5901.7010307@ichips.intel.com> Message-ID: <795c49870707311243l4615b464v3b1b0f1479870684@mail.gmail.com> Hi. Jeff Scott asked me to help with this. I've started thinking about how to implement it, and I may have a first cut by the end of this week. -jeff On 7/30/07, Arlin Davis wrote: > Roland Dreier wrote: > > > > Maintainers: please review the following proposal regarding new public > > > download locations/website links and respond. This request originated > > > from xwg. > > > > > > http://lists.openfabrics.org/pipermail/xwg/2007-June/000018.html > > > >I guess it's OK, but what's the difference between a README and a > >WEB_README? > > > >Would it make sense to have just one file (maybe in a format that is > >easily transformed to HTML, eg reStructuredText) for all purposes? > > > > > > That works for me. I was waiting for to hear back from Jeff regarding a > filename and content. > > Jeff, can you comment? What format will work best for you? > > -arlin > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ardavis at ichips.intel.com Tue Jul 31 13:28:44 2007 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 31 Jul 2007 13:28:44 -0700 Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com> References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com> Message-ID: <46AF9B7C.7020906@ichips.intel.com> Sean Hefty wrote: >>OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We >>are running into some cases on larger clusters that require longer timeouts >>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that adds >>a module parameter for the response timeout? Thanks. >> >> > >What's in it for me? :) > > > >>Signed-off by: Arlin Davis >> >> > >Acked-by: Sean Hefty > >Vlad, can you add this for OFED 1.2.1? > >- Sean > > Did this get added to 1.2.1? From hal.rosenstock at gmail.com Tue Jul 31 13:30:30 2007 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 31 Jul 2007 16:30:30 -0400 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: <46AF90D1.8050000@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> <46AF90D1.8050000@mellanox.co.il> Message-ID: On 7/31/07, Tziporet Koren wrote: > Roland Dreier wrote: > > > CA type: <=== missing > > > Firmware version: <=== missing > > > Hardware version: <=== missing > > > > These need sysfs entries from the mlx4_ib driver, I guess. > > > > > I think we have them but under drivers/net and not drivers/infiniband Why under drivers/net rather than drivers/infiniband like all the other drivers ? Does this really need special casing (in libibumad) ? -- Hal > Tziporet > From rdreier at cisco.com Tue Jul 31 13:33:54 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 13:33:54 -0700 Subject: [ofa-general] Re: [ewg] OFED 1.2.c-9 is available In-Reply-To: (Hal Rosenstock's message of "Tue, 31 Jul 2007 16:30:30 -0400") References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> <46AF90D1.8050000@mellanox.co.il> Message-ID: > Why under drivers/net rather than drivers/infiniband like all the > other drivers ? Does this really need special casing (in libibumad) ? Tziporet is incorrect. There's nothing from the mlx4_core driver either, and when it is implemented, it should work exactly the same as all other drivers. From davem at systemfabricworks.com Tue Jul 31 13:39:27 2007 From: davem at systemfabricworks.com (davem at systemfabricworks.com) Date: Tue, 31 Jul 2007 15:39:27 -0500 Subject: [ofa-general] [PATCH] infiniband-diags: Add common flags -P, -C, and -t Message-ID: <46AF9DFF.mailEUD1EC05T@systemfabricworks.com> Add common flags -P, -C, and -t to infiniband-diags programs and scripts to allow specifying the HCA port number, HCA device name, and query timeout. These diagnostic programs can now be directed to either different fabrics attached to the system, or forced to use different ports should the fabric fail and comparisons are needed. Two of these had conflicting prior use of the flags. For the ibcheckerrs script, -T is now used to specify the threshold file. In the saquery program, -p and -c are now used to specify getting the PathRecord info and getting the SA's class port info. Other than the resolution of the three conflicts, all comands behave exactly the same as they did before the change if these common flags are not used. Signed-off-by: David A. McMillen --- infiniband-diags/man/dump_lfts.8 | 12 +++++- infiniband-diags/man/dump_mfts.8 | 13 ++++++- infiniband-diags/man/ibcheckerrors.8 | 9 ++++- infiniband-diags/man/ibcheckerrs.8 | 15 ++++++- infiniband-diags/man/ibchecknet.8 | 9 ++++- infiniband-diags/man/ibchecknode.8 | 9 ++++- infiniband-diags/man/ibcheckport.8 | 9 ++++- infiniband-diags/man/ibcheckportstate.8 | 9 ++++- infiniband-diags/man/ibcheckportwidth.8 | 9 ++++- infiniband-diags/man/ibcheckstate.8 | 9 ++++- infiniband-diags/man/ibcheckwidth.8 | 12 +++++- infiniband-diags/man/ibclearcounters.8 | 11 ++++- infiniband-diags/man/ibclearerrors.8 | 11 ++++- infiniband-diags/man/ibdatacounters.8 | 8 +++- infiniband-diags/man/ibdatacounts.8 | 9 ++++- infiniband-diags/man/ibhosts.8 | 12 +++++- infiniband-diags/man/ibnodes.8 | 12 +++++- infiniband-diags/man/ibrouters.8 | 12 +++++- infiniband-diags/man/ibswitches.8 | 12 +++++- infiniband-diags/man/saquery.8 | 15 ++++++- infiniband-diags/scripts/dump_lfts.sh | 49 ++++++++++++++++++---- infiniband-diags/scripts/dump_mfts.sh | 49 ++++++++++++++++++---- infiniband-diags/scripts/ibcheckerrors.in | 34 ++++++++++++---- infiniband-diags/scripts/ibcheckerrs.in | 27 ++++++++++--- infiniband-diags/scripts/ibchecknet.in | 36 ++++++++++++---- infiniband-diags/scripts/ibchecknode.in | 22 ++++++++-- infiniband-diags/scripts/ibcheckport.in | 22 ++++++++-- infiniband-diags/scripts/ibcheckportstate.in | 22 ++++++++-- infiniband-diags/scripts/ibcheckportwidth.in | 22 ++++++++-- infiniband-diags/scripts/ibcheckstate.in | 32 +++++++++++--- infiniband-diags/scripts/ibcheckwidth.in | 32 +++++++++++--- infiniband-diags/scripts/ibclearcounters.in | 32 +++++++++++--- infiniband-diags/scripts/ibclearerrors.in | 30 +++++++++++--- infiniband-diags/scripts/ibdatacounters.in | 34 ++++++++++++---- infiniband-diags/scripts/ibdatacounts.in | 25 +++++++++-- infiniband-diags/scripts/ibhosts.in | 30 +++++++++++-- infiniband-diags/scripts/ibnodes.in | 2 +- infiniband-diags/scripts/ibrouters.in | 30 +++++++++++-- infiniband-diags/scripts/ibswitches.in | 30 +++++++++++-- infiniband-diags/src/saquery.c | 56 ++++++++++++++++++++------ 40 files changed, 680 insertions(+), 153 deletions(-) diff --git a/infiniband-diags/man/dump_lfts.8 b/infiniband-diags/man/dump_lfts.8 index c1458b3..091de41 100644 --- a/infiniband-diags/man/dump_lfts.8 +++ b/infiniband-diags/man/dump_lfts.8 @@ -5,7 +5,8 @@ dump_lfts.sh \- dump InfiniBand unicast forwarding tables .SH SYNOPSIS .B dump_lfts.sh -[\-h] [\-D] [>/path/to/dump-file] +[\-h] [\-D] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [>/path/to/dump-file] + .SH DESCRIPTION .PP @@ -24,6 +25,15 @@ dump forwarding tables using direct routed rather than LID routed SMPs .TP \fB\-h\fR show help +.TP +\fB\-C\fR +use the specified ca_name. +.TP +\fB\-P\fR +use the specified ca_port. +.TP +\fB\-t\fR +override the default timeout for the solicited mads. .SH SEE ALSO .BR dump_mfts(8), diff --git a/infiniband-diags/man/dump_mfts.8 b/infiniband-diags/man/dump_mfts.8 index fc8bc2e..90dd2ac 100644 --- a/infiniband-diags/man/dump_mfts.8 +++ b/infiniband-diags/man/dump_mfts.8 @@ -5,7 +5,8 @@ dump_lfts.sh \- dump InfiniBand multicast forwarding tables .SH SYNOPSIS .B dump_mfts.sh -[\-h] [\-D] [>/path/to/file] +[\-h] [\-D] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] +[>/path/to/file] .SH DESCRIPTION .PP @@ -21,6 +22,16 @@ dump forwarding tables using direct routed rather than LID routed SMPs .TP \fB\-h\fR show help +.TP +\fB\-C\fR +use the specified ca_name. +.TP +\fB\-P\fR +use the specified ca_port. +.TP +\fB\-t\fR +override the default timeout for the solicited mads. + .SH SEE ALSO .BR dump_lfts(8), diff --git a/infiniband-diags/man/ibcheckerrors.8 b/infiniband-diags/man/ibcheckerrors.8 index 489d531..15b646f 100644 --- a/infiniband-diags/man/ibcheckerrors.8 +++ b/infiniband-diags/man/ibcheckerrors.8 @@ -5,7 +5,8 @@ ibcheckerrors \- validate IB subnet and report errors .SH SYNOPSIS .B ibcheckerrors -[\-h] [\-b] [\-v] [\-N | \-nocolor] [] +[\-h] [\-b] [\-v] [\-N | \-nocolor] [ | \-C ca_name +\-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -21,6 +22,12 @@ errors (from port counters). not what they are. .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibcheckerrs.8 b/infiniband-diags/man/ibcheckerrs.8 index 7b22163..f901889 100644 --- a/infiniband-diags/man/ibcheckerrs.8 +++ b/infiniband-diags/man/ibcheckerrs.8 @@ -5,7 +5,10 @@ ibcheckerrs \- validate IB port (or node) and report errors in counters above th .SH SYNOPSIS .B ibcheckerrs -[\-h] [\-b] [\-v] [\-G] [\-t ] [\-s(how_thresholds)] [\-N | \-nocolor] +[\-h] [\-b] [\-v] [\-G] [\-T ] [\-s(how_thresholds)] +[\-N | \-nocolor] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] + + .SH DESCRIPTION .PP @@ -23,7 +26,7 @@ specified using the -t option. .PP \-s show predefined thresholds .PP -\-t use specified threshold file +\-T use specified threshold file .PP \-v increase the verbosity level .PP @@ -31,6 +34,12 @@ specified using the -t option. present, not what they are. .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP @@ -38,7 +47,7 @@ ibcheckerrs 2 # check aggregated node counter for lid 2 .PP ibcheckerrs 2 4 # check port counters for lid 2 port 4 .PP -ibcheckerrs -t xxx 2 # check node using xxx threshold file +ibcheckerrs -T xxx 2 # check node using xxx threshold file .SH SEE ALSO .BR perfquery(8), diff --git a/infiniband-diags/man/ibchecknet.8 b/infiniband-diags/man/ibchecknet.8 index ddeccc8..375427b 100644 --- a/infiniband-diags/man/ibchecknet.8 +++ b/infiniband-diags/man/ibchecknet.8 @@ -5,7 +5,8 @@ ibchecknet \- validate IB subnet and report errors .SH SYNOPSIS .B ibchecknet -[\-h] [\-N | \-nocolor] [] +[\-h] [\-N | \-nocolor] [ | \-C ca_name \-P ca_port +\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -16,6 +17,12 @@ reports errors (from port counters). .SH OPTIONS .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibchecknode.8 b/infiniband-diags/man/ibchecknode.8 index ad1e88b..ecd8bf9 100644 --- a/infiniband-diags/man/ibchecknode.8 +++ b/infiniband-diags/man/ibchecknode.8 @@ -5,7 +5,8 @@ ibchecknode \- validate IB node and report errors .SH SYNOPSIS .B ibchecknode -[\-h] [\-v] [\-N | \-nocolor] [\-G] +[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port] +[\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address. \-v increase the verbosity level .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP diff --git a/infiniband-diags/man/ibcheckport.8 b/infiniband-diags/man/ibcheckport.8 index 3a18f21..08166c3 100644 --- a/infiniband-diags/man/ibcheckport.8 +++ b/infiniband-diags/man/ibcheckport.8 @@ -5,7 +5,8 @@ ibcheckport \- validate IB port and report errors .SH SYNOPSIS .B ibcheckport -[\-h] [\-v] [\-N | \-nocolor] [\-G] +[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port] +[\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address. \-v increase the verbosity level .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP diff --git a/infiniband-diags/man/ibcheckportstate.8 b/infiniband-diags/man/ibcheckportstate.8 index 139da57..4c70f16 100644 --- a/infiniband-diags/man/ibcheckportstate.8 +++ b/infiniband-diags/man/ibcheckportstate.8 @@ -5,7 +5,8 @@ ibcheckportstate \- validate IB port for LinkUp and not Active state .SH SYNOPSIS .B ibcheckportstate -[\-h] [\-v] [\-N | \-nocolor] [\-G] +[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port] +[\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -22,6 +23,12 @@ Port address is a lid unless -G option is used to specify a GUID address. \-v increase the verbosity level .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP diff --git a/infiniband-diags/man/ibcheckportwidth.8 b/infiniband-diags/man/ibcheckportwidth.8 index 304e345..541be8a 100644 --- a/infiniband-diags/man/ibcheckportwidth.8 +++ b/infiniband-diags/man/ibcheckportwidth.8 @@ -5,7 +5,8 @@ ibcheckportwidth \- validate IB port for 1x link width .SH SYNOPSIS .B ibcheckport -[\-h] [\-v] [\-N | \-nocolor] [\-G] +[\-h] [\-v] [\-N | \-nocolor] [\-G] [\-C ca_name] [\-P ca_port] +[\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -21,6 +22,12 @@ Port address is a lid unless -G option is used to specify a GUID address. \-v increase the verbosity level .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP diff --git a/infiniband-diags/man/ibcheckstate.8 b/infiniband-diags/man/ibcheckstate.8 index 5cb41c9..e718979 100644 --- a/infiniband-diags/man/ibcheckstate.8 +++ b/infiniband-diags/man/ibcheckstate.8 @@ -5,7 +5,8 @@ ibcheckstate \- find ports in IB subnet which are link up but not active .SH SYNOPSIS .B ibcheckstate -[\-h] [\-v] [\-N | \-nocolor] [] +[\-h] [\-v] [\-N | \-nocolor] [ | \-C ca_name \-P ca_port +\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -17,6 +18,12 @@ a port physical state other than LinkUp. .SH OPTIONS .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibcheckwidth.8 b/infiniband-diags/man/ibcheckwidth.8 index 5a3b1df..da9a70b 100644 --- a/infiniband-diags/man/ibcheckwidth.8 +++ b/infiniband-diags/man/ibcheckwidth.8 @@ -5,7 +5,9 @@ ibcheckwidth \- find 1x links in IB subnet .SH SYNOPSIS .B ibcheckwidth -[\-h] [\-v] [\-N | \-nocolor] [] +[\-h] [\-v] [\-N | \-nocolor] [ | \-C ca_name +\-P ca_port \-t(imeout) timeout_ms] + .SH DESCRIPTION .PP @@ -15,7 +17,13 @@ reports any 1x links. .SH OPTIONS .PP -\-N | \-nocolor use mono rather than color mode +\-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibclearcounters.8 b/infiniband-diags/man/ibclearcounters.8 index 96ed8fa..d14e038 100644 --- a/infiniband-diags/man/ibclearcounters.8 +++ b/infiniband-diags/man/ibclearcounters.8 @@ -5,7 +5,8 @@ ibclearcounters \- clear port counters in IB subnet .SH SYNOPSIS .B ibclearcounters -[\-h] [\-N | \-nocolor] [] +[\-h] [\-N | \-nocolor] [ | \-C ca_name +\-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -14,7 +15,13 @@ the IB subnet topology or using an already saved topology file. .SH OPTIONS .PP -\-N | \-nocolor use mono rather than color mode +\-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibclearerrors.8 b/infiniband-diags/man/ibclearerrors.8 index 6479f9c..58f73d9 100644 --- a/infiniband-diags/man/ibclearerrors.8 +++ b/infiniband-diags/man/ibclearerrors.8 @@ -5,7 +5,8 @@ ibclearerrors \- clear error counters in IB subnet .SH SYNOPSIS .B ibclearerrors -[\-h] [\-N | \-nocolor] [] +[\-h] [\-N | \-nocolor] [ | \-C ca_name \-P ca_port +\-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -15,7 +16,13 @@ file. .SH OPTIONS .PP -\-N | \-nocolor use mono rather than color mode +\-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibdatacounters.8 b/infiniband-diags/man/ibdatacounters.8 index 7d562a0..309a8f2 100644 --- a/infiniband-diags/man/ibdatacounters.8 +++ b/infiniband-diags/man/ibdatacounters.8 @@ -5,7 +5,7 @@ ibdatacounters \- query IB subnet for data counters .SH SYNOPSIS .B ibdatacounters -[\-h] [\-b] [\-v] [\-N | \-nocolor] [] +[\-h] [\-b] [\-v] [\-N | \-nocolor] [ | \-C ca_name \-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP @@ -21,6 +21,12 @@ the data counters (from port counters). not what they are. .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO .BR ibnetdiscover(8), diff --git a/infiniband-diags/man/ibdatacounts.8 b/infiniband-diags/man/ibdatacounts.8 index 8a731a6..8b995f8 100644 --- a/infiniband-diags/man/ibdatacounts.8 +++ b/infiniband-diags/man/ibdatacounts.8 @@ -5,7 +5,8 @@ ibdatacounts \- get IB port data counters .SH SYNOPSIS .B ibdatacounts -[\-h] [\-b] [\-v] [\-G] [\-N | \-nocolor] [] +[\-h] [\-b] [\-v] [\-G] [\-N | \-nocolor] [\-C ca_name] [\-P ca_port] +[\-t(imeout) timeout_ms] [] .SH DESCRIPTION .PP @@ -24,6 +25,12 @@ address. \-b brief mode .PP \-N | \-nocolor use mono rather than color mode +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH EXAMPLE .PP diff --git a/infiniband-diags/man/ibhosts.8 b/infiniband-diags/man/ibhosts.8 index 31788fc..9d7fe9a 100644 --- a/infiniband-diags/man/ibhosts.8 +++ b/infiniband-diags/man/ibhosts.8 @@ -5,13 +5,23 @@ ibhosts \- show InfiniBand host nodes in topology .SH SYNOPSIS .B ibhosts -[\-h] [] +[\-h] [ | \-C ca_name \-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP ibhosts is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the CA nodes. +.SH OPTIONS +.PP +\-h show the usage message +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. + .SH SEE ALSO .BR ibnetdiscover(8) diff --git a/infiniband-diags/man/ibnodes.8 b/infiniband-diags/man/ibnodes.8 index fdd394c..dc59ca2 100644 --- a/infiniband-diags/man/ibnodes.8 +++ b/infiniband-diags/man/ibnodes.8 @@ -5,14 +5,24 @@ ibnodes \- show InfiniBand nodes in topology .SH SYNOPSIS .B ibnodes -[] +[\-h] [ | \-C ca_name \-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP ibnodes is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the IB nodes (CAs and switches). +.SH OPTIONS +.PP +\-h show the usage message +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. .SH SEE ALSO + .BR ibnetdiscover(8) .SH AUTHOR diff --git a/infiniband-diags/man/ibrouters.8 b/infiniband-diags/man/ibrouters.8 index 068a2d9..698e0ee 100644 --- a/infiniband-diags/man/ibrouters.8 +++ b/infiniband-diags/man/ibrouters.8 @@ -5,13 +5,23 @@ ibrouters \- show InfiniBand router nodes in topology .SH SYNOPSIS .B ibrouters -[\-h] [] +[\-h] [ | \-C ca_name \-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP ibrouters is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the Rt nodes. +.SH OPTIONS +.PP +\-h show the usage message +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. + .SH SEE ALSO .BR ibnetdiscover(8) diff --git a/infiniband-diags/man/ibswitches.8 b/infiniband-diags/man/ibswitches.8 index c9d3650..0929240 100644 --- a/infiniband-diags/man/ibswitches.8 +++ b/infiniband-diags/man/ibswitches.8 @@ -5,13 +5,23 @@ ibswitches\- show InfiniBand switch nodes in topology .SH SYNOPSIS .B ibswitches -[\-h] [] +[\-h] [ | \-C ca_name \-P ca_port \-t(imeout) timeout_ms] .SH DESCRIPTION .PP ibswitches is a script which either walks the IB subnet topology or uses an already saved topology file and extracts the switch nodes. +.SH OPTIONS +.PP +\-h show the usage message +.PP +\-C use the specified ca_name. +.PP +\-P use the specified ca_port. +.PP +\-t override the default timeout for the solicited mads. + .SH SEE ALSO .BR ibnetdiscover(8) diff --git a/infiniband-diags/man/saquery.8 b/infiniband-diags/man/saquery.8 index 535851f..5558cc9 100644 --- a/infiniband-diags/man/saquery.8 +++ b/infiniband-diags/man/saquery.8 @@ -5,7 +5,10 @@ saquery \- query InfiniBand subnet administration attributes .SH SYNOPSIS .B saquery -[\-h] [\-d] [\-P] [\-N] [\-\-list | \-D] [\-S] [\-I] [\-L] [\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst ] [\-t(imeout) ] [\-\-switch\-map ] [ | | ] +[\-h] [\-d] [\-p] [\-N] [\-\-list | \-D] [\-S] [\-I] [\-L] [\-l] [\-G] [\-O] +[\-U] [\-c] [\-s] [\-g] [\-m] [--src-to-dst ] [\-C ca_name] +[\-P ca_port] [\-t(imeout) ] [\-\-switch\-map ] +[ | | ] .SH DESCRIPTION .PP @@ -15,7 +18,7 @@ saquery issues the selected SA query. Node records are queried by default. .PP .TP -\fB\-P\fR +\fB\-p\fR get PathRecord info .TP \fB\-N\fR @@ -45,7 +48,7 @@ return the name for the Lid specified \fB\-U\fR return the name for the Guid specified .TP -\fB\-C\fR +\fB\-c\fR get the SA's class port info .TP \fB\-s\fR @@ -63,6 +66,12 @@ description for each entry. Example: saquery -m 0xc000 get a PathRecord for where src and dst are either node names or LIDs .TP +\fB\-C\fR +use the specified ca_name. +.TP +\fB\-P\fR +use the specified ca_port. +.TP \fB\-t\fR, \fB\-timeout\fR Specify SA query response timeout in milliseconds. Default is 100 milliseconds. You may want to use diff --git a/infiniband-diags/scripts/dump_lfts.sh b/infiniband-diags/scripts/dump_lfts.sh index 49e86da..67a307c 100755 --- a/infiniband-diags/scripts/dump_lfts.sh +++ b/infiniband-diags/scripts/dump_lfts.sh @@ -7,35 +7,66 @@ usage () { - echo "usage: $0 [-D]" + echo Usage: `basename $0` "[-h] [-D] [-C ca_name]" \ + "[-P ca_port] [-t(imeout) timeout_ms]" exit 2 } dump_by_lid () { -for sw_lid in `ibswitches \ +for sw_lid in `ibswitches $ca_info \ | sed -ne 's/^.* lid \([0-9a-f]*\) .*$/\1/p'` ; do - ibroute $sw_lid + ibroute $ca_info $sw_lid done } dump_by_dr_path () { -for sw_dr in `ibnetdiscover -v \ +for sw_dr in `ibnetdiscover $ca_info -v \ | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ | sed -e 's/\]\[/,/g' \ | sort -u` ; do - ibroute -D ${sw_dr} + ibroute $ca_info -D ${sw_dr} done } +use_d="" +ca_info="" -if [ "$1" = "-D" ] ; then +while [ "$1" ]; do + case $1 in + -D) + use_d="-D" + ;; + -h) + usage + ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; + -*) + usage + ;; + *) + usage + ;; + esac + shift +done + +if [ "$use_d" = "-D" ] ; then dump_by_dr_path -elif [ -z "$1" ] ; then - dump_by_lid else - usage + dump_by_lid fi exit diff --git a/infiniband-diags/scripts/dump_mfts.sh b/infiniband-diags/scripts/dump_mfts.sh index 20281e8..39fc5fb 100755 --- a/infiniband-diags/scripts/dump_mfts.sh +++ b/infiniband-diags/scripts/dump_mfts.sh @@ -7,35 +7,66 @@ usage () { - echo "usage: $0 [-D]" + echo Usage: `basename $0` "[-h] [-D] [-C ca_name]" \ + "[-P ca_port] [-t(imeout) timeout_ms]" exit 2 } dump_by_lid () { -for sw_lid in `ibswitches \ +for sw_lid in `ibswitches $ca_info \ | sed -ne 's/^.* lid \([0-9a-f]*\) .*$/\1/p'` ; do - ibroute -M $sw_lid + ibroute $ca_info -M $sw_lid done } dump_by_dr_path () { -for sw_dr in `ibnetdiscover -v \ +for sw_dr in `ibnetdiscover $ca_info -v \ | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ | sed -e 's/\]\[/,/g' \ | sort -u` ; do - ibroute -D ${sw_dr} + ibroute $ca_info -M -D ${sw_dr} done } +use_d="" +ca_info="" -if [ "$1" = "-D" ] ; then +while [ "$1" ]; do + case $1 in + -D) + use_d="-D" + ;; + -h) + usage + ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; + -*) + usage + ;; + *) + usage + ;; + esac + shift +done + +if [ "$use_d" = "-D" ] ; then dump_by_dr_path -elif [ -z "$1" ] ; then - dump_by_lid else - usage + dump_by_lid fi exit diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in index e08eba3..01c7a99 100644 --- a/infiniband-diags/scripts/ibcheckerrors.in +++ b/infiniband-diags/scripts/ibcheckerrors.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-b] [-v] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-b] [-v] [-N | -nocolor]"\ + "[ | -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -21,6 +22,8 @@ v=0 ntype="" nodeguid="" oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -39,20 +42,35 @@ while [ "$1" ]; do brief=-b verbose="" ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -62,12 +80,12 @@ BEGIN { function check_node(lid) { nodechecked=1 - if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) { + if (system("'$IBPATH'/ibchecknode '"$ca_info"' '$gflags' '$verbose' " lid)) { ne++ badnode=1 return } - if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " 255")) + if (system("'$IBPATH'/ibcheckerrs '"$ca_info"' '$gflags' '$verbose' '$brief' " lid " 255")) nodeerr=1; } @@ -102,7 +120,7 @@ function check_node(lid) sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (nodeerr) - if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' '$brief' " lid " " port)) { + if (system("'$IBPATH'/ibcheckerrs '"$ca_info"' '$gflags' '$verbose' '$brief' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index ff3256b..99d45cd 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -3,7 +3,9 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-t ] [-s(how_thresholds)] [-N \| -nocolor] []" + echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-T ]" \ + "[-s(how_thresholds)] [-N \| -nocolor] [-C ca_name] [-P ca_port]" \ + "[-t(imeout) timeout_ms] []" exit -1 } @@ -64,6 +66,7 @@ guid_addr="" bw="" verbose="" brief="" +ca_info="" while [ "$1" ]; do case $1 in @@ -81,7 +84,7 @@ while [ "$1" ]; do brief=yes verbose="" ;; - -t) + -T) if ! [ -r $2 ]; then echo "Can't use threshold file '$2'" usage @@ -93,6 +96,18 @@ while [ "$1" ]; do show_thresholds exit 0 ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -121,7 +136,7 @@ else fi if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 @@ -129,16 +144,16 @@ if [ "$guid_addr" ]; then guid=$1 else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 fi fi -nodename=`smpquery nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` +nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` -if $IBPATH/perfquery $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' +if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' function blue(s) { if (brief == "yes") { diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in index 9f36742..e2f7fb8 100644 --- a/infiniband-diags/scripts/ibchecknet.in +++ b/infiniband-diags/scripts/ibchecknet.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \ + "[ | -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -18,6 +19,8 @@ gflags="" verbose="" v=0 oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -31,20 +34,35 @@ while [ "$1" ]; do verbose=-v v=0 ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -55,12 +73,12 @@ BEGIN { function check_node(lid) { nodechecked=1 - if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) { + if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) { ne++ badnode=1 return } - if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' " lid " 255")) + if (system("'$IBPATH'/ibcheckerrs'"$ca_info"' '$gflags' '$verbose' " lid " 255")) nodeerr=1; } @@ -94,7 +112,7 @@ function check_node(lid) } sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) - if (system("'$IBPATH'/ibcheckport '$gflags' '$verbose' " lid " " port)) { + if (system("'$IBPATH'/ibcheckport'"$ca_info"' '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid @@ -103,7 +121,7 @@ function check_node(lid) } if (nodeerr) - if (system("'$IBPATH'/ibcheckerrs '$gflags' '$verbose' " lid " " port)) { + if (system("'$IBPATH'/ibcheckerrs'"$ca_info"' '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid diff --git a/infiniband-diags/scripts/ibchecknode.in b/infiniband-diags/scripts/ibchecknode.in index 9d3aaba..5eea7b5 100644 --- a/infiniband-diags/scripts/ibchecknode.in +++ b/infiniband-diags/scripts/ibchecknode.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] " + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \ + "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] " exit -1 } @@ -30,6 +31,7 @@ function red() { guid_addr="" bw="" verbose="" +ca_info="" while [ "$1" ]; do case $1 in @@ -42,6 +44,18 @@ while [ "$1" ]; do -v) verbose=yes ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -57,14 +71,14 @@ if [ -z "$1" ]; then fi if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 fi else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 @@ -73,7 +87,7 @@ fi ## For now, check node only checks if node info is replied -if $IBPATH/smpquery nodeinfo $lid > /dev/null 2>&1 ; then +if $IBPATH/smpquery $ca_info nodeinfo $lid > /dev/null 2>&1 ; then if [ "$verbose" = "yes" ]; then echo -n "Node check lid $lid: " green OK diff --git a/infiniband-diags/scripts/ibcheckport.in b/infiniband-diags/scripts/ibcheckport.in index f910fdc..3c7c396 100644 --- a/infiniband-diags/scripts/ibcheckport.in +++ b/infiniband-diags/scripts/ibcheckport.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] " + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \ + "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] " exit -1 } @@ -30,6 +31,7 @@ function red() { guid_addr="" bw="" verbose="" +ca_info="" while [ "$1" ]; do case $1 in @@ -42,6 +44,18 @@ while [ "$1" ]; do -v) verbose=yes ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -59,7 +73,7 @@ fi portnum=$2 if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 @@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then guid=$1 else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 @@ -75,7 +89,7 @@ else fi -if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) diff --git a/infiniband-diags/scripts/ibcheckportstate.in b/infiniband-diags/scripts/ibcheckportstate.in index 3c36601..f3a5f05 100644 --- a/infiniband-diags/scripts/ibcheckportstate.in +++ b/infiniband-diags/scripts/ibcheckportstate.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] " + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \ + "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] " exit -1 } @@ -30,6 +31,7 @@ function red() { guid_addr="" bw="" verbose="" +ca_info="" while [ "$1" ]; do case $1 in @@ -42,6 +44,18 @@ while [ "$1" ]; do -v) verbose=yes ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -59,7 +73,7 @@ fi portnum=$2 if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 @@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then guid=$1 else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 @@ -75,7 +89,7 @@ else fi -if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) diff --git a/infiniband-diags/scripts/ibcheckportwidth.in b/infiniband-diags/scripts/ibcheckportwidth.in index 5f6762e..fdc75d1 100644 --- a/infiniband-diags/scripts/ibcheckportwidth.in +++ b/infiniband-diags/scripts/ibcheckportwidth.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-v] [-N \| -nocolor] [-G] " + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor] [-G]" \ + "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] " exit -1 } @@ -30,6 +31,7 @@ function red() { guid_addr="" bw="" verbose="" +ca_info="" while [ "$1" ]; do case $1 in @@ -42,6 +44,18 @@ while [ "$1" ]; do -v) verbose=yes ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -59,7 +73,7 @@ fi portnum=$2 if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 @@ -67,7 +81,7 @@ if [ "$guid_addr" ]; then guid=$1 else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 @@ -75,7 +89,7 @@ else fi -if $IBPATH/smpquery portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in index 30b5513..944e139 100644 --- a/infiniband-diags/scripts/ibcheckstate.in +++ b/infiniband-diags/scripts/ibcheckstate.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \ + "[ | -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -20,6 +21,8 @@ v=0 ntype="" nodeguid="" oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -33,20 +36,35 @@ while [ "$1" ]; do verbose=-v v=1 ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -57,7 +75,7 @@ BEGIN { function check_node(lid) { nodechecked=1 - if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) { + if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) { ne++ badnode=1 return @@ -93,7 +111,7 @@ function check_node(lid) } sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) - if (system("'$IBPATH'/ibcheckportstate '$gflags' '$verbose' " lid " " port)) { + if (system("'$IBPATH'/ibcheckportstate'"$ca_info"' '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in index 072d433..8ad0f7f 100644 --- a/infiniband-diags/scripts/ibcheckwidth.in +++ b/infiniband-diags/scripts/ibcheckwidth.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-v] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-v] [-N | -nocolor]" \ + "[ \| -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -20,6 +21,8 @@ v=0 ntype="" nodeguid="" oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -33,20 +36,35 @@ while [ "$1" ]; do verbose="-v" v=1 ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -57,7 +75,7 @@ BEGIN { function check_node(lid) { nodechecked=1 - if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) { + if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) { ne++ badnode=1 return @@ -93,7 +111,7 @@ function check_node(lid) } sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) - if (system("'$IBPATH'/ibcheckportwidth '$gflags' '$verbose' " lid " " port)) { + if (system("'$IBPATH'/ibcheckportwidth'"$ca_info"' '$gflags' '$verbose' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in index 54551b3..b3c009e 100644 --- a/infiniband-diags/scripts/ibclearcounters.in +++ b/infiniband-diags/scripts/ibclearcounters.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-N | -nocolor] [" \ + "| -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -18,6 +19,8 @@ gflags="" verbose="" v=0 oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -27,20 +30,35 @@ while [ "$1" ]; do -N|-nocolor) gflags=-N ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -48,14 +66,14 @@ eval $netcmd | awk ' function clear_counters(lid) { nodecleared=1 - if (system("'$IBPATH'/perfquery '$gflags' -R -a " lid)) + if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R -a " lid)) nodeerr++ } function clear_port_counters(lid, port) { nodecleared=1 - if (system("'$IBPATH'/perfquery '$gflags' -R " lid " " port)) + if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R " lid " " port)) nodeerr++ } diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in index 4a086ae..097c3fe 100644 --- a/infiniband-diags/scripts/ibclearerrors.in +++ b/infiniband-diags/scripts/ibclearerrors.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-N | -nocolor] [" \ + "| -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -18,6 +19,8 @@ gflags="" verbose="" v=0 oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -27,20 +30,35 @@ while [ "$1" ]; do -N|-nocolor) gflags=-N ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -48,7 +66,7 @@ eval $netcmd | awk ' function clear_errors(lid, port) { nodecleared=1 - if (system("'$IBPATH'/perfquery '$gflags' -R " lid " " port " 0x0fff")) + if (system("'$IBPATH'/perfquery'"$ca_info"' '$gflags' -R " lid " " port " 0x0fff")) nodeerr++ } diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in index d27149e..bee9bd8 100644 --- a/infiniband-diags/scripts/ibdatacounters.in +++ b/infiniband-diags/scripts/ibdatacounters.in @@ -3,7 +3,8 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [-b] [-v] [-N \| -nocolor] [\] + echo Usage: `basename $0` "[-h] [-b] [-v] [-N | -nocolor]" \ + "[ \| -C ca_name -P ca_port -t(imeout) timeout_ms]" exit -1 } @@ -21,6 +22,8 @@ v=0 ntype="" nodeguid="" oldlid="" +topofile="" +ca_info="" while [ "$1" ]; do case $1 in @@ -39,20 +42,35 @@ while [ "$1" ]; do brief=-b verbose="" ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' @@ -62,12 +80,12 @@ BEGIN { function check_node(lid) { nodechecked=1 - if (system("'$IBPATH'/ibchecknode '$gflags' '$verbose' " lid)) { + if (system("'$IBPATH'/ibchecknode'"$ca_info"' '$gflags' '$verbose' " lid)) { ne++ badnode=1 return } - if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " 255")) + if (system("'$IBPATH'/ibdatacounts'"$ca_info"' '$gflags' '$verbose' '$brief' " lid " 255")) nodeerr=1; } @@ -102,7 +120,7 @@ function check_node(lid) sub("\\(.*\\)", "", port) gsub("[\\[\\]]", "", port) if (nodeerr) - if (system("'$IBPATH'/ibdatacounts'$gflags' '$verbose' '$brief' " lid " " port)) { + if (system("'$IBPATH'/ibdatacounts'"$ca_info"' '$gflags' '$verbose' '$brief' " lid " " port)) { if (!'$v' && oldlid != lid) { print "# Checked " ntype ": nodeguid 0x" nodeguid " with failure" oldlid = lid diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in index 668558f..927a978 100644 --- a/infiniband-diags/scripts/ibdatacounts.in +++ b/infiniband-diags/scripts/ibdatacounts.in @@ -3,7 +3,9 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-N \| -nocolor] []" + echo Usage: `basename $0` "[-h] [-b] [-v] [-G] [-N | -nocolor]" \ + "[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] " \ + "[]" exit -1 } @@ -31,6 +33,7 @@ guid_addr="" bw="" verbose="" brief="" +ca_info="" while [ "$1" ]; do case $1 in @@ -48,6 +51,18 @@ while [ "$1" ]; do brief=yes verbose="" ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; @@ -76,7 +91,7 @@ else fi if [ "$guid_addr" ]; then - if ! lid=`$IBPATH/ibaddr -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then + if ! lid=`$IBPATH/ibaddr $ca_info -G -L $1 | awk '/failed/{exit -1} {print $3}'`; then echo -n "guid $1 address resolution: " red "FAILED" exit -1 @@ -84,16 +99,16 @@ if [ "$guid_addr" ]; then guid=$1 else lid=$1 - if ! temp=`$IBPATH/ibaddr -L $1 | awk '/failed/{exit -1} {print $1}'`; then + if ! temp=`$IBPATH/ibaddr $ca_info -L $1 | awk '/failed/{exit -1} {print $1}'`; then echo -n "lid $1 address resolution: " red "FAILED" exit -1 fi fi -nodename=`smpquery nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` +nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` -if $IBPATH/perfquery $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' +if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' function blue(s) { if (brief == "yes") { diff --git a/infiniband-diags/scripts/ibhosts.in b/infiniband-diags/scripts/ibhosts.in index b9aadc1..0d6b1bc 100644 --- a/infiniband-diags/scripts/ibhosts.in +++ b/infiniband-diags/scripts/ibhosts.in @@ -3,28 +3,48 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [\] + echo Usage: `basename $0` "[-h] [ | -C ca_name" \ + "-P ca_port -t(imeout) timeout_ms]" exit -1 } +topofile="" +ca_info="" + while [ "$1" ]; do case $1 in -h) usage ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac + shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' diff --git a/infiniband-diags/scripts/ibnodes.in b/infiniband-diags/scripts/ibnodes.in index 32acd9c..5871da8 100644 --- a/infiniband-diags/scripts/ibnodes.in +++ b/infiniband-diags/scripts/ibnodes.in @@ -2,4 +2,4 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} -$IBPATH/ibhosts; $IBPATH/ibswitches +$IBPATH/ibhosts $@; $IBPATH/ibswitches $@ diff --git a/infiniband-diags/scripts/ibrouters.in b/infiniband-diags/scripts/ibrouters.in index 96ebfe0..fea72bb 100644 --- a/infiniband-diags/scripts/ibrouters.in +++ b/infiniband-diags/scripts/ibrouters.in @@ -3,28 +3,48 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [\] + echo Usage: `basename $0` "[-h] [ | -C ca_name" \ + "-P ca_port -t(imeout) timeout_ms]" exit -1 } +topofile="" +ca_info="" + while [ "$1" ]; do case $1 in -h) usage ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac + shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' diff --git a/infiniband-diags/scripts/ibswitches.in b/infiniband-diags/scripts/ibswitches.in index 2a92360..859aacd 100644 --- a/infiniband-diags/scripts/ibswitches.in +++ b/infiniband-diags/scripts/ibswitches.in @@ -3,28 +3,48 @@ IBPATH=${IBPATH:- at IBSCRIPTPATH@} function usage() { - echo Usage: `basename $0` [-h] [\] + echo Usage: `basename $0` "[-h] [ | -C ca_name" \ + "-P ca_port -t(imeout) timeout_ms]" exit -1 } +topofile="" +ca_info="" + while [ "$1" ]; do case $1 in -h) usage ;; + -P | -C | -t | -timeout) + case $2 in + -*) + usage + ;; + esac + if [ x$2 = x ] ; then + usage + fi + ca_info="$ca_info $1 $2" + shift + ;; -*) usage ;; *) - break + if [ "$topofile" ]; then + usage + fi + topofile="$1" ;; esac + shift done -if [ "$1" ]; then - netcmd="cat $1" +if [ "$topofile" ]; then + netcmd="cat $topofile" else - netcmd="$IBPATH/ibnetdiscover" + netcmd="$IBPATH/ibnetdiscover $ca_info" fi eval $netcmd | awk ' diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index daff824..522399e 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -73,6 +73,8 @@ osm_mad_pool_t mad_pool; osm_vendor_t *vendor = NULL; int osm_debug = 0; uint32_t sa_timeout_ms = DEFAULT_SA_TIMEOUT_MS; +char *sa_hca_name = NULL; +uint32_t sa_port_num = 0; enum { ALL, @@ -137,7 +139,7 @@ print_node_record(ib_node_record_t *node_record) if (p_ni->node_type == IB_NODE_TYPE_SWITCH) name = lookup_switch_name(switch_map_fp, cl_ntoh64(p_ni->node_guid), - p_nd->description); + (char *)p_nd->description); else name = clean_nodedesc((char *)p_nd->description); printf("%s\n", name); @@ -956,6 +958,7 @@ get_bind_handle(void) ib_api_status_t status; ib_port_attr_t attr_array[MAX_PORTS]; uint32_t num_ports = MAX_PORTS; + uint32_t ca_name_index = 0; complib_init(); @@ -985,6 +988,16 @@ get_bind_handle(void) } for (i = 0; i < num_ports; i++) { + if (i > 1 && cl_ntoh64(attr_array[i].port_guid) + != (cl_ntoh64(attr_array[i-1].port_guid) + 1)) + ca_name_index++; + if (sa_port_num && sa_port_num != attr_array[i].port_num) + continue; + if (sa_hca_name && i == 0) + continue; + if (sa_hca_name + && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0) + continue; if (attr_array[i].link_state == IB_LINK_ACTIVE) port_guid = attr_array[i].port_guid; } @@ -1029,10 +1042,13 @@ clean_up(void) static void usage(void) { - fprintf(stderr, "Usage: %s [-h -d -P -N] [--list | -D] [-S -I -L -l -G -O -U -C -s -g -m --src-to-dst -t(imeout) ] [ | | ]\n", argv0); + fprintf(stderr, "Usage: %s [-h -d -p -N] [--list | -D] [-S -I -L -l -G" + " -O -U -c -s -g -m --src-to-dst -C " + "-P -t(imeout) ] [ | | ]\n", + argv0); fprintf(stderr, " Queries node records by default\n"); fprintf(stderr, " -d enable debugging\n"); - fprintf(stderr, " -P get PathRecord info\n"); + fprintf(stderr, " -p get PathRecord info\n"); fprintf(stderr, " -N get NodeRecord info\n"); fprintf(stderr, " --list | -D the node desc of the CA's\n"); fprintf(stderr, " -S get ServiceRecord info\n"); @@ -1042,15 +1058,21 @@ usage(void) fprintf(stderr, " -G return the Guids of the name specified\n"); fprintf(stderr, " -O return name for the Lid specified\n"); fprintf(stderr, " -U return name for the Guid specified\n"); - fprintf(stderr, " -C get the SA's class port info\n"); - fprintf(stderr, " -s return the PortInfoRecords with isSM or isSMdisabled capability mask bit on\n"); + fprintf(stderr, " -c get the SA's class port info\n"); + fprintf(stderr, " -s return the PortInfoRecords with isSM or " + "isSMdisabled capability mask bit on\n"); fprintf(stderr, " -g get multicast group info\n"); fprintf(stderr, " -m get multicast member info\n"); - fprintf(stderr, " (if multicast group specified, list member GIDs only for group specified\n"); + fprintf(stderr, " (if multicast group specified, list member GIDs" + " only for group specified\n"); fprintf(stderr, " specified, for example 'saquery -m 0xC000')\n"); fprintf(stderr, " --src-to-dst get a PathRecord for \n" - " where src amd dst are either node names or LIDs\n"); - fprintf(stderr, " -t | --timeout specify the SA query response timeout (default %u msec)\n", + " where src amd dst are either node " + "names or LIDs\n"); + fprintf(stderr, " -C specify the SA query HCA\n"); + fprintf(stderr, " -P specify the SA query port\n"); + fprintf(stderr, " -t | --timeout specify the SA query " + "response timeout (default %u msec)\n", DEFAULT_SA_TIMEOUT_MS); fprintf(stderr, " --switch-map specify a switch map\n"); exit(-1); @@ -1068,9 +1090,9 @@ main(int argc, char **argv) ib_net16_t dst_lid; ib_api_status_t status; - static char const str_opts[] = "PVNDLlGOUCSIsgmdht:"; + static char const str_opts[] = "pVNDLlGOUcSIsgmdhP:C:t:"; static const struct option long_opts [] = { - {"P", 0, 0, 'P'}, + {"p", 0, 0, 'p'}, {"Version", 0, 0, 'V'}, {"N", 0, 0, 'N'}, {"L", 0, 0, 'L'}, @@ -1082,9 +1104,11 @@ main(int argc, char **argv) {"g", 0, 0, 'g'}, {"m", 0, 0, 'm'}, {"d", 0, 0, 'd'}, - {"C", 0, 0, 'C'}, + {"c", 0, 0, 'c'}, {"S", 0, 0, 'S'}, {"I", 0, 0, 'I'}, + {"P", 1, 0, 'P'}, + {"C", 1, 0, 'C'}, {"help", 0, 0, 'h'}, {"list", 0, 0, 'D'}, {"src-to-dst", 1, 0, 1}, @@ -1118,7 +1142,7 @@ main(int argc, char **argv) case 2: switch_map = strdup(optarg); break; - case 'P': + case 'p': query_type = IB_MAD_ATTR_PATH_RECORD; break; case 'V': @@ -1127,7 +1151,7 @@ main(int argc, char **argv) case 'D': node_print_desc = ALL_DESC; break; - case 'C': + case 'c': query_type = IB_MAD_ATTR_CLASS_PORT_INFO; break; case 'S': @@ -1167,6 +1191,12 @@ main(int argc, char **argv) case 'd': osm_debug = 1; break; + case 'C': + sa_hca_name = optarg; + break; + case 'P': + sa_port_num = strtoul(optarg, NULL, 0); + break; case 't': sa_timeout_ms = strtoul(optarg, NULL, 0); break; From mst at dev.mellanox.co.il Tue Jul 31 14:08:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Aug 2007 00:08:39 +0300 Subject: [ofa-general] Re: patches for 1.2.c In-Reply-To: <46AF89A0.9070805@opengridcomputing.com> References: <46AF89A0.9070805@opengridcomputing.com> Message-ID: <20070731210839.GE20859@mellanox.co.il> > Quoting Steve Wise : > Subject: patches for 1.2.c > > Guys, > > I have 2 more patches to go in ofed_1_2/ofed_1_2_c. > > Is there some grand scheme to the naming of kernel_patches/fixes/* for > 1.2.c? I noticed a slew of new files for the post-2.6.22 fixes, and > wondered if there is a naming scheme? Not really, just stick the module name in there please so it's easy to figure that cxgb3 is involved. > Or should I just post a patch for the ofed_1_2 branch and let you all > create the ofed_1_2_c kernel_patches/fixes/ patch file ?? It's best if you post the patch that should go into kernel_patches/fixes/, or clone the ofed_1_2_c branch and add the file there. -- MST From mst at dev.mellanox.co.il Tue Jul 31 14:18:39 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Aug 2007 00:18:39 +0300 Subject: [ofa-general] Re: QoS RFC In-Reply-To: References: <46A283B6.1070105@dev.mellanox.co.il> <20070723002010.GU27878@sashak.voltaire.com> <46A89608.9010709@dev.mellanox.co.il> <20070731160223.GF29844@sashak.voltaire.com> Message-ID: <20070731211839.GH20859@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: QoS RFC > > I think that defining a new file format is really going in the wrong > direction. XML would make a lot of sense (and you could use something > like RELAX NG to define the schema very readably and precisely). XML > has the advantage that many parsers, GUI editors, and other tools are > already widely available. > > If you don't like XML for whatever reason, please at least consider > something like YAML before you invent something completely new. I second that. -- MST From mst at dev.mellanox.co.il Tue Jul 31 14:56:47 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Aug 2007 00:56:47 +0300 Subject: [ofa-general] Re: OFED 1.2.c-9 is available In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> <46AF90D1.8050000@mellanox.co.il> Message-ID: <20070731215647.GB5290@mellanox.co.il> > Quoting Roland Dreier : > Subject: Re: [ofa-general] Re: OFED 1.2.c-9 is available > > > Why under drivers/net rather than drivers/infiniband like all the > > other drivers ? Does this really need special casing (in libibumad) ? > > Tziporet is incorrect. There's nothing from the mlx4_core driver > either, and when it is implemented, it should work exactly the same as > all other drivers. At some point you suggested sticking this stuff under the pci device and adding softlinks under drivers/infiniband, so that if there's an ethernet device on top of the core these can be shared. Not sure how to do this though, and no idea why would just adding the attributes in both places be any worse, either. Comments? -- MST From sean.hefty at intel.com Tue Jul 31 15:10:54 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 15:10:54 -0700 Subject: [ofa-general] [PATCH] ib/mad: fix address handle leak in mad_rmpp In-Reply-To: <46AC6B5C.6020702@dev.mellanox.co.il> Message-ID: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com> The address handle associated with dual-sided RMPP direction switch ACKs is never destroyed. Free the AH for ACKs which fall into this category. Problem was reported by Dotan Barak (dotanb at dev.mellanox.co.il). Signed-off-by: Sean Hefty --- Dotan, can you verify that this fixes the problem for you? (I tested against osmtest as you indicated as well.) Roland, this fix would be for 2.6.23. drivers/infiniband/core/mad_rmpp.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index 3663fd7..d43bc62 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -163,8 +163,10 @@ static struct ib_mad_send_buf *alloc_response_msg(struct ib_mad_agent *agent, hdr_len, 0, GFP_KERNEL); if (IS_ERR(msg)) ib_destroy_ah(ah); - else + else { msg->ah = ah; + msg->context[0] = ah; + } return msg; } @@ -197,9 +199,7 @@ static void ack_ds_ack(struct ib_mad_agent_private *agent, void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc) { - struct ib_rmpp_mad *rmpp_mad = mad_send_wc->send_buf->mad; - - if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_ACK) + if (mad_send_wc->send_buf->context[0] == mad_send_wc->send_buf->ah) ib_destroy_ah(mad_send_wc->send_buf->ah); ib_free_send_mad(mad_send_wc->send_buf); } From rdreier at cisco.com Tue Jul 31 15:19:52 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 15:19:52 -0700 Subject: [ofa-general] Re: OFED 1.2.c-9 is available In-Reply-To: <20070731215647.GB5290@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Aug 2007 00:56:47 +0300") References: <6C2C79E72C305246B504CBA17B5500C901563925@mtlexch01.mtl.com> <46AF8B29.7090906@mellanox.co.il> <46AF90D1.8050000@mellanox.co.il> <20070731215647.GB5290@mellanox.co.il> Message-ID: > At some point you suggested sticking this stuff under the pci device and > adding softlinks under drivers/infiniband, so that > if there's an ethernet device on top of the core these can be shared. > Not sure how to do this though, and no idea why would > just adding the attributes in both places be any worse, either. I didn't look at whether it's easy to create symlinks in sysfs. I don't really see any problem with just having both mlx4_ib and mlx4_eth export the same data, and in fact that may make sense if there is a different way mlx4_eth might want to export it (ethtool?). However I definitely don't think we should force all userspace tools to look in two different places in sysfs for information. And if there are some attributes that all devices support, then I guess we should move the implementation of those attrs into core/sysfs.c. - R. From rdreier at cisco.com Tue Jul 31 15:20:18 2007 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 31 Jul 2007 15:20:18 -0700 Subject: [ofa-general] Re: [PATCH] ib/mad: fix address handle leak in mad_rmpp In-Reply-To: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Tue, 31 Jul 2007 15:10:54 -0700") References: <000001c7d3bf$aa5e0090$ff0da8c0@amr.corp.intel.com> Message-ID: > Roland, this fix would be for 2.6.23. OK, I'll wait for Dotan's ACK. From sean.hefty at intel.com Tue Jul 31 17:04:54 2007 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 31 Jul 2007 17:04:54 -0700 Subject: [ofa-general] RE: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. In-Reply-To: <46ACF9DD.1010509@opengridcomputing.com> Message-ID: <000001c7d3cf$97a0b050$9c98070a@amr.corp.intel.com> >The correct solution in my mind is to use the host stack's TCP port >space for _all_ RDMA_PS_TCP port allocations. The patch below is a >minimal delta to unify the port spaces bay using the kernel stack to >bind ports. This is done by allocating a kernel socket and binding to >the appropriate local addr/port. It also allows the kernel stack to >pick ephemeral ports by virtue of just passing in port 0 on the kernel >bind operation. I'm not thrilled with the idea of overlapping port spaces, and I can't come up with a solution that works for all situations. I understand the overlapping port space problem, but I consider the ability to use the same port number for both RDMA and sockets a feature. What if MPI used a similar mechanism as SDP? That is, if it gets a port number from sockets, it reserves that same RDMA port number, or vice-versa. The rdma_cm advertises separate port spaces from TCP/UDP, so IMO any assumption otherwise, at this point, is a bug in the user's code. Before merging the port spaces, I'd like a way for an application to use a single well-known port number that works over both RDMA and sockets. >RDMA/CMA: Allocate PS_TCP ports from the host TCP port space. Is there any reason to limit this behavior to TCP only, or would we also include UDP? >diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >index 9e0ab04..e4d2d7f 100644 >--- a/drivers/infiniband/core/cma.c >+++ b/drivers/infiniband/core/cma.c >@@ -111,6 +111,7 @@ struct rdma_id_private { > struct rdma_cm_id id; > > struct rdma_bind_list *bind_list; >+ struct socket *sock; This points off to a rather largish structure... > struct hlist_node node; > struct list_head list; > struct list_head listen_list; >@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma > kfree(bind_list); > } > mutex_unlock(&lock); >+ if (id_priv->sock) >+ sock_release(id_priv->sock); > } > > void rdma_destroy_id(struct rdma_cm_id *id) >@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps, > return 0; > } > >+static int cma_get_tcp_port(struct rdma_id_private *id_priv) >+{ >+ int ret; >+ struct socket *sock; >+ >+ ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock); >+ if (ret) >+ return ret; >+ ret = sock->ops->bind(sock, >+ (struct socketaddr *)&id_priv->id.route.addr.src_addr, >+ ip_addr_size(&id_priv->id.route.addr.src_addr)); >+ if (ret) { >+ sock_release(sock); >+ return ret; >+ } >+ id_priv->sock = sock; >+ return 0; >+} >+ > static int cma_get_port(struct rdma_id_private *id_priv) > { > struct idr *ps; >@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p > break; > case RDMA_PS_TCP: > ps = &tcp_ps; >+ ret = cma_get_tcp_port(id_priv); /* Synch with native stack */ >+ if (ret) >+ goto out; Would we need tcp_ps (and udp_ps) anymore? Also, I think SDP maps into the TCP port space already, so changes to SDP will be needed as well, which may eliminate its port space. - Sean From davem at systemfabricworks.com Tue Jul 31 17:09:54 2007 From: davem at systemfabricworks.com (davem at systemfabricworks.com) Date: Tue, 31 Jul 2007 19:09:54 -0500 Subject: [ofa-general] [PATCH] infiniband-diags/scripts: Fix Bug 239 Error Reporting Message-ID: <46AFCF52.mailFVL1I50O7@systemfabricworks.com> Fix Bug 239 OpenIB diag scripts don't return error when lacking umad permissions. Returning the error from the head of a shell pipeline is a problem, so this fix causes the awk scripts to pass error messages through. This will pass all standard error messages. This patch needs [ofa-general] [PATCH] infiniband-diags: Add common flags -P, -C, and -t (posted Tue Jul 31 13:39:27 PDT 2007) applied first. Signed-off-by: David A. McMillen --- infiniband-diags/scripts/ibcheckerrors.in | 11 +++++++++-- infiniband-diags/scripts/ibcheckerrs.in | 13 ++++++++++--- infiniband-diags/scripts/ibchecknet.in | 16 ++++++++++++++-- infiniband-diags/scripts/ibcheckport.in | 11 +++++++++-- infiniband-diags/scripts/ibcheckportstate.in | 11 +++++++++-- infiniband-diags/scripts/ibcheckportwidth.in | 11 +++++++++-- infiniband-diags/scripts/ibcheckstate.in | 10 +++++++++- infiniband-diags/scripts/ibcheckwidth.in | 10 +++++++++- infiniband-diags/scripts/ibclearcounters.in | 10 +++++++++- infiniband-diags/scripts/ibclearerrors.in | 10 +++++++++- infiniband-diags/scripts/ibdatacounters.in | 11 +++++++++-- infiniband-diags/scripts/ibdatacounts.in | 11 +++++++++-- infiniband-diags/scripts/ibhosts.in | 9 ++++++++- infiniband-diags/scripts/ibrouters.in | 9 ++++++++- infiniband-diags/scripts/ibswitches.in | 9 ++++++++- 15 files changed, 138 insertions(+), 24 deletions(-) diff --git a/infiniband-diags/scripts/ibcheckerrors.in b/infiniband-diags/scripts/ibcheckerrors.in index 01c7a99..ebf44ec 100644 --- a/infiniband-diags/scripts/ibcheckerrors.in +++ b/infiniband-diags/scripts/ibcheckerrors.in @@ -73,7 +73,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' BEGIN { ne=0 } @@ -129,10 +131,15 @@ function check_node(lid) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne printf "## %d ports checked, %d ports have errors beyond threshold\n", nports, pcnterr exit (ne + pcnterr) } ' -exit $? +exit $rv diff --git a/infiniband-diags/scripts/ibcheckerrs.in b/infiniband-diags/scripts/ibcheckerrs.in index 99d45cd..aa29525 100644 --- a/infiniband-diags/scripts/ibcheckerrs.in +++ b/infiniband-diags/scripts/ibcheckerrs.in @@ -151,9 +151,11 @@ else fi fi -nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` +nodename=`$IBPATH/smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` -if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' +text="`eval $IBPATH/perfquery $ca_info $lid $portnum`" +rv=$? +if echo "$text" | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' function blue(s) { if (brief == "yes") { @@ -184,6 +186,11 @@ BEGIN { /^CounterSelect/ {next} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + /^PortSelect/ { if ($2 != '$portnum') {err = err "error: lid '$lid' port " $2 " does not match query ('$portnum')\n"; exit -1}} $1 ~ "(Xmt|Rcv)(Pkts|Data)" { next } @@ -201,7 +208,7 @@ END { exit -1 } exit 0 -}' 2>&1 ; then +}' 2>&1 && test $rv -eq 0 ; then if [ "$verbose" = "yes" ]; then echo -n "Error check on lid $lid ($nodename) port $portname: " green OK diff --git a/infiniband-diags/scripts/ibchecknet.in b/infiniband-diags/scripts/ibchecknet.in index e2f7fb8..a47ab8e 100644 --- a/infiniband-diags/scripts/ibchecknet.in +++ b/infiniband-diags/scripts/ibchecknet.in @@ -65,7 +65,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' BEGIN { ne=0 pe=0 @@ -130,6 +132,11 @@ function check_node(lid) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne printf "## %d ports checked, %d bad ports found\n", nports, pe @@ -137,4 +144,9 @@ END { exit (ne + pe + pcnterr) } ' -exit $? +av=$? +if [ $av -ne 0 ] ; then + exit $av +else + exit $rv +fi diff --git a/infiniband-diags/scripts/ibcheckport.in b/infiniband-diags/scripts/ibcheckport.in index 3c7c396..94cfc6c 100644 --- a/infiniband-diags/scripts/ibcheckport.in +++ b/infiniband-diags/scripts/ibcheckport.in @@ -89,7 +89,9 @@ else fi -if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`" +rv=$? +if echo "$text" | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) @@ -114,6 +116,11 @@ function blue(s) #/^LocalPort/ { if ($2 != '$portnum') {err = err "#error: port " $2 " does not match query ('$portnum')\n"; exit -1}} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { if (err != "") { blue(err) @@ -124,7 +131,7 @@ END { exit -1 } exit 0 -}' 2>&1 ; then +}' 2>&1 && test $rv -eq 0 ; then if [ "$verbose" = "yes" ]; then echo -n "Port check lid $lid port $portnum: " green "OK" diff --git a/infiniband-diags/scripts/ibcheckportstate.in b/infiniband-diags/scripts/ibcheckportstate.in index f3a5f05..2931f06 100644 --- a/infiniband-diags/scripts/ibcheckportstate.in +++ b/infiniband-diags/scripts/ibcheckportstate.in @@ -89,7 +89,9 @@ else fi -if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`" +rv=$? +if echo "$text" | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) @@ -106,6 +108,11 @@ function blue(s) /^LinkState/{ if ($2 != "Active") warn = warn "#warn: Logical link state is " $2 " lid '$lid' port '$portnum'\n"} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { if (err != "") { blue(err) @@ -116,7 +123,7 @@ END { exit -1 } exit 0 -}' 2>&1 ; then +}' 2>&1 && test $rv -eq 0 ; then if [ "$verbose" = "yes" ]; then echo -n "Port check lid $lid port $portnum: " green "OK" diff --git a/infiniband-diags/scripts/ibcheckportwidth.in b/infiniband-diags/scripts/ibcheckportwidth.in index fdc75d1..84f1ef7 100644 --- a/infiniband-diags/scripts/ibcheckportwidth.in +++ b/infiniband-diags/scripts/ibcheckportwidth.in @@ -89,7 +89,9 @@ else fi -if $IBPATH/smpquery $ca_info portinfo $lid $portnum | awk -v mono=$bw -F '[.:]*' ' +text="`eval $IBPATH/smpquery $ca_info portinfo $lid $portnum`" +rv=$? +if echo "$text" | awk -v mono=$bw -F '[.:]*' ' function blue(s) { if (mono) @@ -104,6 +106,11 @@ function blue(s) /^LinkWidthSupported/{ if ($2 != "1X") { next } } /^LinkWidthActive/{ if ($2 == "1X") warn = warn "#warn: Link configured as 1X lid '$lid' port '$portnum'\n"} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { if (err != "") { blue(err) @@ -114,7 +121,7 @@ END { exit -1 } exit 0 -}' 2>&1 ; then +}' 2>&1 && test $rv -eq 0 ; then if [ "$verbose" = "yes" ]; then echo -n "Port check lid $lid port $portnum: " green "OK" diff --git a/infiniband-diags/scripts/ibcheckstate.in b/infiniband-diags/scripts/ibcheckstate.in index 944e139..6ce0854 100644 --- a/infiniband-diags/scripts/ibcheckstate.in +++ b/infiniband-diags/scripts/ibcheckstate.in @@ -67,7 +67,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' BEGIN { ne=0 pe=0 @@ -120,8 +122,14 @@ function check_node(lid) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne printf "## %d ports checked, %d ports with bad state found\n", nports, pe } ' +exit $rv diff --git a/infiniband-diags/scripts/ibcheckwidth.in b/infiniband-diags/scripts/ibcheckwidth.in index 8ad0f7f..f8f6a8b 100644 --- a/infiniband-diags/scripts/ibcheckwidth.in +++ b/infiniband-diags/scripts/ibcheckwidth.in @@ -67,7 +67,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' BEGIN { ne=0 pe=0 @@ -120,8 +122,14 @@ function check_node(lid) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne printf "## %d ports checked, %d ports with 1x width in error found\n", nports, pe } ' +exit $rv diff --git a/infiniband-diags/scripts/ibclearcounters.in b/infiniband-diags/scripts/ibclearcounters.in index b3c009e..1818c42 100644 --- a/infiniband-diags/scripts/ibclearcounters.in +++ b/infiniband-diags/scripts/ibclearcounters.in @@ -61,7 +61,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' function clear_counters(lid) { @@ -100,7 +102,13 @@ function clear_port_counters(lid, port) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes cleared %d errors\n", nnodes, nodeerr } ' +exit $rv diff --git a/infiniband-diags/scripts/ibclearerrors.in b/infiniband-diags/scripts/ibclearerrors.in index 097c3fe..c63283a 100644 --- a/infiniband-diags/scripts/ibclearerrors.in +++ b/infiniband-diags/scripts/ibclearerrors.in @@ -61,7 +61,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' function clear_errors(lid, port) { @@ -93,7 +95,13 @@ function clear_errors(lid, port) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes cleared %d errors\n", nnodes, nodeerr } ' +exit $rv diff --git a/infiniband-diags/scripts/ibdatacounters.in b/infiniband-diags/scripts/ibdatacounters.in index bee9bd8..902a865 100644 --- a/infiniband-diags/scripts/ibdatacounters.in +++ b/infiniband-diags/scripts/ibdatacounters.in @@ -73,7 +73,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' BEGIN { ne=0 } @@ -128,10 +130,15 @@ function check_node(lid) } } +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + END { printf "\n## Summary: %d nodes checked, %d bad nodes found\n", nnodes, ne printf "## %d ports checked\n", nports exit (ne ) } ' -exit $? +exit $rv diff --git a/infiniband-diags/scripts/ibdatacounts.in b/infiniband-diags/scripts/ibdatacounts.in index 927a978..bbdff71 100644 --- a/infiniband-diags/scripts/ibdatacounts.in +++ b/infiniband-diags/scripts/ibdatacounts.in @@ -108,7 +108,9 @@ fi nodename=`smpquery $ca_info nodedesc $lid | sed -e "s/^Node Description:\.*\(.*\)/\1/"` -if $IBPATH/perfquery $ca_info $lid $portnum | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' +text="`eval $IBPATH/perfquery $ca_info $lid $portnum`" +rv=$? +if echo "$text" | awk -v mono=$bw -v brief=$brief -F '[.:]*' ' function blue(s) { if (brief == "yes") { @@ -128,6 +130,11 @@ function blue(s) /^CounterSelect/ {next} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} + /^PortSelect/ { if ($2 != '$portnum') {err = err "error: lid '$lid' port " $2 " does not match query ('$portnum')\n"; exit -1}} $1 ~ "(Xmt|Rcv)(Pkts|Data)" { print $1 ":........................." $2 } @@ -142,7 +149,7 @@ END { exit -1 } exit 0 -}' 2>&1 ; then +}' 2>&1 && test $rv -eq 0 ; then if [ "$verbose" = "yes" ]; then echo -n "Error on lid $lid ($nodename) port $portname: " green OK diff --git a/infiniband-diags/scripts/ibhosts.in b/infiniband-diags/scripts/ibhosts.in index 0d6b1bc..a287edf 100644 --- a/infiniband-diags/scripts/ibhosts.in +++ b/infiniband-diags/scripts/ibhosts.in @@ -47,7 +47,14 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' /^Ca/ {print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ substr($0, match($0, "#[ \t]*")+RLENGTH)} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} ' +exit $rv diff --git a/infiniband-diags/scripts/ibrouters.in b/infiniband-diags/scripts/ibrouters.in index fea72bb..e053794 100644 --- a/infiniband-diags/scripts/ibrouters.in +++ b/infiniband-diags/scripts/ibrouters.in @@ -47,7 +47,14 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' /^Rt/ {print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ substr($0, match($0, "#[ \t]*")+RLENGTH)} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} ' +exit $rv diff --git a/infiniband-diags/scripts/ibswitches.in b/infiniband-diags/scripts/ibswitches.in index 859aacd..0476d0e 100644 --- a/infiniband-diags/scripts/ibswitches.in +++ b/infiniband-diags/scripts/ibswitches.in @@ -47,7 +47,9 @@ else netcmd="$IBPATH/ibnetdiscover $ca_info" fi -eval $netcmd | awk ' +text="`eval $netcmd`" +rv=$? +echo "$text" | awk ' /^Switch/ { l=$0 desc=substr(l, match(l, "#[ \t]*")+RLENGTH) @@ -69,4 +71,9 @@ eval $netcmd | awk ' else print $1 "\t: 0x" substr($3, 4, 16) " ports " $2 " "\ desc " " type " " pinfo} +/^ib/ {print $0} +/ibpanic:/ {print $0} +/ibwarn:/ {print $0} +/iberror:/ {print $0} ' +exit $rv From tamura at osrg.net Tue Jul 31 17:58:51 2007 From: tamura at osrg.net (Yoshiaki Tamura) Date: Wed, 01 Aug 2007 09:58:51 +0900 Subject: [ofa-general] [PATCH] build_env.sh: remove package checking for debian In-Reply-To: <20070730091020.GH9963@mellanox.co.il> References: <20070730091020.GH9963@mellanox.co.il> Message-ID: <46AFDACB.907@osrg.net> Hi, The following patch removes some package checking which stops building OFED 1.2 on debian. Although it's not a complete solution to port OFED on debian, it should help some debian users to build. Thanks, Yoshi Signed-off-by: Yoshi Tamura --- a/build_env.sh 2007-06-21 23:38:41.000000000 +0900 +++ b/build_env.sh 2007-07-31 14:21:50.000000000 +0900 @@ -1790,6 +1790,8 @@ debian) libibverbs_BUILD_REQ= libibverbs_devel_BUILD_REQ= + libibverbs_RUN_REQ= + mstflint_BUILD_REQ= ;; redhat5) # start_udev is required to create /dev/infiniband From mst at dev.mellanox.co.il Tue Jul 31 22:41:52 2007 From: mst at dev.mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Aug 2007 08:41:52 +0300 Subject: [ofa-general] Re: RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: <46AF9B7C.7020906@ichips.intel.com> References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com> <46AF9B7C.7020906@ichips.intel.com> Message-ID: <20070801054152.GB17884@mellanox.co.il> > Quoting Arlin Davis : > Subject: Re: RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module?parameter > > Sean Hefty wrote: > > >>OFED 1.2 removed the rdma_set_option call used to adjust response > >>timeout. We > >>are running into some cases on larger clusters that require longer > >>timeouts > >>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 that > >>adds > >>a module parameter for the response timeout? Thanks. > >> > >> > > > >What's in it for me? :) > > > > > > > >>Signed-off by: Arlin Davis > >> > >> > > > >Acked-by: Sean Hefty > > > >Vlad, can you add this for OFED 1.2.1? > > > >- Sean > > > > > > Did this get added to 1.2.1? http://www.openfabrics.org/git/?p=ofed_1_2/linux-2.6.git;a=blob;f=kernel_patches/fixes/cma_response_timeout.patch;hb=HEAD -- MST From vlad at mellanox.co.il Tue Jul 31 22:54:29 2007 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 1 Aug 2007 08:54:29 +0300 Subject: [ofa-general] RE: [PATCH] OFED 1.2.1 rdma_cm response timeout module parameter In-Reply-To: <46AF9B7C.7020906@ichips.intel.com> References: <001301c7c404$4ace0f00$3c98070a@amr.corp.intel.com> <46AF9B7C.7020906@ichips.intel.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C901FF14DA@mtlexch01.mtl.com> > Sean Hefty wrote: > > >>OFED 1.2 removed the rdma_set_option call used to adjust response > timeout. We > >>are running into some cases on larger clusters that require longer > timeouts > >>then the default. Can you consider this rdma_cm patch for OFED 1.2.1 > that adds > >>a module parameter for the response timeout? Thanks. > >> > >> > > > >What's in it for me? :) > > > > > > > >>Signed-off by: Arlin Davis > >> > >> > > > >Acked-by: Sean Hefty > > > >Vlad, can you add this for OFED 1.2.1? > > > >- Sean > > > > > > Did this get added to 1.2.1? Yes, It is in ofed_1_2/linux-2.6.git (both ofed_1_2 and ofed_1_2_c branches) commit 020bfb400c759ba89ffb0b13c41f2ca50181aebe Author: Arlin Davis Date: Thu Jul 12 12:01:39 2007 +0300 OFED 1.2 removed the rdma_set_option call used to adjust response timeout. We are running into some cases on larger clusters that require longer timeouts then the default. Signed-off by: Arlin Davis Regards, Vladimir